Uniform random sampling not recommended for large graph size estimation

被引：6

作者：

Lu, Jianguo ^{[1
]}

Wang, Hao ^{[1
]}

机构：

[1] Univ Windsor, Sch Comp Sci, Windsor, ON, Canada

来源：

INFORMATION SCIENCES | 2017年 / 421卷

基金：

加拿大自然科学与工程研究理事会;

关键词：

NETWORKS;

D O I：

10.1016/j.ins.2017.08.030

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

The norm of data size estimation is to use uniform random samples whenever possible. There have been tremendous efforts in obtaining uniform random samples using methods such as Metropolis-Hasting random walk or importance sampling [2]. This paper shows that, on the contrary to the common practice, uniform random sampling should be avoided when PPS (probability proportional to size) sampling is available for large data. To develop intuition of the sampling process, we discuss the sampling and estimation problem in the context of graph. The size is the number of nodes in the graph; uniform random sampling corresponds to uniform random node (RN) sampling; and PPS sampling is approximated by random edge (RE) sampling. In this setting, we show that for large graphs RE sampling outperforms RN sampling with a ratio proportional to the normalized graph degree variance. This result is particularly important in the era of big data, when data are typically large and scale-free [3], resulting in large degree variance. We derive the result by giving the variances of RN and RE estimators. Each step of the derivation is supported and demonstrated by simulation studies assuming power law distributions. Then we use 18 real-world networks to verify the result. Furthermore, we show that the performance of random walk (RW) sampling is data dependent and can be significantly worse than RN and RE. More specifically, RW can estimate online social networks but not Web graphs due to the difference of the graph conductance. Crown Copyright (C) 2017 Published by Elsevier Inc. All rights reserved.

引用

页码：136 / 153

页数：18

共 50 条

[21] Global triangle estimation based on first edge sampling in large graph streams
Changyong Yu
Huimin Liu
Fazal Wahab
Zihan Ling
Tianmei Ren
Haitao Ma
Yuhai Zhao
The Journal of Supercomputing, 2023, 79 : 14079 - 14116
[22] ESTIMATION OF FREQUENCY BY RANDOM SAMPLING
ISOKAWA, Y
ANNALS OF THE INSTITUTE OF STATISTICAL MATHEMATICS, 1983, 35 (02) : 201 - 213
[23] On Set Size Distribution Estimation and the Characterization of Large Networks via Sampling
Murai, Fabricio
Ribeiro, Bruno
Towsley, Don
Wang, Pinghui
IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, 2013, 31 (06) : 1017 - 1025
[24] BURST: Benchmarking uniform random sampling techniques
Acher, Mathieu
Perrouin, Gilles
Cordy, Maxime
SCIENCE OF COMPUTER PROGRAMMING, 2023, 226
[25] AN ALGORITHM FOR UNIFORM RANDOM SAMPLING OF POINTS IN AND ON A HYPERSPHERE
GURALNIK, G
ZEMACH, C
WARNOCK, T
INFORMATION PROCESSING LETTERS, 1985, 21 (01) : 17 - 21
[26] Node copying: A random graph model for effective graph sampling
Regol, Florence
Pal, Soumyasundar
Sun, Jianing
Zhang, Yingxue
Geng, Yanhui
Coates, Mark
SIGNAL PROCESSING, 2022, 192
[27] On the Theorem of Uniform Recovery of Random Sampling Matrices
Andersson, Joel
Stromberg, Jan-Olov
IEEE TRANSACTIONS ON INFORMATION THEORY, 2014, 60 (03) : 1700 - 1710
[28] Exponential random graph model parameter estimation for very large directed networks
Stivala, Alex
Robins, Garry
Lomi, Alessandro
PLOS ONE, 2020, 15 (01):
[29] GRAPH SAMPLING: ESTIMATION OF DEGREE DISTRIBUTIONS
Deri, Joya A.
Moura, Jose M. F.
2013 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2013, : 6501 - 6505
[30] Supports estimation via graph sampling
Wang, Xin
Shi, Jun-Hao
Zou, Jie-Jun
Shen, Ling-Zhen
Lan, Zhuo
Fang, Yu
Xie, Wen -Bo
EXPERT SYSTEMS WITH APPLICATIONS, 2024, 240

← 1 2 3 4 5 →