d-blink: Distributed End-to-End Bayesian Entity Resolution

被引:14
|
作者
Marchant, Neil G. [1 ]
Kaplan, Andee [2 ]
Elazar, Daniel N. [3 ]
Rubinstein, Benjamin I. P. [1 ]
Steorts, Rebecca C. [4 ,5 ]
机构
[1] Univ Melbourne, Sch Comp & Informat Syst, Parkville, Vic 3010, Australia
[2] Colorado State Univ, Dept Stat, Ft Collins, CO 80523 USA
[3] Australian Bur Stat, Methodol Div, Belconnen, ACT, Australia
[4] Duke Univ, Dept Stat Sci & Comp Sci, Durham, NC USA
[5] US Census Bur DRB CBDRB FY 20309, Durham, NC USA
基金
澳大利亚研究理事会;
关键词
Auxiliary variable; Distributed computing; Markov chain Monte Carlo; Partially collapsed Gibbs sampling; Record linkage;
D O I
10.1080/10618600.2020.1825451
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
Entity resolution (ER; also known as record linkage or de-duplication) is the process of merging noisy databases, often in the absence of unique identifiers. A major advancement in ER methodology has been the application of Bayesian generative models, which provide a natural framework for inferring latent entities with rigorous quantification of uncertainty. Despite these advantages, existing models are severely limited in practice, as standard inference algorithms scale quadratically in the number of records. While scaling can be managed by fitting the model on separate blocks of the data, such a naive approach may induce significant error in the posterior. In this article, we propose a principled model for scalable Bayesian ER, called "distributed Bayesian linkage" or d-blink, which jointly performs blocking and ER without compromising posterior correctness. Our approach relies on several key ideas, including: (i) an auxiliary variable representation that induces a partition of the entities and records into blocks; (ii) a method for constructing well-balanced blocks based on k-d trees; (iii) a distributed partially collapsed Gibbs sampler with improved mixing; and (iv) fast algorithms for performing Gibbs updates. Empirical studies on six datasets-including a case study on the 2010 Decennial Census-demonstrate the scalability and effectiveness of our approach. for this article are available online.
引用
收藏
页码:406 / 421
页数:16
相关论文
共 50 条
  • [1] An Overview of End-to-End Entity Resolution for Big Data
    Christophides, Vassilis
    Efthymiou, Vasilis
    Palpanas, Themis
    Papadakis, George
    Stefanidis, Kostas
    ACM COMPUTING SURVEYS, 2021, 53 (06)
  • [2] End-to-End Multi-Perspective Matching for Entity Resolution
    Fu, Cheng
    Han, Xianpei
    Sun, Le
    Chen, Bo
    Zhang, Wei
    Wu, Suhui
    Kong, Hao
    PROCEEDINGS OF THE TWENTY-EIGHTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2019, : 4961 - 4967
  • [3] End-to-end Task Based Parallelization for Entity Resolution on Dynamic Data
    Gazzarri, Leonardo
    Herschel, Melanie
    2021 IEEE 37TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2021), 2021, : 1248 - 1259
  • [4] Domain- and Structure-Agnostic End-to-End Entity Resolution with JedAI
    Papadakis, George
    Tsekouras, Leonidas
    Thanos, Emmanouil
    Giannakopoulos, George
    Palpanas, Themis
    Koubarakis, Manolis
    SIGMOD RECORD, 2019, 48 (04) : 30 - 36
  • [5] End-to-End Entity Detection with Proposer and Regressor
    Xueru Wen
    Changjiang Zhou
    Haotian Tang
    Luguang Liang
    Hong Qi
    Yu Jiang
    Neural Processing Letters, 2023, 55 : 9269 - 9294
  • [6] Contextualized End-to-End Neural Entity Linking
    Chen, Haotian
    Zukov-Gregoric, Andrej
    Li, Xi
    Wadhwa, Sahil
    1ST CONFERENCE OF THE ASIA-PACIFIC CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 10TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (AACL-IJCNLP 2020), 2020, : 637 - 642
  • [7] End-to-End Entity Detection with Proposer and Regressor
    Wen, Xueru
    Zhou, Changjiang
    Tang, Haotian
    Liang, Luguang
    Qi, Hong
    Jiang, Yu
    NEURAL PROCESSING LETTERS, 2023, 55 (07) : 9269 - 9294
  • [8] The return of JedAl: End-to-End Entity Resolution for Structured and Semi-Structured Data
    Papadakis, George
    Tsekouras, Leonidas
    Thanos, Emmanouil
    Giannakopoulos, George
    Palpanas, Themis
    Koubarakis, Manolis
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2018, 11 (12): : 1950 - 1953
  • [9] Distributed End-to-End testing management
    Bai, XY
    Tsai, WT
    Paul, R
    Shen, TC
    Li, B
    FIFTH IEEE INTERNATIONAL ENTERPRISE DISTRIBUTED OBJECT COMPUTING CONFERENCE, PROCEEDINGS, 2001, : 140 - 151
  • [10] End-to-end Distributed Video Coding
    Zhou, Junwei
    Lv, Ting
    Yi, XiangBo
    DCC 2022: 2022 DATA COMPRESSION CONFERENCE (DCC), 2022, : 496 - 496