A Type-Based Blocking Technique for Efficient Entity Resolution over Large-Scale Data

被引:1
|
作者
Zhu, Hui-Juan [1 ]
Zhu, Zheng-Wei [1 ]
Jiang, Tong-Hai [2 ,3 ]
Cheng, Li [2 ,3 ]
Shi, Wei-Lei [2 ,3 ]
Zhou, Xi [2 ,3 ]
Zhao, Fan [2 ,3 ]
Ma, Bo [2 ,3 ]
机构
[1] Changzhou Univ, Sch Informat Sci & Engn, Changzhou 213164, Peoples R China
[2] Chinese Acad Sci, Xinjiang Tech Inst Phys & Chem, Urumqi 830011, Peoples R China
[3] Xinjiang Lab Minor Speech & Language Informat Pro, Urumqi 830011, Peoples R China
关键词
D O I
10.1155/2018/2094696
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
In data integration, entity resolution is an important technique to improve data quality. Existing researches typically assume that the target dataset only contain string-type data and use single similarity metric. For larger high-dimensional dataset, redundant information needs to be verified using traditional blocking or windowing techniques. In this work, we propose a novel ER-resolving method using a hybrid approach, including type-based multiblocks, varying window size, and more flexible similarity metrics. In our new ER workflow, we reduce the searching space for entity pairs by the constraint of redundant attributes and matching likelihood. We develop a reference implementation of our proposed approach and validate its performance using real-life dataset from one Internet of Things project. We evaluate the data processing system using five standard metrics including effectiveness, efficiency, accuracy, recall, and precision. Experimental results indicate that the proposed approach could be a promising alternative for entity resolution and could be feasibly applied in real-world data cleaning for large datasets.
引用
收藏
页数:12
相关论文
共 50 条
  • [41] Efficient data management in a large-scale epidemiology research project
    Meyer, Jens
    Ostrzinski, Stefan
    Fredrich, Daniel
    Havemann, Christoph
    Krafczyk, Janina
    Hoffmann, Wolfgang
    COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE, 2012, 107 (03) : 425 - 435
  • [42] Efficient Data Collection for Large-Scale Mobile Monitoring Applications
    Shen, Haiying
    Li, Ze
    Yu, Lei
    Qiu, Chenxi
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2014, 25 (06) : 1424 - 1436
  • [43] Efficient Subspace Clustering of Large-scale Data Streams with Misses
    Traganitis, Panagiotis A.
    Giannakis, Georgios B.
    2016 ANNUAL CONFERENCE ON INFORMATION SCIENCE AND SYSTEMS (CISS), 2016,
  • [44] An Efficient and Compact Indexing Scheme for Large-scale Data Store
    Lu, Peng
    Wu, Sai
    Shou, Lidan
    Tan, Kian-Lee
    2013 IEEE 29TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE), 2013, : 326 - 337
  • [45] Inverted index and space mapping based redundancies eliminating for data blocking in entity resolution
    Tan, Mingchao
    Diao, Xingchun
    Cao, Jianjun
    Zhou, Xing
    Liu, Yi
    Zheng, Qibin
    Journal of Computational Information Systems, 2015, 11 (17): : 6187 - 6198
  • [46] HHS: an efficient network topology for large-scale data centers
    Azizi, Sadoon
    Hashemi, Naser
    Khonsari, Ahmad
    JOURNAL OF SUPERCOMPUTING, 2016, 72 (03): : 874 - 899
  • [47] Queries over Large-scale Log Data of Hybrid Granularities
    Zhao, Gansen
    Zhuang, Xutian
    Wang, Xinming
    Nie, Ruihua
    Liao, Zhirui
    Lin, Chengchuang
    Li, Zhenyu
    2016 15TH INTERNATIONAL SYMPOSIUM ON PARALLEL AND DISTRIBUTED COMPUTING (ISPDC), 2016, : 240 - 246
  • [48] Efficient Processing of Models for Large-scale Shotgun Proteomics Data
    Grover, Himanshu
    Gopalakrishnan, Vanathi
    PROCEEDINGS OF THE 2012 8TH INTERNATIONAL CONFERENCE ON COLLABORATIVE COMPUTING: NETWORKING, APPLICATIONS AND WORKSHARING (COLLABORATECOM 2012), 2012, : 591 - 596
  • [49] Queries over Large-scale Incremental Data of Hybrid Granularities
    Zhuang, Xutian
    Zhao, Gansen
    Wang, Xinming
    Nie, Ruihua
    Liao, Zhirui
    Lin, Chengchuang
    Li, Zhenyu
    2016 7TH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING AND BIG DATA (CCBD), 2016, : 69 - 74
  • [50] Design and Evaluation of Parallel Hashing over Large-scale Data
    Cheng, Long
    Kotoulas, Spyros
    Ward, Tomas E.
    Theodoropoulos, Georgios
    2014 21ST INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING (HIPC), 2014,