A Type-Based Blocking Technique for Efficient Entity Resolution over Large-Scale Data

被引:1
|
作者
Zhu, Hui-Juan [1 ]
Zhu, Zheng-Wei [1 ]
Jiang, Tong-Hai [2 ,3 ]
Cheng, Li [2 ,3 ]
Shi, Wei-Lei [2 ,3 ]
Zhou, Xi [2 ,3 ]
Zhao, Fan [2 ,3 ]
Ma, Bo [2 ,3 ]
机构
[1] Changzhou Univ, Sch Informat Sci & Engn, Changzhou 213164, Peoples R China
[2] Chinese Acad Sci, Xinjiang Tech Inst Phys & Chem, Urumqi 830011, Peoples R China
[3] Xinjiang Lab Minor Speech & Language Informat Pro, Urumqi 830011, Peoples R China
关键词
D O I
10.1155/2018/2094696
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
In data integration, entity resolution is an important technique to improve data quality. Existing researches typically assume that the target dataset only contain string-type data and use single similarity metric. For larger high-dimensional dataset, redundant information needs to be verified using traditional blocking or windowing techniques. In this work, we propose a novel ER-resolving method using a hybrid approach, including type-based multiblocks, varying window size, and more flexible similarity metrics. In our new ER workflow, we reduce the searching space for entity pairs by the constraint of redundant attributes and matching likelihood. We develop a reference implementation of our proposed approach and validate its performance using real-life dataset from one Internet of Things project. We evaluate the data processing system using five standard metrics including effectiveness, efficiency, accuracy, recall, and precision. Experimental results indicate that the proposed approach could be a promising alternative for entity resolution and could be feasibly applied in real-world data cleaning for large datasets.
引用
收藏
页数:12
相关论文
共 50 条
  • [1] Landmarks-based Blocking Method For Large-scale Entity Resolution
    Herath, Samudra
    Roughan, Matthew
    Glonek, Gary
    2020 IEEE 7TH INTERNATIONAL CONFERENCE ON DATA SCIENCE AND ADVANCED ANALYTICS (DSAA 2020), 2020, : 773 - 774
  • [2] Blocking for Large-Scale Entity Resolution: Challenges, Algorithms, and Practical Examples
    Papadakis, George
    Palpanas, Themis
    2016 32ND IEEE INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE), 2016, : 1436 - 1439
  • [3] Efficient Interactive Training Selection for Large-Scale Entity Resolution
    Wang, Qing
    Vatsalan, Dinusha
    Christen, Peter
    ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PART II, 2015, 9078 : 562 - 573
  • [4] Distributed Entity Resolution Based on Similarity Join for Large-Scale Data Clustering
    Nie, Tiezheng
    Lee, Wang-chien
    Shen, Derong
    Yu, Ge
    Kou, Yue
    WEB-AGE INFORMATION MANAGEMENT, WAIM 2014, 2014, 8485 : 138 - 149
  • [5] Boosting the Efficiency of Large-Scale Entity Resolution with Enhanced Meta-Blocking
    Papadakis, George
    Papastefanatos, George
    Palpanas, Themis
    Koubarakis, Manolis
    BIG DATA RESEARCH, 2016, 6 : 43 - 63
  • [6] Active Learning for Large-Scale Entity Resolution
    Qian, Kun
    Popa, Lucian
    Sen, Prithviraj
    CIKM'17: PROCEEDINGS OF THE 2017 ACM CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, 2017, : 1379 - 1388
  • [7] Flexpath: Type-Based Publish/Subscribe System for Large-scale Science Analytics
    Dayal, Jai
    Bratcher, Drew
    Eisenhauer, Greg
    Schwan, Karsten
    Wolf, Matthew
    Zhang, Xuechen
    Abbasi, Hasan
    Klasky, Scott
    Podhorszki, Norbert
    2014 14TH IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND GRID COMPUTING (CCGRID), 2014, : 246 - 255
  • [8] Incremental Blocking for Entity Resolution over Web Streaming Data
    Araujo, Tiago Brasileiro
    Stefanidis, Kostas
    Santos Pires, Carlos Eduardo
    Nummenmaa, Jyrki
    da Nobrega, Thiago Pereira
    2019 IEEE/WIC/ACM INTERNATIONAL CONFERENCE ON WEB INTELLIGENCE (WI 2019), 2019, : 332 - 336
  • [9] Parallel Meta-blocking: Realizing Scalable Entity Resolution over Large, Heterogeneous Data
    Efthymiou, Vasilis
    Papadakis, George
    Papastefanatos, George
    Stefanidis, Kostas
    Palpanas, Themis
    PROCEEDINGS 2015 IEEE INTERNATIONAL CONFERENCE ON BIG DATA, 2015, : 411 - 420
  • [10] Entity Relation Mining in Large-Scale Data
    Li, Jingnan
    Cai, Yi
    Wang, Qixuan
    Hu, Shuyue
    Wang, Tao
    Min, Huaqing
    DATABASE SYSTEMS FOR ADVANCED APPLICATIONS, DASFAA 2015, 2015, 9052 : 109 - 121