A Type-Based Blocking Technique for Efficient Entity Resolution over Large-Scale Data

被引:1
|
作者
Zhu, Hui-Juan [1 ]
Zhu, Zheng-Wei [1 ]
Jiang, Tong-Hai [2 ,3 ]
Cheng, Li [2 ,3 ]
Shi, Wei-Lei [2 ,3 ]
Zhou, Xi [2 ,3 ]
Zhao, Fan [2 ,3 ]
Ma, Bo [2 ,3 ]
机构
[1] Changzhou Univ, Sch Informat Sci & Engn, Changzhou 213164, Peoples R China
[2] Chinese Acad Sci, Xinjiang Tech Inst Phys & Chem, Urumqi 830011, Peoples R China
[3] Xinjiang Lab Minor Speech & Language Informat Pro, Urumqi 830011, Peoples R China
关键词
D O I
10.1155/2018/2094696
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
In data integration, entity resolution is an important technique to improve data quality. Existing researches typically assume that the target dataset only contain string-type data and use single similarity metric. For larger high-dimensional dataset, redundant information needs to be verified using traditional blocking or windowing techniques. In this work, we propose a novel ER-resolving method using a hybrid approach, including type-based multiblocks, varying window size, and more flexible similarity metrics. In our new ER workflow, we reduce the searching space for entity pairs by the constraint of redundant attributes and matching likelihood. We develop a reference implementation of our proposed approach and validate its performance using real-life dataset from one Internet of Things project. We evaluate the data processing system using five standard metrics including effectiveness, efficiency, accuracy, recall, and precision. Experimental results indicate that the proposed approach could be a promising alternative for entity resolution and could be feasibly applied in real-world data cleaning for large datasets.
引用
收藏
页数:12
相关论文
共 50 条
  • [21] TransformingWikipedia into a Large-Scale Fine-Grained Entity Type Corpus
    Ghaddar, Abbas
    Langlais, Philippe
    PROCEEDINGS OF THE ELEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2018), 2018, : 4413 - 4420
  • [22] Naamapadam: A Large-Scale Named Entity Annotated Data for Indic Languages
    Mhaske, Arnav
    Kedia, Harshit
    Doddapaneni, Sumanth
    Khapra, Mitesh M.
    Kumar, Pratyush
    Murthy, V. Rudra
    Kunchukuttan, Anoop
    PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023): LONG PAPERS, VOL 1, 2023, : 10441 - 10456
  • [23] An Efficient Strategy for Large-Scale CORS Data Processing
    Xiong, Bolin
    Huang, Dingfa
    CHINA SATELLITE NAVIGATION CONFERENCE (CSNC) 2016 PROCEEDINGS, VOL I, 2016, 388 : 213 - 225
  • [24] Efficient large-scale data analysis using mapreduce
    Kubo, R., 1600, Nippon Telegraph and Telephone Corp. (10):
  • [25] Efficient bioinformatics approaches for large-scale data analysis
    Hautaniemi, S.
    FEBS JOURNAL, 2011, 278 : 27 - 27
  • [26] An Efficient Large-Scale Volume Data Compression Algorithm
    Xiao, Degui
    Zhao, Liping
    Yang, Lei
    Li, Zhiyong
    Li, Kenli
    ADVANCES IN NEURAL NETWORKS - ISNN 2009, PT 3, PROCEEDINGS, 2009, 5553 : 567 - 575
  • [27] Efficient Large-Scale Parking Data Prediction based on Parking Zone Division
    Sun, Yue
    Zhang, KangShuai
    Liu, Qi
    Yang, Yang
    Peng, Lei
    2023 IEEE 26TH INTERNATIONAL CONFERENCE ON INTELLIGENT TRANSPORTATION SYSTEMS, ITSC, 2023, : 1398 - 1403
  • [28] Overlapped Hashing: A Novel Scalable Blocking Technique for Entity Resolution in Big-Data Era
    Khalil, Rana
    Shawish, Ahmed
    Elzanfaly, Doaa
    INTELLIGENT COMPUTING, VOL 1, 2019, 858 : 427 - 441
  • [29] An Efficient and Verifiable Encrypted Data Filtering Framework Over Large-Scale Storage in Cloud Edge
    Huang, Qinlong
    Wang, Chao
    Lu, Boyu
    IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, 2024, 19 : 8248 - 8262
  • [30] Efficient Approximate Range Aggregation over Large-scale Spatial Data Federation (Extended Abstract)
    Shi, Yexuan
    Tong, Yongxin
    Zeng, Yuxiang
    Zhou, Zimu
    Ding, Bolin
    Chen, Lei
    2022 IEEE 38TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2022), 2022, : 1559 - 1560