Near Duplicate Detection in Relational Databases

被引:0
|
作者
Bayrak, Ahmet Tugrul [1 ]
Yilmaz, Aykut Inan [1 ]
Yilmaz, Kemal Burak [1 ]
Duzagac, Remzi [1 ]
Yildiz, Olcay Taner [2 ]
机构
[1] ETSTUR, Veri Bilimi & Analit Bolumu, Istanbul, Turkey
[2] Isik Univ, Bilgisayar Muhendisligi Bolumu, Sile Istanbul, Turkey
关键词
Machine Learning; Similarity Functions; Duplicate Record Detection;
D O I
暂无
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
While data amount increases, number of duplicate records in relational databases increase gradually. The duplicate records might cause inconsistency on reports and analyzes. To reduce the effects of this problem, we aim to detect duplicate records using machine learning algorithms with features that are produced by similarity of the records. We achieved to detect 28412 duplicate records in 9301467 records. The detected duplicate rows are removed from the data source and the data become more consistent.
引用
收藏
页数:4
相关论文
共 50 条
  • [21] Converting Relational Databases into Object-relational Databases
    Maatuk, Abdelsalam
    Ali, M. Akhtar
    Rossiter, Nick
    JOURNAL OF OBJECT TECHNOLOGY, 2010, 9 (02): : 145 - 161
  • [22] Online Near-Duplicate Detection of News Articles
    Rodier, Simon
    Carter, Dave
    PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 1242 - 1249
  • [23] Near Duplicate Text Detection using Graph Depiction
    Poulos, Marios
    2016 7TH INTERNATIONAL CONFERENCE ON INFORMATION, INTELLIGENCE, SYSTEMS & APPLICATIONS (IISA), 2016,
  • [24] An Integrated Approach to Near-duplicate Image Detection
    Yang, Heesung
    Park, Hyeyoung
    2023 INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE IN INFORMATION AND COMMUNICATION, ICAIIC, 2023, : 425 - 428
  • [25] Video Query Reformulation for Near-Duplicate Detection
    Chiu, Chih-Yi
    Li, Sheng-Yang
    Hsieh, Cheng-Yu
    IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, 2012, 7 (05) : 1594 - 1603
  • [26] Analysis of Neural Codes for Near-Duplicate Detection
    Pintus, Maurizio
    ADVANCED CONCEPTS FOR INTELLIGENT VISION SYSTEMS, ACIVS 2018, 2018, 11182 : 357 - 368
  • [27] Data and syntax centric anomaly detection for relational databases
    Sallam, Asmaa
    Fadolalkarim, Daren
    Bertino, Elisa
    Xiao, Qian
    WILEY INTERDISCIPLINARY REVIEWS-DATA MINING AND KNOWLEDGE DISCOVERY, 2016, 6 (06) : 231 - 239
  • [28] An Efficient Method for Near-Duplicate Video Detection
    Tahayna, Bashar
    Belkhatir, Mohammed
    ADVANCES IN MULTIMEDIA INFORMATION PROCESSING - PCM 2008, 9TH PACIFIC RIM CONFERENCE ON MULTIMEDIA, 2008, 5353 : 377 - 386
  • [29] flowSim: Near duplicate detection for flow cytometry data
    Montante, Sebastiano
    Chen, Yixuan
    Brinkman, Ryan R.
    CYTOMETRY PART A, 2023, 103 (11) : 889 - 901
  • [30] Efficient Similarity Joins for Near-Duplicate Detection
    Xiao, Chuan
    Wang, Wei
    Lin, Xuemin
    Yu, Jeffrey Xu
    Wang, Guoren
    ACM TRANSACTIONS ON DATABASE SYSTEMS, 2011, 36 (03):