Near Duplicate Detection in Relational Databases

被引:0
|
作者
Bayrak, Ahmet Tugrul [1 ]
Yilmaz, Aykut Inan [1 ]
Yilmaz, Kemal Burak [1 ]
Duzagac, Remzi [1 ]
Yildiz, Olcay Taner [2 ]
机构
[1] ETSTUR, Veri Bilimi & Analit Bolumu, Istanbul, Turkey
[2] Isik Univ, Bilgisayar Muhendisligi Bolumu, Sile Istanbul, Turkey
关键词
Machine Learning; Similarity Functions; Duplicate Record Detection;
D O I
暂无
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
While data amount increases, number of duplicate records in relational databases increase gradually. The duplicate records might cause inconsistency on reports and analyzes. To reduce the effects of this problem, we aim to detect duplicate records using machine learning algorithms with features that are produced by similarity of the records. We achieved to detect 28412 duplicate records in 9301467 records. The detected duplicate rows are removed from the data source and the data become more consistent.
引用
收藏
页数:4
相关论文
共 50 条
  • [1] Different Similarity Measures to Identify Duplicate Records in Relational Databases
    Hadzic, Dulaga
    Sarajlic, Nermin
    Malkic, Jasmin
    2016 24TH TELECOMMUNICATIONS FORUM (TELFOR), 2016, : 790 - 793
  • [2] Seagull optimization-based near-duplicate image detection in large image databases
    Sundaram, Srinidhi
    Kamalakkannan, S.
    Jayaraman, Sasikala
    IMAGING SCIENCE JOURNAL, 2023, 71 (07): : 647 - 659
  • [3] Concentric Circle-Based Image Signature for Near-Duplicate Detection in Large Databases
    Cho, Ayoung
    Yang, Won-Keun
    Oh, Weon-Geun
    Jeong, Dong-Seok
    ETRI JOURNAL, 2010, 32 (06) : 871 - 880
  • [4] Benchmarks for measurement of duplicate detection methods in nucleotide databases
    Chen, Qingyu
    Zobel, Justin
    Verspoor, Karin
    DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION, 2023, 2023
  • [5] An incremental clustering scheme for duplicate detection in large databases
    Cesario, E
    Folino, F
    Manco, G
    Pontieri, L
    9TH INTERNATIONAL DATABASE ENGINEERING & APPLICATION SYMPOSIUM, PROCEEDINGS, 2005, : 89 - 95
  • [6] Automating Duplicate Detection for Lexical Heterogeneous Web Databases
    Ahlawat A.
    Sagar K.
    Recent Advances in Computer Science and Communications, 2022, 15 (04) : 540 - 549
  • [7] Effective incremental clustering for duplicate detection in large databases
    Folino, Francesco
    Manco, Giuseppe
    Pontieri, Luigi
    10TH INTERNATIONAL DATABASE ENGINEERING AND APPLICATIONS SYMPOSIUM, PROCEEDINGS, 2006, : 45 - 52
  • [8] Near Duplicate Detection using MapReduce
    Du, Qinsheng
    Liu, Wei
    Li, Guolin
    Tang, Yonglin
    PROCEEDINGS OF 2012 2ND INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND NETWORK TECHNOLOGY (ICCSNT 2012), 2012, : 243 - 246
  • [9] A constrained clustering approach to duplicate detection among relational data
    Wang, Chao
    Lu, Jie
    Zhang, Guangquan
    ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PROCEEDINGS, 2007, 4426 : 308 - +
  • [10] A Novel Method for Intrusion Detection in Relational Databases
    Ramachandran, Raji
    Arya, P.
    Jayanthy, P. G.
    2017 INTERNATIONAL CONFERENCE ON ADVANCES IN COMPUTING, COMMUNICATIONS AND INFORMATICS (ICACCI), 2017, : 230 - 235