Near Duplicate Detection in Relational Databases

被引:0
|
作者
Bayrak, Ahmet Tugrul [1 ]
Yilmaz, Aykut Inan [1 ]
Yilmaz, Kemal Burak [1 ]
Duzagac, Remzi [1 ]
Yildiz, Olcay Taner [2 ]
机构
[1] ETSTUR, Veri Bilimi & Analit Bolumu, Istanbul, Turkey
[2] Isik Univ, Bilgisayar Muhendisligi Bolumu, Sile Istanbul, Turkey
关键词
Machine Learning; Similarity Functions; Duplicate Record Detection;
D O I
暂无
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
While data amount increases, number of duplicate records in relational databases increase gradually. The duplicate records might cause inconsistency on reports and analyzes. To reduce the effects of this problem, we aim to detect duplicate records using machine learning algorithms with features that are produced by similarity of the records. We achieved to detect 28412 duplicate records in 9301467 records. The detected duplicate rows are removed from the data source and the data become more consistent.
引用
收藏
页数:4
相关论文
共 50 条
  • [31] Benchmarking unsupervised near-duplicate image detection
    Morra, Lia
    Lamberti, Fabrizio
    EXPERT SYSTEMS WITH APPLICATIONS, 2019, 135 : 313 - 326
  • [32] Practical Application of Near Duplicate Detection for Image Database
    Eshkol, Adi
    Grega, Michal
    Leszczuk, Mikolaj
    Weintraub, Ofer
    MULTIMEDIA COMMUNICATIONS, SERVICES AND SECURITY, MCSS 2014, 2014, 429 : 73 - 82
  • [33] RELATIONAL DATABASES
    STOUT, QF
    WOODWORTH, PA
    AMERICAN MATHEMATICAL MONTHLY, 1983, 90 (02): : 101 - 118
  • [34] RELATIONAL DATABASES
    BAKER, HG
    COMMUNICATIONS OF THE ACM, 1992, 35 (04) : 16 - &
  • [35] Original Image Tracing with Image Relational Graph for Near-Duplicate Image Elimination
    Huang, Fang
    Zhou, Zhili
    Liu, Tianliang
    Liu, Xiya
    CLOUD COMPUTING AND SECURITY, ICCCS 2016, PT II, 2016, 10040 : 324 - 336
  • [36] Original image tracing with image relational graph for near-duplicate image elimination
    Huang, Fang
    Zhou, Zhili
    Yang, Ching-Nung
    Liu, Xiya
    Wang, Tao
    INTERNATIONAL JOURNAL OF COMPUTATIONAL SCIENCE AND ENGINEERING, 2019, 18 (03) : 294 - 304
  • [37] Duplicate and near-duplicate documents in the web: detection by means of fuzzy-hash techniques
    Figuerola, Carlos G.
    Gomez Diaz, Raquel
    Alonso Berrocal, Jose L.
    Zazo Rodriguez, Angel F.
    SCIRE-REPRESENTACION Y ORGANIZACION DEL CONOCIMIENTO, 2011, 17 (01): : 49 - 54
  • [38] Duplicate detection and record consolidation in large bibliographic databases: the COPAC database experience
    Cousins, SA
    JOURNAL OF INFORMATION SCIENCE, 1998, 24 (04) : 231 - 240
  • [39] Efficient Semantic-Aware Detection of Near Duplicate Resources
    Ioannou, Ekaterini
    Papapetrou, Odysseas
    Skoutas, Dimitrios
    Nejdl, Wolfgang
    SEMANTIC WEB: RESEARCH AND APPLICATIONS, PT 2, PROCEEDINGS, 2010, 6089 : 136 - 150
  • [40] Algorithms for automated detection of (near-)duplicate periodic crystals
    Kurlin, V.
    ACTA CRYSTALLOGRAPHICA A-FOUNDATION AND ADVANCES, 2022, 78 : E775 - E776