OpenK: An Elastic Data Cleansing System with A Clustering-based Data Anomaly Detection Approach

被引:0
|
作者
Tran Khanh Dang [1 ]
Dinh Khuong Nguyen [1 ]
Luc Minh Tuan [2 ]
机构
[1] Ho Chi Minh Univ Technol HCMUT, VNU HCM, Ho Chi Minh City, Vietnam
[2] Ton Duc Thang Univ, Ctr Appl Informat Technol, Ho Chi Minh City, Vietnam
关键词
Data Cleansing; Levenshtein Distance; Jaro Distance Similarity; Fingerprints; Data Anomaly Detection;
D O I
10.1109/ACOMP53746.2021.00023
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Large amounts of data is generated every second over the Internet, and thus making a proper decision is crucial. Even if we have collected a lot of data, extracting the beneficial knowledge inside the data is a challenging task. The reasons for this difficulty include: 1) data are not normally clean, especially they are obtained from different sources, and 2) data can be redundant or duplicated. Therefore, it is necessary to have a data cleaning process, which is used to detect any anomalies within the data. Additionally, the process can identify any inconsistencies or duplication at an early stage. In this study, we introduce Open(K), an efficient elastic data cleansing system based on clustering methods. Data are clustered based on metrics of similarity generated by different techniques such as: nearest neighbour (e.g. Levenshtein, Damerau-Levenshtein, and Hamming distances), similarity measurement (e.g., Jaro and Jaro-Winkler Distance Similarity), and key collision (e.g., Fingerprints and N-gram fingerprints). Our prototype can run on Windows operating system with an installed AzureCosmosDB version to support a friendly web-based interface and a wide array of input. formals. Experimental results show that our tool outperforms existing software in terms of efficiency and practical perspectives.
引用
收藏
页码:120 / 127
页数:8
相关论文
共 50 条
  • [1] Clustering-based anomaly detection in multivariate time series data
    Li, Jinbo
    Izakian, Hesam
    Pedrycz, Witold
    Jamal, Iqbal
    APPLIED SOFT COMPUTING, 2021, 100
  • [2] Clustering-Based Anomaly Detection in Multi-View Data
    Alvarez, Alejandro Marcos
    Yamada, Makoto
    Kimura, Akisato
    Iwata, Tomoharu
    PROCEEDINGS OF THE 22ND ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT (CIKM'13), 2013, : 1545 - 1548
  • [3] Clustering-based anomaly detection in multivariate time series data
    Li, Jinbo
    Izakian, Hesam
    Pedrycz, Witold
    Jamal, Iqbal
    Applied Soft Computing, 2021, 100
  • [4] Data Clustering-based Anomaly Detection in Industrial Control Systems
    Kiss, Istvan
    Genge, Bela
    Haller, Piroska
    Sebestyen, Gheorghe
    2014 IEEE INTERNATIONAL CONFERENCE ON INTELLIGENT COMPUTER COMMUNICATION AND PROCESSING (ICCP), 2014, : 275 - +
  • [5] A Clustering-Based Unsupervised Approach to Anomaly Intrusion Detection
    Nikolova, Evgeniya
    Jecheva, Veselina
    PROCEEDINGS OF THE 2ND INTERNATIONAL SYMPOSIUM ON COMPUTER, COMMUNICATION, CONTROL AND AUTOMATION, 2013, 68 : 202 - 205
  • [6] Data clustering-based fault detection in WSNs
    Yang, Yang
    Liu, Qian
    Gao, Zhipeng
    Qiu, Xuesong
    Rui, Lanlan
    2015 SEVENTH INTERNATIONAL CONFERENCE ON ADVANCED COMPUTATIONAL INTELLIGENCE (ICACI), 2015, : 334 - 339
  • [7] Clustering-based approach for medical data classification
    Kodabagi, Mallikarjun M.
    Tikotikar, Ahelam
    CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2019, 31 (14):
  • [8] Clustering-based real-time anomaly detection-A breakthrough in big data technologies
    Habeeb, Riyaz Ahamed Ariyaluran
    Nasaruddin, Fariza
    Gani, Abdullah
    Amanullah, Mohamed Ahzam
    Hashem, Ibrahim Abaker Targio
    Ahmed, Ejaz
    Imran, Muhammad
    TRANSACTIONS ON EMERGING TELECOMMUNICATIONS TECHNOLOGIES, 2022, 33 (08):
  • [9] A Genetics Clustering-based Approach for Weblog Data Cleaning
    Ganibardi, Amine
    Ali, Cherif Arab
    2018 SIXTH INTERNATIONAL CONFERENCE ON ENTERPRISE SYSTEMS (ES 2018), 2018, : 75 - 81
  • [10] Graph clustering-based discretization approach to microarray data
    Kittakorn Sriwanna
    Tossapon Boongoen
    Natthakan Iam-On
    Knowledge and Information Systems, 2019, 60 : 879 - 906