OpenK: An Elastic Data Cleansing System with A Clustering-based Data Anomaly Detection Approach

被引:0
|
作者
Tran Khanh Dang [1 ]
Dinh Khuong Nguyen [1 ]
Luc Minh Tuan [2 ]
机构
[1] Ho Chi Minh Univ Technol HCMUT, VNU HCM, Ho Chi Minh City, Vietnam
[2] Ton Duc Thang Univ, Ctr Appl Informat Technol, Ho Chi Minh City, Vietnam
关键词
Data Cleansing; Levenshtein Distance; Jaro Distance Similarity; Fingerprints; Data Anomaly Detection;
D O I
10.1109/ACOMP53746.2021.00023
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Large amounts of data is generated every second over the Internet, and thus making a proper decision is crucial. Even if we have collected a lot of data, extracting the beneficial knowledge inside the data is a challenging task. The reasons for this difficulty include: 1) data are not normally clean, especially they are obtained from different sources, and 2) data can be redundant or duplicated. Therefore, it is necessary to have a data cleaning process, which is used to detect any anomalies within the data. Additionally, the process can identify any inconsistencies or duplication at an early stage. In this study, we introduce Open(K), an efficient elastic data cleansing system based on clustering methods. Data are clustered based on metrics of similarity generated by different techniques such as: nearest neighbour (e.g. Levenshtein, Damerau-Levenshtein, and Hamming distances), similarity measurement (e.g., Jaro and Jaro-Winkler Distance Similarity), and key collision (e.g., Fingerprints and N-gram fingerprints). Our prototype can run on Windows operating system with an installed AzureCosmosDB version to support a friendly web-based interface and a wide array of input. formals. Experimental results show that our tool outperforms existing software in terms of efficiency and practical perspectives.
引用
收藏
页码:120 / 127
页数:8
相关论文
共 50 条
  • [21] Clustering-based data placement in cloud computing: a predictive approach
    Mokhtar Sellami
    Haithem Mezni
    Mohand Said Hacid
    Mohamed Moshen Gammoudi
    Cluster Computing, 2021, 24 : 3311 - 3336
  • [22] Clustering-Based Hybrid Approach for Multivariate Missing Data Imputation
    Dubey, Aditya
    Rasool, Akhtar
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2020, 11 (11) : 710 - 714
  • [23] Clustering-based data placement in cloud computing: a predictive approach
    Sellami, Mokhtar
    Mezni, Haithem
    Hacid, Mohand Said
    Gammoudi, Mohamed Moshen
    CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS, 2021, 24 (04): : 3311 - 3336
  • [24] A Hybrid Unsupervised Clustering-Based Anomaly Detection Method
    Guo Pu
    Lijuan Wang
    Jun Shen
    Fang Dong
    Tsinghua Science and Technology, 2021, 26 (02) : 146 - 153
  • [25] Clustering-based label estimation for network anomaly detection
    Sunhee Baek
    Donghwoon Kwon
    Sang CSuh
    Hyunjoo Kim
    Ikkyun Kim
    Jinoh Kim
    Digital Communications and Networks, 2021, 7 (01) : 37 - 44
  • [26] Clustering-based label estimation for network anomaly detection
    Baek, Sunhee
    Kwon, Donghwoon
    Suh, Sang C.
    Kim, Hyunjoo
    Kim, Ikkyun
    Kim, Jinoh
    DIGITAL COMMUNICATIONS AND NETWORKS, 2021, 7 (01) : 37 - 44
  • [27] A Hybrid Unsupervised Clustering-Based Anomaly Detection Method
    Pu, Guo
    Wang, Lijuan
    Shen, Jun
    Dong, Fang
    TSINGHUA SCIENCE AND TECHNOLOGY, 2021, 26 (02) : 146 - 153
  • [28] Fuzzy clustering-based semi-supervised approach for outlier detection in big text data
    Lazhar, Farek
    PROGRESS IN ARTIFICIAL INTELLIGENCE, 2019, 8 (01) : 123 - 132
  • [29] Fuzzy clustering-based semi-supervised approach for outlier detection in big text data
    Farek Lazhar
    Progress in Artificial Intelligence, 2019, 8 : 123 - 132
  • [30] Data Cleansing With Minimum Distortion for ML-Based Equipment Anomaly Detection
    Hsieh, Yun-Che
    Chen, Chieh-Yu
    Liao, Da-Yin
    Lin, Kuan-Chun
    Chang, Shi-Chung
    IEEE TRANSACTIONS ON SEMICONDUCTOR MANUFACTURING, 2023, 36 (04) : 506 - 514