OpenK: An Elastic Data Cleansing System with A Clustering-based Data Anomaly Detection Approach

被引:0
|
作者
Tran Khanh Dang [1 ]
Dinh Khuong Nguyen [1 ]
Luc Minh Tuan [2 ]
机构
[1] Ho Chi Minh Univ Technol HCMUT, VNU HCM, Ho Chi Minh City, Vietnam
[2] Ton Duc Thang Univ, Ctr Appl Informat Technol, Ho Chi Minh City, Vietnam
关键词
Data Cleansing; Levenshtein Distance; Jaro Distance Similarity; Fingerprints; Data Anomaly Detection;
D O I
10.1109/ACOMP53746.2021.00023
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Large amounts of data is generated every second over the Internet, and thus making a proper decision is crucial. Even if we have collected a lot of data, extracting the beneficial knowledge inside the data is a challenging task. The reasons for this difficulty include: 1) data are not normally clean, especially they are obtained from different sources, and 2) data can be redundant or duplicated. Therefore, it is necessary to have a data cleaning process, which is used to detect any anomalies within the data. Additionally, the process can identify any inconsistencies or duplication at an early stage. In this study, we introduce Open(K), an efficient elastic data cleansing system based on clustering methods. Data are clustered based on metrics of similarity generated by different techniques such as: nearest neighbour (e.g. Levenshtein, Damerau-Levenshtein, and Hamming distances), similarity measurement (e.g., Jaro and Jaro-Winkler Distance Similarity), and key collision (e.g., Fingerprints and N-gram fingerprints). Our prototype can run on Windows operating system with an installed AzureCosmosDB version to support a friendly web-based interface and a wide array of input. formals. Experimental results show that our tool outperforms existing software in terms of efficiency and practical perspectives.
引用
收藏
页码:120 / 127
页数:8
相关论文
共 50 条
  • [41] Anomaly detection of diabetes data based on hierarchical clustering and CNN
    Fang, Jinhai
    Xie, Zuoling
    Cheng, Haitao
    Fan, Bin
    Xu, He
    Li, Peng
    8TH INTERNATIONAL CONFERENCE ON INFORMATION TECHNOLOGY AND QUANTITATIVE MANAGEMENT (ITQM 2020 & 2021): DEVELOPING GLOBAL DIGITAL ECONOMY AFTER COVID-19, 2022, 199 : 71 - 78
  • [42] Clustering-based privacy preserving anonymity approach for table data sharing
    Chunhui Piao
    Liping Liu
    Yajuan Shi
    Xuehong Jiang
    Ning Song
    International Journal of System Assurance Engineering and Management, 2020, 11 : 768 - 773
  • [43] A Clustering-Based Method to Anomaly Detection in Thermal Power Plants
    Drapal, Patricia
    Clemente, Jullya
    Reyes, Dailys Maite
    de Souza, Starch Melo
    Lins, Anthony
    Prudencio, Ricardo B. C.
    2022 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2022,
  • [44] Time series anomaly detection via clustering-based representation
    Enayati, Elham
    Mortazavi, Reza
    Basiri, Abdolali
    Ghasemian, Javad
    Moallem, Mahmoud
    EVOLVING SYSTEMS, 2024, 15 (04) : 1115 - 1136
  • [45] Trajectory Clustering-Based Anomaly Detection in Indoor Human Movement
    Lan, Doi Thi
    Yoon, Seokhoon
    SENSORS, 2023, 23 (06)
  • [46] Deep Convolutional Clustering-Based Time Series Anomaly Detection
    Chadha, Gavneet Singh
    Islam, Intekhab
    Schwung, Andreas
    Ding, Steven X.
    SENSORS, 2021, 21 (16)
  • [47] A visual-numeric approach to clustering and anomaly detection for trajectory data
    Kumar, Dheeraj
    Bezdek, James C.
    Rajasegarar, Sutharshan
    Leckie, Christopher
    Palaniswami, Marimuthu
    VISUAL COMPUTER, 2017, 33 (03): : 265 - 281
  • [48] A visual-numeric approach to clustering and anomaly detection for trajectory data
    Dheeraj Kumar
    James C. Bezdek
    Sutharshan Rajasegarar
    Christopher Leckie
    Marimuthu Palaniswami
    The Visual Computer, 2017, 33 : 265 - 281
  • [49] A Data-Driven Heart Disease Prediction Model Through K-Means Clustering-Based Anomaly Detection
    Ripan R.C.
    Sarker I.H.
    Hossain S.M.M.
    Anwar M.M.
    Nowrozy R.
    Hoque M.M.
    Furhad M.H.
    SN Computer Science, 2021, 2 (2)
  • [50] Data driven battery anomaly detection based on shape based clustering for the data centers class
    Haider, Syed Naeem
    Zhao, Qianchuan
    Li, Xueliang
    JOURNAL OF ENERGY STORAGE, 2020, 29