OpenK: An Elastic Data Cleansing System with A Clustering-based Data Anomaly Detection Approach

被引:0
|
作者
Tran Khanh Dang [1 ]
Dinh Khuong Nguyen [1 ]
Luc Minh Tuan [2 ]
机构
[1] Ho Chi Minh Univ Technol HCMUT, VNU HCM, Ho Chi Minh City, Vietnam
[2] Ton Duc Thang Univ, Ctr Appl Informat Technol, Ho Chi Minh City, Vietnam
关键词
Data Cleansing; Levenshtein Distance; Jaro Distance Similarity; Fingerprints; Data Anomaly Detection;
D O I
10.1109/ACOMP53746.2021.00023
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Large amounts of data is generated every second over the Internet, and thus making a proper decision is crucial. Even if we have collected a lot of data, extracting the beneficial knowledge inside the data is a challenging task. The reasons for this difficulty include: 1) data are not normally clean, especially they are obtained from different sources, and 2) data can be redundant or duplicated. Therefore, it is necessary to have a data cleaning process, which is used to detect any anomalies within the data. Additionally, the process can identify any inconsistencies or duplication at an early stage. In this study, we introduce Open(K), an efficient elastic data cleansing system based on clustering methods. Data are clustered based on metrics of similarity generated by different techniques such as: nearest neighbour (e.g. Levenshtein, Damerau-Levenshtein, and Hamming distances), similarity measurement (e.g., Jaro and Jaro-Winkler Distance Similarity), and key collision (e.g., Fingerprints and N-gram fingerprints). Our prototype can run on Windows operating system with an installed AzureCosmosDB version to support a friendly web-based interface and a wide array of input. formals. Experimental results show that our tool outperforms existing software in terms of efficiency and practical perspectives.
引用
收藏
页码:120 / 127
页数:8
相关论文
共 50 条
  • [31] A particle swarm optimization clustering-based approach for hyperspectral image anomaly targets detection
    College of Physics and Electricity Information Engineering, Daqing Normal University, Daqing 163712, China
    不详
    Guangdianzi Jiguang, 2013, 10 (2047-2054):
  • [32] Efficient Clustering-Based Outlier Detection Algorithm for Dynamic Data Stream
    Elahi, Manzoor
    Li, Kun
    Nisar, Wasif
    Lv, Xinjie
    Wang, Hongan
    FIFTH INTERNATIONAL CONFERENCE ON FUZZY SYSTEMS AND KNOWLEDGE DISCOVERY, VOL 5, PROCEEDINGS, 2008, : 298 - 304
  • [33] Clustering-based privacy preserving anonymity approach for table data sharing
    Piao, Chunhui
    Liu, Liping
    Shi, Yajuan
    Jiang, Xuehong
    Song, Ning
    INTERNATIONAL JOURNAL OF SYSTEM ASSURANCE ENGINEERING AND MANAGEMENT, 2020, 11 (04) : 768 - 773
  • [34] A Hybrid Approach for Clustering-based Data Aggregation in Wireless Sensor Networks
    Jung, Woo-Sung
    Lim, Keun-Woo
    Ko, Young-Bae
    Park, Sang-Joon
    THIRD INTERNATIONAL CONFERENCE ON DIGITAL SOCIETY: ICDS 2009, PROCEEDINGS, 2009, : 112 - 117
  • [35] CDNM: Clustering-Based Data Normalization Method For Automated Vulnerability Detection
    Wu, Tongshuai
    Chen, Liwei
    Du, Gewangzi
    Zhu, Chenguang
    Cui, Ningning
    Shi, Gang
    COMPUTER JOURNAL, 2024, 67 (04): : 1538 - 1549
  • [36] Clustering-based dome detection in lunar images using DTM data
    Micheal, Anto A.
    Vani, K.
    INTERNATIONAL JOURNAL OF REMOTE SENSING, 2018, 39 (18) : 5794 - 5808
  • [37] Adaptive fuzzy clustering based anomaly data detection in energy system of steel industry
    Zhao, Jun
    Liu, Kai
    Wang, Wei
    Liu, Ying
    INFORMATION SCIENCES, 2014, 259 : 335 - 345
  • [38] ClubCF: A Clustering-Based Collaborative Filtering Approach for Big Data Application
    Hu, Rong
    Dou, Wanchun
    Liu, Jianxun
    IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTING, 2014, 2 (03) : 302 - 313
  • [39] Photovoltaic anomaly data detection method based on clustering iForest
    Han, Bitong
    Shan, Yu
    Xie, Hongbin
    Ge, Leyi
    THIRD INTERNATIONAL CONFERENCE ON ELECTRONICS AND COMMUNICATION; NETWORK AND COMPUTER TECHNOLOGY (ECNCT 2021), 2022, 12167
  • [40] A clustering-based approach for classifying data streams using graph matching
    Du, Yuxin
    He, Mingshu
    Wang, Xiaojuan
    JOURNAL OF BIG DATA, 2025, 12 (01)