OpenK: An Elastic Data Cleansing System with A Clustering-based Data Anomaly Detection Approach

被引：0

作者：

Tran Khanh Dang ^{[1
]}

Dinh Khuong Nguyen ^{[1
]}

Luc Minh Tuan ^{[2
]}

机构：

[1] Ho Chi Minh Univ Technol HCMUT, VNU HCM, Ho Chi Minh City, Vietnam

[2] Ton Duc Thang Univ, Ctr Appl Informat Technol, Ho Chi Minh City, Vietnam

来源：

2021 15TH INTERNATIONAL CONFERENCE ON ADVANCED COMPUTING AND APPLICATIONS (ACOMP 2021) | 2021年

关键词：

Data Cleansing; Levenshtein Distance; Jaro Distance Similarity; Fingerprints; Data Anomaly Detection;

D O I：

10.1109/ACOMP53746.2021.00023

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Large amounts of data is generated every second over the Internet, and thus making a proper decision is crucial. Even if we have collected a lot of data, extracting the beneficial knowledge inside the data is a challenging task. The reasons for this difficulty include: 1) data are not normally clean, especially they are obtained from different sources, and 2) data can be redundant or duplicated. Therefore, it is necessary to have a data cleaning process, which is used to detect any anomalies within the data. Additionally, the process can identify any inconsistencies or duplication at an early stage. In this study, we introduce Open(K), an efficient elastic data cleansing system based on clustering methods. Data are clustered based on metrics of similarity generated by different techniques such as: nearest neighbour (e.g. Levenshtein, Damerau-Levenshtein, and Hamming distances), similarity measurement (e.g., Jaro and Jaro-Winkler Distance Similarity), and key collision (e.g., Fingerprints and N-gram fingerprints). Our prototype can run on Windows operating system with an installed AzureCosmosDB version to support a friendly web-based interface and a wide array of input. formals. Experimental results show that our tool outperforms existing software in terms of efficiency and practical perspectives.

引用

页码：120 / 127

页数：8

共 50 条

[21] Clustering-based data placement in cloud computing: a predictive approach
Mokhtar Sellami
Haithem Mezni
Mohand Said Hacid
Mohamed Moshen Gammoudi
Cluster Computing, 2021, 24 : 3311 - 3336
[22] Clustering-Based Hybrid Approach for Multivariate Missing Data Imputation
Dubey, Aditya
Rasool, Akhtar
INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2020, 11 (11) : 710 - 714
[23] Clustering-based data placement in cloud computing: a predictive approach
Sellami, Mokhtar
Mezni, Haithem
Hacid, Mohand Said
Gammoudi, Mohamed Moshen
CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS, 2021, 24 (04): : 3311 - 3336
[24] A Hybrid Unsupervised Clustering-Based Anomaly Detection Method
Guo Pu
Lijuan Wang
Jun Shen
Fang Dong
Tsinghua Science and Technology, 2021, 26 (02) : 146 - 153
[25] Clustering-based label estimation for network anomaly detection
Sunhee Baek
Donghwoon Kwon
Sang CSuh
Hyunjoo Kim
Ikkyun Kim
Jinoh Kim
Digital Communications and Networks, 2021, 7 (01) : 37 - 44
[26] Clustering-based label estimation for network anomaly detection
Baek, Sunhee
Kwon, Donghwoon
Suh, Sang C.
Kim, Hyunjoo
Kim, Ikkyun
Kim, Jinoh
DIGITAL COMMUNICATIONS AND NETWORKS, 2021, 7 (01) : 37 - 44
[27] A Hybrid Unsupervised Clustering-Based Anomaly Detection Method
Pu, Guo
Wang, Lijuan
Shen, Jun
Dong, Fang
TSINGHUA SCIENCE AND TECHNOLOGY, 2021, 26 (02) : 146 - 153
[28] Fuzzy clustering-based semi-supervised approach for outlier detection in big text data
Lazhar, Farek
PROGRESS IN ARTIFICIAL INTELLIGENCE, 2019, 8 (01) : 123 - 132
[29] Fuzzy clustering-based semi-supervised approach for outlier detection in big text data
Farek Lazhar
Progress in Artificial Intelligence, 2019, 8 : 123 - 132
[30] Data Cleansing With Minimum Distortion for ML-Based Equipment Anomaly Detection
Hsieh, Yun-Che
Chen, Chieh-Yu
Liao, Da-Yin
Lin, Kuan-Chun
Chang, Shi-Chung
IEEE TRANSACTIONS ON SEMICONDUCTOR MANUFACTURING, 2023, 36 (04) : 506 - 514

← 1 2 3 4 5 →