A data-driven text similarity measure based on classification algorithms

被引:0
|
作者
机构
[1] Cho, Su Gon
[2] Kim, Seoung Bum
来源
Kim, Seoung Bum (sbkim1@korea.ac.kr) | 1600年 / University of Cincinnati卷 / 24期
基金
新加坡国家研究基金会;
关键词
Application problems - Classification accuracy - Classification algorithm - Comparative experiments - Machine learning repository - Similarity measure - Text similarity - University of California;
D O I
暂无
中图分类号
学科分类号
摘要
Measuring text similarity has shown its fundamental utilization in various text mining application problems. This paper proposes a new method based on classification algorithms for measuring the similarity between two texts. Specifically, a sentence-term matrix that describes the frequency of terms that occur in a collection of sentences was created to measure the classification accuracy of two texts. Our idea is based on the fact that similar texts are difficult to distinguish from each other, which should lead to a low classification accuracy between similar texts. By doing comparative experiments on several widely used text similarity measures, analysis results with real data from the Machine Learning Repository at the University of California, Irvine demonstrate that the proposed method is able to achieve outperformed the other existing similarity measures across the entire range of term selection filters. © International Journal of Industrial Engineering.
引用
收藏
相关论文
共 50 条
  • [31] Data-driven methods for equity similarity prediction
    Yaros, John Robert
    Imielinski, Tomasz
    QUANTITATIVE FINANCE, 2015, 15 (10) : 1657 - 1681
  • [32] Analysis of web data classification methods based on semantic similarity measure
    Ramesh, Kante
    Mohanasundaram, R.
    INFORMATION SECURITY JOURNAL, 2023, 32 (05): : 315 - 330
  • [33] A Text Similarity Measure Based on Suffix Tree
    Huang, Chenghui
    Liu, Yan
    Xia, Shengzhong
    Yin, Jian
    INFORMATION-AN INTERNATIONAL INTERDISCIPLINARY JOURNAL, 2011, 14 (02): : 583 - 592
  • [34] Data-Driven Regular Expressions Evolution for Medical Text Classification Using Genetic Programming
    Liu, Jiandong
    Bai, Ruibin
    Lu, Zheng
    Ge, Peiming
    Aickelin, Uwe
    Liu, Daoyun
    2020 IEEE CONGRESS ON EVOLUTIONARY COMPUTATION (CEC), 2020,
  • [35] A Data-driven Crowd Simulation Model based on Clustering and Classification
    Zhao, Mingbi
    Turner, Stephen John
    Cai, Wentong
    17TH IEEE/ACM INTERNATIONAL SYMPOSIUM ON DISTRIBUTED SIMULATION AND REAL TIME APPLICATIONS (DS-RT 2013), 2013, : 125 - 134
  • [36] A logical approach to data-driven classification
    Osswald, R
    Petersen, W
    KI 2003: ADVANCES IN ARTIFICIAL INTELLIGENCE, 2003, 2821 : 267 - 281
  • [37] Data-Driven Classification of Screwdriving Operations
    Aronson, Reuben M.
    Bhatia, Ankit
    Jia, Zhenzhong
    Guillame-Bert, Mathieu
    Bourne, David
    Dubrawski, Artur
    Mason, Matthew T.
    2016 INTERNATIONAL SYMPOSIUM ON EXPERIMENTAL ROBOTICS, 2017, 1 : 244 - 253
  • [38] Data-driven signal detection and classification
    Sayeed, AM
    1997 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS I - V: VOL I: PLENARY, EXPERT SUMMARIES, SPECIAL, AUDIO, UNDERWATER ACOUSTICS, VLSI; VOL II: SPEECH PROCESSING; VOL III: SPEECH PROCESSING, DIGITAL SIGNAL PROCESSING; VOL IV: MULTIDIMENSIONAL SIGNAL PROCESSING, NEURAL NETWORKS - VOL V: STATISTICAL SIGNAL AND ARRAY PROCESSING, APPLICATIONS, 1997, : 3697 - 3700
  • [39] A Data-Driven Measure of Effective Connectivity Based on Renyi's α-Entropy
    Panche, Ivan De La Pava
    Alvarez-Meza, Andres M.
    Orozco-Gutierrez, Alvaro
    FRONTIERS IN NEUROSCIENCE, 2019, 13
  • [40] A new image distortion measure based on a data-driven multisensor organization
    Martinez-Baena, J
    Fdez-Valdivia, J
    Garcia, JA
    Fdez-Vidal, XR
    PATTERN RECOGNITION, 1998, 31 (08) : 1099 - 1116