Text similarity detection method based on NLP

被引:0
|
作者
Dai X. [1 ,2 ]
Liu S. [1 ]
Gong D. [1 ]
机构
[1] School of Economics and Management, Beijing Jiaotong University, Beijing
[2] China InfoCom Media Group, Beijing
来源
基金
中国国家自然科学基金;
关键词
Analytic hierarchy process; Feature word extraction; Pearson correlation coefficient; Text similarity; Word position weight;
D O I
10.11959/j.issn.1000-436x.2021192
中图分类号
学科分类号
摘要
Current text similarity detection methods that ignore document structure information and lack semantic relevance. To solve these problems, a text-oriented similarity detection method was proposed. First, analytic hierarchy process (AHP) was used to calculate word position weight to extract feature words. Second, the Pearson correlation coefficient was used to measure semantic correlation between words which was the weight of generalized Dice coefficient to calculate similarity. Experimental results show that the proposed method can improve the precision of feature word extraction and the accuracy of similarity calculation results. © 2021, Editorial Board of Journal on Communications. All right reserved.
引用
收藏
页码:173 / 181
页数:8
相关论文
共 37 条
  • [1] YANG Z X, CHEN Z F, ZHANG P, Et al., An information intelligent search method for computer forensics based on text similarity, Proceedings of Proceedings of the 2020 4th International Conference on Cryptography, Security and Privacy, pp. 79-83, (2020)
  • [2] ALMEIDA C, SANTOS D., Text similarity using word embeddings to classify misinformation, pp. 63-68, (2003)
  • [3] SEKI K., Cross-lingual text similarity exploiting neural machine translation models, Journal of Information Science, 47, 3, pp. 404-418, (2021)
  • [4] LIANG H Z, LIN K B, ZHU S Z., Short text similarity hybrid algorithm for a Chinese medical intelligent question answering sys-tem, Technology-Inspired Smart Learning for Future Education, pp. 129-142, (2020)
  • [5] PRAKOSO D W, ABDI A, AMRIT C., Short text similarity measurement methods: a review, Soft Computing, 25, 6, pp. 4699-4723, (2021)
  • [6] IRVING R W, FRASER C B., Two algorithms for the longest common subsequence of three (or more) strings, Combinatorial Pattern Matching, pp. 214-229, (1992)
  • [7] DAMERAU F J., A technique for computer detection and correction of spelling errors, Communications of the ACM, 7, 3, pp. 171-176, (1964)
  • [8] JACCARD P., The distribution of the flora in the alpine zone.1, New Phytologist, 11, 2, pp. 37-50, (1912)
  • [9] DICE L., Measures of the amount of ecologic association between species, Ecology, 26, 3, pp. 297-302, (1945)
  • [10] DEZA M M, DEZA E., Encyclopedia of distances, (2009)