NATURAL TEXT: MATHEMATICAL METHODS OF ATTRIBUTION

被引:1
|
作者
Popov, Vladimir V. [1 ,2 ]
Shtelmakh, Tatyana, V [2 ]
机构
[1] Volgograd State Univ, Sci Phys & Math, Prosp Univ Sky 100, Volgograd 400062, Russia
[2] Volgograd State Univ, Dept Comp Sci & Expt Math, Prosp Univ Sky 100, Volgograd 400062, Russia
关键词
natural text; pseudo-text; text filtering; Zipf's law; n-grams; the rate of appearance of new words; bag of words" model of the text; graph model of the text;
D O I
10.15688/jvolsu2.2019.2.13
中图分类号
H [语言、文字];
学科分类号
05 ;
摘要
The article proposes two algorithms for substandard texts filtering. The first of these is based on the fact that the frequency of n-grams occurrence in a quality text obeys the Zipf law, and when the words of the text are rearranged, the law ceases to act. Comparison of the frequency characteristics of the source text with the characteristics of the text resulting from the permutation of words enables researchers to draw conclusions regarding the quality of the source text. The second algorithm is based on calculating and comparing the rate new words appear in good quality and randomly generated texts. In a good text, this rate is, as a rule, uneven whereas in randomly generated texts, this unevenness is smoothed out, which makes it possible to detect low-quality texts. The methods for solving the problem of substandard texts filtering are statistical and are based on the calculation of various frequency characteristics of the text. As compared to the "bag of words" model, a graph model of the text, in which the vertices are words or word forms, and the edges are pairs of words, as well as models with higher order structures, in which the frequency characteristics of n-grams are used with n > 2, takes into account the mutual disposition of word pairs, as well as triples of words in a common part of the text, for example, in one sentence or one n-gram.
引用
收藏
页码:147 / 158
页数:12
相关论文
共 50 条
  • [1] Authorship Attribution of The Golden Lotus Based on Text Classification Methods
    Tang, Xuemei
    Liang, Shichen
    Liu, Zhiying
    3RD INTERNATIONAL CONFERENCE ON INNOVATION IN ARTIFICIAL INTELLIGENCE (ICIAI 2019), 2019, : 69 - 72
  • [2] Additive Feature Attribution Explainable Methods to Craft Adversarial Attacks for Text Classification and Text Regression
    Chai, Yidong
    Liang, Ruicheng
    Samtani, Sagar
    Zhu, Hongyi
    Wang, Meng
    Liu, Yezheng
    Jiang, Yuanchun
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2023, 35 (12) : 12400 - 12414
  • [3] Authorship Attribution of Noisy Text Data With a Comparative Study of Clustering Methods
    Hamadache, Zohra
    Sayoud, Halim
    INTERNATIONAL JOURNAL OF KNOWLEDGE AND SYSTEMS SCIENCE, 2018, 9 (02) : 45 - 69
  • [4] Mathematical methods of the natural and engineering sciences
    Zeitschrift fuer Angewandte Mathematik und Mechanik, 75
  • [5] ATTRIBUTION AND MATHEMATICAL LEARNING
    Leo, Maria del Valle
    ATENAS, 2013, 1 (21): : 54 - 69
  • [6] A review of natural scene text detection methods
    Yang, Lingqian
    Ergu, Daji
    Cai, Ying
    Liu, Fangyao
    Ma, Bo
    8TH INTERNATIONAL CONFERENCE ON INFORMATION TECHNOLOGY AND QUANTITATIVE MANAGEMENT (ITQM 2020 & 2021): DEVELOPING GLOBAL DIGITAL ECONOMY AFTER COVID-19, 2022, 199 : 1458 - 1465
  • [7] Mathematical Methods and Models in the Natural to the Life Sciences
    Naz, Rehana
    Freire, Igor Leite
    Naeem, Imran
    Torrisi, Mariano
    ABSTRACT AND APPLIED ANALYSIS, 2014,
  • [8] An example of mathematical authorship attribution
    Basile, Chiara
    Benedetto, Dario
    Caglioti, Emanuele
    Esposti, Mirko Degli
    JOURNAL OF MATHEMATICAL PHYSICS, 2008, 49 (12)
  • [9] Natural Language Premise Selection: Finding Supporting Statements for Mathematical Text
    Ferreira, Deborah
    Freitas, Andre
    PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 2175 - 2182
  • [10] Sustainable management of natural resources: Mathematical models and methods
    Universite Paris-Est, CERMICS, 6-8 avenue Blaise Pascal, Marne la Vallee Cedex 2
    77455, France
    不详
    75005, France
    Environ. Sci. Eng. (Subseries: Environ. Sci.), 2008, 9783540790730 (III-IV):