NATURAL TEXT: MATHEMATICAL METHODS OF ATTRIBUTION

被引:1
|
作者
Popov, Vladimir V. [1 ,2 ]
Shtelmakh, Tatyana, V [2 ]
机构
[1] Volgograd State Univ, Sci Phys & Math, Prosp Univ Sky 100, Volgograd 400062, Russia
[2] Volgograd State Univ, Dept Comp Sci & Expt Math, Prosp Univ Sky 100, Volgograd 400062, Russia
关键词
natural text; pseudo-text; text filtering; Zipf's law; n-grams; the rate of appearance of new words; bag of words" model of the text; graph model of the text;
D O I
10.15688/jvolsu2.2019.2.13
中图分类号
H [语言、文字];
学科分类号
05 ;
摘要
The article proposes two algorithms for substandard texts filtering. The first of these is based on the fact that the frequency of n-grams occurrence in a quality text obeys the Zipf law, and when the words of the text are rearranged, the law ceases to act. Comparison of the frequency characteristics of the source text with the characteristics of the text resulting from the permutation of words enables researchers to draw conclusions regarding the quality of the source text. The second algorithm is based on calculating and comparing the rate new words appear in good quality and randomly generated texts. In a good text, this rate is, as a rule, uneven whereas in randomly generated texts, this unevenness is smoothed out, which makes it possible to detect low-quality texts. The methods for solving the problem of substandard texts filtering are statistical and are based on the calculation of various frequency characteristics of the text. As compared to the "bag of words" model, a graph model of the text, in which the vertices are words or word forms, and the edges are pairs of words, as well as models with higher order structures, in which the frequency characteristics of n-grams are used with n > 2, takes into account the mutual disposition of word pairs, as well as triples of words in a common part of the text, for example, in one sentence or one n-gram.
引用
收藏
页码:147 / 158
页数:12
相关论文
共 50 条
  • [21] Text Categorization for Authorship Attribution in English Poetry
    Gallagher, Catherine
    Li, Yanjun
    INTELLIGENT COMPUTING, VOL 1, 2019, 858 : 249 - 261
  • [22] Text Recycling and Excessive Attribution: A Pragmatic Perspective
    Klika, Karel D.
    JOURNAL OF SCHOLARLY PUBLISHING, 2022, 53 (04) : 177 - 191
  • [23] Selecting text features relevant for authorship attribution
    Zoya, Rezanova, I
    Alexandr, Romanov S.
    Roman, Meshcheryakov, V
    VESTNIK TOMSKOGO GOSUDARSTVENNOGO UNIVERSITETA FILOLOGIYA-TOMSK STATE UNIVERSITY JOURNAL OF PHILOLOGY, 2013, 26 (06): : 38 - 52
  • [24] Stopword Graphs and Authorship Attribution in Text Corpora
    Arun, R.
    Suresh, V.
    Madhavan, C. E. Veni
    2009 IEEE THIRD INTERNATIONAL CONFERENCE ON SEMANTIC COMPUTING (ICSC 2009), 2009, : 192 - 196
  • [25] Incorporating Priors with Feature Attribution on Text Classification
    Liu, Frederick
    Avci, Besim
    57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 6274 - 6283
  • [26] A Method of Text Attribution Based on the Statistics of Numerals
    Zenkov, Andrei V.
    JOURNAL OF QUANTITATIVE LINGUISTICS, 2018, 25 (03) : 256 - 270
  • [27] Learning Tone and Attribution for Financial Text Mining
    El-Haj, Mahmoud
    Rayson, Paul
    Young, Steven
    Moore, Andrew
    Walker, Martin
    Schleicher, Thomas
    Athanasakou, Vasiliki
    LREC 2016 - TENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2016, : 1820 - 1825
  • [28] Read a mathematical text
    Patras, Frederic
    QUINZAINE LITTERAIRE, 2011, (1043): : 29 - 29
  • [29] Computational Methods in Authorship Attribution
    Koppel, Moshe
    Schler, Jonathan
    Argamon, Shlorno
    JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 2009, 60 (01): : 9 - 26
  • [30] Mathematical methods as factors linking natural with social and humanitarian sciences (sociology)
    Tolstova, Yu. N.
    SOTSIOLOGICHESKIE ISSLEDOVANIYA, 2015, (10): : 12 - +