Comparing text corpora via topic modelling

被引:1
|
作者
Krasnov, Fedor [1 ]
Shvartsman, Mikhail [2 ]
Dimentov, Alexander [2 ]
机构
[1] Gazpromneft Sci & Technol Ctr, 75-79 Liter D Moika River Emb, St Petersburg 190000, Russia
[2] Russian State Lib, Natl Elect Informat Consortium, 4-5 Letnikovskaia St, Moscow 115114, Russia
关键词
topic modelling; text classification; ARTM; additive regularisation of topic models; PLSA; random forest; comparing text collections;
D O I
10.1504/IJDMMM.2022.10050161
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
A method is developed for conducting comparative analysis on the content of full text patents collections. Named T4C, the approach is based on topic modelling and machine learning and extends comparative text mining. The idea of T4C was inspired by the possibility of precise topics extracting from a joint collection of texts and following analysing the parts of collection on the topics. The different aspects of meta information of the patents full texts collection are considered. The ownership of a patent in a particular country can be identified with an accuracy of 97.5% by using supervised machine learning. By studying how patents vary with time, those belonging to a specific period can be identified with an accuracy of 85% for a given country. Also developed is a visual representation of the thematic correlation between groups of patents. In terms of the text composition of patent descriptions, Chinese patents differ fundamentally from US patents. T4C method is valid for structured medium-sized collections of texts in English. The experimental results are used to manage the patenting process at GazpromNeft STC.
引用
收藏
页码:203 / 216
页数:15
相关论文
共 50 条
  • [21] Joint dynamic topic model for recognition of lead-lag relationship in two text corpora
    Yandi Zhu
    Xiaoling Lu
    Jingya Hong
    Feifei Wang
    Data Mining and Knowledge Discovery, 2022, 36 : 2272 - 2298
  • [22] Topic Discovery via Convex Polytopic Model: A Case Study with Small Corpora
    Wu, King Keung
    Meng, Helen
    Yam, Yeung
    2018 9TH IEEE INTERNATIONAL CONFERENCE ON COGNITIVE INFOCOMMUNICATIONS (COGINFOCOM), 2018, : 367 - 372
  • [23] On the Assessment of Text Corpora
    Pinto, David
    Rosso, Paolo
    Jimenez-Salazar, Hector
    NATURAL LANGUAGE PROCESSING AND INFORMATION SYSTEMS, 2010, 5723 : 281 - +
  • [24] Inferring Concept Hierarchies from Text Corpora via Hyperbolic Embeddings
    Le, Matt
    Roller, Stephen
    Papaxanthos, Laetitia
    Kiela, Douwe
    Nickel, Maximilian
    57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 3231 - 3241
  • [25] Biomedical Text Categorization Based on Ensemble Pruning and Optimized Topic Modelling
    Onan, Aytug
    COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE, 2018, 2018
  • [26] Modelling human judgments of semantic similarities and topic assignments for text documents
    Navarro, DJ
    AUSTRALIAN JOURNAL OF PSYCHOLOGY, 2004, 56 : 211 - 211
  • [27] Clustering Prominent Named Entities in Topic-Specific Text Corpora Completed Research Full Papers
    Alsudais, Abdulkareem
    Tchalian, Hovig
    25TH AMERICAS CONFERENCE ON INFORMATION SYSTEMS (AMCIS 2019), 2019,
  • [28] Maintaining Topic Models for Growing Corpora
    Kuhr, Felix
    Bender, Magnus
    Braun, Tanya
    Moeller, Ralf
    2020 IEEE 14TH INTERNATIONAL CONFERENCE ON SEMANTIC COMPUTING (ICSC 2020), 2020, : 451 - 458
  • [29] Sentiment Detection of Short Text via Probabilistic Topic Modeling
    Wu, Zewei
    Rao, Yanghui
    Li, Xin
    Li, Jun
    Xie, Haoran
    Wang, Fu Lee
    DATABASE SYSTEMS FOR ADVANCED APPLICATIONS, DASFAA 2015, 2015, 9052 : 76 - 85
  • [30] A survey on news text visualization via probabilistic topic modeling
    Tang, Siliang
    Cheng, Lu
    Shao, Jian
    Wu, Fei
    Lu, Weiming
    Jisuanji Fuzhu Sheji Yu Tuxingxue Xuebao/Journal of Computer-Aided Design and Computer Graphics, 2015, 27 (05): : 771 - 782