Comparing text corpora via topic modelling

被引：1

作者：

Krasnov, Fedor ^{[1
]}

Shvartsman, Mikhail ^{[2
]}

Dimentov, Alexander ^{[2
]}

机构：

[1] Gazpromneft Sci & Technol Ctr, 75-79 Liter D Moika River Emb, St Petersburg 190000, Russia

[2] Russian State Lib, Natl Elect Informat Consortium, 4-5 Letnikovskaia St, Moscow 115114, Russia

来源：

INTERNATIONAL JOURNAL OF DATA MINING MODELLING AND MANAGEMENT | 2022年 / 14卷 / 03期

关键词：

topic modelling; text classification; ARTM; additive regularisation of topic models; PLSA; random forest; comparing text collections;

D O I：

10.1504/IJDMMM.2022.10050161

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

A method is developed for conducting comparative analysis on the content of full text patents collections. Named T4C, the approach is based on topic modelling and machine learning and extends comparative text mining. The idea of T4C was inspired by the possibility of precise topics extracting from a joint collection of texts and following analysing the parts of collection on the topics. The different aspects of meta information of the patents full texts collection are considered. The ownership of a patent in a particular country can be identified with an accuracy of 97.5% by using supervised machine learning. By studying how patents vary with time, those belonging to a specific period can be identified with an accuracy of 85% for a given country. Also developed is a visual representation of the thematic correlation between groups of patents. In terms of the text composition of patent descriptions, Chinese patents differ fundamentally from US patents. T4C method is valid for structured medium-sized collections of texts in English. The experimental results are used to manage the patenting process at GazpromNeft STC.

引用

页码：203 / 216

页数：15

共 50 条

[21] Joint dynamic topic model for recognition of lead-lag relationship in two text corpora
Yandi Zhu
Xiaoling Lu
Jingya Hong
Feifei Wang
Data Mining and Knowledge Discovery, 2022, 36 : 2272 - 2298
[22] Topic Discovery via Convex Polytopic Model: A Case Study with Small Corpora
Wu, King Keung
Meng, Helen
Yam, Yeung
2018 9TH IEEE INTERNATIONAL CONFERENCE ON COGNITIVE INFOCOMMUNICATIONS (COGINFOCOM), 2018, : 367 - 372
[23] On the Assessment of Text Corpora
Pinto, David
Rosso, Paolo
Jimenez-Salazar, Hector
NATURAL LANGUAGE PROCESSING AND INFORMATION SYSTEMS, 2010, 5723 : 281 - +
[24] Inferring Concept Hierarchies from Text Corpora via Hyperbolic Embeddings
Le, Matt
Roller, Stephen
Papaxanthos, Laetitia
Kiela, Douwe
Nickel, Maximilian
57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 3231 - 3241
[25] Biomedical Text Categorization Based on Ensemble Pruning and Optimized Topic Modelling
Onan, Aytug
COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE, 2018, 2018
[26] Modelling human judgments of semantic similarities and topic assignments for text documents
Navarro, DJ
AUSTRALIAN JOURNAL OF PSYCHOLOGY, 2004, 56 : 211 - 211
[27] Clustering Prominent Named Entities in Topic-Specific Text Corpora Completed Research Full Papers
Alsudais, Abdulkareem
Tchalian, Hovig
25TH AMERICAS CONFERENCE ON INFORMATION SYSTEMS (AMCIS 2019), 2019,
[28] Maintaining Topic Models for Growing Corpora
Kuhr, Felix
Bender, Magnus
Braun, Tanya
Moeller, Ralf
2020 IEEE 14TH INTERNATIONAL CONFERENCE ON SEMANTIC COMPUTING (ICSC 2020), 2020, : 451 - 458
[29] Sentiment Detection of Short Text via Probabilistic Topic Modeling
Wu, Zewei
Rao, Yanghui
Li, Xin
Li, Jun
Xie, Haoran
Wang, Fu Lee
DATABASE SYSTEMS FOR ADVANCED APPLICATIONS, DASFAA 2015, 2015, 9052 : 76 - 85
[30] A survey on news text visualization via probabilistic topic modeling
Tang, Siliang
Cheng, Lu
Shao, Jian
Wu, Fei
Lu, Weiming
Jisuanji Fuzhu Sheji Yu Tuxingxue Xuebao/Journal of Computer-Aided Design and Computer Graphics, 2015, 27 (05): : 771 - 782

← 1 2 3 4 5 →