Automatic noise reduction of domain-specific bibliographic datasets using positive-unlabeled learning

被引：0

作者：

Guo Chen

Jing Chen

Yu Shao

Lu Xiao

机构：

[1] Nanjing University of Science and Technology,Department of Information Management

[2] Northwest Engineering Corporation Limited,Information Centre

[3] Nanjing University of Finance and Economics,School of Journalism

来源：

Scientometrics | 2023年 / 128卷

关键词：

Domain analysis; Bibliographic dataset; Noise reduction; PU-learning;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

Constructing a bibliographic dataset is fundamental for domain analysis in bibliometric research. However, irrelevant documents(so-called “impurities”) in the initial domain dataset are inevitable and difficult to identify, requiring considerable human efforts to eliminate. To solve this problem, we propose a weak-supervised noise reduction approach based on the Positive-Unlabeled Learning (PU-Learning) algorithm to clean the initial bibliographic dataset automatically. The basic idea is to use a batch of “absolutely positive sample sets” already available in the dataset to obtain a collection of “reliable negative sample sets,” based on which a training set can be constructed for the downstream supervised classification. This paper conducted a comparative experiment using the Artificial Intelligence (AI) domain of the US National Technical Reports Library (NTIS) report as an example. We compared schemes with different variables to explore the influence of various technical aspects on the final noise reduction performance. Our approach achieved significant improvements compared with the similarity-based unsupervised baseline; the recall rose from 0.3742 to 0.8103, and the precision rose from 0.6621 to 0.7383. We found that the impact of document representation algorithms is crucial while classification strategies and s_ratio in PU-Learning are not. Our approach needs no manual annotation data and thus can provide powerful help for bibliometric researchers to construct high-quality bibliographic datasets.

引用

页码：1187 / 1204

页数：17

共 50 条

[41] Adversarial Positive-Unlabeled Learning-Based Invasive Plant Detection in Alpine Wetland Using Jilin-1 and Sentinel-2 Imageries
Zhu, Enzhao
Samat, Alim
Li, Erzhu
Xu, Ren
Li, Wei
Li, Wenbo
REMOTE SENSING, 2025, 17 (06)
[42] Anomaly Detection in Automatic Meter Intelligence System Using Positive Unlabeled Learning and Multiple Symbolic Aggregate Approximation
Nguyen, Thi Ngoc Anh
Vu, Hoai Thu
Dang, Minh Tuan
Kim, Dohyeun
Le, Anh Ngoc
BIG DATA, 2023, 11 (03) : 225 - 238
[43] Constructing automatic domain-specific sentiment lexicon using KNN search via terms discrimination vectors
Alqasemi F.
Abdelwahab A.
Abdelkader H.
International Journal of Computers and Applications, 2019, 41 (02) : 127 - 137
[44] Predicting drug-drug interactions using multi-modal deep auto-encoders based network embedding and positive-unlabeled learning
Zhang, Yang
Qiu, Yang
Cui, Yuxin
Liu, Shichao
Zhang, Wen
METHODS, 2020, 179 : 37 - 46
[45] Detecting Arabic Offensive Language in Microblogs Using Domain-Specific Word Embeddings and Deep Learning
Aljuhani, Khulood O.
Alyoubi, Khaled H.
Alotaibi, Fahd S.
TEHNICKI GLASNIK-TECHNICAL JOURNAL, 2022, 16 (03): : 394 - 400
[46] Using Domain-Specific, Immediate Feedback to Support Students Learning Computer Programming to Make Music
Krug, Douglas Lusa
Zhang, Yifan
Mouza, Chrystalla
Barnett, Taylor
Pollock, Lori
Shepherd, David C.
PROCEEDINGS OF THE 2023 CONFERENCE ON INNOVATION AND TECHNOLOGY IN COMPUTER SCIENCE EDUCATION, ITICSE 2023, VOL 1, 2023, : 368 - 374
[47] Using Combined List Hierarchy and Headings of HTML']HTML Documents for Learning Domain-Specific Ontology
Raza, Muhammad Ahsan
Raza, Binish
Jabeen, Taiba
Raza, Sehrish
Abbas, Munnawar
INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2020, 11 (04) : 233 - 239
[48] Beyond Conceptual Change: Using Representations to Integrate Domain-Specific Structural Models in Learning Mathematics
Singer, Florence Mihaela
MIND BRAIN AND EDUCATION, 2007, 1 (02) : 84 - 97
[49] Improving document classification using domain-specific vocabulary: hybridization of deep learning approach with TFIDF
Kalra V.
Kashyap I.
Kaur H.
International Journal of Information Technology, 2022, 14 (5) : 2451 - 2457
[50] Minimally-supervised learning of domain-specific causal relations using an open-domain corpus as knowledge base
Ittoo, Ashwin
Bouma, Gosse
DATA & KNOWLEDGE ENGINEERING, 2013, 88 : 142 - 163

← 1 2 3 4 5 →