Automatic noise reduction of domain-specific bibliographic datasets using positive-unlabeled learning

被引:0
|
作者
Guo Chen
Jing Chen
Yu Shao
Lu Xiao
机构
[1] Nanjing University of Science and Technology,Department of Information Management
[2] Northwest Engineering Corporation Limited,Information Centre
[3] Nanjing University of Finance and Economics,School of Journalism
来源
Scientometrics | 2023年 / 128卷
关键词
Domain analysis; Bibliographic dataset; Noise reduction; PU-learning;
D O I
暂无
中图分类号
学科分类号
摘要
Constructing a bibliographic dataset is fundamental for domain analysis in bibliometric research. However, irrelevant documents(so-called “impurities”) in the initial domain dataset are inevitable and difficult to identify, requiring considerable human efforts to eliminate. To solve this problem, we propose a weak-supervised noise reduction approach based on the Positive-Unlabeled Learning (PU-Learning) algorithm to clean the initial bibliographic dataset automatically. The basic idea is to use a batch of “absolutely positive sample sets” already available in the dataset to obtain a collection of “reliable negative sample sets,” based on which a training set can be constructed for the downstream supervised classification. This paper conducted a comparative experiment using the Artificial Intelligence (AI) domain of the US National Technical Reports Library (NTIS) report as an example. We compared schemes with different variables to explore the influence of various technical aspects on the final noise reduction performance. Our approach achieved significant improvements compared with the similarity-based unsupervised baseline; the recall rose from 0.3742 to 0.8103, and the precision rose from 0.6621 to 0.7383. We found that the impact of document representation algorithms is crucial while classification strategies and s_ratio in PU-Learning are not. Our approach needs no manual annotation data and thus can provide powerful help for bibliometric researchers to construct high-quality bibliographic datasets.
引用
收藏
页码:1187 / 1204
页数:17
相关论文
共 50 条
  • [41] Adversarial Positive-Unlabeled Learning-Based Invasive Plant Detection in Alpine Wetland Using Jilin-1 and Sentinel-2 Imageries
    Zhu, Enzhao
    Samat, Alim
    Li, Erzhu
    Xu, Ren
    Li, Wei
    Li, Wenbo
    REMOTE SENSING, 2025, 17 (06)
  • [42] Anomaly Detection in Automatic Meter Intelligence System Using Positive Unlabeled Learning and Multiple Symbolic Aggregate Approximation
    Nguyen, Thi Ngoc Anh
    Vu, Hoai Thu
    Dang, Minh Tuan
    Kim, Dohyeun
    Le, Anh Ngoc
    BIG DATA, 2023, 11 (03) : 225 - 238
  • [43] Constructing automatic domain-specific sentiment lexicon using KNN search via terms discrimination vectors
    Alqasemi F.
    Abdelwahab A.
    Abdelkader H.
    International Journal of Computers and Applications, 2019, 41 (02) : 127 - 137
  • [44] Predicting drug-drug interactions using multi-modal deep auto-encoders based network embedding and positive-unlabeled learning
    Zhang, Yang
    Qiu, Yang
    Cui, Yuxin
    Liu, Shichao
    Zhang, Wen
    METHODS, 2020, 179 : 37 - 46
  • [45] Detecting Arabic Offensive Language in Microblogs Using Domain-Specific Word Embeddings and Deep Learning
    Aljuhani, Khulood O.
    Alyoubi, Khaled H.
    Alotaibi, Fahd S.
    TEHNICKI GLASNIK-TECHNICAL JOURNAL, 2022, 16 (03): : 394 - 400
  • [46] Using Domain-Specific, Immediate Feedback to Support Students Learning Computer Programming to Make Music
    Krug, Douglas Lusa
    Zhang, Yifan
    Mouza, Chrystalla
    Barnett, Taylor
    Pollock, Lori
    Shepherd, David C.
    PROCEEDINGS OF THE 2023 CONFERENCE ON INNOVATION AND TECHNOLOGY IN COMPUTER SCIENCE EDUCATION, ITICSE 2023, VOL 1, 2023, : 368 - 374
  • [47] Using Combined List Hierarchy and Headings of HTML']HTML Documents for Learning Domain-Specific Ontology
    Raza, Muhammad Ahsan
    Raza, Binish
    Jabeen, Taiba
    Raza, Sehrish
    Abbas, Munnawar
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2020, 11 (04) : 233 - 239
  • [49] Improving document classification using domain-specific vocabulary: hybridization of deep learning approach with TFIDF
    Kalra V.
    Kashyap I.
    Kaur H.
    International Journal of Information Technology, 2022, 14 (5) : 2451 - 2457
  • [50] Minimally-supervised learning of domain-specific causal relations using an open-domain corpus as knowledge base
    Ittoo, Ashwin
    Bouma, Gosse
    DATA & KNOWLEDGE ENGINEERING, 2013, 88 : 142 - 163