Automatic noise reduction of domain-specific bibliographic datasets using positive-unlabeled learning

被引:0
|
作者
Guo Chen
Jing Chen
Yu Shao
Lu Xiao
机构
[1] Nanjing University of Science and Technology,Department of Information Management
[2] Northwest Engineering Corporation Limited,Information Centre
[3] Nanjing University of Finance and Economics,School of Journalism
来源
Scientometrics | 2023年 / 128卷
关键词
Domain analysis; Bibliographic dataset; Noise reduction; PU-learning;
D O I
暂无
中图分类号
学科分类号
摘要
Constructing a bibliographic dataset is fundamental for domain analysis in bibliometric research. However, irrelevant documents(so-called “impurities”) in the initial domain dataset are inevitable and difficult to identify, requiring considerable human efforts to eliminate. To solve this problem, we propose a weak-supervised noise reduction approach based on the Positive-Unlabeled Learning (PU-Learning) algorithm to clean the initial bibliographic dataset automatically. The basic idea is to use a batch of “absolutely positive sample sets” already available in the dataset to obtain a collection of “reliable negative sample sets,” based on which a training set can be constructed for the downstream supervised classification. This paper conducted a comparative experiment using the Artificial Intelligence (AI) domain of the US National Technical Reports Library (NTIS) report as an example. We compared schemes with different variables to explore the influence of various technical aspects on the final noise reduction performance. Our approach achieved significant improvements compared with the similarity-based unsupervised baseline; the recall rose from 0.3742 to 0.8103, and the precision rose from 0.6621 to 0.7383. We found that the impact of document representation algorithms is crucial while classification strategies and s_ratio in PU-Learning are not. Our approach needs no manual annotation data and thus can provide powerful help for bibliometric researchers to construct high-quality bibliographic datasets.
引用
收藏
页码:1187 / 1204
页数:17
相关论文
共 50 条
  • [1] Automatic noise reduction of domain-specific bibliographic datasets using positive-unlabeled learning
    Chen, Guo
    Chen, Jing
    Shao, Yu
    Xiao, Lu
    SCIENTOMETRICS, 2023, 128 (02) : 1187 - 1204
  • [2] Positive-unlabeled learning for open set domain adaptation
    Loghmani, Mohammad Reza
    Vincze, Markus
    Tommasi, Tatiana
    PATTERN RECOGNITION LETTERS, 2020, 136 : 198 - 204
  • [3] AdaSampling for Positive-Unlabeled and Label Noise Learning With Bioinformatics Applications
    Yang, Pengyi
    Ormerod, John T.
    Liu, Wei
    Ma, Chendong
    Zomaya, Albert Y.
    Yang, Jean Y. H.
    IEEE TRANSACTIONS ON CYBERNETICS, 2019, 49 (05) : 1932 - 1943
  • [4] Spotting Fake Reviews using Positive-Unlabeled Learning
    Li, Huayi
    Liu, Bing
    Mukherjee, Arjun
    Shao, Jidong
    COMPUTACION Y SISTEMAS, 2014, 18 (03): : 467 - 475
  • [5] Computational Identification of Lysine Glutarylation Sites Using Positive-Unlabeled Learning
    Ju, Zhe
    Wang, Shi-Yun
    CURRENT GENOMICS, 2020, 21 (03) : 204 - 211
  • [6] Predicting drug-target interaction using positive-unlabeled learning
    Lan, Wei
    Wang, Jianxin
    Li, Min
    Liu, Jin
    Li, Yaohang
    Wu, Fang-Xiang
    Pan, Yi
    NEUROCOMPUTING, 2016, 206 : 50 - 57
  • [7] Robust Positive-Unlabeled Learning via Noise Negative Sample Self-correction
    Zhu, Zhangchi
    Wang, Lu
    Zhao, Pu
    Du, Chao
    Zhang, Wei
    Dong, Hang
    Qiao, Bo
    Lin, Qingwei
    Rajmohan, Saravan
    Zhang, Dongmei
    PROCEEDINGS OF THE 29TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, KDD 2023, 2023, : 3663 - 3673
  • [8] Distantly Supervised Named Entity Recognition using Positive-Unlabeled Learning
    Peng, Minlong
    Xing, Xiaoyu
    Zhang, Qi
    Fu, Jinlan
    Huang, Xuanjing
    57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 2409 - 2419
  • [9] A domain-specific language for describing machine learning datasets
    Giner-Miguelez, Joan
    Gomez, Abel
    Cabot, Jordi
    JOURNAL OF COMPUTER LANGUAGES, 2023, 76
  • [10] Foundations for improved vaccine correlate of risk analysis using positive-unlabeled learning
    Kelkar, Natasha S.
    Morrison, Kyle S.
    Ackerman, Margaret E.
    HUMAN VACCINES & IMMUNOTHERAPEUTICS, 2023, 19 (01)