Automatic noise reduction of domain-specific bibliographic datasets using positive-unlabeled learning

被引:0
|
作者
Guo Chen
Jing Chen
Yu Shao
Lu Xiao
机构
[1] Nanjing University of Science and Technology,Department of Information Management
[2] Northwest Engineering Corporation Limited,Information Centre
[3] Nanjing University of Finance and Economics,School of Journalism
来源
Scientometrics | 2023年 / 128卷
关键词
Domain analysis; Bibliographic dataset; Noise reduction; PU-learning;
D O I
暂无
中图分类号
学科分类号
摘要
Constructing a bibliographic dataset is fundamental for domain analysis in bibliometric research. However, irrelevant documents(so-called “impurities”) in the initial domain dataset are inevitable and difficult to identify, requiring considerable human efforts to eliminate. To solve this problem, we propose a weak-supervised noise reduction approach based on the Positive-Unlabeled Learning (PU-Learning) algorithm to clean the initial bibliographic dataset automatically. The basic idea is to use a batch of “absolutely positive sample sets” already available in the dataset to obtain a collection of “reliable negative sample sets,” based on which a training set can be constructed for the downstream supervised classification. This paper conducted a comparative experiment using the Artificial Intelligence (AI) domain of the US National Technical Reports Library (NTIS) report as an example. We compared schemes with different variables to explore the influence of various technical aspects on the final noise reduction performance. Our approach achieved significant improvements compared with the similarity-based unsupervised baseline; the recall rose from 0.3742 to 0.8103, and the precision rose from 0.6621 to 0.7383. We found that the impact of document representation algorithms is crucial while classification strategies and s_ratio in PU-Learning are not. Our approach needs no manual annotation data and thus can provide powerful help for bibliometric researchers to construct high-quality bibliographic datasets.
引用
收藏
页码:1187 / 1204
页数:17
相关论文
共 50 条
  • [31] Adaptive multi-task positive-unlabeled learning for joint prediction of multiple chronic diseases using online shopping behaviors
    Wang, Yongzhen
    Lin, Jun
    Bi, Sheng
    Sun, Changlong
    Si, Luo
    Liu, Xiaozhong
    EXPERT SYSTEMS WITH APPLICATIONS, 2022, 191
  • [32] Learning Domain-Specific and Domain-Independent Opinion Oriented Lexicons using Multiple Domain Knowledge
    Vishnu, K. Sai
    Apoorva, T.
    Gupta, Deepa
    2014 SEVENTH INTERNATIONAL CONFERENCE ON CONTEMPORARY COMPUTING (IC3), 2014, : 318 - 323
  • [33] Using UML as a Domain-Specific Modeling Language: A Proposal for Automatic Generation of UML Profiles
    Giachetti, Giovanni
    Marin, Beatriz
    Pastor, Oscar
    ADVANCED INFORMATION SYSTEMS ENGINEERING, PROCEEDINGS, 2009, 5565 : 110 - 124
  • [34] Domain-Specific Image Classification Using Ensemble Learning Utilizing Open-Domain Knowledge
    Sun, Han
    Yang, Jian
    2019 INTERNATIONAL CONFERENCE ON COMPUTING, NETWORKING AND COMMUNICATIONS (ICNC), 2019, : 592 - 596
  • [35] Automatic Generation of Domain-Specific Genetic Algorithm Operators using the Hierarchical Bayesian Optimization Algorithm
    Janikow, Cezary Z.
    Hauschild, Mark
    PROCEEDINGS OF THE 2017 GENETIC AND EVOLUTIONARY COMPUTATION CONFERENCE (GECCO'17), 2017, : 801 - 808
  • [36] Using topic-noise models to generate domain-specific topics across data sources
    Churchill, Rob
    Singh, Lisa
    KNOWLEDGE AND INFORMATION SYSTEMS, 2023, 65 (05) : 2159 - 2186
  • [37] Using topic-noise models to generate domain-specific topics across data sources
    Rob Churchill
    Lisa Singh
    Knowledge and Information Systems, 2023, 65 : 2159 - 2186
  • [38] Robustness Analysis of Machine Learning Models Using Domain-Specific Test Data Perturbation
    Lambert, Marian
    Schuster, Thomas
    Kessel, Marcus
    Atkinson, Colin
    PROGRESS IN ARTIFICIAL INTELLIGENCE, EPIA 2023, PT I, 2023, 14115 : 158 - 170
  • [39] ReSIL: Revivifying Function Signature Inference using Deep Learning with Domain-Specific Knowledge
    Lin, Yan
    Gao, Debin
    Lo, David
    CODASPY'22: PROCEEDINGS OF THE TWELVETH ACM CONFERENCE ON DATA AND APPLICATION SECURITY AND PRIVACY, 2022, : 107 - 118
  • [40] Using domain-specific knowledge in generalization error bounds for support vector machine learning
    Eryarsoy, Enes
    Koehler, Gary J.
    Aytug, Haldun
    DECISION SUPPORT SYSTEMS, 2009, 46 (02) : 481 - 491