Automatic noise reduction of domain-specific bibliographic datasets using positive-unlabeled learning

被引:0
|
作者
Guo Chen
Jing Chen
Yu Shao
Lu Xiao
机构
[1] Nanjing University of Science and Technology,Department of Information Management
[2] Northwest Engineering Corporation Limited,Information Centre
[3] Nanjing University of Finance and Economics,School of Journalism
来源
Scientometrics | 2023年 / 128卷
关键词
Domain analysis; Bibliographic dataset; Noise reduction; PU-learning;
D O I
暂无
中图分类号
学科分类号
摘要
Constructing a bibliographic dataset is fundamental for domain analysis in bibliometric research. However, irrelevant documents(so-called “impurities”) in the initial domain dataset are inevitable and difficult to identify, requiring considerable human efforts to eliminate. To solve this problem, we propose a weak-supervised noise reduction approach based on the Positive-Unlabeled Learning (PU-Learning) algorithm to clean the initial bibliographic dataset automatically. The basic idea is to use a batch of “absolutely positive sample sets” already available in the dataset to obtain a collection of “reliable negative sample sets,” based on which a training set can be constructed for the downstream supervised classification. This paper conducted a comparative experiment using the Artificial Intelligence (AI) domain of the US National Technical Reports Library (NTIS) report as an example. We compared schemes with different variables to explore the influence of various technical aspects on the final noise reduction performance. Our approach achieved significant improvements compared with the similarity-based unsupervised baseline; the recall rose from 0.3742 to 0.8103, and the precision rose from 0.6621 to 0.7383. We found that the impact of document representation algorithms is crucial while classification strategies and s_ratio in PU-Learning are not. Our approach needs no manual annotation data and thus can provide powerful help for bibliometric researchers to construct high-quality bibliographic datasets.
引用
收藏
页码:1187 / 1204
页数:17
相关论文
共 50 条
  • [21] TLATR: Automatic Topic Labeling Using Automatic (Domain-Specific) Term Recognition
    Truica, Ciprian-Octavian
    Apostol, Elena-Simona
    IEEE Access, 2021, 9 : 76624 - 76641
  • [22] Learning and using domain-specific heuristics in ASP solvers
    Balduccini, Marcello
    AI COMMUNICATIONS, 2011, 24 (02) : 147 - 164
  • [23] Predicting disease-associated circular RNAs using deep forests combined with positive-unlabeled learning methods
    Zeng, Xiangxiang
    Zhong, Yue
    Lin, Wei
    Zou, Quan
    BRIEFINGS IN BIOINFORMATICS, 2020, 21 (04) : 1425 - 1436
  • [24] Enhancing landslide susceptibility mapping using a positive-unlabeled machine learning approach: a case study in Chamoli, India
    Zhang, Danrong
    Jindal, Dipali
    Roy, Nimisha
    Vangla, Prashanth
    Frost, J. David
    GEOENVIRONMENTAL DISASTERS, 2024, 11 (01)
  • [25] Automatic Ontology Learning from Domain-specific Short Unstructured Text Data
    Xu, Yiming
    Rajpathak, Dnyanesh
    Gibbs, Ian
    Klabjan, Diego
    PROCEEDINGS OF THE 12TH INTERNATIONAL JOINT CONFERENCE ON KNOWLEDGE DISCOVERY, KNOWLEDGE ENGINEERING AND KNOWLEDGE MANAGEMENT (KMIS), VOL 3, 2020, : 29 - 39
  • [26] Learning Named Entity Tagger using Domain-Specific Dictionary
    Shang, Jingbo
    Liu, Liyuan
    Gu, Xiaotao
    Ren, Xiang
    Ren, Teng
    Han, Jiawei
    2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), 2018, : 2054 - 2064
  • [27] An automatic method to generate domain-specific investigator networks using PubMed abstracts
    Wei Yu
    Ajay Yesupriya
    Anja Wulf
    Junfeng Qu
    Marta Gwinn
    Muin J Khoury
    BMC Medical Informatics and Decision Making, 7
  • [28] Automatic Heterogeneous Runtime Using Signal Processing Domain-Specific and Parallel Patterns
    Zaidi, Yaseen
    Winberg, Simon
    INTERNATIONAL JOURNAL OF PARALLEL PROGRAMMING, 2025, 53 (02)
  • [29] An automatic method to generate domain-specific investigator networks using PubMed abstracts
    Yu, Wei
    Yesupriya, Ajay
    Wulf, Anja
    Qu, Junfeng
    Gwinn, Marta
    Khoury, Muin J.
    BMC MEDICAL INFORMATICS AND DECISION MAKING, 2007, 7 (1)
  • [30] Domain-Specific Relation Extraction Using Distant Supervision Machine Learning
    Aljamel, Abduladem
    Osman, Taha
    Acampora, Giovanni
    2015 7TH INTERNATIONAL JOINT CONFERENCE ON KNOWLEDGE DISCOVERY, KNOWLEDGE ENGINEERING AND KNOWLEDGE MANAGEMENT (IC3K), 2015, : 92 - 103