Large Scale Semi-Automated Labeling of Routine Free-Text Clinical Records for Deep Learning

被引:14
|
作者
Trivedi, Hari M. [1 ]
Panahiazar, Maryam [2 ]
Liang, April [3 ]
Lituiev, Dmytro [2 ]
Chang, Peter [1 ]
Sohn, Jae Ho [1 ]
Chen, Yunn-Yi [4 ]
Franc, Benjamin L. [1 ]
Joe, Bonnie [1 ]
Hadley, Dexter [2 ]
机构
[1] Univ Calif San Francisco, Dept Radiol & Biomed Imaging, San Francisco, CA 94143 USA
[2] Univ Calif San Francisco, Inst Computat Hlth Sci, San Francisco, CA 94143 USA
[3] Univ Calif San Francisco, Sch Med, San Francisco, CA USA
[4] Univ Calif San Francisco, Dept Pathol, San Francisco, CA 94140 USA
关键词
IBM Watson; Machine learning; Artificial intelligence; Deep learning; Natural language processing (NLP); Pathology; Mammography; CANCER; CLASSIFICATION; ARCHITECTURE; MAMMOGRAPHY; MASSES;
D O I
10.1007/s10278-018-0105-8
中图分类号
R8 [特种医学]; R445 [影像诊断学];
学科分类号
1002 ; 100207 ; 1009 ;
摘要
Breast cancer is a leading cause of cancer death among women in the USA. Screening mammography is effective in reducing mortality, but has a high rate of unnecessary recalls and biopsies. While deep learning can be applied to mammography, large-scale labeled datasets, which are difficult to obtain, are required. We aim to remove many barriers of dataset development by automatically harvesting data from existing clinical records using a hybrid framework combining traditional NLP and IBM Watson. An expert reviewer manually annotated 3521 breast pathology reports with one of four outcomes: left positive, right positive, bilateral positive, negative. Traditional NLP techniques using seven different machine learning classifiers were compared to IBM Watson's automated natural language classifier. Techniques were evaluated using precision, recall, and F-measure. Logistic regression outperformed all other traditional machine learning classifiers and was used for subsequent comparisons. Both traditional NLP and Watson's NLC performed well for cases under 1024 characters with weighted average F-measures above 0.96 across all classes. Performance of traditional NLP was lower for cases over 1024 characters with an F-measure of 0.83. We demonstrate a hybrid framework using traditional NLP techniques combined with IBM Watson to annotate over 10,000 breast pathology reports for development of a large-scale database to be used for deep learning in mammography. Our work shows that traditional NLP and IBM Watson perform extremely well for cases under 1024 characters and can accelerate the rate of data annotation.
引用
收藏
页码:30 / 37
页数:8
相关论文
共 50 条
  • [41] Semi-Automated Framework for Digitalizing Multi-Product Warehouses with Large Scale Camera Arrays
    Higashiura, Keisuke
    Yokoyama, Kodai
    Asai, Yusuke
    Shimosato, Hironori
    Kano, Kazuma
    Katayama, Shin
    Urano, Kenta
    Yonezawa, Takuro
    Kawaguchi, Nobuo
    2024 IEEE INTERNATIONAL CONFERENCE ON PERVASIVE COMPUTING AND COMMUNICATIONS, PERCOM, 2024, : 98 - 105
  • [42] Ambiguous requirements: A semi-automated approach to identify and clarify ambiguity in large-scale projects
    Asadabadi, Mehdi Rajabi
    Saberi, Morteza
    Zwikael, Ofer
    Chang, Elizabeth
    COMPUTERS & INDUSTRIAL ENGINEERING, 2020, 149
  • [43] Screening Referable Diabetic Retinopathy Using a Semi-automated Deep Learning Algorithm Assisted Approach
    Wang, Yueye
    Shi, Danli
    Tan, Zachary
    Niu, Yong
    Jiang, Yu
    Xiong, Ruilin
    Peng, Guankai
    He, Mingguang
    FRONTIERS IN MEDICINE, 2021, 8
  • [44] Comparative evaluation of conventional and deep learning methods for semi-automated segmentation of pulmonary nodules on CT
    Bianconi, Francesco
    Fravolini, Mario Luca
    Pizzoli, Sofia
    Palumbo, Isabella
    Minestrini, Matteo
    Rondini, Maria
    Nuvoli, Susanna
    Spanu, Angela
    Palumbo, Barbara
    QUANTITATIVE IMAGING IN MEDICINE AND SURGERY, 2021, 11 (07) : 3286 - 3305
  • [45] Proposal and evaluation of FASDIM, a Fast And Simple De-Identification Method for unstructured free-text clinical records
    Chazard, Emmanuel
    Mouret, Capucine
    Ficheur, Gregoire
    Schaffar, Aurelien
    Beuscart, Jean-Baptiste
    Beuscart, Regis
    INTERNATIONAL JOURNAL OF MEDICAL INFORMATICS, 2014, 83 (04) : 303 - 312
  • [46] Generalised deep learning model for semi-automated length measurement of fish in stereo-BRUVS
    Marrable, Daniel
    Tippaya, Sawitchaya
    Barker, Kathryn
    Harvey, Euan
    Bierwagen, Stacy L.
    Wyatt, Mathew
    Bainbridge, Scott
    Stowar, Marcus
    FRONTIERS IN MARINE SCIENCE, 2023, 10
  • [47] Supervised methods for symptom name recognition in free-text clinical records of traditional Chinese medicine: An empirical study
    Wang, Yaqiang
    Yu, Zhonghua
    Chen, Li
    Chen, Yunhui
    Liu, Yiguang
    Hu, Xiaoguang
    Jiang, Yongguang
    JOURNAL OF BIOMEDICAL INFORMATICS, 2014, 47 : 91 - 104
  • [48] Advanced Sampling Technique in Radiology Free-Text Data for Efficiently Building Text Mining Models by Deep Learning in Vertebral Fracture
    Hung, Wei-Chieh
    Lin, Yih-Lon
    Lin, Chi-Wei
    Chin, Wei-Leng
    Wu, Chih-Hsing
    DIAGNOSTICS, 2024, 14 (02)
  • [49] Modular Semi-Automated Clinical Scale Manufacturing of CAR-T Cells for Cancer Immunotherapy
    Desai, Kunjan
    Somasagara, Ranganatha
    Ravinder, Namritha
    MOLECULAR THERAPY, 2023, 31 (04) : 379 - 379
  • [50] QUERY ALGORITHMS AND MACHINE LEARNING METHODS AS TOOLS TO IDENTIFY COMORBIDITIES IN LARGE-SCALE FREE-TEXT BASED FIELDS: A CASE-REPORT
    Rohrich, D.
    Maarseveen, T.
    de Boer, A.
    van den Ende, C.
    den Broeder, A.
    Popa, C.
    Knevel, R.
    ANNALS OF THE RHEUMATIC DISEASES, 2021, 80 : 501 - 502