Development and Validation of a Natural Language Processing Algorithm for Extracting Clinical and Pathological Features of Breast Cancer From Pathology Reports

被引:0
|
作者
Munzone, Elisabetta [1 ]
Marra, Antonio [2 ]
Comotto, Federico [3 ]
Guercio, Lorenzo [3 ]
Sangalli, Claudia Anna [4 ]
Lo Cascio, Martina [5 ]
Pagan, Eleonora [6 ]
Sangalli, Davide [5 ]
Bigoni, Ilaria [3 ]
Porta, Francesca Maria [7 ]
D'Ercole, Marianna [7 ]
Ritorti, Fabiana [3 ]
Bagnardi, Vincenzo [6 ]
Fusco, Nicola [7 ,8 ]
Curigliano, Giuseppe [2 ,8 ]
机构
[1] IRCCS, European Inst Oncol, Div Med Senol, Milan, Italy
[2] IRCCS, European Inst Oncol, Div Early Drug Dev Innovat Therapies, Milan, Italy
[3] Reply SPA, Turin, Italy
[4] IRCCS, European Inst Oncol, Clin Trial Off, Milan, Italy
[5] IRCCS, European Inst Oncol, Cent Management Informat Syst & Technol, Milan, Italy
[6] Univ Milano Bicocca, Dept Stat & Quantitat Methods, Milan, Italy
[7] IRCCS, European Inst Oncol, Div Pathol, Milan, Italy
[8] Univ Milan, Dept Oncol & Hemato Oncol, Milan, Italy
来源
关键词
D O I
10.1200/CCI.24.00034
中图分类号
R73 [肿瘤学];
学科分类号
100214 ;
摘要
PURPOSEElectronic health records (EHRs) are valuable information repositories that offer insights for enhancing clinical research on breast cancer (BC) using real-world data. The objective of this study was to develop a natural language processing (NLP) model specifically designed to extract structured data from BC pathology reports written in natural language.METHODSDuring the initial phase, the algorithm's development cohort comprised 193 pathology reports from 116 patients with BC from 2012 to 2016. A rule-based NLP algorithm was applied to extract 26 variables for analysis and was compared with the manual extraction of data performed by both a data entry specialist and an oncologist. Following the first approach, the data set was expanded to include 513 reports, and a Named Entity Recognition (NER)-NLP model was trained and evaluated using K-fold cross-validation.RESULTSThe first approach led to a concordance analysis, which revealed an 82.9% agreement between the algorithm and the oncologist, whereas the concordance between the data entry specialist and the oncologist was 90.8%. The second training approach introduced the definition of an NER-NLP model, in which the accuracy showed remarkable potential (97.8%). Notably, the model demonstrated remarkable performance, especially for parameters such as estrogen receptor, progesterone receptor, human epidermal growth factor receptor 2, and Ki-67 (F1-score 1.0).CONCLUSIONThe present study aligns with the rapidly evolving field of artificial intelligence (AI) applications in oncology, seeking to expedite the development of complex cancer databases and registries. The results of the model are currently undergoing postprocessing procedures to organize the data into tabular structures, facilitating their utilization in real-world clinical and research endeavors. A high-accuracy NLP model was developed to extract structured data from breast cancer pathology reports.
引用
收藏
页数:9
相关论文
共 50 条
  • [31] Discovering social determinants of health from case reports using natural language processing: algorithmic development and validation
    Shaina Raza
    Elham Dolatabadi
    Nancy Ondrusek
    Laura Rosella
    Brian Schwartz
    BMC Digital Health, 1 (1):
  • [32] Mining Clinical Notes for Physical Rehabilitation Exercise Information: Natural Language Processing Algorithm Development and Validation Study
    Sivarajkumar, Sonish
    Gao, Fengyi
    Denny, Parker
    Aldhahwani, Bayan
    Visweswaran, Shyam
    Bove, Allyn
    Wang, Yanshan
    JMIR MEDICAL INFORMATICS, 2024, 12
  • [33] Clinical accuracy of information extracted from prostate needle biopsy pathology reports using natural language processing.
    Wong, Risa Liang
    Sagar, Medha
    Hoffman, Jacob
    Huang, Claire
    Lerma, Angelica
    Kanabolo, Diboro
    Caldwell, Joshua
    Gore, John L.
    JOURNAL OF CLINICAL ONCOLOGY, 2021, 39 (15)
  • [34] Identification of findings suspicious for breast cancer based on natural language processing of mammogram reports
    Jain, NL
    Friedman, C
    JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 1997, : 829 - 833
  • [35] Extracting Intrauterine Device Usage from Clinical Texts using Natural Language Processing
    Shi, Jianlin
    Mowery, Danielle
    Chapman, Wendy
    Zhang, Mingyuan
    Sanders, Jessica
    Gawron, Lori
    2017 IEEE INTERNATIONAL CONFERENCE ON HEALTHCARE INFORMATICS (ICHI), 2017, : 568 - 571
  • [36] Natural Language Processing for Surveillance of Cervical and Anal Cancer and Precancer: Algorithm Development and Split-Validation Study
    Oliveira, Carlos R.
    Niccolai, Patrick
    Ortiz, Anette Michelle
    Sheth, Sangini S.
    Shapiro, Eugene D.
    Niccolai, Linda M.
    Brandt, Cynthia A.
    JMIR MEDICAL INFORMATICS, 2020, 8 (11)
  • [37] Identification of Prediabetes Discussions in Unstructured Clinical Documentation: Validation of a Natural Language Processing Algorithm
    Schwartz, Jessica L.
    Tseng, Eva
    Maruthur, Nisa M.
    Rouhizadeh, Masoud
    JMIR MEDICAL INFORMATICS, 2022, 10 (02)
  • [38] Using Natural Language Processing for Extracting Information from Portable Chest X-Ray Reports
    Wang, D. Y.
    Hwang, T. S.
    Rubin, D.
    Chambers, J.
    South, B. R.
    Goldstein, M. K.
    JOURNAL OF THE AMERICAN GERIATRICS SOCIETY, 2013, 61 : S103 - S103
  • [39] Near Real-time Natural Language Processing for the Extraction of Abdominal Aortic Aneurysm Diagnoses From Radiology Reports: Algorithm Development and Validation Study
    Gaviria-Valencia, Simon
    Murphy, Sean P.
    Kaggal, Vinod C.
    McBane II, Robert D.
    Rooke, Thom W.
    Chaudhry, Rajeev
    Alzate-Aguirre, Mateo
    Arruda-Olson, Adelaide M.
    JMIR MEDICAL INFORMATICS, 2023, 11
  • [40] Successful Development of a Natural Language Processing Algorithm for Pancreatic Neoplasms and Associated Histologic Features
    Harrison, Jon Michael
    Yala, Adam
    Mikhael, Peter
    Roldan, Jorge
    Ciprani, Debora
    Michelakos, Theodoros
    Bolm, Louisa
    Qadan, Motaz
    Ferrone, Cristina
    Fernandez-del Castillo, Carlos
    Lillemoe, Keith Douglas
    Santus, Enrico
    Hughes, Kevin
    PANCREAS, 2023, 52 (04) : E219 - E223