Learning regular expressions for clinical text classification

被引:66
|
作者
Duy Duc An Bui [1 ,2 ]
Zeng-Treitler, Qing [1 ,2 ]
机构
[1] Univ Utah, Dept Biomed Informat, Salt Lake City, UT 84112 USA
[2] VA Salt Lake City Hlth Care Syst, Salt Lake City, UT USA
关键词
RECORDS; SUPPORT;
D O I
10.1136/amiajnl-2013-002411
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Objectives Natural language processing (NLP) applications typically use regular expressions that have been developed manually by human experts. Our goal is to automate both the creation and utilization of regular expressions in text classification. Methods We designed a novel regular expression discovery (RED) algorithm and implemented two text classifiers based on RED. The RED+ALIGN classifier combines RED with an alignment algorithm, and RED +SVM combines RED with a support vector machine (SVM) classifier. Two clinical datasets were used for testing and evaluation: the SMOKE dataset, containing 1091 text snippets describing smoking status; and the PAIN dataset, containing 702 snippets describing pain status. We performed 10-fold cross-validation to calculate accuracy, precision, recall, and F-measure metrics. In the evaluation, an SVM classifier was trained as the control. Results The two RED classifiers achieved 80.9-83.0% in overall accuracy on the two datasets, which is 1.3-3% higher than SVM's accuracy (p<0.001). Similarly, small but consistent improvements have been observed in precision, recall, and F-measure when RED classifiers are compared with SVM alone. More significantly, RED+ALIGN correctly classified many instances that were misclassified by the SVM classifier (8.1-10.3% of the total instances and 43.8-53.0% of SVM's misclassifications). Conclusions Machine-generated regular expressions can be effectively used in clinical text classification. The regular expression-based classifier can be combined with other classifiers, like SVM, to improve classification performance.
引用
收藏
页码:850 / 857
页数:8
相关论文
共 50 条
  • [1] Active Learning for Biomedical Text Classification Based on Automatically Generated Regular Expressions
    Flores, Christopher A.
    Figueroa, Rosa L.
    Pezoa, Jorge E.
    Flores, Christopher A. (christopher.flores@biomedica.udec.cl), 1600, Institute of Electrical and Electronics Engineers Inc. (09): : 38767 - 38777
  • [2] Active Learning for Biomedical Text Classification Based on Automatically Generated Regular Expressions
    Flores, Christopher A.
    Figueroa, Rosa L.
    Pezoa, Jorge E.
    IEEE ACCESS, 2021, 9 : 38767 - 38777
  • [3] Learning Regular Expressions for Interpretable Medical Text Classification Using a Pool-based Simulated Annealing Approach
    Tu, Chaofan
    Cui, Menglin
    2020 IEEE CONGRESS ON EVOLUTIONARY COMPUTATION (CEC), 2020,
  • [4] FREGEX: A Feature Extraction Method for Biomedical Text Classification using Regular Expressions
    Flores, Christopher A.
    Figueroa, Rosa L.
    Pezoa, Jorge E.
    2019 41ST ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY (EMBC), 2019, : 6085 - 6088
  • [5] Text Manipulation Using Regular Expressions
    Biswas, S.
    Sengupta, D.
    Bhattacharjee, R.
    Handique, M.
    2016 IEEE 6TH INTERNATIONAL CONFERENCE ON ADVANCED COMPUTING (IACC), 2016, : 62 - 67
  • [6] EFFICIENT TEXT SEARCHING OF REGULAR EXPRESSIONS
    BAEZAYATES, RA
    GONNET, GH
    LECTURE NOTES IN COMPUTER SCIENCE, 1989, 382 : 1 - 2
  • [7] On the use of regular expressions for searching text
    Clarke, CLA
    Cormack, GV
    ACM TRANSACTIONS ON PROGRAMMING LANGUAGES AND SYSTEMS, 1997, 19 (03): : 413 - 426
  • [8] On the use of regular expressions for searching text
    Univ of Waterloo, Waterloo, Canada
    ACM Trans Program Lang Syst, 3 (413-426):
  • [9] EFFICIENT TEXT SEARCHING OF REGULAR EXPRESSIONS
    BAEZAYATES, RA
    GONNET, GH
    LECTURE NOTES IN COMPUTER SCIENCE, 1989, 372 : 46 - 62
  • [10] ENHANCING REGULAR EXPRESSIONS FOR POLISH TEXT PROCESSING
    Dorosz, Krzysztof
    Szczerbinska, Anna
    COMPUTER SCIENCE-AGH, 2009, 10 : 19 - 35