Learning regular expressions for clinical text classification

被引：66

作者：

Duy Duc An Bui ^{[1
,2
]}

Zeng-Treitler, Qing ^{[1
,2
]}

机构：

[1] Univ Utah, Dept Biomed Informat, Salt Lake City, UT 84112 USA

[2] VA Salt Lake City Hlth Care Syst, Salt Lake City, UT USA

来源：

JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION | 2014年 / 21卷 / 05期

关键词：

RECORDS; SUPPORT;

D O I：

10.1136/amiajnl-2013-002411

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Objectives Natural language processing (NLP) applications typically use regular expressions that have been developed manually by human experts. Our goal is to automate both the creation and utilization of regular expressions in text classification. Methods We designed a novel regular expression discovery (RED) algorithm and implemented two text classifiers based on RED. The RED+ALIGN classifier combines RED with an alignment algorithm, and RED +SVM combines RED with a support vector machine (SVM) classifier. Two clinical datasets were used for testing and evaluation: the SMOKE dataset, containing 1091 text snippets describing smoking status; and the PAIN dataset, containing 702 snippets describing pain status. We performed 10-fold cross-validation to calculate accuracy, precision, recall, and F-measure metrics. In the evaluation, an SVM classifier was trained as the control. Results The two RED classifiers achieved 80.9-83.0% in overall accuracy on the two datasets, which is 1.3-3% higher than SVM's accuracy (p<0.001). Similarly, small but consistent improvements have been observed in precision, recall, and F-measure when RED classifiers are compared with SVM alone. More significantly, RED+ALIGN correctly classified many instances that were misclassified by the SVM classifier (8.1-10.3% of the total instances and 43.8-53.0% of SVM's misclassifications). Conclusions Machine-generated regular expressions can be effectively used in clinical text classification. The regular expression-based classifier can be combined with other classifiers, like SVM, to improve classification performance.

引用

页码：850 / 857

页数：8

共 50 条

[31] Algorithms for learning regular expressions from positive data
Fernau, Henning
INFORMATION AND COMPUTATION, 2009, 207 (04) : 521 - 541
[32] Text classification with active learning
Novak, B
Mladenic, D
Grobelnik, M
FROM DATA AND INFORMATION ANALYSIS TO KNOWLEDGE ENGINEERING, 2006, : 398 - +
[33] Learning to Weight for Text Classification
Moreo, Alejandro
Esuli, Andrea
Sebastiani, Fabrizio
IEEE Transactions on Knowledge and Data Engineering, 2020, 32 (02): : 302 - 316
[34] Learning to Weight for Text Classification
Moreo, Alejandro
Esuli, Andrea
Sebastiani, Fabrizio
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2020, 32 (02) : 302 - 316
[35] GENERATION OF REGULAR EXPRESSIONS FOR AUTOMATA BY INTEGRAL OF REGULAR EXPRESSIONS
SMITH, LW
YAU, SS
COMPUTER JOURNAL, 1972, 15 (03): : 222 - &
[36] Regular expressions
Becker, P
DR DOBBS JOURNAL, 2006, 31 (05): : 52 - +
[37] Regular expressions
LeFebvre, William
Performance Computing/Unix Review, 1999, 17 (11): : 49 - 51
[38] Regular expressions
Scientific Computing and Instrumentation, 2000, 17 (08):
[39] Automated classification of clinical trial eligibility criteria text based on ensemble learning and metric learning
Kun Zeng
Yibin Xu
Ge Lin
Likeng Liang
Tianyong Hao
BMC Medical Informatics and Decision Making, 21
[40] Automated classification of clinical trial eligibility criteria text based on ensemble learning and metric learning
Zeng, Kun
Xu, Yibin
Lin, Ge
Liang, Likeng
Hao, Tianyong
BMC MEDICAL INFORMATICS AND DECISION MAKING, 2021, 21 (SUPPL 2)

← 1 2 3 4 5 →