A system for de-identifying medical message board text

被引:0
|
作者
Adrian Benton
Shawndra Hill
Lyle Ungar
Annie Chung
Charles Leonard
Cristin Freeman
John H Holmes
机构
[1] University of Pennsylvania School of Medicine,
[2] University of Pennsylvania,undefined
[3] The Wharton School,undefined
[4] University of Pennsylvania School of Engineering and Applied Science,undefined
来源
关键词
Word List; Name Entity Recognition; Entity Recognition; Message Board; Doxy;
D O I
暂无
中图分类号
学科分类号
摘要
There are millions of public posts to medical message boards by users seeking support and information on a wide range of medical conditions. It has been shown that these posts can be used to gain a greater understanding of patients’ experiences and concerns. As investigators continue to explore large corpora of medical discussion board data for research purposes, protecting the privacy of the members of these online communities becomes an important challenge that needs to be met. Extant entity recognition methods used for more structured text are not sufficient because message posts present additional challenges: the posts contain many typographical errors, larger variety of possible names, terms and abbreviations specific to Internet posts or a particular message board, and mentions of the authors’ personal lives. The main contribution of this paper is a system to de-identify the authors of message board posts automatically, taking into account the aforementioned challenges. We demonstrate our system on two different message board corpora, one on breast cancer and another on arthritis. We show that our approach significantly outperforms other publicly available named entity recognition and de-identification systems, which have been tuned for more structured text like operative reports, pathology reports, discharge summaries, or newswire.
引用
收藏
相关论文
共 50 条
  • [1] A system for de-identifying medical message board text
    Benton, Adrian
    Hill, Shawndra
    Ungar, Lyle
    Chung, Annie
    Leonard, Charles
    Freeman, Cristin
    Holmes, John H.
    BMC BIOINFORMATICS, 2011, 12
  • [2] Privacy Guarantees for De-identifying Text Transformations
    Adelani, David Ifeoluwa
    Davody, Ali
    Kleinbauer, Thomas
    Klakow, Dietrich
    INTERSPEECH 2020, 2020, : 4666 - 4670
  • [3] De-identifying free text of Japanese electronic health records
    Kajiyama, Kohei
    Horiguchi, Hiromasa
    Okumura, Takashi
    Morita, Mizuki
    Kano, Yoshinobu
    JOURNAL OF BIOMEDICAL SEMANTICS, 2020, 11 (01)
  • [4] An integrated framework for de-identifying unstructured medical data
    Gardner, James
    Xiong, Li
    DATA & KNOWLEDGE ENGINEERING, 2009, 68 (12) : 1441 - 1451
  • [5] De-identifying free text of Japanese electronic health records
    Kohei Kajiyama
    Hiromasa Horiguchi
    Takashi Okumura
    Mizuki Morita
    Yoshinobu Kano
    Journal of Biomedical Semantics, 11
  • [6] De-Identifying Swedish EHR Text Using Public Resources in the General Domain
    Chomutare, Taridzo
    Yigzaw, Kassaye Yitbarek
    Budrionis, Andrius
    Makhlysheva, Alexandra
    Godtliebsen, Fred
    Dalianis, Hercules
    DIGITAL PERSONALIZED HEALTH AND MEDICINE, 2020, 270 : 148 - 152
  • [7] De-identifying an EHR Database - Anonymity, Correctness and Readability of the Medical Record
    Pantazos, Kostas
    Lauesen, Soren
    Lippert, Soren
    USER CENTRED NETWORKED HEALTH CARE, 2011, 169 : 862 - 866
  • [8] Heuristics for de-identifying health data
    El Emam, Khaled
    IEEE SECURITY & PRIVACY, 2008, 6 (04) : 58 - 61
  • [9] Heuristics for de-identifying health data
    El Emam, Khaled
    IEEE Security and Privacy, 2008, 6 (04): : 58 - 61
  • [10] De-identifying Swedish clinical text - refinement of a gold standard and experiments with Conditional random fields
    Hercules Dalianis
    Sumithra Velupillai
    Journal of Biomedical Semantics, 1