Research Report: Building a Wide Reach Corpus for Secure Parser Development

被引:3
|
作者
Allison, Tim [1 ]
Burke, Wayne [1 ]
Constantinou, Valentino [1 ]
Goh, Edwin [1 ]
Mattmann, Chris [1 ]
Mensikova, Anastasija [1 ]
Southam, Philip [1 ]
Stonebraker, Ryan [1 ]
Timmaraju, Virisha [1 ]
机构
[1] CALTECH, Jet Prop Lab, Pasadena, CA 91125 USA
来源
2020 IEEE SYMPOSIUM ON SECURITY AND PRIVACY WORKSHOPS (SPW 2020) | 2020年
关键词
LangSec; language-theoretic security; file corpus creation; file forensics; text extraction; parser resources; DIGITAL FORENSICS; PDF;
D O I
10.1109/SPW50608.2020.00066
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Computer software that parses electronic files is often vulnerable to maliciously crafted input data. Rather than relying on developers to implement ad hoc defenses against such data, the Language-theoretic security (LangSec) philosophy offers formally correct and verifiable input handling throughout the software development lifecycle. Whether developing from a specification or deriving parsers from samples, LangSec parser developers require wide-reach corpora of their target file format in order to identify key edge cases or common deviations from the format's specification. In this research report, we provide the details of several methods we have used to gather approximately 30 million files, extract features and make these features amenable to search and use in analytics. Additionally, we provide documentation on opportunities and limitations of some popular open-source datasets and annotation tools that will benefit researchers which need to efficiently gather a large file corpus for the purposes of LangSec parser development.
引用
收藏
页码:318 / 326
页数:9
相关论文
共 50 条
  • [21] Corpus Linguistics for Writing Development: A Guide for Research
    Richter, Michelle
    CORPORA, 2024, 19 (03) : 417 - 419
  • [22] Perspectives on Building Momentum to Reach New Heights in Music Therapy Research
    Smith, Amy R.
    JOURNAL OF MUSIC THERAPY, 2023, 60 (02) : 123 - 130
  • [23] A development of a speech data transcription tool for building a spoken corpus
    You, Yeonguk
    Noh, Hyangrae
    Park, Jaeeun
    Kim, Yunsoo
    KwaK, Yongjn
    Kim, Yoonjoong
    2018 INTERNATIONAL CONFERENCE ON INFORMATION AND COMMUNICATION TECHNOLOGY CONVERGENCE (ICTC), 2018, : 1437 - 1439
  • [24] Development of Focused Crawlers for Building Large Punjabi News Corpus
    Mahi, Gurjot Singh
    Verma, Amandeep
    JOURNAL OF ICT RESEARCH AND APPLICATIONS, 2021, 15 (03) : 205 - 215
  • [25] BUILDING RESEARCH STATION - REPORT FOR 1953
    不详
    NATURE, 1954, 174 (4434) : 782 - 782
  • [26] Encouraging cumulative knowledge building as normal practice in (learner) corpus research
    Larsson, Tove
    Biber, Douglas
    INTERNATIONAL JOURNAL OF LEARNER CORPUS RESEARCH, 2025, 11 (01) : 1 - 16
  • [27] EP-Poland: Building A Bilingual Parallel Corpus For Interpreting Research
    Bartlomiejczyk, Magdalena
    Gumul, Ewa
    Korzinek, Danijel
    GEMA ONLINE JOURNAL OF LANGUAGE STUDIES, 2022, 22 (01): : 110 - 126
  • [28] Building Research Infrastructure to Address Psychosocial Frailty and Reach Underserved Aging Populations
    Thiamwong, Ladda
    Lopez, Janet
    Leinbach, Carla Beth
    JOURNAL OF PSYCHOSOCIAL NURSING AND MENTAL HEALTH SERVICES, 2025, 63 (02) : 7 - 10
  • [29] Building Blocks for the Development of University-Wide Entrepreneurship
    Morris, N. Michael
    Kuratko, Donald F.
    Pryor, Christopher G.
    ENTREPRENEURSHIP RESEARCH JOURNAL, 2014, 4 (01) : 45 - 68
  • [30] Multilingual spoken language corpus development for communication research
    Takezawa, Toshiyuki
    Chinese Spoken Language Processing, Proceedings, 2006, 4274 : 781 - 791