Research Report: Building a Wide Reach Corpus for Secure Parser Development

被引:3
|
作者
Allison, Tim [1 ]
Burke, Wayne [1 ]
Constantinou, Valentino [1 ]
Goh, Edwin [1 ]
Mattmann, Chris [1 ]
Mensikova, Anastasija [1 ]
Southam, Philip [1 ]
Stonebraker, Ryan [1 ]
Timmaraju, Virisha [1 ]
机构
[1] CALTECH, Jet Prop Lab, Pasadena, CA 91125 USA
来源
2020 IEEE SYMPOSIUM ON SECURITY AND PRIVACY WORKSHOPS (SPW 2020) | 2020年
关键词
LangSec; language-theoretic security; file corpus creation; file forensics; text extraction; parser resources; DIGITAL FORENSICS; PDF;
D O I
10.1109/SPW50608.2020.00066
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Computer software that parses electronic files is often vulnerable to maliciously crafted input data. Rather than relying on developers to implement ad hoc defenses against such data, the Language-theoretic security (LangSec) philosophy offers formally correct and verifiable input handling throughout the software development lifecycle. Whether developing from a specification or deriving parsers from samples, LangSec parser developers require wide-reach corpora of their target file format in order to identify key edge cases or common deviations from the format's specification. In this research report, we provide the details of several methods we have used to gather approximately 30 million files, extract features and make these features amenable to search and use in analytics. Additionally, we provide documentation on opportunities and limitations of some popular open-source datasets and annotation tools that will benefit researchers which need to efficiently gather a large file corpus for the purposes of LangSec parser development.
引用
收藏
页码:318 / 326
页数:9
相关论文
共 50 条
  • [11] Corpus and the Research of English Textbooks Development
    Zhang, Hongqin
    SOCIAL SCIENCE AND EDUCATION, 2013, 9 : 530 - 535
  • [12] Working group report on building secure knowledge systems
    Applied Knowledge Group, Reston, United States
    J Eng Appl Sci, (309-311):
  • [13] Working group report on building secure knowledge systems
    Davis, BC
    Wood, BJ
    SIXTH IEEE WORKSHOPS ON ENABLING TECHNOLOGIES: INFRASTRUCTURE FOR COLLABORATIVE ENTERPRISES, PROCEEDINGS, 1997, : 309 - 311
  • [14] Development of a secure medical research environment
    Alaoui, A
    Levine, B
    Cleary, K
    Mun, SK
    2000 IEEE EMBS INTERNATIONAL CONFERENCE ON INFORMATION TECHNOLOGY APPLICATIONS IN BIOMEDICINE, PROCEEDINGS, 2000, : 44 - 49
  • [15] Building a Synthetic Biomedical Research Article Citation Linkage Corpus
    Roy, Sudipta Singha
    Mercer, Robert E.
    LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 5665 - 5672
  • [16] Building more secure software with improved development processes
    Howard, M
    IEEE SECURITY & PRIVACY, 2004, 2 (06) : 63 - 65
  • [17] RESEARCH ON ECOLOGICAL BUILDING AND SUSTAINABLE BUILDING DEVELOPMENT
    Chen, Nan
    FRESENIUS ENVIRONMENTAL BULLETIN, 2021, 30 (03): : 2998 - 3004
  • [18] Corpus linguistics for writing development: A guide for research
    Albelihi, Hani Hamd
    SYSTEM, 2023, 115
  • [19] Corpus linguistics for writing development: A guide for research
    Li, Qidi
    Yan, Jianwei
    Durrant, Philip
    JOURNAL OF ENGLISH FOR ACADEMIC PURPOSES, 2023, 66
  • [20] Corpus linguistics for writing development: A guide for research
    Sun, Shuyi Amelia
    JOURNAL OF SECOND LANGUAGE WRITING, 2023, 59