Research Report: Building a Wide Reach Corpus for Secure Parser Development

被引:3
|
作者
Allison, Tim [1 ]
Burke, Wayne [1 ]
Constantinou, Valentino [1 ]
Goh, Edwin [1 ]
Mattmann, Chris [1 ]
Mensikova, Anastasija [1 ]
Southam, Philip [1 ]
Stonebraker, Ryan [1 ]
Timmaraju, Virisha [1 ]
机构
[1] CALTECH, Jet Prop Lab, Pasadena, CA 91125 USA
关键词
LangSec; language-theoretic security; file corpus creation; file forensics; text extraction; parser resources; DIGITAL FORENSICS; PDF;
D O I
10.1109/SPW50608.2020.00066
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Computer software that parses electronic files is often vulnerable to maliciously crafted input data. Rather than relying on developers to implement ad hoc defenses against such data, the Language-theoretic security (LangSec) philosophy offers formally correct and verifiable input handling throughout the software development lifecycle. Whether developing from a specification or deriving parsers from samples, LangSec parser developers require wide-reach corpora of their target file format in order to identify key edge cases or common deviations from the format's specification. In this research report, we provide the details of several methods we have used to gather approximately 30 million files, extract features and make these features amenable to search and use in analytics. Additionally, we provide documentation on opportunities and limitations of some popular open-source datasets and annotation tools that will benefit researchers which need to efficiently gather a large file corpus for the purposes of LangSec parser development.
引用
收藏
页码:318 / 326
页数:9
相关论文
共 50 条
  • [1] Research Report: Building a File Observatory for Secure Parser Development
    Allison, Tim
    Burke, Wayne
    Mattmann, Chris
    Mensikova, Anastasija
    Southam, Philip
    Stonebraker, Ryan
    2021 IEEE SYMPOSIUM ON SECURITY AND PRIVACY WORKSHOPS (SPW 2021), 2021, : 121 - 127
  • [2] Research Report: Progress on Building a File Observatory for Secure Parser Development
    Allison, Tim
    Burke, Wayne
    Graf, Dustin
    Mattmann, Chris
    Mensikova, Anastasija
    Milano, Mike
    Southam, Philip
    Stonebraker, Ryan
    2022 43RD IEEE SYMPOSIUM ON SECURITY AND PRIVACY WORKSHOPS (SPW 2022), 2022, : 168 - 175
  • [3] Corpus-wide Analysis of Parser Behaviors via a Format Analysis Workbench
    Menon, Pottayil Harisanker
    Woods, Walt
    2023 IEEE SECURITY AND PRIVACY WORKSHOPS, SPW, 2023, : 209 - 218
  • [4] Building deep dependency structures with a wide-coverage CCG parser
    Clark, S
    Hockenmaier, J
    Steedman, M
    40TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, PROCEEDINGS OF THE CONFERENCE, 2002, : 327 - 334
  • [5] Building Corpora for the Development of a Dependency Parser for Spanish Using Maltparser
    Herrera, Jesus
    Gervas, Pablo
    Moriano, Pedro J.
    Munoz, Alfonso
    Romero, Luis
    PROCESAMIENTO DEL LENGUAJE NATURAL, 2007, (39): : 181 - 186
  • [6] PROGRESS REPORT ON ITALY - WIDE RANGING RESEARCH AND DEVELOPMENT
    不详
    NUCLEAR ENGINEERING INTERNATIONAL, 1980, 25 (300): : 48 - 51
  • [7] Building a Secure Medical Research Organization
    Alaoul, Adil
    Collmann, Jeff
    Coleman, Johnathan
    Subbiah, Nishant
    Cleary, Kevin
    Mun, Seong K.
    Coleman, Johnathan
    2008 IEEE SYMPOSIUM ON COMPUTERS AND COMMUNICATIONS, VOLS 1-3, 2008, : 1026 - 1035
  • [8] Efficient corpus development for lexicography: building the New Corpus for Ireland
    Adam Kilgarriff
    Michael Rundell
    Elaine Uí Dhonnchadha
    Language Resources and Evaluation, 2006, 40 : 127 - 152
  • [9] Efficient corpus development for lexicography: building the New Corpus for Ireland
    Kilgarriff, Adam
    Rundell, Michael
    Dhonnchadha, Elaine Ui
    LANGUAGE RESOURCES AND EVALUATION, 2006, 40 (02) : 127 - 152
  • [10] A system development life cycle approach to building secure electronic markets on the World Wide Web
    Liu, C
    Rollins, R
    DECISION SCIENCES INSTITUTE 1998 PROCEEDINGS, VOLS 1-3, 1998, : 1011 - 1011