Research Report: Building a Wide Reach Corpus for Secure Parser Development

被引:3
|
作者
Allison, Tim [1 ]
Burke, Wayne [1 ]
Constantinou, Valentino [1 ]
Goh, Edwin [1 ]
Mattmann, Chris [1 ]
Mensikova, Anastasija [1 ]
Southam, Philip [1 ]
Stonebraker, Ryan [1 ]
Timmaraju, Virisha [1 ]
机构
[1] CALTECH, Jet Prop Lab, Pasadena, CA 91125 USA
来源
2020 IEEE SYMPOSIUM ON SECURITY AND PRIVACY WORKSHOPS (SPW 2020) | 2020年
关键词
LangSec; language-theoretic security; file corpus creation; file forensics; text extraction; parser resources; DIGITAL FORENSICS; PDF;
D O I
10.1109/SPW50608.2020.00066
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Computer software that parses electronic files is often vulnerable to maliciously crafted input data. Rather than relying on developers to implement ad hoc defenses against such data, the Language-theoretic security (LangSec) philosophy offers formally correct and verifiable input handling throughout the software development lifecycle. Whether developing from a specification or deriving parsers from samples, LangSec parser developers require wide-reach corpora of their target file format in order to identify key edge cases or common deviations from the format's specification. In this research report, we provide the details of several methods we have used to gather approximately 30 million files, extract features and make these features amenable to search and use in analytics. Additionally, we provide documentation on opportunities and limitations of some popular open-source datasets and annotation tools that will benefit researchers which need to efficiently gather a large file corpus for the purposes of LangSec parser development.
引用
收藏
页码:318 / 326
页数:9
相关论文
共 50 条
  • [31] Building Secure Platforms for Research on Human Subjects: The Importance of Computer Scientists
    Lane, Julia Ingrid
    HPDC'17: PROCEEDINGS OF THE 26TH INTERNATIONAL SYMPOSIUM ON HIGH-PERFORMANCE PARALLEL AND DISTRIBUTED COMPUTING, 2017, : 1 - 1
  • [32] The research development of plywood building templates
    Zhang, Zhijun
    Jia, Zhen
    Li, Guoliang
    PROGRESS IN INDUSTRIAL AND CIVIL ENGINEERING, PTS. 1-5, 2012, 204-208 : 3863 - +
  • [33] Industrial research and development capacity building
    Ramingwong, Sakgasem
    Sopadang, Apichat
    Anantana, Tanyanuparb
    Sinthavalai, Runchana
    Santiteerakul, Salinee
    AMAZONIA INVESTIGA, 2024, 13 (81): : 9 - 23
  • [34] RESEARCH ON SUSTAINABLE DEVELOPMENT OF GREEN BUILDING
    Chen Hui
    Sui Yubing
    Zhao Jinling
    JOURNAL OF ENVIRONMENTAL PROTECTION AND ECOLOGY, 2020, 21 (02): : 561 - 570
  • [35] Automated Building of an Environment for Secure Software Development in Web Technologies Courses
    Petrov, Milen
    Zarkov, Alexander
    Aleksieva-Petrova, Adelina
    LEARNING IN THE AGE OF DIGITAL AND GREEN TRANSITION, ICL2022, VOL 1, 2023, 633 : 742 - 751
  • [36] Research and development of nuclear-secure plasma reactor
    Dedov N.V.
    Kozyrev A.S.
    Toumanov Y.N.
    Inorganic Materials: Applied Research, 2013, 4 (03) : 211 - 216
  • [37] Building and Validating a Scale for Secure Software Development Self-Efficacy
    Votipka, Daniel
    Abrokwa, Desiree
    Mazurek, Michelle L.
    PROCEEDINGS OF THE 2020 CHI CONFERENCE ON HUMAN FACTORS IN COMPUTING SYSTEMS (CHI'20), 2020,
  • [38] Development of a Medical Incident Report Corpus with Intention and Factuality Annotation
    Zhang, Hongkuan
    Sasano, Ryohei
    Takeda, Koichi
    Wong, Zoie Shui-Yee
    PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 4578 - 4584
  • [39] Chhattisgarhi speech corpus for research and development in automatic speech recognition
    Londhe, Narendra D.
    Kshirsagar, Ghanahshyam B.
    INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY, 2018, 21 (02) : 193 - 210
  • [40] Research on ancient literature corpus creation and development of Chinesetraditional medicine
    Liu, Yao
    Zhao, Yazhen
    ICIC Express Letters, 2009, 3 (04): : 1227 - 1232