A Context-free Markup Language for Semi-structured Text

被引:5
|
作者
Xi, Qian [1 ]
Walker, David [1 ]
机构
[1] Princeton Univ, Princeton, NJ 08544 USA
关键词
Domain-specific Languages; Tool Generation; Ad Hoc Data; PADS; ANNE;
D O I
10.1145/1806596.1806622
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
An ad hoc data format is any nonstandard, semi-structured data format for which robust data processing tools are not easily available. In this paper, we present ANNE, a new kind of markup language designed to help users generate documentation and data processing tools for ad hoc text data. More specifically, given a new ad hoc data source, an ANNE programmer edits the document to add a number of simple annotations, which serve to specify its syntactic structure. Annotations include elements that specify constants, optional data, alternatives, enumerations, sequences, tabular data, and recursive patterns. The ANNE system uses a combination of user annotations and the raw data itself to extract a context-free grammar from the document. This context-free grammar can then be used to parse the data and transform it into an XML parse tree, which may be viewed through a browser for analysis or debugging purposes. In addition, the ANNE system generates a PADS/ML description [19], which may be saved as lasting documentation of the data format or compiled into a host of useful data processing tools. In addition to designing and implementing ANNE, we have devised a semantic theory for the core elements of the language. This semantic theory describes the editing process, which translates a raw, unannotated text document into an annotated document, and the grammar extraction process, which generates a context-free grammar from an annotated document. We also present an alternative characterization of system behavior by drawing upon ideas from the field of relevance logic. This secondary characterization, which we call relevance analysis, specifies a direct relationship between unannotated documents and the context-free grammars that our system can generate from them. Relevance analysis allows us to prove important theorems concerning the expressiveness and utility of our system.
引用
收藏
页码:221 / 232
页数:12
相关论文
共 50 条
  • [21] A CONTEXT-FREE LANGUAGE DECISION PROBLEM
    LITOW, B
    THEORETICAL COMPUTER SCIENCE, 1994, 125 (02) : 339 - 343
  • [22] SPREADSHEETCODER: Formula Prediction from Semi-structured Context
    Chen, Xinyun
    Maniatis, Petros
    Singh, Rishabh
    Sutton, Charles
    Dai, Hanjun
    Lin, Max
    Zhou, Denny
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 139, 2021, 139
  • [23] Integrating a query language for structured and semi-structured data and IR techniques
    Heuer, A
    Priebe, D
    11TH INTERNATIONAL WORKSHOP ON DATABASE AND EXPERT SYSTEMS APPLICATION, PROCEEDINGS, 2000, : 703 - 707
  • [24] Chinese resume information extraction based on semi-structured text
    Wentan, Yan
    Yupeng, Qiao
    Chinese Control Conference, CCC, 2017, : 11177 - 11182
  • [25] Chinese resume information extraction based on semi-structured text
    Yan Wentan
    Qiao Yupeng
    PROCEEDINGS OF THE 36TH CHINESE CONTROL CONFERENCE (CCC 2017), 2017, : 11177 - 11182
  • [26] Compressing semi-structured text using hierarchical phrase identifications
    NevillManning, CG
    Witten, IH
    Olsen, DR
    DCC '96 - DATA COMPRESSION CONFERENCE, PROCEEDINGS, 1996, : 63 - 72
  • [27] Joint Distributed Representation of Text and Structure of Semi-Structured Documents
    Laddha, Abhishek
    Joshi, Salil
    Shaikh, Samiulla
    Mehta, Sameep
    HT'18: PROCEEDINGS OF THE 29TH ACM CONFERENCE ON HYPERTEXT AND SOCIAL MEDIA, 2018, : 25 - 32
  • [28] Information Extraction of Strategic Activities based on Semi-structured Text
    Ma, Xubu
    Guo, Ju-E
    Ma, Xubu
    2014 SEVENTH INTERNATIONAL JOINT CONFERENCE ON COMPUTATIONAL SCIENCES AND OPTIMIZATION (CSO), 2014, : 579 - 583
  • [29] SEMI-DISCRETE CONTEXT-FREE LANGUAGES
    LATTEUX, M
    THIERRIN, G
    INTERNATIONAL JOURNAL OF COMPUTER MATHEMATICS, 1983, 14 (01) : 3 - 18
  • [30] Students' Ability in Free, Semi-Structured and Structured Problem Posing Situations
    Ngah, Norulbiah
    Ismail, Zaleha
    Tasir, Zaidatun
    Said, Mohd Nihra Haruzuan Mohamad
    ADVANCED SCIENCE LETTERS, 2016, 22 (12) : 4205 - 4208