A Context-free Markup Language for Semi-structured Text

被引:5
|
作者
Xi, Qian [1 ]
Walker, David [1 ]
机构
[1] Princeton Univ, Princeton, NJ 08544 USA
关键词
Domain-specific Languages; Tool Generation; Ad Hoc Data; PADS; ANNE;
D O I
10.1145/1806596.1806622
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
An ad hoc data format is any nonstandard, semi-structured data format for which robust data processing tools are not easily available. In this paper, we present ANNE, a new kind of markup language designed to help users generate documentation and data processing tools for ad hoc text data. More specifically, given a new ad hoc data source, an ANNE programmer edits the document to add a number of simple annotations, which serve to specify its syntactic structure. Annotations include elements that specify constants, optional data, alternatives, enumerations, sequences, tabular data, and recursive patterns. The ANNE system uses a combination of user annotations and the raw data itself to extract a context-free grammar from the document. This context-free grammar can then be used to parse the data and transform it into an XML parse tree, which may be viewed through a browser for analysis or debugging purposes. In addition, the ANNE system generates a PADS/ML description [19], which may be saved as lasting documentation of the data format or compiled into a host of useful data processing tools. In addition to designing and implementing ANNE, we have devised a semantic theory for the core elements of the language. This semantic theory describes the editing process, which translates a raw, unannotated text document into an annotated document, and the grammar extraction process, which generates a context-free grammar from an annotated document. We also present an alternative characterization of system behavior by drawing upon ideas from the field of relevance logic. This secondary characterization, which we call relevance analysis, specifies a direct relationship between unannotated documents and the context-free grammars that our system can generate from them. Relevance analysis allows us to prove important theorems concerning the expressiveness and utility of our system.
引用
收藏
页码:221 / 232
页数:12
相关论文
共 50 条
  • [31] The FC-rank of a context-free language
    Carayol, Arnaud
    Esik, Zoltan
    INFORMATION PROCESSING LETTERS, 2013, 113 (08) : 285 - 287
  • [32] Model-Checking Structured Context-Free Languages
    Chiari, Michele
    Mandrioli, Dino
    Pradella, Matteo
    COMPUTER AIDED VERIFICATION, PT II, CAV 2021, 2021, 12760 : 387 - 410
  • [33] ON MEMORY REQUIREMENTS FOR CONTEXT-FREE LANGUAGE RECOGNITION
    HARTMANIS, J
    JOURNAL OF THE ACM, 1967, 14 (04) : 663 - +
  • [34] GLASS: A graphical query language for semi-structured data.
    Ni, W
    Ling, TW
    EIGHTH INTERNATIONAL CONFERENCE ON DATABASE SYSTEMS FOR ADVANCED APPLICATIONS, PROCEEDINGS, 2003, : 363 - 370
  • [35] Scalable Attribute-Value Extraction from Semi-Structured Text
    Wong, Yuk Wah
    Widdows, Dominic
    Lokovic, Tom
    Nigam, Kamal
    2009 IEEE INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOPS (ICDMW 2009), 2009, : 302 - 307
  • [36] Efficient Training of Adaptive Regularization of Weight Vectors for Semi-structured Text
    Iwakura, Tomoya
    ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PAKDD 2017, PT II, 2017, 10235 : 261 - 272
  • [37] History-based visual mining of semi-structured audio and text
    Bouamrane, Matt-Mouley
    Luz, Saturnino
    Masoodian, Masood
    12TH INTERNATIONAL MULTI-MEDIA MODELLING CONFERENCE PROCEEDINGS, 2006, : 360 - 363
  • [38] Improvement Research of the Software of Transforming Semi-Structured Html']Html File into Structured Text File
    Cui, Qiming
    Wang, Xue
    Chen, Guodong
    Zhao, Yongbin
    Li, Bo
    Ning, Yi
    Cui, Shuting
    Zhang, Zirong
    Zhao, Rui
    Meng, Hongyu
    Zhang, Yao
    Fu, Zhenqiang
    PROCEEDINGS OF THE 2017 7TH INTERNATIONAL CONFERENCE ON MANUFACTURING SCIENCE AND ENGINEERING (ICMSE 2017), 2017, 128 : 323 - 327
  • [39] Context-Aware Duplicate Detection in Semi-structured Data Streams
    Shukla, Parijat
    Somani, Arun K.
    2014 IEEE WORLD CONGRESS ON SERVICES (SERVICES), 2014, : 216 - 223
  • [40] LANGUAGE MODELING USING STOCHASTIC CONTEXT-FREE GRAMMARS
    CORAZZA, A
    DEMORI, R
    GRETTER, R
    SATTA, G
    SPEECH COMMUNICATION, 1993, 13 (1-2) : 163 - 170