Joining Extractions of Regular Expressions

被引:22
|
作者
Freydenberger, Dominik D. [1 ]
Kimelfeld, Benny [2 ]
Peterfreund, Liat [2 ]
机构
[1] Loughborough Univ, Loughborough, Leics, England
[2] Technion Israel Inst Technol, Haifa, Israel
基金
以色列科学基金会;
关键词
Information Extraction; Document Spanners; Regular Expressions; Unions of Conjunctive Queries; Polynomial Delay; KNOWLEDGE-BASE; SYSTEM;
D O I
10.1145/3196959.3196967
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Regular expressions with capture variables, also known as "regex formulas," extract relations of spans (interval positions) from text. These relations can be further manipulated via the relational Algebra as studied in the context of "document spanners," Fagin et al.'s formal framework for information extraction. We investigate the complexity of querying text by Conjunctive Queries (CQs) and Unions of CQs (UCQs) on top of regex formulas. Such queries have been investigated in prior work on document spanners, but little is known about the (combined) complexity of their evaluation. We show that the lower bounds (NP-completeness and W[1]-hardness) from the relational world also hold in our setting; in particular, hardness hits already single-character text. Yet, the upper bounds from the relational world do not carry over. Unlike the relational world, acyclic CQs, and even gamma-acyclic CQs, are hard to compute. The source of hardness is that it may be intractable to instantiate the relation defined by a regex formula, simply because it has an exponential number of tuples. Yet, we are able to establish general upper bounds. In particular, UCQs can be evaluated with polynomial delay, provided that every CQ has a bounded number of atoms (while unions and projection can be arbitrary). Furthermore, UCQ evaluation is solvable with FPT (Fixed-Parameter Tractable) delay when the parameter is the size of the UCQ.
引用
收藏
页码:137 / 149
页数:13
相关论文
共 50 条
  • [1] THE COMPLEXITY OF AGGREGATES OVER EXTRACTIONS BY REGULAR EXPRESSIONS *
    Doleschal, Johannes
    Kimelfeld, Benny
    Martens, Wim
    LOGICAL METHODS IN COMPUTER SCIENCE, 2023, 19 (03) : 1 - 12
  • [2] GENERATION OF REGULAR EXPRESSIONS FOR AUTOMATA BY INTEGRAL OF REGULAR EXPRESSIONS
    SMITH, LW
    YAU, SS
    COMPUTER JOURNAL, 1972, 15 (03): : 222 - &
  • [3] Regular expressions
    Becker, P
    DR DOBBS JOURNAL, 2006, 31 (05): : 52 - +
  • [4] Regular expressions
    LeFebvre, William
    Performance Computing/Unix Review, 1999, 17 (11): : 49 - 51
  • [5] Regular expressions
    Scientific Computing and Instrumentation, 2000, 17 (08):
  • [6] Regular Transducer Expressions for Regular Transformations
    Dave, Vrunda
    Gastin, Paul
    Krishna, Shankara Narayanan
    LICS'18: PROCEEDINGS OF THE 33RD ANNUAL ACM/IEEE SYMPOSIUM ON LOGIC IN COMPUTER SCIENCE, 2018, : 315 - 324
  • [7] Regular transducer expressions for regular transformations
    Dave, Vrunda
    Gastin, Paul
    Krishna, Shankara Narayanan
    INFORMATION AND COMPUTATION, 2022, 282
  • [8] Synchronized regular expressions
    Della Penna, G
    Intrigila, B
    Tronci, E
    Zilli, MV
    ACTA INFORMATICA, 2003, 39 (01) : 31 - 70
  • [9] Regular Expressions on the Web
    Hodovan, Renata
    Herczeg, Zoltan
    Kiss, Akos
    12TH IEEE INTERNATIONAL SYMPOSIUM ON WEB SYSTEMS EVOLUTION (WSE 2010), 2010, : 29 - 32
  • [10] Forkable Regular Expressions
    Sulzmann, Martin
    Thiemann, Peter
    LANGUAGE AND AUTOMATA THEORY AND APPLICATIONS, LATA 2016, 2016, 9618 : 194 - 206