Joining Extractions of Regular Expressions

被引:22
|
作者
Freydenberger, Dominik D. [1 ]
Kimelfeld, Benny [2 ]
Peterfreund, Liat [2 ]
机构
[1] Loughborough Univ, Loughborough, Leics, England
[2] Technion Israel Inst Technol, Haifa, Israel
基金
以色列科学基金会;
关键词
Information Extraction; Document Spanners; Regular Expressions; Unions of Conjunctive Queries; Polynomial Delay; KNOWLEDGE-BASE; SYSTEM;
D O I
10.1145/3196959.3196967
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Regular expressions with capture variables, also known as "regex formulas," extract relations of spans (interval positions) from text. These relations can be further manipulated via the relational Algebra as studied in the context of "document spanners," Fagin et al.'s formal framework for information extraction. We investigate the complexity of querying text by Conjunctive Queries (CQs) and Unions of CQs (UCQs) on top of regex formulas. Such queries have been investigated in prior work on document spanners, but little is known about the (combined) complexity of their evaluation. We show that the lower bounds (NP-completeness and W[1]-hardness) from the relational world also hold in our setting; in particular, hardness hits already single-character text. Yet, the upper bounds from the relational world do not carry over. Unlike the relational world, acyclic CQs, and even gamma-acyclic CQs, are hard to compute. The source of hardness is that it may be intractable to instantiate the relation defined by a regex formula, simply because it has an exponential number of tuples. Yet, we are able to establish general upper bounds. In particular, UCQs can be evaluated with polynomial delay, provided that every CQ has a bounded number of atoms (while unions and projection can be arbitrary). Furthermore, UCQ evaluation is solvable with FPT (Fixed-Parameter Tractable) delay when the parameter is the size of the UCQ.
引用
收藏
页码:137 / 149
页数:13
相关论文
共 50 条
  • [21] Memoized Regular Expressions
    Van der Merwe, Brink
    Mouton, Jacobie
    Van Litsenborgh, Steyn
    Berglund, Martin
    IMPLEMENTATION AND APPLICATION OF AUTOMATA (CIAA 2021), 2021, 12803 : 39 - 52
  • [22] Intersection of -ω-Regular Expressions
    A. N. Chebotarev
    Cybernetics and Systems Analysis, 2021, 57 : 676 - 684
  • [23] On a generalization of regular expressions
    Gomozov, AL
    Stanevichene, LI
    PROGRAMMING AND COMPUTER SOFTWARE, 2000, 26 (05) : 258 - 267
  • [24] On Extended Regular Expressions
    Carle, Benjamin
    Narendran, Paliath
    LANGUAGE AND AUTOMATA THEORY AND APPLICATIONS, 2009, 5457 : 279 - 289
  • [25] Regular expressions of conditions
    Popa, Emil Marin
    3RD INT CONF ON CYBERNETICS AND INFORMATION TECHNOLOGIES, SYSTEMS, AND APPLICAT/4TH INT CONF ON COMPUTING, COMMUNICATIONS AND CONTROL TECHNOLOGIES, VOL 3, 2006, : 174 - 176
  • [26] Regular Expressions with Lookahead
    Berglund, Martin
    van Der Merwe, Brink
    van Litsenborgh, Steyn
    JOURNAL OF UNIVERSAL COMPUTER SCIENCE, 2021, 27 (04) : 324 - 340
  • [27] Probabilistic ω-Regular Expressions
    Weidner, Thomas
    LANGUAGE AND AUTOMATA THEORY AND APPLICATIONS (LATA 2014), 2014, 8370 : 588 - 600
  • [28] Rewriting of regular expressions and regular path queries
    Calvanese, D
    De Giacomo, G
    Lenzerini, M
    Vardi, MY
    JOURNAL OF COMPUTER AND SYSTEM SCIENCES, 2002, 64 (03) : 443 - 465
  • [29] Regular binoid expressions and regular binoid languages
    Hashiguchi, K
    Wada, Y
    Jimbo, S
    THEORETICAL COMPUTER SCIENCE, 2003, 304 (1-3) : 291 - 313
  • [30] Modelling the semantics of calendar expressions as extended regular expressions
    Niemi, Jyrki
    Carlson, Lauri
    FINITE-STATE METHODS AND NATURAL LANGUAGE PROCESSING, 2006, 4002 : 179 - +