Loughborough University
Leicestershire, UK
LE11 3TU
+44 (0)1509 263171
Loughborough University

Loughborough University Institutional Repository

Please use this identifier to cite or link to this item: https://dspace.lboro.ac.uk/2134/26421

Title: Joining extractions of regular expressions
Authors: Freydenberger, Dominik D.
Kimelfeld, Benny
Peterfreund, Liat
Issue Date: 2018
Publisher: © Association for Computing Machinery (ACM)
Citation: FREYDENBERGER, D.D., KIMELFELD, B. and PETERFREUND, L., 2018. Joining extractions of regular expressions. IN: Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, Houston, Texas, USA, 10th-15th June 2018, pp. 137-149.
Abstract: Regular expressions with capture variables, also known as “regex formulas,” extract relations of spans (interval positions) from text. These relations can be further manipulated via the relational Algebra as studied in the context of “document spanners,” Fagin et al.’s formal framework for information extraction. We investigate the complexity of querying text by Conjunctive Queries (CQs) and Unions of CQs (UCQs) on top of regex formulas. Such queries have been investigated in prior work on document spanners, but little is known about the (combined) complexity of their evaluation. We show that the lower bounds (NP-completeness and W[1]-hardness) from the relational world also hold in our setting; in particular, hardness hits already single-character text. Yet, the upper bounds from the relational world do not carry over. Unlike the relational world, acyclic CQs, and even gamma-acyclic CQs, are hard to compute. The source of hardness is that it may be intractable to instantiate the relation defined by a regex formula, simply because it has an exponential number of tuples. Yet, we are able to establish general upper bounds. In particular, UCQs can be evaluated with polynomial delay, provided that every CQ has a bounded number of atoms (while unions and projection can be arbitrary). Furthermore, UCQ evaluation is solvable with FPT (Fixed-Parameter Tractable) delay when the parameter is the size of the UCQ.
Description: This is the author's version of the work. It is posted here for your personal use. Not for redistribution. The definitive Version of Record was published in International Conference on Management of Data, Houston, Texas, USA, 10th-15th June 2018, http://doi.org/10.1145/3196959.3196967
Version: Accepted for publication
DOI: 10.1145/3196959.3196967;
URI: https://dspace.lboro.ac.uk/2134/26421
Publisher Link: http://doi.org/10.1145/3196959.3196967
ISBN: 9781450347068
Appears in Collections:Conference Papers and Presentations (Computer Science)

Files associated with this item:

File Description SizeFormat
main.pdfAccepted version513.62 kBAdobe PDFView/Open


SFX Query

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.