Export Citations
Save this search
Please login to be able to save your searches and receive alerts for new content matching your search criteria.
- research-articleNovember 2024
Revisiting Weighted Information Extraction: A Simpler and Faster Algorithm for Ranked Enumeration
Proceedings of the ACM on Management of Data (PACMMOD), Volume 2, Issue 5Article No.: 222, Pages 1–19https://doi.org/10.1145/3695840Information extraction from textual data, where the query is represented by a finite transducer and the task is to enumerate all results without repetition, and its extension to the weighted case, where each output element has a weight and the output ...
- research-articleMay 2024
Generalized Core Spanner Inexpressibility via Ehrenfeucht-Fraïssé Games for FC
Proceedings of the ACM on Management of Data (PACMMOD), Volume 2, Issue 2Article No.: 80, Pages 1–18https://doi.org/10.1145/3651143Despite considerable research on document spanners, little is known about the expressive power of generalized core spanners. In this paper, we use Ehrenfeucht-Fraïssé games to obtain general inexpressibility lemmas for the logic FC (a finite model ...
- research-articleJune 2022
Efficient Enumeration for Annotated Grammars
PODS '22: Proceedings of the 41st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database SystemsPages 291–300https://doi.org/10.1145/3517804.3526232We introduce annotated grammars, an extension of context-free grammars which allows annotations on terminals. Our model extends the standard notion of regular spanners, and is more expressive than the extraction grammars recently introduced by ...
- research-articleJune 2022
Document Spanners - A Brief Overview of Concepts, Results, and Recent Developments
PODS '22: Proceedings of the 41st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database SystemsPages 139–150https://doi.org/10.1145/3517804.3526069The information extraction framework of document spanners was introduced by Fagin, Kimelfeld, Reiss, and Vansummeren (PODS 2013, J. ACM 2015) as a formalisation of the query language AQL, which is used in IBM's information extraction engine SystemT. ...
- research-articleJune 2022
Query Evaluation over SLP-Represented Document Databases with Complex Document Editing
PODS '22: Proceedings of the 41st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database SystemsPages 79–89https://doi.org/10.1145/3517804.3524158It is known that the query result of a regular spanner over a single document D can be enumerated after O(|D|) preprocessing and with constant delay in data complexity (Florenzano et al., ACM TODS 2020, Amarilli et al., ACM TODS 2021). It has been shown ...
- research-articleJune 2021
Spanner Evaluation over SLP-Compressed Documents
PODS'21: Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database SystemsPages 153–165https://doi.org/10.1145/3452021.3458325We consider the problem of evaluating regular spanners over compressed documents, i.e., we wish to solve evaluation tasks directly on the compressed data, without decompression. As compressed forms of the documents we use straight-line programs (SLPs) --...
- research-articleJune 2019
Complexity Bounds for Relational Algebra over Document Spanners
PODS '19: Proceedings of the 38th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database SystemsPages 320–334https://doi.org/10.1145/3294052.3319699We investigate the complexity of evaluating queries in Relational Algebra (RA) over the relations extracted by regex formulas (i.e., regular expressions with capture variables) over text documents. Such queries, also known as the regular document ...
- research-articleMay 2018
Document Spanners for Extracting Incomplete Information: Expressiveness and Complexity
PODS '18: Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database SystemsPages 125–136https://doi.org/10.1145/3196959.3196968Rule-based information extraction has lately received a fair amount of attention from the database community, with several languages appearing in the last few years. Although information extraction systems are intended to deal with semistructured data, ...
- research-articleMay 2018
Joining Extractions of Regular Expressions
PODS '18: Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database SystemsPages 137–149https://doi.org/10.1145/3196959.3196967Regular expressions with capture variables, also known as "regex formulas,'' extract relations of spans (interval positions) from text. These relations can be further manipulated via the relational Algebra as studied in the context of "document spanners,...
- columnMay 2016
A Relational Framework for Information Extraction
Information Extraction commonly refers to the task of populating a relational schema, having predefined underlying semantics, from textual content. This task is pervasive in contemporary computational challenges associated with Big Data. In this article ...
- research-articleMay 2015
Document Spanners: A Formal Approach to Information Extraction
Journal of the ACM (JACM), Volume 62, Issue 2Article No.: 12, Pages 1–51https://doi.org/10.1145/2699442An intrinsic part of information extraction is the creation and manipulation of relations extracted from text. In this article, we develop a foundational framework where the central construct is what we call a document spanner (or just spanner for short)...
- tutorialJune 2014
Database principles in information extraction
PODS '14: Proceedings of the 33rd ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systemsPages 156–163https://doi.org/10.1145/2594538.2594563Information Extraction commonly refers to the task of populating a relational schema, having predefined underlying semantics, from textual content. This task is pervasive in contemporary computational challenges associated with Big Data. This tutorial ...
- research-articleJune 2014
Cleaning inconsistencies in information extraction via prioritized repairs
PODS '14: Proceedings of the 33rd ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systemsPages 164–175https://doi.org/10.1145/2594538.2594540The population of a predefined relational schema from textual content, commonly known as Information Extraction (IE), is a pervasive task in contemporary computational challenges associated with Big Data. Since the textual content varies widely in nature ...