research-article

Declarative Cleaning of Inconsistencies in Information Extraction

Authors:

Ronald Fagin,

Benny Kimelfeld,

Frederick Reiss,

Stijn VansummerenAuthors Info & Claims

ACM Transactions on Database Systems (TODS), Volume 41, Issue 1

Article No.: 6, Pages 1 - 44

https://doi.org/10.1145/2877202

Published: 07 April 2016 Publication History

Get Access

Abstract

The population of a predefined relational schema from textual content, commonly known as Information Extraction (IE), is a pervasive task in contemporary computational challenges associated with Big Data. Since the textual content varies widely in nature and structure (from machine logs to informal natural language), it is notoriously difficult to write IE programs that unambiguously extract the sought information. For example, during extraction, an IE program could annotate a substring as both an address and a person name. When this happens, the extracted information is said to be inconsistent, and some way of removing inconsistencies is crucial to compute the final output. Industrial-strength IE systems like GATE and IBM SystemT therefore provide a built-in collection of cleaning operations to remove inconsistencies from extracted relations. These operations, however, are collected in an ad hoc fashion through use cases. Ideally, we would like to allow IE developers to declare their own policies. But existing cleaning operations are defined in an algorithmic way, and hence it is not clear how to extend the built-in operations without requiring low-level coding of internal or external functions.

We embark on the establishment of a framework for declarative cleaning of inconsistencies in IE through principles of database theory. Specifically, building upon the formalism of document spanners for IE, we adopt the concept of prioritized repairs, which has been recently proposed as an extension of the traditional database repairs to incorporate priorities among conflicting facts. We show that our framework captures the popular cleaning policies, as well as the POSIX semantics for extraction through regular expressions. We explore the problem of determining whether a cleaning declaration is unambiguous (i.e., always results in a single repair) and whether it increases the expressive power of the extraction language. We give both positive and negative results, some of which are general and some of which apply to policies used in practice.

References

[1]

Jitendra Ajmera, Hyung-Il Ahn, Meena Nagarajan, Ashish Verma, Danish Contractor, Stephen Dill, and Matthew Denesuk. 2013. A CRM system for social media: Challenges and experiences. In WWW. 49--58.

Abstract

References

Cited By

Index Terms

Recommendations

Cleaning inconsistencies in information extraction via prioritized repairs

A Flexible Text Mining System for Entity and Relation Extraction in PubMed

Two learning approaches for protein name extraction

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tag

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations