[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Declarative Cleaning of Inconsistencies in Information Extraction

Published: 07 April 2016 Publication History

Abstract

The population of a predefined relational schema from textual content, commonly known as Information Extraction (IE), is a pervasive task in contemporary computational challenges associated with Big Data. Since the textual content varies widely in nature and structure (from machine logs to informal natural language), it is notoriously difficult to write IE programs that unambiguously extract the sought information. For example, during extraction, an IE program could annotate a substring as both an address and a person name. When this happens, the extracted information is said to be inconsistent, and some way of removing inconsistencies is crucial to compute the final output. Industrial-strength IE systems like GATE and IBM SystemT therefore provide a built-in collection of cleaning operations to remove inconsistencies from extracted relations. These operations, however, are collected in an ad hoc fashion through use cases. Ideally, we would like to allow IE developers to declare their own policies. But existing cleaning operations are defined in an algorithmic way, and hence it is not clear how to extend the built-in operations without requiring low-level coding of internal or external functions.
We embark on the establishment of a framework for declarative cleaning of inconsistencies in IE through principles of database theory. Specifically, building upon the formalism of document spanners for IE, we adopt the concept of prioritized repairs, which has been recently proposed as an extension of the traditional database repairs to incorporate priorities among conflicting facts. We show that our framework captures the popular cleaning policies, as well as the POSIX semantics for extraction through regular expressions. We explore the problem of determining whether a cleaning declaration is unambiguous (i.e., always results in a single repair) and whether it increases the expressive power of the extraction language. We give both positive and negative results, some of which are general and some of which apply to policies used in practice.

References

[1]
Jitendra Ajmera, Hyung-Il Ahn, Meena Nagarajan, Ashish Verma, Danish Contractor, Stephen Dill, and Matthew Denesuk. 2013. A CRM system for social media: Challenges and experiences. In WWW. 49--58.
[2]
Douglas E. Appelt and Boyan Onyshkevych. 1998. The common pattern specification language. In Proceedings of the TIPSTER Text Program: Phase III. 23--30.
[3]
Marcelo Arenas, Leopoldo E. Bertossi, and Jan Chomicki. 1999. Consistent query answers in inconsistent databases. In PODS. 68--79.
[4]
Edward Benson, Aria Haghighi, and Regina Barzilay. 2011. Event discovery in social media feeds. In ACL. 389--398.
[5]
Leopoldo E. Bertossi, Solmaz Kolahi, and Laks V. S. Lakshmanan. 2013. Data cleaning and query answering with matching dependencies and matching functions. Theor. Comput. Syst. 52, 3 (2013), 441--482.
[6]
George Beskales, Ihab F. Ilyas, Lukasz Golab, and Artur Galiullin. 2014. Sampling from repairs of conditional functional dependency violations. VLDB J. 23, 1 (2014), 103--128.
[7]
Jens Bleiholder and Felix Naumann. 2008. Data fusion. ACM Comput. Surv. 41, 1 (2008).
[8]
Philip Bohannon, Michael Flaster, Wenfei Fan, and Rajeev Rastogi. 2005. A cost-based model and effective heuristic for repairing constraints by value modification. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, New York, NY, 143--154.
[9]
Laura Chiticariu, Vivian Chu, Sajib Dasgupta, Thilo W. Goetz, Howard Ho, Rajasekar Krishnamurthy, Alexander Lang, Yunyao Li, Bin Liu, Sriram Raghavan, and others. 2011. The SystemT IDE: An integrated development environment for information extraction rules. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data. ACM, New York, NY, 1291--1294.
[10]
Laura Chiticariu, Rajasekar Krishnamurthy, Yunyao Li, Sriram Raghavan, Frederick Reiss, and Shivakumar Vaithyanathan. 2010. SystemT: An algebraic approach to declarative information extraction. In ACL. 128--137.
[11]
Laura Chiticariu, Yunyao Li, and Frederick R. Reiss. 2013. Rule-based information extraction is dead! Long live rule-based information extraction systems!. In EMNLP. 827--832.
[12]
Rus Cox. 2009. Regular Expression Matching: the Virtual Machine Approach. Digression: POSIX Submatching. Retrieved from http://swtch.com/ rsc/regexp/regexp2.html.
[13]
Hamish Cunningham. 2002. GATE, a general architecture for text engineering. Comput. Humanities 36, 2 (2002), 223--254.
[14]
H. Cunningham, D. Maynard, and V. Tablan. 2000. JAPE: A Java Annotation Patterns Engine (Second Edition). Research Memorandum CS--00--10. Department of Computer Science, University of Sheffield.
[15]
Gerald DeJong. 1982. An overview of the FRUMP system. In Strategies for Natural Language Processing, Wendy G. Lehnert and Martin H. Ringle (Eds.). Lawrence Erlbaum Associates, 149--176.
[16]
Maximilian Dylla, Iris Miliaraki, and Martin Theobald. 2013. A temporal-probabilistic database model for information extraction. Proc. VLDB 6, 14 (2013), 1810--1821.
[17]
Ronald Fagin, Benny Kimelfeld, Yunyao Li, Sriram Raghavan, and Shivakumar Vaithyanathan. 2011. Rewrite rules for search database systems. In PODS. 271--282.
[18]
Ronald Fagin, Benny Kimelfeld, Frederick Reiss, and Stijn Vansummeren. 2014. Cleaning inconsistencies in information extraction via prioritized repairs. In Proceedings of the 33rd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS’14). ACM, New York, NY, 164--175.
[19]
Ronald Fagin, Benny Kimelfeld, Frederick Reiss, and Stijn Vansummeren. 2015. Document spanners: A formal approach to information extraction. J. ACM 62, 2 (2015), 12.
[20]
Wenfei Fan. 2008. Dependencies revisited for improving data quality. In PODS. 159--170.
[21]
Wenfei Fan, Hong Gao, Xibei Jia, Jianzhong Li, and Shuai Ma. 2011a. Dynamic constraints for record matching. VLDB J. 20, 4 (2011), 495--520.
[22]
Wenfei Fan, Jianzhong Li, Shuai Ma, Nan Tang, and Wenyuan Yu. 2011b. Interaction between record matching and data repairing. In SIGMOD Conference. 469--480.
[23]
David A. Ferrucci and Adam Lally. 2004. UIMA: An architectural approach to unstructured information processing in the corporate research environment. Nat. Lang. Eng. 10, 3--4 (2004), 327--348.
[24]
Glenn Fowler. 2003. An interpreation of the POSIX Regex Standard (2003). http://gsf.cococlyde.org/download/re-interpretation.tgz.
[25]
Dayne Freitag. 1998. Toward general-purpose learning for information extraction. In COLING-ACL. 404--408.
[26]
Qiang Fu, Jian-Guang Lou, Yi Wang, and Jiang Li. 2009. Execution anomaly detection in distributed systems through unstructured log analysis. In ICDM. 149--158.
[27]
Ralph Grishman and Beth Sundheim. 1996. Message understanding conference-6: A brief history. In COLING. 466--471.
[28]
Markus Holzer, Martin Kutrib, and Andreas Malcher. 2008. Multi-head finite automata: Characterizations, concepts and open problems. In CSP (EPTCS), Vol. 1. 93--107.
[29]
Institute of Electrical and Electronic Engineers and The Open Group. 2013. The Open Group Base Specifications Issue 7. (2013). IEEE Std 1003.1, 2013 Edition.
[30]
John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In ICML. 282--289.
[31]
Ville Laurikari. 2001. Efficient Submatch Addressing for Regular Expressions. Master’s thesis. Helsinki University of Technology.
[32]
T. R. Leek. 1997. Information Extraction Using Hidden Markov Models. Master’s thesis. UC San Diego.
[33]
Bin Liu, Laura Chiticariu, Vivian Chu, H. V. Jagadish, and Frederick Reiss. 2010. Automatic rule refinement for information extraction. Proc. VLDB 3, 1 (2010), 588--597.
[34]
Shuai Ma, Wenfei Fan, and Loreto Bravo. 2014. Extending inclusion dependencies with conditions. Theor. Comput. Sci. 515 (2014), 64--95.
[35]
Andrew McCallum, Dayne Freitag, and Fernando C. N. Pereira. 2000. Maximum entropy Markov models for information extraction and segmentation. In ICML. 591--598.
[36]
Feng Niu, Christopher Ré, AnHai Doan, and Jude W. Shavlik. 2011. Tuffy: Scaling up statistical inference in Markov logic networks using an RDBMS. Proc. VLDB 4, 6 (2011), 373--384.
[37]
Satoshi Okui and Taro Suzuki. 2010. Disambiguation in regular expression matching via position automata with augmented transitions. In CIAA (Lecture Notes in Computer Science), Michael Domaratzki and Kai Salomaa (Eds.), Vol. 6482. 231--240.
[38]
Hoifung Poon and Pedro Domingos. 2007. Joint inference in information extraction. In AAAI. AAAI Press, 913--918.
[39]
Frederick Reiss, Sriram Raghavan, Rajasekar Krishnamurthy, Huaiyu Zhu, and Shivakumar Vaithyanathan. 2008. An algebraic approach to rule-based information extraction. In ICDE. 933--942.
[40]
Ellen Riloff. 1993. Automatically constructing a dictionary for information extraction tasks. In AAAI. 811--816.
[41]
Sudeepa Roy, Laura Chiticariu, Vitaly Feldman, Frederick R. Reiss, and Huaiyu Zhu. 2013. Provenance-based dictionary refinement in information extraction. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data. ACM, New York, NY, 457--468.
[42]
Warren Shen, AnHai Doan, Jeffrey F. Naughton, and Raghu Ramakrishnan. 2007. Declarative information extraction using datalog with embedded extraction predicates. In VLDB. 1033--1044.
[43]
Stephen Soderland. 1999. Learning information extraction rules for semi-structured and free text. Mach. Learn. 34, 1--3 (1999), 233--272.
[44]
Stephen Soderland, David Fisher, Jonathan Aseltine, and Wendy G. Lehnert. 1995. CRYSTAL: Inducing a conceptual dictionary. In IJCAI. 1314--1321.
[45]
Slawek Staworko, Jan Chomicki, and Jerzy Marcinkowski. 2012. Prioritized repairing and consistent query answering in relational databases. Ann. Math. Artif. Intell. 64, 2--3 (2012), 209--246.
[46]
Stijn Vansummeren. 2006. Type inference for unique pattern matching. ACM Trans. Program. Lang. Syst. 28, 3 (2006), 389--428.
[47]
Hua Xu, Shane P. Stenner, Son Doan, Kevin B. Johnson, Lemuel R. Waitman, and Joshua C. Denny. 2010. Application of information technology: MedEx: A medication information extraction system for clinical narratives. JAMIA 17, 1 (2010), 19--24.
[48]
Huaiyu Zhu, Sriram Raghavan, Shivakumar Vaithyanathan, and Alexander Löser. 2007. Navigating the intranet with high precision. In WWW. 491--500.

Cited By

View all

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Database Systems
ACM Transactions on Database Systems  Volume 41, Issue 1
Invited Paper from ICDT 2015, SIGMOD 2014, EDBT 2014 and Regular Papers
April 2016
287 pages
ISSN:0362-5915
EISSN:1557-4644
DOI:10.1145/2897141
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 April 2016
Accepted: 01 November 2015
Revised: 01 September 2015
Received: 01 January 2015
Published in TODS Volume 41, Issue 1

Permissions

Request permissions for this article.

Check for updates

Author Tag

  1. Information Extraction

Qualifiers

  • Research-article
  • Research
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)10
  • Downloads (Last 6 weeks)1
Reflects downloads up to 09 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2023)Enumerating grammar-based extractionsDiscrete Applied Mathematics10.1016/j.dam.2023.08.014341(372-392)Online publication date: Dec-2023
  • (2021)Database Principles and Challenges in Text AnalysisACM SIGMOD Record10.1145/3484622.348462450:2(6-17)Online publication date: 31-Aug-2021
  • (2020)Efficient Enumeration Algorithms for Regular Document SpannersACM Transactions on Database Systems10.1145/335145145:1(1-42)Online publication date: 8-Feb-2020
  • (2020)On Multiple Semantics for Declarative Database RepairsProceedings of the 2020 ACM SIGMOD International Conference on Management of Data10.1145/3318464.3389721(817-831)Online publication date: 11-Jun-2020
  • (2020)Ontology-based Document Spanning Systems for Information ExtractionInternational Journal of Semantic Computing10.1142/S1793351X2040001214:01(3-26)Online publication date: 9-Jun-2020
  • (2019)Complexity Bounds for Relational Algebra over Document SpannersProceedings of the 38th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems10.1145/3294052.3319699(320-334)Online publication date: 25-Jun-2019
  • (2019)Split-Correctness in Information ExtractionProceedings of the 38th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems10.1145/3294052.3319684(149-163)Online publication date: 25-Jun-2019
  • (2019)A Formal Framework for Coupling Document Spanners with Ontologies2019 IEEE Second International Conference on Artificial Intelligence and Knowledge Engineering (AIKE)10.1109/AIKE.2019.00036(155-162)Online publication date: Jun-2019
  • (2018)Employing Semantic Context for Sparse Information Extraction AssessmentACM Transactions on Knowledge Discovery from Data10.1145/320140712:5(1-36)Online publication date: 27-Jun-2018
  • (2018)Constant Delay Algorithms for Regular Document SpannersProceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems10.1145/3196959.3196987(165-177)Online publication date: 27-May-2018
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media