More Web Proxy on the site http://driver.im/

research-article

Information extraction as a filtering task

Authors:

Henning Wachsmuth,

Gregor EngelsAuthors Info & Claims

CIKM '13: Proceedings of the 22nd ACM international conference on Information & Knowledge Management

Pages 2049 - 2058

https://doi.org/10.1145/2505515.2505557

Published: 27 October 2013 Publication History

Abstract

Information extraction is usually approached as an annotation task: Input texts run through several analysis steps of an extraction process in which different semantic concepts are annotated and matched against the slots of templates. We argue that such an approach lacks an efficient control of the input of the analysis steps. In this paper, we hence propose and evaluate a model and a formal approach that consistently put the filtering view in the focus: Before spending annotation effort, filter those portions of the input texts that may contain relevant information for filling a template and discard the others. We model all dependencies between the semantic concepts sought for with a truth maintenance system, which then efficiently infers the portions of text to be annotated in each analysis step. The filtering view enables an information extraction system (1) to annotate only relevant portions of input texts and (2) to easily trade its run-time efficiency for its recall. We provide our approach as an open-source extension of Apache UIMA and we show the potential of our approach in a number of experiments.

References

[1]

E. Agichtein. Scaling Information Extraction to Large Document Collections. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 28:3--10, 2005.

[2]

E. Agichtein and L. Gravano. Querying Text Databases for Efficient Information Extraction. In Proc. of the 19th International Conference on Data Engineering, pages 113--124, 2003.

[3]

R. Al-Rfou' and S. Skiena. SpeedRead: A Fast Named Entity Recognition Pipeline. In Proc. of the 24th International Conference on Computational Linguistics, pages 51--66, 2012.

[4]

M. J. Cafarella, D. Downey, S. Soderland, and O. Etzioni. KnowItNow: Fast, Scalable Information Extraction from the Web. In Proc. of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing, pages 563--570, 2005.

Digital Library

[5]

C. Cardie, V. Ng, D. Pierce, and C. Buckley. Examining the Role of Statistical and Linguistic Knowledge Sources in a General-Knowledge Question-Answering System. In Proc. of the Sixth Applied Natural Language Processing Conference, pages 180--187, 2000.

Digital Library

[6]

N. Chinchor, D. D. Lewis, and L. Hirschman. Evaluating Message Understanding Systems: An Analysis of the Third Message Understanding Conference (MUC-3). Computational Linguistics, 19(3):409--449, 1993.

Digital Library

[7]

L. Chiticariu, R. Krishnamurthy, Y. Li, S. Raghavan, F. R. Reiss, and S. Vaithyanathan. SystemT: An Algebraic Approach to Declarative Information Extraction. In Proc. of the 48th Annual Meeting of the Association for Computational Linguistics, pages 128--137, 2010.

Digital Library

[8]

J. Cowie and W. Lehnert. Information Extraction. Communications of the ACM, 39(1):80--91, 1996.

Digital Library

[9]

H. Cui, R. Sun, K. Li, M.-Y. Kan, and T.-S. Chua. Question Answering Passage Retrieval using Dependency Relations. In Proc. of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 400--407, 2005.

Digital Library

[10]

H. Cunningham, D. Maynard, K. Bontcheva, V. Tablan, N. Aswani, I. Roberts, G. Gorrell, A. Funk, A. Roberts, D. Damljanovic, T. Heitz, M. A. Greenwood, H. Saggion, J. Petrak, Y. Li, and W. Peters. Text Processing with GATE (Version 6). University of Sheffield, 2011.

Digital Library

[11]

A. Das Sarma, A. Jain, and P. Bohannon. Building a Generic Debugger for Information Extraction Pipelines. In Proc. of the 20th ACM International Conference on Information and Knowledge Management, pages 2229--2232, 2011.

Digital Library

[12]

A. Doan, J. F. Naughton, R. Ramakrishnan, A. Baid, X. Chai, F. Chen, T. Chen, E. Chu, P. DeRose, B. Gao, C. Gokhale, J. Huang, W. Shen, and B.-Q. Vuong. Information Extraction Challenges in Managing Unstructured Data. SIGMOD Records, 37(4):14--20, 2009.

Digital Library

[13]

A. Doan, R. Ramakrishnan, and S. Vaithyanathan. Managing Information Extraction: State of the Art and Research Directions. In Proc. of the 2006 ACM SIGMOD International Conference on Management of Data, pages 799--800, 2006.

Digital Library

[14]

M. Faruqui and S. Padó. Training and Evaluating a German Named Entity Recognizer with Semantic Generalization. In Proc. of KONVENS 2010, pages 129--133, 2010.

[15]

D. Ferrucci and A. Lally. UIMA: An Architectural Approach to Unstructured Information Processing in the Corporate Research Environment. Natural Language Engineering, 10(3-4):327--348, 2004.

Digital Library

[16]

J. R. Finkel, T. Grenager, and C. D. Manning. Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling. In Proc. of the 43nd Annual Meeting of the Association for Computational Linguistics, pages 363--370, 2005.

Digital Library

[17]

J. R. Finkel, C. D. Manning, and A. Y. Ng. Solving the Problem of Cascading Errors: Approximate Bayesian Inference for Linguistic Annotation Pipelines. In Proc. of the 2006 Conference on Empirical Methods in Natural Language Processing, pages 618--626, 2006.

Digital Library

[18]

G. Forman and E. Kirshenbaum. Extremely Fast Text Feature Extraction for Classification and Indexing. In Proc. of the 17th ACM Conference on Information and Knowledge Management, pages 1221--1230, 2008.

Digital Library

[19]

D. Gruhl, L. Chavet, D. Gibson, J. Meyer, P. Pattanayak, A. Tomkins, and J. Y. Zien. How to Build a WebFountain: An Architecture for Very Large-scale Text Analytics. IBM Systems Journal, 43(1):64--77, 2004.

Digital Library

[20]

B. Hayes-Roth. A Blackboard Architecture for Control. Artificial Intelligence, 26(3):251--321, 1985.

Digital Library

[21]

K. Hollingshead and B. Roark. Pipeline Iteration. In Proc. of the 45th Annual Meeting of the Association for Computational Linguistics, pages 952--959, 2007.

[22]

L. Jean-Louis, R. Besançon, and O. Ferret. Text Segmentation and Graph-based Method for Template Filling in Information Extraction. In Proc. of the 5th International Joint Conference on Natural Language Processing, pages 723--731, 2011.

[23]

D. Jurafsky and J. H. Martin. Speech and Language Processing: An Introduction to Natural Language Processing, Speech Recognition, and Computational Linguistics. Prentice-Hall, 2nd edition, 2009.

Digital Library

[24]

J.-D. Kim, S. Pyysalo, T. Ohta, R. Bossy, N. Nguyen, and J. Tsujii. Overview of BioNLP Shared Task 2011. In Proc. of the BioNLP Shared Task 2011 Workshop, pages 1--6, 2011.

Digital Library

[25]

E. Krikon, D. Carmel, and O. Kurland. Predicting the Performance of Passage Retrieval for Question Answering. In Proc. of the 21st ACM International Conference on Information and Knowledge management, pages 2451--2454, 2012.

Digital Library

[26]

G. Limaye, S. Sarawagi, and S. Chakrabarti. Annotating and Searching Web Tables using Entities, Types and Relationships. Proc. of the VLDB Endowment, 3(1):1338--1347, 2010.

Digital Library

[27]

W. Lu and D. Roth. Automatic Event Extraction with Structured Preference Modeling. In Proc. of the 50th Annual Meeting of the Association for Computational Linguistics, pages 835--844, 2012.

Digital Library

[28]

Mausam, M. Schmitz, R. Bart, S. Soderland, and O. Etzioni. Open Language Learning for Information Extraction. In Proc. of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 523--534, 2012.

Digital Library

[29]

C. Nedellec, M. O. A. Vetah, and P. Bessières. Sentence Filtering for Information Extraction in Genomics, a Classification Problem. In Proc. of the 5th European Conference on Principles of Data Mining and Knowledge Discovery, pages 326--337, 2001.

Digital Library

[30]

OMG. Unified Modeling Language (OMG UML) Superstructure, Version 2.4.1. 2011.

[31]

P. Pantel, D. Ravichandran, and E. Hovy. Towards Terascale Knowledge Acquisition. In Proc. of the 20th International Conference on Computational Linguistics, pages 771--777, 2004.

Digital Library

[32]

M. Pasca. Web-based Open-Domain Information Extraction. In Proc. of the 20th ACM International Conference on Information and Knowledge Management, pages 2605--2606, 2011.

Digital Library

[33]

S. Patwardhan and E. Riloff. Effective Information Extraction with Semantic Affinity Patterns and Relevant Regions. In Proc. of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 717--727, 2007.

[34]

H. Poon and P. Domingos. Joint Inference in Information Extraction. In Proc. of the 22nd National Conference on Artificial Intelligence, pages 913--918, 2007.

Digital Library

[35]

S. J. Russell and P. Norvig. Artificial Intelligence: A Modern Approach. Prentice-Hall, 3rd edition, 2009.

Digital Library

[36]

E. F. T. K. Sang and F. D. Meulder. Introduction to the CoNLL-2003 Shared Task: Language-independent Named Entity Recognition. In Proc. of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, pages 142--147, 2003.

Digital Library

[37]

S. Sarawagi. Information Extraction. Foundations and Trends in Databases, 1(3):261--377., 2008.

Digital Library

[38]

H. Schmid. Improvements in Part-of-Speech Tagging with an Application to German. In Proc. of the ACL SIGDAT-Workshop, pages 47--50, 1995.

[39]

W. Shen, A. Doan, J. F. Naughton, and R. Ramakrishnan. Declarative Information Extraction using Datalog with Embedded Extraction Predicates. In Proc. of the 33rd International Conference on Very Large Data Bases, pages 1033--1044, 2007.

Digital Library

[40]

B. Stein, S. M. zu Eissen, G. Gräfe, and F. Wissbrock. Automating Market Forecast Summarization from Internet Data. In Proc. of the Fourth Conference on WWW/Internet, pages 395--402, 2005.

[41]

M. Stevenson. Fact Distribution in Information Extraction. Language Resources and Evaluation, 40(2):183--201, 2007.

[42]

H. Wachsmuth, P. Prettenhofer, and B. Stein. Efficient Statement Identification for Automatic Market Forecasting. In Proc. of the 23rd International Conference on Computational Linguistics, pages 1128--1136, 2010.

Digital Library

[43]

H. Wachsmuth, M. Rose, and G. Engels. Automatic Pipeline Construction for Real-Time Annotation. In Proc. of the 14th International Conference on Intelligent Text Processing and Computational Linguistics, pages 38--49, 2013.

Digital Library

[44]

H. Wachsmuth and B. Stein. Optimal Scheduling of Information Extraction Algorithms. In Proc. of the 24th International Conference on Computational Linguistics: Posters, pages 1281--1290, 2012.

[45]

H. Wachsmuth, B. Stein, and G. Engels. Constructing Efficient Information Extraction Pipelines. In Proc. of the 20th ACM Conference on Information and Knowledge Management, pages 2237--2240, 2011.

Digital Library

[46]

W. Wang, R. Besançon, O. Ferret, and B. Grau. Filtering and Clustering Relations for Unsupervised Information Extraction in Open Domain. In Proceedings of the 20th ACM Conference on Information and Knowledge Management, pages 1405--1414, 2011.

Digital Library

[47]

C. Whitelaw, A. Kehlenbeck, N. Petrovic, and L. Ungar. Web-scale Named Entity Recognition. In Proc. of the 17th ACM Conference on Information and Knowledge Management, pages 123--132, 2008.

Digital Library

Cited By

Aquino IM. dos Santos MDorneles CT. Carvalho J(2024)Extracting Information from Brazilian Legal Documents with Retrieval Augmented GenerationAnais Estendidos do XXXIX Simpósio Brasileiro de Banco de Dados (SBBD Estendido 2024)10.5753/sbbd_estendido.2024.244241(280-287)Online publication date: 14-Oct-2024
https://doi.org/10.5753/sbbd_estendido.2024.244241

Index Terms

Information extraction as a filtering task
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
2. Theory of computation
  1. Computational complexity and cryptography
    1. Complexity classes

Recommendations

Constructing efficient information extraction pipelines
CIKM '11: Proceedings of the 20th ACM international conference on Information and knowledge management

Information Extraction (IE) pipelines analyze text through several stages. The pipeline's algorithms determine both its effectiveness and its run-time efficiency. In real-world tasks, however, IE pipelines often fail acceptable run-times because they ...
A Flexible Text Mining System for Entity and Relation Extraction in PubMed
DTMBIO '15: Proceedings of the ACM Ninth International Workshop on Data and Text Mining in Biomedical Informatics

Due to an enormous number of scientific publications that cannot be handled manually, there is a rising interest in text-mining techniques for automated information extraction, especially in the biomedical field. Such techniques provide effective means ...
Web-scale information extraction in knowitall: (preliminary results)
WWW '04: Proceedings of the 13th international conference on World Wide Web

Manually querying search engines in order to accumulate a large bodyof factual information is a tedious, error-prone process of piecemealsearch. Search engines retrieve and rank potentially relevantdocuments for human perusal, but do not extract facts, ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

CIKM '13: Proceedings of the 22nd ACM international conference on Information & Knowledge Management

October 2013

2612 pages

ISBN:9781450322638

DOI:10.1145/2505515

General Chairs:
Qi He
LinkedIn, USA
,
Arun Iyengar
IBM T.J. Watson Research Center, USA
,
Program Chairs:
Wolfgang Nejdl
L3S Research Center, Germany
,
Jian Pei
Simon Fraser University, Canada
,
Rajeev Rastogi
Amazon, India

Copyright © 2013.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2013

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

CIKM'13

Sponsor:

CIKM'13: 22nd ACM International Conference on Information and Knowledge Management

October 27 - November 1, 2013

California, San Francisco, USA

Acceptance Rates

CIKM '13 Paper Acceptance Rate 143 of 848 submissions, 17%;

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
236
Total Downloads

Downloads (Last 12 months)4
Downloads (Last 6 weeks)0

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Aquino IM. dos Santos MDorneles CT. Carvalho J(2024)Extracting Information from Brazilian Legal Documents with Retrieval Augmented GenerationAnais Estendidos do XXXIX Simpósio Brasileiro de Banco de Dados (SBBD Estendido 2024)10.5753/sbbd_estendido.2024.244241(280-287)Online publication date: 14-Oct-2024
https://doi.org/10.5753/sbbd_estendido.2024.244241

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten