[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/2505515.2505557acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Information extraction as a filtering task

Published: 27 October 2013 Publication History

Abstract

Information extraction is usually approached as an annotation task: Input texts run through several analysis steps of an extraction process in which different semantic concepts are annotated and matched against the slots of templates. We argue that such an approach lacks an efficient control of the input of the analysis steps. In this paper, we hence propose and evaluate a model and a formal approach that consistently put the filtering view in the focus: Before spending annotation effort, filter those portions of the input texts that may contain relevant information for filling a template and discard the others. We model all dependencies between the semantic concepts sought for with a truth maintenance system, which then efficiently infers the portions of text to be annotated in each analysis step. The filtering view enables an information extraction system (1) to annotate only relevant portions of input texts and (2) to easily trade its run-time efficiency for its recall. We provide our approach as an open-source extension of Apache UIMA and we show the potential of our approach in a number of experiments.

References

[1]
E. Agichtein. Scaling Information Extraction to Large Document Collections. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 28:3--10, 2005.
[2]
E. Agichtein and L. Gravano. Querying Text Databases for Efficient Information Extraction. In Proc. of the 19th International Conference on Data Engineering, pages 113--124, 2003.
[3]
R. Al-Rfou' and S. Skiena. SpeedRead: A Fast Named Entity Recognition Pipeline. In Proc. of the 24th International Conference on Computational Linguistics, pages 51--66, 2012.
[4]
M. J. Cafarella, D. Downey, S. Soderland, and O. Etzioni. KnowItNow: Fast, Scalable Information Extraction from the Web. In Proc. of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing, pages 563--570, 2005.
[5]
C. Cardie, V. Ng, D. Pierce, and C. Buckley. Examining the Role of Statistical and Linguistic Knowledge Sources in a General-Knowledge Question-Answering System. In Proc. of the Sixth Applied Natural Language Processing Conference, pages 180--187, 2000.
[6]
N. Chinchor, D. D. Lewis, and L. Hirschman. Evaluating Message Understanding Systems: An Analysis of the Third Message Understanding Conference (MUC-3). Computational Linguistics, 19(3):409--449, 1993.
[7]
L. Chiticariu, R. Krishnamurthy, Y. Li, S. Raghavan, F. R. Reiss, and S. Vaithyanathan. SystemT: An Algebraic Approach to Declarative Information Extraction. In Proc. of the 48th Annual Meeting of the Association for Computational Linguistics, pages 128--137, 2010.
[8]
J. Cowie and W. Lehnert. Information Extraction. Communications of the ACM, 39(1):80--91, 1996.
[9]
H. Cui, R. Sun, K. Li, M.-Y. Kan, and T.-S. Chua. Question Answering Passage Retrieval using Dependency Relations. In Proc. of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 400--407, 2005.
[10]
H. Cunningham, D. Maynard, K. Bontcheva, V. Tablan, N. Aswani, I. Roberts, G. Gorrell, A. Funk, A. Roberts, D. Damljanovic, T. Heitz, M. A. Greenwood, H. Saggion, J. Petrak, Y. Li, and W. Peters. Text Processing with GATE (Version 6). University of Sheffield, 2011.
[11]
A. Das Sarma, A. Jain, and P. Bohannon. Building a Generic Debugger for Information Extraction Pipelines. In Proc. of the 20th ACM International Conference on Information and Knowledge Management, pages 2229--2232, 2011.
[12]
A. Doan, J. F. Naughton, R. Ramakrishnan, A. Baid, X. Chai, F. Chen, T. Chen, E. Chu, P. DeRose, B. Gao, C. Gokhale, J. Huang, W. Shen, and B.-Q. Vuong. Information Extraction Challenges in Managing Unstructured Data. SIGMOD Records, 37(4):14--20, 2009.
[13]
A. Doan, R. Ramakrishnan, and S. Vaithyanathan. Managing Information Extraction: State of the Art and Research Directions. In Proc. of the 2006 ACM SIGMOD International Conference on Management of Data, pages 799--800, 2006.
[14]
M. Faruqui and S. Padó. Training and Evaluating a German Named Entity Recognizer with Semantic Generalization. In Proc. of KONVENS 2010, pages 129--133, 2010.
[15]
D. Ferrucci and A. Lally. UIMA: An Architectural Approach to Unstructured Information Processing in the Corporate Research Environment. Natural Language Engineering, 10(3-4):327--348, 2004.
[16]
J. R. Finkel, T. Grenager, and C. D. Manning. Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling. In Proc. of the 43nd Annual Meeting of the Association for Computational Linguistics, pages 363--370, 2005.
[17]
J. R. Finkel, C. D. Manning, and A. Y. Ng. Solving the Problem of Cascading Errors: Approximate Bayesian Inference for Linguistic Annotation Pipelines. In Proc. of the 2006 Conference on Empirical Methods in Natural Language Processing, pages 618--626, 2006.
[18]
G. Forman and E. Kirshenbaum. Extremely Fast Text Feature Extraction for Classification and Indexing. In Proc. of the 17th ACM Conference on Information and Knowledge Management, pages 1221--1230, 2008.
[19]
D. Gruhl, L. Chavet, D. Gibson, J. Meyer, P. Pattanayak, A. Tomkins, and J. Y. Zien. How to Build a WebFountain: An Architecture for Very Large-scale Text Analytics. IBM Systems Journal, 43(1):64--77, 2004.
[20]
B. Hayes-Roth. A Blackboard Architecture for Control. Artificial Intelligence, 26(3):251--321, 1985.
[21]
K. Hollingshead and B. Roark. Pipeline Iteration. In Proc. of the 45th Annual Meeting of the Association for Computational Linguistics, pages 952--959, 2007.
[22]
L. Jean-Louis, R. Besançon, and O. Ferret. Text Segmentation and Graph-based Method for Template Filling in Information Extraction. In Proc. of the 5th International Joint Conference on Natural Language Processing, pages 723--731, 2011.
[23]
D. Jurafsky and J. H. Martin. Speech and Language Processing: An Introduction to Natural Language Processing, Speech Recognition, and Computational Linguistics. Prentice-Hall, 2nd edition, 2009.
[24]
J.-D. Kim, S. Pyysalo, T. Ohta, R. Bossy, N. Nguyen, and J. Tsujii. Overview of BioNLP Shared Task 2011. In Proc. of the BioNLP Shared Task 2011 Workshop, pages 1--6, 2011.
[25]
E. Krikon, D. Carmel, and O. Kurland. Predicting the Performance of Passage Retrieval for Question Answering. In Proc. of the 21st ACM International Conference on Information and Knowledge management, pages 2451--2454, 2012.
[26]
G. Limaye, S. Sarawagi, and S. Chakrabarti. Annotating and Searching Web Tables using Entities, Types and Relationships. Proc. of the VLDB Endowment, 3(1):1338--1347, 2010.
[27]
W. Lu and D. Roth. Automatic Event Extraction with Structured Preference Modeling. In Proc. of the 50th Annual Meeting of the Association for Computational Linguistics, pages 835--844, 2012.
[28]
Mausam, M. Schmitz, R. Bart, S. Soderland, and O. Etzioni. Open Language Learning for Information Extraction. In Proc. of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 523--534, 2012.
[29]
C. Nedellec, M. O. A. Vetah, and P. Bessières. Sentence Filtering for Information Extraction in Genomics, a Classification Problem. In Proc. of the 5th European Conference on Principles of Data Mining and Knowledge Discovery, pages 326--337, 2001.
[30]
OMG. Unified Modeling Language (OMG UML) Superstructure, Version 2.4.1. 2011.
[31]
P. Pantel, D. Ravichandran, and E. Hovy. Towards Terascale Knowledge Acquisition. In Proc. of the 20th International Conference on Computational Linguistics, pages 771--777, 2004.
[32]
M. Pasca. Web-based Open-Domain Information Extraction. In Proc. of the 20th ACM International Conference on Information and Knowledge Management, pages 2605--2606, 2011.
[33]
S. Patwardhan and E. Riloff. Effective Information Extraction with Semantic Affinity Patterns and Relevant Regions. In Proc. of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 717--727, 2007.
[34]
H. Poon and P. Domingos. Joint Inference in Information Extraction. In Proc. of the 22nd National Conference on Artificial Intelligence, pages 913--918, 2007.
[35]
S. J. Russell and P. Norvig. Artificial Intelligence: A Modern Approach. Prentice-Hall, 3rd edition, 2009.
[36]
E. F. T. K. Sang and F. D. Meulder. Introduction to the CoNLL-2003 Shared Task: Language-independent Named Entity Recognition. In Proc. of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, pages 142--147, 2003.
[37]
S. Sarawagi. Information Extraction. Foundations and Trends in Databases, 1(3):261--377., 2008.
[38]
H. Schmid. Improvements in Part-of-Speech Tagging with an Application to German. In Proc. of the ACL SIGDAT-Workshop, pages 47--50, 1995.
[39]
W. Shen, A. Doan, J. F. Naughton, and R. Ramakrishnan. Declarative Information Extraction using Datalog with Embedded Extraction Predicates. In Proc. of the 33rd International Conference on Very Large Data Bases, pages 1033--1044, 2007.
[40]
B. Stein, S. M. zu Eissen, G. Gräfe, and F. Wissbrock. Automating Market Forecast Summarization from Internet Data. In Proc. of the Fourth Conference on WWW/Internet, pages 395--402, 2005.
[41]
M. Stevenson. Fact Distribution in Information Extraction. Language Resources and Evaluation, 40(2):183--201, 2007.
[42]
H. Wachsmuth, P. Prettenhofer, and B. Stein. Efficient Statement Identification for Automatic Market Forecasting. In Proc. of the 23rd International Conference on Computational Linguistics, pages 1128--1136, 2010.
[43]
H. Wachsmuth, M. Rose, and G. Engels. Automatic Pipeline Construction for Real-Time Annotation. In Proc. of the 14th International Conference on Intelligent Text Processing and Computational Linguistics, pages 38--49, 2013.
[44]
H. Wachsmuth and B. Stein. Optimal Scheduling of Information Extraction Algorithms. In Proc. of the 24th International Conference on Computational Linguistics: Posters, pages 1281--1290, 2012.
[45]
H. Wachsmuth, B. Stein, and G. Engels. Constructing Efficient Information Extraction Pipelines. In Proc. of the 20th ACM Conference on Information and Knowledge Management, pages 2237--2240, 2011.
[46]
W. Wang, R. Besançon, O. Ferret, and B. Grau. Filtering and Clustering Relations for Unsupervised Information Extraction in Open Domain. In Proceedings of the 20th ACM Conference on Information and Knowledge Management, pages 1405--1414, 2011.
[47]
C. Whitelaw, A. Kehlenbeck, N. Petrovic, and L. Ungar. Web-scale Named Entity Recognition. In Proc. of the 17th ACM Conference on Information and Knowledge Management, pages 123--132, 2008.

Cited By

View all
  • (2024)Extracting Information from Brazilian Legal Documents with Retrieval Augmented GenerationAnais Estendidos do XXXIX Simpósio Brasileiro de Banco de Dados (SBBD Estendido 2024)10.5753/sbbd_estendido.2024.244241(280-287)Online publication date: 14-Oct-2024

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
CIKM '13: Proceedings of the 22nd ACM international conference on Information & Knowledge Management
October 2013
2612 pages
ISBN:9781450322638
DOI:10.1145/2505515
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2013

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. filtering
  2. information extraction
  3. relevance
  4. run-time efficiency
  5. truth maintenance

Qualifiers

  • Research-article

Conference

CIKM'13
Sponsor:
CIKM'13: 22nd ACM International Conference on Information and Knowledge Management
October 27 - November 1, 2013
California, San Francisco, USA

Acceptance Rates

CIKM '13 Paper Acceptance Rate 143 of 848 submissions, 17%;
Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)4
  • Downloads (Last 6 weeks)0
Reflects downloads up to 22 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Extracting Information from Brazilian Legal Documents with Retrieval Augmented GenerationAnais Estendidos do XXXIX Simpósio Brasileiro de Banco de Dados (SBBD Estendido 2024)10.5753/sbbd_estendido.2024.244241(280-287)Online publication date: 14-Oct-2024

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media