[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/2463676.2465284acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Provenance-based dictionary refinement in information extraction

Published: 22 June 2013 Publication History

Abstract

Dictionaries of terms and phrases (e.g. common person or organization names) are integral to information extraction systems that extract structured information from unstructured text. Using noisy or unrefined dictionaries may lead to many incorrect results even when highly precise and sophisticated extraction rules are used. In general, the results of the system are dependent on dictionary entries in arbitrary complex ways, and removal of a set of entries can remove both correct and incorrect results. Further, any such refinement critically requires laborious manual labeling of the results.
In this paper, we study the dictionary refinement problem and address the above challenges. Using provenance of the outputs in terms of the dictionary entries, we formalize an optimization problem of maximizing the quality of the system with respect to the refined dictionaries, study complexity of this problem, and give efficient algorithms. We also propose solutions to address incomplete labeling of the results where we estimate the missing labels assuming a statistical model. We conclude with a detailed experimental evaluation using several real-world extractors and competition datasets to validate our solutions. Beyond information extraction, our provenance-based techniques and solutions may find applications in view-maintenance in general relational settings.

References

[1]
In www.census.gov.
[2]
In www.geonames.org.
[3]
Automatic Content Extraction 2005 Evaluation Dataset. 2005.
[4]
E. Agichtein and L. Gravano. Snowball: Extracting Relations from Large Plain-Text Collections. In ACM DL, pages 85--94, 2000.
[5]
N. Ashish, S. Mehrotra, and P. Pirzadeh. XAR: An Integrated Framework for Information Extraction. In WRI Wold Congress on Computer Science and Information Engineering, 2009.
[6]
P. Buneman, S. Khanna, and W.-C. Tan. On propagation of deletions and annotations through views. In PODS, pages 150--158, 2002.
[7]
X. Chai, B.-Q. Vuong, A. Doan, and J. F. Naughton. Efficiently incorporating user feedback into information extraction and integration programs. In SIGMOD, 2009.
[8]
J. Cheney, L. Chiticariu, and W. Tan. Provenance in databases: Why, how, and where. Foundations and Trends in Databases, 1(4):379--474, 2009.
[9]
L. Chiticariu, R. Krishnamurthy, Y. Li, F. Reiss, and S. Vaithyanathan. Domain adaptation of rule-based annotators for named-entity recognition tasks. In EMNLP, pages 1002--1012, 2010.
[10]
W. W. Cohen and S. Sarawagi. Exploiting dictionaries in named entity extraction: combining semi-markov extraction processes and data integration methods. In KDD, pages 89--98, 2004.
[11]
D. G. Corneil and Y. Perl. Clustering and domination in perfect graphs. Discrete Applied Mathematics, 9(1):27 -- 39, 1984.
[12]
H. Cunningham. JAPE: a Java Annotation Patterns Engine. Research Memorandum CS -- 99 -- 06, University of Sheffield, May 1999.
[13]
N. N. Dalvi, K. Schnaitter, and D. Suciu. Computing query probability with incidence algebras. In PODS, pages 203--214, 2010.
[14]
A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the em algorithm. JOURNAL OF THE ROYAL STATISTICAL SOCIETY, SERIES B, 39(1):1--38, 1977.
[15]
H. Elmeleegy, J. Madhavan, and A. Halevy. Harvesting relational tables from lists on the web. PVLDB, pages 1078--1089, 2009.
[16]
D. Eppstein and D. S. Hirschberg. Choosing subsets with maximum weighted average. J. Algorithms, 24(1):177--193, 1997.
[17]
O. Etzioni, M. Cafarella, D. Downey, A.-M. Popescu, T. Shaked, S. Soderland, D. S. Weld, and A. Yates. Methods for domain-independent information extraction from the web: an experimental comparison. In AAAI, 2004.
[18]
M. R. Garey and D. S. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness. 1979.
[19]
T. J. Green, G. Karvounarakis, and V. Tannen. Provenance semirings. In PODS, pages 31--40, 2007.
[20]
J. Hoffart, M. A. Yosef, I. Bordino, H. Fürstenau, M. Pinkal, M. Spaniol, B. Taneva, S. Thater, and G. Weikum. Robust disambiguation of named entities in text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP '11, 2011.
[21]
D. Jurafsky and J. Martin. Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition. Pearson Prentice Hall, 2009.
[22]
J. Kazama and K. Torisawa. Inducing gazetteers for named entity recognition by large-scale clustering of dependency relations. In ACL, pages 407--415, 2008.
[23]
B. Kimelfeld, J. Vondrák, and R. Williams. Maximizing conjunctive views in deletion propagation. In PODS, pages 187--198, 2011.
[24]
Z. Kozareva. Bootstrapping named entity recognition with automatically generated gazetteer lists. In EACL: Student Research Workshop, 2006.
[25]
R. Krishnamurthy, Y. Li, S. Raghavan, F. Reiss, S. Vaithyanathan, and H. Zhu. SystemT: a system for declarative information extraction. SIGMOD Record, 37(4):7--13, 2008.
[26]
B. Liu, L. Chiticariu, V. Chu, H. V. Jagadish, and F. R. Reiss. Automatic Rule Refinement for Information Extraction. PVLDB, 2010.
[27]
M. P. Marcus, B. Santorini, and M. A. Marcinkiewicz. Building a large annotated corpus of english: The penn treebank. COMPUTATIONAL LINGUISTICS, 19(2):313--330, 1993.
[28]
D. Maynard, K. Bontcheva, and H. Cunningham. Towards a semantic extraction of named entities. In Recent Advances in Natural Language Processing, 2003.
[29]
A. Meliou, W. Gatterbauer, S. Nath, and D. Suciu. Tracing data errors with view-conditioned causality. In SIGMOD, 2011.
[30]
A. Mikheev, M. Moens, and C. Grover. Named entity recognition without gazetteers. In EACL, pages 1--8, 1999.
[31]
D. Nadeau, P. D. Turney, and S. Matwin. Unsupervised named-entity recognition: Generating gazetteers and resolving ambiguity. In Canadian Conference on AI, pages 266--277, 2006.
[32]
F. Reiss, S. Raghavan, R. Krishnamurthy, H. Zhu, and S. Vaithyanathan. An algebraic approach to rule-based information extraction. In ICDE, pages 933--942, 2008.
[33]
E. Riloff. Automatically constructing a dictionary for information extraction tasks. In KDD, 1993.
[34]
W. Shen, P. DeRose, R. McCann, A. Doan, and R. Ramakrishnan. Toward best-effort information extraction. In SIGMOD, 2008.
[35]
W. Shen, A. Doan, J. F. Naughton, and R. Ramakrishnan. Declarative information extraction using datalog with embedded extraction predicates. In VLDB, pages 1033--1044, 2007.
[36]
E. F. Tjong Kim Sang and F. De Meulder. Introduction to the CoNLL-2003 shared task: language-independent named entity recognition. In HLT-NAACL, 2003.
[37]
L. G. Valiant. The complexity of computing the permanent. Theor. Comput. Sci., 8:189--201, 1979.
[38]
C. J. van Rijsbergen. Information Retrieval. Butterworth, 1979.
[39]
A. Yates, M. Banko, M.Broadhead, M. J. Cafarella, O. Etzioni, and S. Soderland. TextRunner: Open Information Extraction on the Web. In HLT-NAACL (Demonstration), pages 25--26, 2007.

Cited By

View all
  • (2024)From Data to Insight: Transforming Online Job Postings into Labor-Market IntelligenceInformation10.3390/info1508049615:8(496)Online publication date: 20-Aug-2024
  • (2019)Predictable and Consistent Information ExtractionProceedings of the ACM Symposium on Document Engineering 201910.1145/3342558.3345391(1-10)Online publication date: 23-Sep-2019
  • (2018)Web Information ExtractionEncyclopedia of Database Systems10.1007/978-1-4614-8265-9_459(4620-4629)Online publication date: 7-Dec-2018
  • Show More Cited By

Index Terms

  1. Provenance-based dictionary refinement in information extraction

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      SIGMOD '13: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
      June 2013
      1322 pages
      ISBN:9781450320375
      DOI:10.1145/2463676
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 22 June 2013

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. information extraction
      2. optimization
      3. provenance
      4. refinement

      Qualifiers

      • Research-article

      Conference

      SIGMOD/PODS'13
      Sponsor:

      Acceptance Rates

      SIGMOD '13 Paper Acceptance Rate 76 of 372 submissions, 20%;
      Overall Acceptance Rate 785 of 4,003 submissions, 20%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)5
      • Downloads (Last 6 weeks)1
      Reflects downloads up to 09 Jan 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)From Data to Insight: Transforming Online Job Postings into Labor-Market IntelligenceInformation10.3390/info1508049615:8(496)Online publication date: 20-Aug-2024
      • (2019)Predictable and Consistent Information ExtractionProceedings of the ACM Symposium on Document Engineering 201910.1145/3342558.3345391(1-10)Online publication date: 23-Sep-2019
      • (2018)Web Information ExtractionEncyclopedia of Database Systems10.1007/978-1-4614-8265-9_459(4620-4629)Online publication date: 7-Dec-2018
      • (2017)A survey on provenanceThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-017-0486-126:6(881-906)Online publication date: 1-Dec-2017
      • (2017)Web Information ExtractionEncyclopedia of Database Systems10.1007/978-1-4899-7993-3_459-2(1-9)Online publication date: 27-Jan-2017
      • (2016)Data Driven Discovery of Attribute DictionariesTransactions on Computational Collective Intelligence XXI - Volume 963010.5555/3090176.3090180(69-96)Online publication date: 1-Jan-2016
      • (2016)Declarative Cleaning of Inconsistencies in Information ExtractionACM Transactions on Database Systems10.1145/287720241:1(1-44)Online publication date: 7-Apr-2016
      • (2016)Long-tail Vocabulary Dictionary Extraction from the WebProceedings of the Ninth ACM International Conference on Web Search and Data Mining10.1145/2835776.2835778(625-634)Online publication date: 8-Feb-2016
      • (2016)Data Driven Discovery of Attribute DictionariesTransactions on Computational Collective Intelligence XXI10.1007/978-3-662-49521-6_4(69-96)Online publication date: 2016

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media