[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/2594538.2594563acmconferencesArticle/Chapter ViewAbstractPublication PagespodsConference Proceedingsconference-collections
tutorial

Database principles in information extraction

Published: 18 June 2014 Publication History

Abstract

Information Extraction commonly refers to the task of populating a relational schema, having predefined underlying semantics, from textual content. This task is pervasive in contemporary computational challenges associated with Big Data. This tutorial gives an overview of the algorithmic concepts and techniques used for performing Information Extraction tasks, and describes some of the declarative frameworks that provide abstractions and infrastructure for programming extractors. In addition, the tutorial highlights opportunities for research impact through principles of data management, illustrates these opportunities through recent work, and proposes directions for future research.

References

[1]
J. S. Aitken. Learning information extraction rules: An inductive logic programming approach. In ECAI, pages 355--359. IOS Press, 2002.
[2]
J. Ajmera, H.-I. Ahn, M. Nagarajan, A. Verma, D. Contractor, S. Dill, and M. Denesuk. A CRM system for social media: challenges and experiences. In WWW, pages 49--58, 2013.
[3]
C. Aone and M. Ramos-Santacruz. Rees: A large-scale relation and event extraction system. In ANLP, pages 76--83, 2000.
[4]
D. E. Appelt, J. R. Hobbs, J. Bear, D. J. Israel, and M. Tyson. FASTUS: A finite-state processor for information extraction from real-world text. In IJCAI, pages 1172--1178. Morgan Kaufmann, 1993.
[5]
D. E. Appelt and B. Onyshkevych. The common pattern specification language. In Proceedings of the TIPSTER Text Program: Phase III, pages 23--30, Baltimore, Maryland, USA, 1998.
[6]
M. Arenas, L. E. Bertossi, and J. Chomicki. Consistent query answers in inconsistent databases. In PODS, pages 68--79, 1999.
[7]
E. Benson, A. Haghighi, and R. Barzilay. Event discovery in social media feeds. In ACL, pages 389--398, 2011.
[8]
D. M. Bikel, S. Miller, R. M. Schwartz, and R. M. Weischedel. Nymble: a high-performance learning name-finder. In ANLP, pages 194--201, 1997.
[9]
V. R. Borkar, K. Deshmukh, and S. Sarawagi. Automatic segmentation of text into structured records. In SIGMOD Conference, pages 175--186. ACM, 2001.
[10]
M. Bröcheler, L. Mihalkova, and L. Getoor. Probabilistic similarity logic. In UAI, pages 73--82. AUAI Press, 2010.
[11]
R. C. Bunescu and R. J. Mooney. Subsequence kernels for relation extraction. In NIPS, 2005.
[12]
M. E. Califf and R. J. Mooney. Relational learning of pattern-match rules for information extraction. In AAAI/IAAI, pages 328--334. AAAI Press / The MIT Press, 1999.
[13]
F. Chen, X. Feng, C. Re, and M. Wang. Optimizing statistical information extraction programs over evolving text. In ICDE, pages 870--881. IEEE Computer Society, 2012.
[14]
L. Chiticariu, R. Krishnamurthy, Y. Li, S. Raghavan, F. Reiss, and S. Vaithyanathan. SystemT: An algebraic approach to declarative information extraction. In ACL, pages 128--137, 2010.
[15]
L. Chiticariu, Y. Li, and F. R. Reiss. Rule-based information extraction is dead! Long live rule-based information extraction systems! In EMNLP, pages 827--832. ACL, 2013.
[16]
F. Ciravegna. Adaptive information extraction from text by rule induction and generalisation. In IJCAI, pages 1251--1256. Morgan Kaufmann, 2001.
[17]
A. Coden, D. Gruhl, N. Lewis, M. A. Tanenblatt, and J. Terdiman. Spot the drug! An unsupervised pattern matching method to extract drug names from very large clinical corpora. In HISB, pages 33--39. IEEE Computer Society, 2012.
[18]
A. Culotta and J. S. Sorensen. Dependency tree kernels for relation extraction. In ACL, pages 423--429. ACL, 2004.
[19]
H. Cunningham. GATE, a general architecture for text engineering. Computers and the Humanities, 36(2):223--254, 2002.
[20]
N. N. Dalvi and D. Suciu. Efficient query evaluation on probabilistic databases. In VLDB, pages 864--875. Morgan Kaufmann, 2004.
[21]
M. Dylla, I. Miliaraki, and M. Theobald. A temporal-probabilistic database model for information extraction. PVLDB, 6(14):1810--1821, 2013.
[22]
R. Fagin, B. Kimelfeld, F. Reiss, and S. Vansummeren. Spanners: a formal framework for information extraction. In PODS, pages 37--48. ACM, 2013.
[23]
R. Fagin, B. Kimelfeld, F. Reiss, and S. Vansummeren. Cleaning inconsistencies in information extraction via prioritized repairs. In PODS. ACM, 2014.
[24]
D. Freitag. Toward general-purpose learning for information extraction. In COLING-ACL, pages 404--408, 1998.
[25]
D. Freitag. Machine learning for information extraction in informal domains. Machine Learning, 39(2/3):169--202, 2000.
[26]
Q. Fu, J.-G. Lou, Y. Wang, and J. Li. Execution anomaly detection in distributed systems through unstructured log analysis. In ICDM, pages 149--158, 2009.
[27]
S. Ginsburg and X. S. Wang. Regular sequence operations and their use in database queries. J. Comput. Syst. Sci., 56(1):1--26, 1998.
[28]
V. Gogate, W. A. Webb, and P. Domingos. Learning efficient Markov networks. In NIPS, pages 748--756. Curran Associates, Inc., 2010.
[29]
R. Grishman and B. Sundheim. Message understanding conference 6: A brief history. In COLING, pages 466--471, 1996.
[30]
R. Hoffmann. Interactive Learning of Relation Extractors with Weak Supervision. PhD thesis, University of Washington, 2012.
[31]
R. Hoffmann, C. Zhang, X. Ling, L. S. Zettlemoyer, and D. S. Weld. Knowledge-based weak supervision for information extraction of overlapping relations. In ACL, pages 541--550. The Association for Computer Linguistics, 2011.
[32]
X. D. Huang, Y. Ariki, and M. A. Jack. Hidden Markov models for speech recognition, volume 2004. Edinburgh university press Edinburgh, 1990.
[33]
S. B. Huffman. Learning information extraction patterns from examples. In S. Wermter, E. Riloff, and G. Scheler, editors, Learning for Natural Language Processing, volume 1040 of Lecture Notes in Computer Science, pages 246--260. Springer, 1995.
[34]
Institute of Electrical and Electronic Engineers and the Open group. The open group base specifications issue 7, 2013. IEEE Std 1003.1, 2013 Edition.
[35]
H. Isozaki and H. Kazawa. Efficient support vector classifiers for named entity recognition. In COLING, 2002.
[36]
T. S. Jayram, R. Krishnamurthy, S. Raghavan, S. Vaithyanathan, and H. Zhu. Avatar information extraction system. IEEE Data Eng. Bull., 29(1):40--48, 2006.
[37]
A. K. Jha and D. Suciu. Probabilistic databases with MarkoViews. PVLDB, 5(11):1160--1171, 2012.
[38]
B. Kimelfeld and C. Ré. Transducing Markov sequences. In PODS, pages 15--26. ACM, 2010.
[39]
D. Klein and C. D. Manning. Conditional structure versus conditional estimation in NLP models. In EMNLP, pages 9--16. Association for Computational Linguistics, 2002.
[40]
S. Kok and P. Domingos. Using structural motifs for learning Markov logic networks. In Statistical Relational Artificial Intelligence, volume WS-10-06 of AAAI Workshops. AAAI, 2010.
[41]
J. D. Lafferty, A. McCallum, and F. C. N. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In ICML, pages 282--289, 2001.
[42]
T. R. Leek. Information extraction using hidden Markov models. Master's thesis, UC San Diego, 1997.
[43]
Y. Li, K. Bontcheva, and H. Cunningham. SVM based learning system for information extraction. In Deterministic and Statistical Methods in Machine Learning, volume 3635 of Lecture Notes in Computer Science, pages 319--339. Springer, 2004.
[44]
X. Ling and D. S. Weld. Temporal information extraction. In AAAI. AAAI Press, 2010.
[45]
B. Liu, L. Chiticariu, V. Chu, H. V. Jagadish, and F. Reiss. Automatic rule refinement for information extraction. PVLDB, 3(1):588--597, 2010.
[46]
A. McCallum, D. Freitag, and F. C. N. Pereira. Maximum entropy Markov models for information extraction and segmentation. In ICML, pages 591--598, 2000.
[47]
A. Nagesh, G. Ramakrishnan, L. Chiticariu, R. Krishnamurthy, A. Dharkar, and P. Bhattacharyya. Towards efficient named-entity rule induction for customizability. In EMNLP-CoNLL, pages 128--138. ACL, 2012.
[48]
F. Niu, C. Ré, A. Doan, and J. W. Shavlik. Tuffy: Scaling up statistical inference in Markov logic networks using an RDBMS. PVLDB, 4(6):373--384, 2011.
[49]
R. Plamondon and S. N. Srihari. On-line and off-line handwriting recognition: A comprehensive survey. IEEE Trans. Pattern Anal. Mach. Intell., 22(1):63--84, 2000.
[50]
H. Poon and P. Domingos. Joint inference in information extraction. In AAAI'07: Proceedings of the 22nd national conference on Artificial intelligence, pages 913--918. AAAI Press, 2007.
[51]
J. Pujara, H. Miao, L. Getoor, and W. Cohen. Knowledge graph identification. In International Semantic Web Conference (1), volume 8218 of Lecture Notes in Computer Science, pages 542--557. Springer, 2013.
[52]
L. D. Raedt and K. Kersting. Statistical relational learning. In C. Sammut and G. I. Webb, editors, Encyclopedia of Machine Learning, pages 916--924. Springer, 2010.
[53]
K. Raghunathan, H. Lee, S. Rangarajan, N. Chambers, M. Surdeanu, D. Jurafsky, and C. D. Manning. A multi-pass sieve for coreference resolution. In EMNLP, pages 492--501. ACL, 2010.
[54]
F. Reiss, S. Raghavan, R. Krishnamurthy, H. Zhu, and S. Vaithyanathan. An algebraic approach to rule-based information extraction. In ICDE, pages 933--942, 2008.
[55]
M. Richardson and P. Domingos. Markov logic networks. Machine Learning, 62(1--2):107--136, 2006.
[56]
E. Riloff and R. Jones. Learning dictionaries for information extraction by multi-level bootstrapping. In AAAI/IAAI, pages 474--479. AAAI Press / The MIT Press, 1999.
[57]
S. Sarawagi. Information extraction. Foundations and Trends in Databases, 1(3):261--377, 2008.
[58]
S. Satpal, S. Bhadra, S. Sellamanickam, R. Rastogi, and P. Sen. Web information extraction using Markov logic networks. In KDD, pages 1406--1414. ACM, 2011.
[59]
W. Shen, A. Doan, J. F. Naughton, and R. Ramakrishnan. Declarative information extraction using datalog with embedded extraction predicates. In VLDB, pages 1033--1044, 2007.
[60]
S. Soderland. Learning information extraction rules for semi-structured and free text. Machine Learning, 34(1--3):233--272, 1999.
[61]
S. Staworko, J. Chomicki, and J. Marcinkowski. Prioritized repairing and consistent query answering in relational databases. Ann. Math. Artif. Intell., 64(2--3):209--246, 2012.
[62]
F. M. Suchanek, G. Ifrim, and G. Weikum. Combining linguistic and statistical analysis to extract relations from web documents. In KDD, pages 712--717. ACM, 2006.
[63]
F. M. Suchanek, M. Sozio, and G. Weikum. SOFIE: a self-organizing framework for information extraction. In WWW, pages 631--640. ACM, 2009.
[64]
D. Z. Wang, M. J. Franklin, M. N. Garofalakis, J. M. Hellerstein, and M. L. Wick. Hybrid in-database inference for declarative information extraction. In SIGMOD Conference, pages 517--528. ACM, 2011.
[65]
R. Wisnesky, M. A. Hernández, and L. Popa. Mapping polymorphism. In ICDT, ACM International Conference Proceeding Series, pages 196--208. ACM, 2010.
[66]
H. Xu, S. P. Stenner, S. Doan, K. B. Johnson, L. R. Waitman, and J. C. Denny. Application of information technology: Medex: a medication information extraction system for clinical narratives. JAMIA, 17(1):19--24, 2010.
[67]
D. Zelenko, C. Aone, and A. Richardella. Kernel methods for relation extraction. Journal of Machine Learning Research, 3:1083--1106, 2003.
[68]
C. Zhang, T. Baldwin, H. Ho, B. Kimelfeld, and Y. Li. Adaptive parser-centric text normalization. In ACL (1), pages 1159--1168. The Association for Computer Linguistics, 2013.
[69]
H. Zhu, S. Raghavan, S. Vaithyanathan, and A. Löser. Navigating the intranet with high precision. In WWW, pages 491--500, 2007.

Cited By

View all
  • (2020)Efficient Enumeration Algorithms for Regular Document SpannersACM Transactions on Database Systems10.1145/335145145:1(1-42)Online publication date: 8-Feb-2020
  • (2019)A Formal Framework for Coupling Document Spanners with Ontologies2019 IEEE Second International Conference on Artificial Intelligence and Knowledge Engineering (AIKE)10.1109/AIKE.2019.00036(155-162)Online publication date: Jun-2019
  • (2018)Constant Delay Algorithms for Regular Document SpannersProceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems10.1145/3196959.3196987(165-177)Online publication date: 27-May-2018
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
PODS '14: Proceedings of the 33rd ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
June 2014
300 pages
ISBN:9781450323758
DOI:10.1145/2594538
  • General Chair:
  • Richard Hull,
  • Program Chair:
  • Martin Grohe
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 June 2014

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. database inconsistency
  2. database repairs
  3. document spanners
  4. finite-state transducers
  5. information extraction
  6. prioritized repairs
  7. regular expressions

Qualifiers

  • Tutorial

Conference

SIGMOD/PODS'14
Sponsor:

Acceptance Rates

PODS '14 Paper Acceptance Rate 22 of 67 submissions, 33%;
Overall Acceptance Rate 642 of 2,707 submissions, 24%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)5
  • Downloads (Last 6 weeks)1
Reflects downloads up to 24 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2020)Efficient Enumeration Algorithms for Regular Document SpannersACM Transactions on Database Systems10.1145/335145145:1(1-42)Online publication date: 8-Feb-2020
  • (2019)A Formal Framework for Coupling Document Spanners with Ontologies2019 IEEE Second International Conference on Artificial Intelligence and Knowledge Engineering (AIKE)10.1109/AIKE.2019.00036(155-162)Online publication date: Jun-2019
  • (2018)Constant Delay Algorithms for Regular Document SpannersProceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems10.1145/3196959.3196987(165-177)Online publication date: 27-May-2018
  • (2018)Document Spanners for Extracting Incomplete InformationProceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems10.1145/3196959.3196968(125-136)Online publication date: 27-May-2018
  • (2018)Cost-effective conceptual design using taxonomiesThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-018-0501-127:3(369-394)Online publication date: 1-Jun-2018
  • (2017)Cost-Effective Conceptual Design Over TaxonomiesProceedings of the 20th International Workshop on the Web and Databases10.1145/3068839.3068841(35-40)Online publication date: 14-May-2017
  • (2017)CoTypeProceedings of the 2017 ACM International Conference on Management of Data10.1145/3055167.3055184(52-54)Online publication date: 14-May-2017
  • (2017)Building Structured Databases of Factual Knowledge from Massive Text CorporaProceedings of the 2017 ACM International Conference on Management of Data10.1145/3035918.3054781(1741-1745)Online publication date: 9-May-2017
  • (2015)Extending Datalog IntelligenceWeb Reasoning and Rule Systems10.1007/978-3-319-22002-4_1(1-10)Online publication date: 22-Jul-2015

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media