Abstract
This paper describes a hybrid statistical and knowledge-based information extraction model, able to extract entities and relations at the sentence level. The model attempts to retain and improve the high accuracy levels of knowledge-based systems while drastically reducing the amount of manual labour by relying on statistics drawn from a training corpus. The implementation of the model, called TEG (trainable extraction grammar), can be adapted to any IE domain by writing a suitable set of rules in a SCFG (stochastic context-free grammar)-based extraction language and training them using an annotated corpus. The system does not contain any purely linguistic components, such as PoS tagger or shallow parser, but allows to using external linguistic components if necessary. We demonstrate the performance of the system on several named entity extraction and relation extraction tasks. The experiments show that our hybrid approach outperforms both purely statistical and purely knowledge-based systems, while requiring orders of magnitude less manual rule writing and smaller amounts of training data. We also demonstrate the robustness of our system under conditions of poor training-data quality.
Similar content being viewed by others
References
Aitken JS (2002) Learning information extraction rules: an inductive logic programming approach. In: Proceedings of 15th European Conference on Artificial Intelligence. IOS Press, Amsterdam
Bikel DM, Miller S, Schwartz R, Weischedel R (1997) Nymble: a high-performance learning name-finder. In: Proceedings of ANLP-97, pp 194–201. ACL, Washington DC
Bikel DM, Schwartz RL, Weischedel RM (1999) An algorithm that learns what's in a name. Mach Learn 34(1–3), 211–231
Charniak E (2000) A maximum-entropy-inspired parser. In: Proceedings of the Meeting of the North American Association for Computational Linguistics. ACL, Hong Kong
Chinchor N, Hirschman L, Lewis D (1994) Evaluating message understanding systems: an analysis of the third message understanding conference (muc-3). Comput Linguistics 3(19), 409–449
Collins M, Miller S (1998) Semantic tagging using a probabilistic context free grammar. In: Proceedings of the 6th Workshop on Very Large Corpora, pp 38–48. ACL, Montreal, Canada
Collins M (1997) Three generative, lexicalized models for statistical parsing. In: Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and 8th Conference of the European Chapter of the Association for Computational Linguistics, pp 16–23. ACL, Madrid, Spain
De Sitter A, Daelemans W (2003) Information extraction via double classification. In: Proceedings of International Workshop on Adaptive Text Extraction and Mining, pp 66–73. Dubrovnik, Croatia
Feldman R (2002) Text mining. In: Kloesgen W, Zytkow J (eds) Handbook of Data Mining and Knowledge Discovery. MIT Press, Cambridge, MA
Freitag D, McCallum AK (1999) Information extraction with hmms and shrinkage. In: Proceedings of the AAAI-99 Workshop on Machine Learning for Information Extraction. AAAI, Orlando, Florida
Freitag D, McCallum A (2000) Information extraction with hmm structures learned by stochastic optimization. In: Proceedings of AAAI/IAAI, pp 584–589. AAAI, Austin, Texas
Freitag D (1997) Using grammatical inference to improve precision in information extraction. In: Proceedings of Workshop on Grammatical Inference, Automata Induction, and Language Acquisition (ICML'97). Nashville, TN
Freitag D (1998) Information extraction from html: application of a general machine learning approach. In: Proceedings of AAAI/IAAI, pp 517–523. AAAI, Madison, WI
Grieser G, Jantke KP, Lange S, Thomas B (2000) A unifying approach to html wrapper representation and learning. In: Discovery Science, Third International Conference, DS 2000. Kyoto, Japan. Proceedings, vol 1967, pp 50–64. Springer, Berlin Heidelberg New York
Kushmerick N, Johnston E, McGuinness S (2001) Information extraction by text classification. In: Proceedings of IJCAI-01 Workshop on Adaptive text Extraction and Mining. Seattle, WA
Kushmerick N (2002) Finite-state approaches to web information extraction. In: Proceedings of 3rd Summer Convention on Information Extraction. Springer, Rome, Italy
Leek TR (1997) Information extraction using hidden Markov models. Master's thesis, UC San Diego
Lafferty J, McCallum A, Pereira F (2001) Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of 18th International Conference on Machine Learning, pp 282–289. Morgan Kaufmann, San Francisco, CA
Miller S, Crystal M, Fox H, Ramshaw L, Schwartz R, Stone R, Weischedel R (1998) The annotation group. Algorithms that learn to extract information–BBN: description of the SIFT system as used for MUC. In: Proceedings of the 7th Message Understanding Conference (MUC-7). Morgan Kaufman, Fairfax, VA
McCallum A, Freitag D, Pereira F (2000) Maximum entropy Markov models for information extraction and segmentation. In: Proceedings of 17th International Conference on Machine Learning, pp 591–598. Morgan Kaufmann, San Francisco, CA
Roark B, Johnson M (1999) Efficient probabilistic top-down and left-corner parsing. In: Proceedings of the 37th Annual Meeting of the ACL. ACL, Colleage Park, MD
Sun A, Naing M, Lim E, Lam W (2003) Using support vector machine for terrorism information extraction. In: Proceedings of 1st NSF/NIJ Symposium on Intelligence and Security Informatics.
Yeh A, Hirschman L (2002) Background and overview for kdd cup 2002 task 1: information extraction from biomedical articles. KDD Explorations 4(2), 87–89
Author information
Authors and Affiliations
Corresponding author
Additional information
Ronen Feldman is a senior lecturer at the Mathematics and Computer Science Department of Bar-Ilan University in Israel, and the Director of the Data Mining Laboratory. He received his B.Sc. in Math, Physics and Computer Science from the Hebrew University, M.Sc. in Computer Science from Bar-Ilan University, and his Ph.D. in Computer Science from Cornell University in NY. He was an Adjunct Professor at NYU Stern Business School. He is the founder of ClearForest Corporation, a Boston based company specializing in development of text mining tools and applications. He has given more than 30 tutorials on next mining and information extraction and authored numerous papers on these topics. He is currently finishing his book “The Text Mining Handbook” to the published by Cambridge University Press.
Benjamin Rosenfeld is a research scientist at ClearForest Corporation. He received his B.Sc. in Mathematics and Computer Science from Bar-Ilan University. He is the co-inventor of the DIAL information extraction language.
Moshe Fresko is finalizing his Ph.D. in Computer Science Department at Bar-Ilan University in Israel. He received his B.Sc. in Computer Engineering from Bogazici University, Istanbul/Turkey on 1991, and M.Sc. on 1994. He is also an adjunct lecturer at the Computer Science Department of Bar-Ilan University and functions as the Information-Extraction Group Leader in the Data Mining Laboratory.
Rights and permissions
About this article
Cite this article
Feldman, R., Rosenfeld, B. & Fresko, M. TEG—a hybrid approach to information extraction. Knowl Inf Syst 9, 1–18 (2006). https://doi.org/10.1007/s10115-005-0204-y
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-005-0204-y