Abstract
Given an information extraction (IE) system that performs an extraction task against texts in one language, it is natural to consider how to modify the system to perform the same task against texts in a different language. More generally, there may be a requirement to do the extraction task against texts in an arbitrary number of different languages and to present results to a user who has no knowledge of the source language from which the information has been extracted. To minimise the language-specific alterations that need to be made in extending the system to a new language, it is important to separate the task-specific conceptual knowledge the system uses, which may be assumed to be language independent, from the language-dependent lexical knowledge the system requires, which unavoidably must be extended for each new language. In this paper we describe how the architecture of the LaSIE system, an IE system designed to do monolingual extraction from English texts, has been modified to support a clean separation between conceptual and lexical information. This separation allows hard-to-acquire, domain-specific conceptual knowledge to be represented only once, and hence to be reused in extracting information from texts in multiple languages, while standard lexical resources can be used to extend language coverage. Preliminary experiments with extending the system to French are described.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Advanced Research Projects Agency. Proceedings of the Fifth Message Understanding Conference (MUC-5). Morgan Kaufmann, 1993.
H. Alshawi, editor. The Core Language Engine. MIT Press, Cambridge MA, 1992.
AVENTINUS: Advanced information system for multinational drug enforcement. http://www2.echo.lu/langeng/en/lel/aventinus/aventinus.html. Site visited 29/05/97.
J. Cowie and W. Lehnert. Information extraction. Communications of the ACM, 39(1):80–91, 1996.
H. Cunningham, S. Azzam, and Y. Wilks. Domain Modelling for AVENTINUS (WP 4.2). LE project LEl-2238 AVENTINUS internal technical report, University of Sheffield, UK, 1996.
Defense Advanced Research Projects Agency. Proceedings of the Sixth Message Understanding Conference (MUC-6). Morgan Kaufmann, 1995.
ECRAN: Extraction of Content: Research at Near-Market. http://www2.echo.lu/langeng/en/lei/ecran/ecran.html. Site visited 29/05/97.
FACILE: Fast and Accurate Categorisation of Information by Language Engineering. http://www2.echo.lu/langeng/en/lel/facile/facile.html. Site visited 29/05/97.
R. Gaizauskas. XI: A Knowledge Representation Language Based on Cross-Classification and Inheritance. Technical Report CS-95-24, Department of Computer Science, University of Sheffield, 1995.
R. Gaizauskas and K. Humphreys. Using a semantic network for information extraction. Journal of Natural Language Engineering, 1997. In press.
R. Gaizauskas, T. Wakao, K Humphreys, H. Cunningham, and Y. Wilks. Description of the LaSIE system as used for MUC-6. In Proceedings of the Sixth Message Understanding Conference (MUC-6). Morgan Kaufmann, 1995.
R. Gaizauskas and Y. Wilks. Information Extraction: Beyond Document Retrieval. Submitted to Journal of Documentation, 1997.
R. Grishman and B. Sundheim. Message understanding conference — 6: A brief history. In Proceedings of the 16th International Conference on Computational Linguistics, Copenhagen, June 1996.
H. Horacek and M. Zock, editors. New Concepts in Natural Language Generation: Planning, Realization and Systems. Pinter Publishers, London, 1993.
W.J. Hutchins. Machine Translation: past, present, future. Chichester: Ellis Horwood, 1986.
M. Kameyama. Information Extraction across Linguistic Boundaries. In AAAI Spring Symposium on Cross-Language Text and Speech Processing, 1997.
R. Merchant, M.E. Okurowski, and N. Chinchor. The Multi-Lingual Entity Tast (MET) Overview. In Advances in Text Processing — TIPSTER Programme Phase II, pages 445–447. DARPA, Morgan Kaufman, 1996.
G. A. Miller (Ed.). WordNet: An on-line lexical database. International Journal of Lexicography, 3(4):235–312, 1990.
SPARKLE: Shallow parsing and knowledge extraction for language engineering. http://www2.echo.lu/langeng/en/lei/sparkle/sparkle.html. Site visited 10/06/97.
TREE: Trans European Employment. http://www2.echo.lu/langeng/en/lel/tree/tree.html. Site visited 29/05/97.
Y. Wilks and M. Stevenson. Sense tagging: Semantic tagging with a lexicon. In Proceedings of the ANLP97 Workshop on Tagging Text with Lexical Semantics, 1997.
D. Yarowsky. Word-sense disambiguation using statistical models of Roget's cate-gories trained on large corpora. In COLING-92, 1992.
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 1997 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Gaizauskas, R., Humphreys, K., Azzam, S., Wilks, Y. (1997). Concepticons vs. lexicons: An architecture for multilingual information extraction. In: Pazienza, M.T. (eds) Information Extraction A Multidisciplinary Approach to an Emerging Information Technology. SCIE 1997. Lecture Notes in Computer Science, vol 1299. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-63438-X_3
Download citation
DOI: https://doi.org/10.1007/3-540-63438-X_3
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-63438-6
Online ISBN: 978-3-540-69548-6
eBook Packages: Springer Book Archive