Concepticons vs. lexicons: An architecture for multilingual information extraction

Robert Gaizauskas¹,
Kevin Humphreys¹,
Saliha Azzam¹ &
…
Yorick Wilks¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 1299))

Included in the following conference series:

International Summer School on Information Extraction

509 Accesses
1 Citations

Abstract

Given an information extraction (IE) system that performs an extraction task against texts in one language, it is natural to consider how to modify the system to perform the same task against texts in a different language. More generally, there may be a requirement to do the extraction task against texts in an arbitrary number of different languages and to present results to a user who has no knowledge of the source language from which the information has been extracted. To minimise the language-specific alterations that need to be made in extending the system to a new language, it is important to separate the task-specific conceptual knowledge the system uses, which may be assumed to be language independent, from the language-dependent lexical knowledge the system requires, which unavoidably must be extended for each new language. In this paper we describe how the architecture of the LaSIE system, an IE system designed to do monolingual extraction from English texts, has been modified to support a clean separation between conceptual and lexical information. This separation allows hard-to-acquire, domain-specific conceptual knowledge to be represented only once, and hence to be reused in extracting information from texts in multiple languages, while standard lexical resources can be used to extend language coverage. Preliminary experiments with extending the system to French are described.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: GBP 19.95; Price includes VAT (United Kingdom)

eBook: GBP 25.99; Price includes VAT (United Kingdom)

Softcover Book: GBP 32.99; Price includes VAT (United Kingdom)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Advanced Research Projects Agency. Proceedings of the Fifth Message Understanding Conference (MUC-5). Morgan Kaufmann, 1993.
Google Scholar
H. Alshawi, editor. The Core Language Engine. MIT Press, Cambridge MA, 1992.
Google Scholar
AVENTINUS: Advanced information system for multinational drug enforcement. http://www2.echo.lu/langeng/en/lel/aventinus/aventinus.html. Site visited 29/05/97.
Google Scholar
J. Cowie and W. Lehnert. Information extraction. Communications of the ACM, 39(1):80–91, 1996.
Article Google Scholar
H. Cunningham, S. Azzam, and Y. Wilks. Domain Modelling for AVENTINUS (WP 4.2). LE project LEl-2238 AVENTINUS internal technical report, University of Sheffield, UK, 1996.
Google Scholar
Defense Advanced Research Projects Agency. Proceedings of the Sixth Message Understanding Conference (MUC-6). Morgan Kaufmann, 1995.
Google Scholar
ECRAN: Extraction of Content: Research at Near-Market. http://www2.echo.lu/langeng/en/lei/ecran/ecran.html. Site visited 29/05/97.
Google Scholar
FACILE: Fast and Accurate Categorisation of Information by Language Engineering. http://www2.echo.lu/langeng/en/lel/facile/facile.html. Site visited 29/05/97.
Google Scholar
R. Gaizauskas. XI: A Knowledge Representation Language Based on Cross-Classification and Inheritance. Technical Report CS-95-24, Department of Computer Science, University of Sheffield, 1995.
Google Scholar
R. Gaizauskas and K. Humphreys. Using a semantic network for information extraction. Journal of Natural Language Engineering, 1997. In press.
Google Scholar
R. Gaizauskas, T. Wakao, K Humphreys, H. Cunningham, and Y. Wilks. Description of the LaSIE system as used for MUC-6. In Proceedings of the Sixth Message Understanding Conference (MUC-6). Morgan Kaufmann, 1995.
Google Scholar
R. Gaizauskas and Y. Wilks. Information Extraction: Beyond Document Retrieval. Submitted to Journal of Documentation, 1997.
Google Scholar
R. Grishman and B. Sundheim. Message understanding conference — 6: A brief history. In Proceedings of the 16th International Conference on Computational Linguistics, Copenhagen, June 1996.
Google Scholar
H. Horacek and M. Zock, editors. New Concepts in Natural Language Generation: Planning, Realization and Systems. Pinter Publishers, London, 1993.
Google Scholar
W.J. Hutchins. Machine Translation: past, present, future. Chichester: Ellis Horwood, 1986.
Google Scholar
M. Kameyama. Information Extraction across Linguistic Boundaries. In AAAI Spring Symposium on Cross-Language Text and Speech Processing, 1997.
Google Scholar
R. Merchant, M.E. Okurowski, and N. Chinchor. The Multi-Lingual Entity Tast (MET) Overview. In Advances in Text Processing — TIPSTER Programme Phase II, pages 445–447. DARPA, Morgan Kaufman, 1996.
Google Scholar
G. A. Miller (Ed.). WordNet: An on-line lexical database. International Journal of Lexicography, 3(4):235–312, 1990.
Article Google Scholar
SPARKLE: Shallow parsing and knowledge extraction for language engineering. http://www2.echo.lu/langeng/en/lei/sparkle/sparkle.html. Site visited 10/06/97.
Google Scholar
TREE: Trans European Employment. http://www2.echo.lu/langeng/en/lel/tree/tree.html. Site visited 29/05/97.
Google Scholar
Y. Wilks and M. Stevenson. Sense tagging: Semantic tagging with a lexicon. In Proceedings of the ANLP97 Workshop on Tagging Text with Lexical Semantics, 1997.
Google Scholar
D. Yarowsky. Word-sense disambiguation using statistical models of Roget's cate-gories trained on large corpora. In COLING-92, 1992.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Sheffield, USA
Robert Gaizauskas, Kevin Humphreys, Saliha Azzam & Yorick Wilks

Authors

Robert Gaizauskas
View author publications
You can also search for this author in PubMed Google Scholar
Kevin Humphreys
View author publications
You can also search for this author in PubMed Google Scholar
Saliha Azzam
View author publications
You can also search for this author in PubMed Google Scholar
Yorick Wilks
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Maria Teresa Pazienza

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Gaizauskas, R., Humphreys, K., Azzam, S., Wilks, Y. (1997). Concepticons vs. lexicons: An architecture for multilingual information extraction. In: Pazienza, M.T. (eds) Information Extraction A Multidisciplinary Approach to an Emerging Information Technology. SCIE 1997. Lecture Notes in Computer Science, vol 1299. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-63438-X_3

Download citation

DOI: https://doi.org/10.1007/3-540-63438-X_3
Published: 30 July 2005
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-63438-6
Online ISBN: 978-3-540-69548-6
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics