Lightly supervised acquisition of named entities and linguistic patterns for multilingual text mining

César de Pablo-Sánchez¹,
Isabel Segura-Bedmar¹,
Paloma Martínez¹ &
…
Ana Iglesias-Maqueda¹

317 Accesses
5 Citations
3 Altmetric
Explore all metrics

Abstract

Named Entity Recognition and Classification (NERC) is an important component of applications like Opinion Tracking, Information Extraction, or Question Answering. When these applications require to work in several languages, NERC becomes a bottleneck because its development requires language-specific tools and resources like lists of names or annotated corpora. This paper presents a lightly supervised system that acquires lists of names and linguistic patterns from large raw text collections in western languages and starting with only a few seeds per class selected by a human expert. Experiments have been carried out with English and Spanish news collections and with the Spanish Wikipedia. Evaluation of NE classification on standard datasets shows that NE lists achieve high precision and reveals that contextual patterns increase recall significantly. Therefore, it would be helpful for applications where annotated NERC data are not available such as those that have to deal with several western languages or information from different domains.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Institutional subscriptions

A Hybrid Approach for Persian Named Entity Recognition

Article 15 March 2017

Using Wikipedia for Cross-Language Named Entity Recognition

Weakly-Supervised Named Entity Extraction Using Word Representations

References

Agichtein E, Gravano L (2000) Snowball: extracting relations from large plain-text collections. In: Proceedings of the fifth ACM conference on digital libraries (DL ’00), ACM Press, New York, pp 85–94
Biggio S, Giuliano C, Poesio M, Versley Y, Uryupina O, Zanoli, R (2009) Local entity detection and recognition task. In: Proceedings of evaluation of NLP and speech tools for Italian (Evalita 2009), Rome, pp 1–8
Bikel DM, Schwartz RM, Weischedel RM (1999) An algorithm that learns what’s in a name. Mach Learn 34(1–3): 211–231
Article MATH Google Scholar
Carreras X, Márquez L, Padró L (2002) Named entity extraction using adaboost. In: Proceedings of the 6th conference on natural language learning (CONLL-2002), Toulouse, pp 1–4
Carreras X, Mà àrquez L, Padró L (2003) Named entity recognition for catalan using spanish resources. In: Proceedings of the tenth conference on European chapter of the association for computational linguistics, Association for Computational Linguistics, Sapporo, pp 43–50
Collins M, Singer Y (1999) Unsupervised models for named entity classification. In: Proceedings of empirical methods in natural language processing and very large corpora (EMNLP 99), New Brunswick, pp 189–196
Cucerzan S, Yarowsky D (1999) Language independent named entity recognition combining morphological and contextual evidence. In: Proceedings of the joint SIGDAT conference on EMNLP and VLC 1999 joint SIGDAT conference on EMNLP and VLC, pp 90–99
Li Y, Funk, A (2008) Developing language processing components with GATE version 5 (a user guide). University of Sheffield, Sheffield, last edited February
Dorji T, Atlam E, Yata S, Fuketa M, Morita K, Aoe J-I (2011) Extraction, selection and ranking of field association (fa) terms from domain-specific corpora for building a comprehensive fa terms dictionary. Knowl Inf Syst 27: 141–161
Article Google Scholar
Etzioni O, Cafarella M, Downey D, Popescu AM, Shaked T, Soderland S, Weld DS, Yates A (2005) Unsupervised named-entity extraction from the web: an experimental study. Artif Intell 165(1): 91–134
Article Google Scholar
Florian R, Ittycheriah A, Jing H, Zhang T (2003) Named entity recognition through classifier combination. In: Proceedings of human language technology conference (HLT-NAACL ’03), Edmonton, pp 168–171
Harabagiu S, Strzalkowski T (2006) Advances in open domain question answering. Springerg, New York
Google Scholar
Hearst MA (1992) Automatic acquisition of hyponyms from large text corpora. In: Proceedings of the 14th conference on computational linguistics, Association for Computational Linguistics, Morristown, pp 539–545
Ipeirotis PG, Agichtein E, Jain P, Gravano L (2006) To search or to crawl?: towards a query optimizer for text-centric tasks. In: Proceedings of international conference on management of data/principles of database systems (SIGMOD ’06), New York, pp 265–276
Kazama J, Torisawa K (2007) Exploiting wikipedia as external knowledge for named entity recognition. In: Proceedings of joint meeting of the conference on empirical methods on natural language processing (EMNLP) and the conference on natural language learning (CONLL), Prague, pp 698–707
Liu L, Liang Q (2011) A high-performing comprehensive learning algorithm for text classification without pre-labeled training set. Knowl Inf Syst 29(3): 727–738
Article Google Scholar
Nadeau D, Turney P, Matwin S (2006) Unsupervised named-entity recognition: generating gazetteers and resolving ambiguity. In: Advances in artificial intelligence (LNCS), vol 401, pp 266–277
Nadeau D, Sekine S (2007) A survey of named entity recognition and classification. Linguist Investig 30(1): 3–26
Article Google Scholar
NIST (2008) Automatic content extraction 2008 evaluation plan (ace 2008). Assessment of detection and recognition of entities and relations within and across documents, technical report, National Institute of Standards and Technology
On BW, Lee I, Lee D (2011) Scalable clustering methods for the name disambiguation problem. Knowle Inf Syst 31: 1–23
Google Scholar
Pang B, Lee L (2008) Opinion mining and sentiment analysis. Found Trends Inf Retr 2(1–2): 1–135
Article Google Scholar
Richman AE, Schone P (2008) Mining wiki resources for multilingual named entity recognition. In: Proceedings of human language technologies conference, Association for Computational Linguistics, Columbus, pp 1–9
Santos D, Seco N, Cardoso N, Vilela R (2006) Harem: An advanced ner evaluation contest for portuguese. In: Proceedings of the 5th international conference on language resources and evaluation (LREC), Genoa, pp 1986–1991
Sarawagi S (2008) Information extraction. Found Trends Databases 1(3): 261–377
Article Google Scholar
Sekine S, Sudo K, Nobata C (2002) Extended named entity hierarchy. In: Proceedings of the international conference on language resources and evaluation conference (LREC), Las Palmas, pp 1–7
Steinberger R, Bruno P, Ignat C (2004) Exploiting multilingual nomenclatures and language-independent text features as an interlingua for cross-lingual text analysis applications. In: Proceedings of the 4th Slovenian language technology conference. Information Society 2004 (IS’2004), Ljubljana
Steinberger R, Pouliquen B, Ignat C (2005) Navigating multilingual news collections using automatically extracted information. J Comput Inf Technol 13: 257–264
Article Google Scholar
Thelen M, Riloff E (2002) A bootstrapping method for learning semantic lexicons using extraction pattern contexts. In: Proceedings of the conference on Empirical methods in natural language processing, Morristown, pp 214–221
Tjong-Kim-Sang EF (2002) Introduction to the conll-2002 shared task: Language-independent named entity recognition. In: Proceedings of the conference on natural language learning (CoNLL-2002), Taipei, pp 155–158
Tjong-Kim-Sang EF, Meulder FD (2003) Introduction to the conll-2003 shared task: language-independent named entity recognition. In: Proceedings of the conference on natural language learning (CoNLL-2003), Edmonton, pp 142–147
Toral A, Munoz R (2006) A proposal to automatically build and maintain gazetteers for named entity recognition using wikipedia. In: Proceedings of the conference of the European chapter of the Association for computational linguistic (EACL ’06), Trento, pp 56–62
Yangarber R, Lin W, Grishman R (2002) Unsupervised learning of generalized names. In: Proceedings of the 19th international conference on computational linguistics, Morristown, pp 1–7
Zanoli R, Pianta E, Giuliano C (2009) Named entity recognition through redundancy driven classifiers. In: In Proceedings of evaluation of NLP and speech tools for Italian (Evalita 2009), Rome, pp 1–5
Zitouni I, Florian R (2008) Mention detection crossing the language barrier. In: Proceedings of Conference on empirical methods on natural language processing (EMPNLP), Honolulu, pp 600–609

Download references

Author information

Authors and Affiliations

Department of Computer Science, Universidad Carlos III de Madrid, 28911, Leganés, Madrid, Spain
César de Pablo-Sánchez, Isabel Segura-Bedmar, Paloma Martínez & Ana Iglesias-Maqueda

Authors

César de Pablo-Sánchez
View author publications
You can also search for this author in PubMed Google Scholar
Isabel Segura-Bedmar
View author publications
You can also search for this author in PubMed Google Scholar
Paloma Martínez
View author publications
You can also search for this author in PubMed Google Scholar
Ana Iglesias-Maqueda
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to César de Pablo-Sánchez.

Rights and permissions

Reprints and permissions

About this article

Cite this article

de Pablo-Sánchez, C., Segura-Bedmar, I., Martínez, P. et al. Lightly supervised acquisition of named entities and linguistic patterns for multilingual text mining. Knowl Inf Syst 35, 87–109 (2013). https://doi.org/10.1007/s10115-012-0502-0

Download citation

Received: 23 May 2011
Revised: 07 December 2011
Accepted: 23 April 2012
Published: 16 May 2012
Issue Date: April 2013
DOI: https://doi.org/10.1007/s10115-012-0502-0

Lightly supervised acquisition of named entities and linguistic patterns for multilingual text mining

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A Hybrid Approach for Persian Named Entity Recognition

Using Wikipedia for Cross-Language Named Entity Recognition

Weakly-Supervised Named Entity Extraction Using Word Representations

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Lightly supervised acquisition of named entities and linguistic patterns for multilingual text mining

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A Hybrid Approach for Persian Named Entity Recognition

Using Wikipedia for Cross-Language Named Entity Recognition

Weakly-Supervised Named Entity Extraction Using Word Representations

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation