Abstract
Named Entity Recognition and Classification (NERC) is an important component of applications like Opinion Tracking, Information Extraction, or Question Answering. When these applications require to work in several languages, NERC becomes a bottleneck because its development requires language-specific tools and resources like lists of names or annotated corpora. This paper presents a lightly supervised system that acquires lists of names and linguistic patterns from large raw text collections in western languages and starting with only a few seeds per class selected by a human expert. Experiments have been carried out with English and Spanish news collections and with the Spanish Wikipedia. Evaluation of NE classification on standard datasets shows that NE lists achieve high precision and reveals that contextual patterns increase recall significantly. Therefore, it would be helpful for applications where annotated NERC data are not available such as those that have to deal with several western languages or information from different domains.
Similar content being viewed by others
References
Agichtein E, Gravano L (2000) Snowball: extracting relations from large plain-text collections. In: Proceedings of the fifth ACM conference on digital libraries (DL ’00), ACM Press, New York, pp 85–94
Biggio S, Giuliano C, Poesio M, Versley Y, Uryupina O, Zanoli, R (2009) Local entity detection and recognition task. In: Proceedings of evaluation of NLP and speech tools for Italian (Evalita 2009), Rome, pp 1–8
Bikel DM, Schwartz RM, Weischedel RM (1999) An algorithm that learns what’s in a name. Mach Learn 34(1–3): 211–231
Carreras X, Márquez L, Padró L (2002) Named entity extraction using adaboost. In: Proceedings of the 6th conference on natural language learning (CONLL-2002), Toulouse, pp 1–4
Carreras X, Mà àrquez L, Padró L (2003) Named entity recognition for catalan using spanish resources. In: Proceedings of the tenth conference on European chapter of the association for computational linguistics, Association for Computational Linguistics, Sapporo, pp 43–50
Collins M, Singer Y (1999) Unsupervised models for named entity classification. In: Proceedings of empirical methods in natural language processing and very large corpora (EMNLP 99), New Brunswick, pp 189–196
Cucerzan S, Yarowsky D (1999) Language independent named entity recognition combining morphological and contextual evidence. In: Proceedings of the joint SIGDAT conference on EMNLP and VLC 1999 joint SIGDAT conference on EMNLP and VLC, pp 90–99
Li Y, Funk, A (2008) Developing language processing components with GATE version 5 (a user guide). University of Sheffield, Sheffield, last edited February
Dorji T, Atlam E, Yata S, Fuketa M, Morita K, Aoe J-I (2011) Extraction, selection and ranking of field association (fa) terms from domain-specific corpora for building a comprehensive fa terms dictionary. Knowl Inf Syst 27: 141–161
Etzioni O, Cafarella M, Downey D, Popescu AM, Shaked T, Soderland S, Weld DS, Yates A (2005) Unsupervised named-entity extraction from the web: an experimental study. Artif Intell 165(1): 91–134
Florian R, Ittycheriah A, Jing H, Zhang T (2003) Named entity recognition through classifier combination. In: Proceedings of human language technology conference (HLT-NAACL ’03), Edmonton, pp 168–171
Harabagiu S, Strzalkowski T (2006) Advances in open domain question answering. Springerg, New York
Hearst MA (1992) Automatic acquisition of hyponyms from large text corpora. In: Proceedings of the 14th conference on computational linguistics, Association for Computational Linguistics, Morristown, pp 539–545
Ipeirotis PG, Agichtein E, Jain P, Gravano L (2006) To search or to crawl?: towards a query optimizer for text-centric tasks. In: Proceedings of international conference on management of data/principles of database systems (SIGMOD ’06), New York, pp 265–276
Kazama J, Torisawa K (2007) Exploiting wikipedia as external knowledge for named entity recognition. In: Proceedings of joint meeting of the conference on empirical methods on natural language processing (EMNLP) and the conference on natural language learning (CONLL), Prague, pp 698–707
Liu L, Liang Q (2011) A high-performing comprehensive learning algorithm for text classification without pre-labeled training set. Knowl Inf Syst 29(3): 727–738
Nadeau D, Turney P, Matwin S (2006) Unsupervised named-entity recognition: generating gazetteers and resolving ambiguity. In: Advances in artificial intelligence (LNCS), vol 401, pp 266–277
Nadeau D, Sekine S (2007) A survey of named entity recognition and classification. Linguist Investig 30(1): 3–26
NIST (2008) Automatic content extraction 2008 evaluation plan (ace 2008). Assessment of detection and recognition of entities and relations within and across documents, technical report, National Institute of Standards and Technology
On BW, Lee I, Lee D (2011) Scalable clustering methods for the name disambiguation problem. Knowle Inf Syst 31: 1–23
Pang B, Lee L (2008) Opinion mining and sentiment analysis. Found Trends Inf Retr 2(1–2): 1–135
Richman AE, Schone P (2008) Mining wiki resources for multilingual named entity recognition. In: Proceedings of human language technologies conference, Association for Computational Linguistics, Columbus, pp 1–9
Santos D, Seco N, Cardoso N, Vilela R (2006) Harem: An advanced ner evaluation contest for portuguese. In: Proceedings of the 5th international conference on language resources and evaluation (LREC), Genoa, pp 1986–1991
Sarawagi S (2008) Information extraction. Found Trends Databases 1(3): 261–377
Sekine S, Sudo K, Nobata C (2002) Extended named entity hierarchy. In: Proceedings of the international conference on language resources and evaluation conference (LREC), Las Palmas, pp 1–7
Steinberger R, Bruno P, Ignat C (2004) Exploiting multilingual nomenclatures and language-independent text features as an interlingua for cross-lingual text analysis applications. In: Proceedings of the 4th Slovenian language technology conference. Information Society 2004 (IS’2004), Ljubljana
Steinberger R, Pouliquen B, Ignat C (2005) Navigating multilingual news collections using automatically extracted information. J Comput Inf Technol 13: 257–264
Thelen M, Riloff E (2002) A bootstrapping method for learning semantic lexicons using extraction pattern contexts. In: Proceedings of the conference on Empirical methods in natural language processing, Morristown, pp 214–221
Tjong-Kim-Sang EF (2002) Introduction to the conll-2002 shared task: Language-independent named entity recognition. In: Proceedings of the conference on natural language learning (CoNLL-2002), Taipei, pp 155–158
Tjong-Kim-Sang EF, Meulder FD (2003) Introduction to the conll-2003 shared task: language-independent named entity recognition. In: Proceedings of the conference on natural language learning (CoNLL-2003), Edmonton, pp 142–147
Toral A, Munoz R (2006) A proposal to automatically build and maintain gazetteers for named entity recognition using wikipedia. In: Proceedings of the conference of the European chapter of the Association for computational linguistic (EACL ’06), Trento, pp 56–62
Yangarber R, Lin W, Grishman R (2002) Unsupervised learning of generalized names. In: Proceedings of the 19th international conference on computational linguistics, Morristown, pp 1–7
Zanoli R, Pianta E, Giuliano C (2009) Named entity recognition through redundancy driven classifiers. In: In Proceedings of evaluation of NLP and speech tools for Italian (Evalita 2009), Rome, pp 1–5
Zitouni I, Florian R (2008) Mention detection crossing the language barrier. In: Proceedings of Conference on empirical methods on natural language processing (EMPNLP), Honolulu, pp 600–609
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
de Pablo-Sánchez, C., Segura-Bedmar, I., Martínez, P. et al. Lightly supervised acquisition of named entities and linguistic patterns for multilingual text mining. Knowl Inf Syst 35, 87–109 (2013). https://doi.org/10.1007/s10115-012-0502-0
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-012-0502-0