Developing a Robust Part-of-Speech Tagger for Biomedical Text

Yoshimasa Tsuruoka^18,19,
Yuka Tateishi^18,19,
Jin-Dong Kim^18,19,
Tomoko Ohta^18,19,
John McNaught^20,22,
Sophia Ananiadou^21,22 &
…
Jun’ichi Tsujii^19,20

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 3746))

Included in the following conference series:

Panhellenic Conference on Informatics

2498 Accesses
3 Altmetric

Abstract

This paper presents a part-of-speech tagger which is specifically tuned for biomedical text. We have built the tagger with maximum entropy modeling and a state-of-the-art tagging algorithm. The tagger was trained on a corpus containing newspaper articles and biomedical documents so that it would work well on various types of biomedical text. Experimental results on the Wall Street Journal corpus, the GENIA corpus, and the PennBioIE corpus revealed that adding training data from a different domain does not hurt the performance of a tagger, and our tagger exhibits very good precision (97% to 98%) on all these corpora. We also evaluated the robustness of the tagger using recent MEDLINE articles.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

From 0 to 10 million annotated words: part-of-speech tagging for Middle High German

Article 08 April 2019

Corpus based part-of-speech tagging

Article 01 August 2016

PMM: A Model for Bangla Parts-of-Speech Tagging Using Sentence Map

References

Kudo, T., Matsumoto, Y.: Chunking with support vector machines. In: Proceedings of NAACL 2001, pp. 192–199 (2001)
Google Scholar
Bikel, D.M.: Intricacies of collins’ parsing model. Computational Linguistics 30, 479–511 (2004)
Article Google Scholar
Kulick, S., Bies, A., Libeman, M., Mandel, M., McDonald, R., Palmer, M., Schein, A., Ungar, L.: Integrated annotation for biomedical information extraction. In: Proceedings of HLT/NAACL 2004 (2004)
Google Scholar
Tateisi, Y., Tsujii, J.: Part-of-speech annotation of biology research abstracts. In: Proceedings of 4th International Conference on Language Resource and Evaluation (LREC 2004), pp. 1267–1270 (2004)
Google Scholar
Brants, T.: TnT– a statistical part-of-speech tagger. In: Proceedings of the 6th Applied NLP Conference, ANLP (2000)
Google Scholar
Ohta, T., Tateisi, Y., Kim, J.D., Tsujii, J.: Genia corpus: an annotated research abstract corpus in molecular biology domain. In: Proceedings of the Human Language Technology Conference, HLT 2002 (2002)
Google Scholar
Marcus, M.P., Santorini, B., Marcinkiewicz, M.A.: Building a large annotated corpus of english: The penn treebank. Computational Linguistics 19, 313–330 (1994)
Google Scholar
Toutanova, K., Klein, D., Manning, C., Singer, Y.: Feature-rich part-of-speech tagging with a cyclic dependency network. In: Proceedings of HLT-NAACL 2003, pp. 252–259 (2003)
Google Scholar
Gimenez, J., Marquez, L.: Fast and accurate part-of-speech tagging: The SVM approach revisited. In: Proceedings of RANLP 2003, pp. 158–165 (2003)
Google Scholar
Collins, M.: Discriminative training methods for hidden markov models: Theory and experiments with perceptron algorithms. In: Proceedings of EMNLP 2002, pp. 1–8 (2002)
Google Scholar
Kazama, J., Tsujii, J.: Evaluation and extension of maximum entropy models with inequality constraints. In: Proceedings of EMNLP (2003)
Google Scholar
Chen, S.F., Rosenfeld, R.: A gaussian prior for smoothing maximum entropy models. Technical Report CMUCS -99-108, Carnegie Mellon University (1999)
Google Scholar
Ratnaparkhi, A.: A maximum entropy model for part-of-speech tagging. In: Proceedings of EMNLP (1997)
Google Scholar

Download references

Author information

Authors and Affiliations

CREST, JST (Japan Science and Technology Agency), Honcho 4-1-8, Kawaguchi-shi, Saitama, 332-0012, Japan
Yoshimasa Tsuruoka, Yuka Tateishi, Jin-Dong Kim & Tomoko Ohta
University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo, 113-0033, Japan
Yoshimasa Tsuruoka, Yuka Tateishi, Jin-Dong Kim, Tomoko Ohta & Jun’ichi Tsujii
School of Informatics, University of Manchester, P.O.Box 88, Sackville St, Manchester, M60 1QD, UK
John McNaught & Jun’ichi Tsujii
School of Computing, Science and Engineering, Salford University, Salford, Greater Manchester, M5 4WT, UK
Sophia Ananiadou
The National Centre for Text Mining, P.O.Box 88, Sackville St, Manchester, M60 1QD, UK
John McNaught & Sophia Ananiadou

Authors

Yoshimasa Tsuruoka
View author publications
You can also search for this author in PubMed Google Scholar
Yuka Tateishi
View author publications
You can also search for this author in PubMed Google Scholar
Jin-Dong Kim
View author publications
You can also search for this author in PubMed Google Scholar
Tomoko Ohta
View author publications
You can also search for this author in PubMed Google Scholar
John McNaught
View author publications
You can also search for this author in PubMed Google Scholar
Sophia Ananiadou
View author publications
You can also search for this author in PubMed Google Scholar
Jun’ichi Tsujii
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer and Communication Engineering, University of Thessaly, Glavani 37, 382 21, Volos, Greece
Panayiotis Bozanis
Department of Computer and Communication Engineering, University of Thessaly, 382 21, Volos, Greece
Elias N. Houstis

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tsuruoka, Y. et al. (2005). Developing a Robust Part-of-Speech Tagger for Biomedical Text. In: Bozanis, P., Houstis, E.N. (eds) Advances in Informatics. PCI 2005. Lecture Notes in Computer Science, vol 3746. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11573036_36

Download citation

DOI: https://doi.org/10.1007/11573036_36
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-29673-7
Online ISBN: 978-3-540-32091-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics