[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Part-of-Speech Tagging of Odia Language Using Statistical and Deep Learning Based Approaches

Published: 16 June 2023 Publication History

Abstract

Automatic part-of-speech (POS) tagging is a preprocessing step of many natural language processing tasks, such as named entity recognition, speech processing, information extraction, word sense disambiguation, and machine translation. It has already gained promising results in English and European languages. However, in Indian languages, particularly in the Odia language, it is not yet well explored because of the lack of supporting tools, resources, and morphological richness of the language. Unfortunately, we were unable to locate an open source POS tagger for the Odia language, and only a handful of attempts have been made to develop POS taggers for the Odia language. The main contribution of this research work is to present statistical approaches such as the maximum entropy Markov model and conditional random field (CRF), as well as deep learning based approaches, including the convolutional neural network (CNN) and bidirectional long short-term memory (Bi-LSTM) to develop the Odia POS tagger. A publicly accessible corpus annotated with the Bureau of Indian Standards (BIS) tagset is used in our work. However, most of the languages around the globe have used the dataset annotated with the Universal Dependencies (UD) tagset. Hence, to maintain uniformity, the Odia dataset should use the same tagset. Thus, following the BIS and UD guidelines, we constructed a mapping from the BIS tagset to the UD tagset. The maximum entropy Markov model, CRF, Bi-LSTM, and CNN models are trained using the Indian Languages Corpora Initiative corpus with the BIS and UD tagsets. We have experimented with various feature sets as input to the statistical models to prepare a baseline system and observed the impact of constructed feature sets. The deep learning based model includes the Bi-LSTM network, the CNN network, the CRF layer, character sequence information, and a pre-trained word vector. Seven different combinations of neural sequence labeling models are implemented, and their performance measures are investigated. It has been observed that the Bi-LSTM model with the character sequence feature and pre-trained word vector achieved a result with 94.58% accuracy.

A Appendix

A.1 Confusion Matrix of Different Deep Learning Models

Table A.1.
ADJ4,9819180088092447632960000520
ADP111,961140153015401612000020
ADV24191,028082602250121300200120
AUX00015100020000000940
CCONJ0192403,60830047014700970280
DET300230133,6280791330970034000
INTJ0000004417020000000
NOUN962137151051183534,931114848079602005825
NUM12060040514,335220400020
PART3212902930082142,48570040300
PRON1551506140025203,152207000
PROPN68000000720230127,829000206
PUNCT0000000000009,9050000
SCONJ00160714101203200729001
SYM0000000000000035900
VERB243445703012210300012,0500
X002027082000110000119
 ADJADPADVAUXCCONJDETINTJNOUNNUMPARTPRONPROPNPUNCTSCONJSYMVERBX
Table A.1. Confusion Matrix of the CharCNN + CNN + CRF Model
Table A.2.
ADJ5,3828110043066830351348000330
ADP92,0295010012501270000 0
ADV28221,069011200194012700100140
AUX00017900000000000680
CCONJ0211303,68924026011400680180
DET3509093,6650811019860033000
INTJ0000004517001000000
NOUN74112167054119535,71193575257602104777
NUM21050050624,322120900000
PART3215401627064132,53740000220
PRON213707100035303,184009000
PROPN67000000609210117,946000204
PUNCT0000000000009,9050000
SCONJ004057250902000778000
SYM0000000000000035900
VERB302677702990150000012,0510
X004000076000190100123
 ADJADPADVAUXCCONJDETINTJNOUNNUMPARTPRONPROPNPUNCTSCONJSYMVERBX
Table A.2. Confusion Matrix of the CharCNN + Bi-LSTM + CRF Model
Table A.3.
ADJ5,1058180066085039542457000520
ADP121,976200120013602011000000
ADV27181,052014150209010130021090
AUX00016800000000000800
CCONJ0121703,641240500141100800230
DET330270183,5690979361110047000
INTJ0000004815000000000
NOUN890133168147117035,264958872650017054118
NUM15000030554,348150000000
PART3413002319090182,50610000290
PRON160220689034003,194009000
PROPN69230000760170127,7830002011
PUNCT0000000000009,9060000
SCONJ0012062280800000765000
SYM0000000000000035900
VERB2600176602760110000012,0820
X003026046000110000154
 ADJADPADVAUXCCONJDETINTJNOUNNUMPARTPRONPROPNPUNCTSCONJSYMVERBX
Table A.3. Confusion Matrix of the CharCNN + CNN Model
Table A.4.
ADJ5,3847140028068423381751000270
ADP62,0381006001080118000000
ADV2591,1180101701730129007080
AUX00020100000000000470
CCONJ061403,7241903406500520120
DET310120123,63601011222900031000
INTJ0000005013000000000
NOUN582748403477036,22081414447701603629
NUM13000040504,356110000000
PART256401815074112,55300000243
PRON14090572037103,226006000
PROPN6500000069419097,870000173
PUNCT0000000000009,9060000
SCONJ006062150700000785000
SYM0000000000000035900
VERB2608156502940110200012,0570
X003003066000120000138
 ADJADPADVAUXCONDETINTJNOUNNUMPARTPRONPROPNPUNCTSCONJSYMVERBX
Table A.4. Confusion Matrix of the CharCNN + Bi-LSTM Model

References

[1]
Indian Language Technology Proliferation and Deployment Center. n.d. Home Page. Retrieved August 24, 2021 from http://tdil-dc.in.
[2]
Firoj Alam, Shammur Absar Chowdhury, and Sheak Rashed Haider Noori. 2016. Bidirectional LSTMs—CRFs networks for Bangla POS tagging. In Proceedings of the 2016 19th International Conference on Computer and Information Technology (ICCIT’16). IEEE, Los Alamitos, CA, 377–382.
[3]
Pitambar Behera. 2015. Odia Parts of Speech Tagging Corpora: Suitability of Statistical Model. Ph.D. Dissertation. Jawaharlal Nehru University, New Delhi, India.
[4]
Irshad Bhat, Riyaz Ahmad Bhat, Manish Shrivastava, and Dipti Misra Sharma. 2018. Universal dependency parsing for Hindi-English code-switching. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 987–998.
[5]
Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics 5 (2017), 135–146.
[6]
Eric Brill. 1995. Transformation-based error-driven learning and natural language processing: A case study in part-of-speech tagging. Computational Linguistics 21, 4 (1995), 543–565.
[7]
Nitish Chandra, Sudhakar Kumawat, and Vinayak Srivastava. 2014. Various tagsets for Indian languages and their performance in part of speech tagging. In Proceedings of the 5th IRF International Conference.
[8]
Doug Cutting, Julian Kupiec, Jan Pedersen, and Penelope Sibun. 1992. A practical part-of-speech tagger. In Proceedings of the 3rd Conference on Applied Natural Language Processing (ANLC’92). 133–140.
[9]
Bishwa Ranjan Das and Srikanta Patnaik. 2014. A novel approach for Odia part of speech tagging using artificial neural network. In Proceedings of the International Conference on Frontiers of Intelligent Computing: Theory and Applications (FICTA’14). 147–154.
[10]
Bishwa Ranjan Das, Smrutirekha Sahoo, Chandra Sekhar Panda, and Srikanta Patnaik. 2015. Part of speech tagging in Odia using support vector machine. Procedia Computer Science 48 (2015), 507–512.
[11]
Marie-Catherine De Marneffe, Timothy Dozat, Natalia Silveira, Katri Haverinen, Filip Ginter, Joakim Nivre, and Christopher D. Manning. 2014. Universal Stanford dependencies: A cross-linguistic typology. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC’14). 4585–4592.
[12]
Marie-Catherine De Marneffe and Christopher D. Manning. 2008. The Stanford typed dependencies representation. In Proceedings of the Workshop on Cross-Framework and Cross-Domain Parser Evaluation (COLING’08). 1–8.
[13]
V. Dhanalakshmi, G. Shivapratap, K. P. Soman, and S. Rajendran. 2009. Tamil POS tagging using linear programming. 1, 2 (2009), 166–169.
[14]
Cicero Dos Santos and Bianca Zadrozny. 2014. Learning character-level representations for part-of-speech tagging. In Proceedings of the International Conference on Machine Learning. 1818–1826.
[15]
Asif Ekbal, Rejwanul Haque, and Sivaji Bandyopadhyay. 2007. Bengali part of speech tagging using conditional random field. In Proceedings of the 7th International Symposium on Natural Language Processing (SNLP’07). 131–136.
[16]
Erick R. Fonseca, João Luís G. Rosa, and Sandra Maria Aluísio. 2015. Evaluating word embeddings and a revised corpus for part-of-speech tagging in Portuguese. Journal of the Brazilian Computer Society 21, 1 (2015), 1–14.
[17]
Zellig Harris. 1962. String Analysis of Language Structure. Mouton & Co., The Hague.
[18]
Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint arXiv:1508.01991 (2015).
[19]
K. S. Gokul Krishnan, A. Pooja, M. Anand Kumar, and K. P. Soman. 2017. Character based bidirectional LSTM for disambiguating tamil part-of-speech categories. International Journal of Control Theory and Applications 229 (2017), 235.
[20]
Anoop Kunchukuttan, Divyanshu Kakwani, Satish Golla, Gokul NC, Avik Bhattacharyya, Mitesh M. Khapra, and Pratyush Kumar. 2020. AI4Bharat-IndicNLP corpus: Monolingual corpora and word embeddings for Indic languages. arXiv preprint arXiv:2005.00085 (2020).
[21]
John Lafferty, Andrew McCallum, and Fernando C. N. Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the 18th International Conference on Machine Learning (ICML’01).
[22]
Wang Ling, Chris Dyer, Alan W. Black, Isabel Trancoso, Ramón Fermandez, Silvio Amir, Luis Marujo, and Tiago Luís. 2015. Finding function in form: Compositional character models for open vocabulary word representation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. 1520–1530.
[23]
Xuezhe Ma and Eduard Hovy. 2016. End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1064–1074.
[24]
Ruslan Mitkov. 2022. The Oxford Handbook of Computational Linguistics. Oxford University Press.
[25]
Joakim Nivre, Marie-Catherine De Marneffe, Filip Ginter, Yoav Goldberg, Jan Hajic, Christopher D. Manning, Ryan McDonald, et al. 2016. Universal dependencies v1: A multilingual treebank collection. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC’16). 1659–1666.
[26]
Joakim Nivre, Marie-Catherine de Marneffe, Filip Ginter, Jan Hajic, Christopher D. Manning, Sampo Pyysalo, Sebastian Schuster, Francis Tyers, and Daniel Zeman. 2020. Universal dependencies v2: An evergrowing multilingual treebank collection. In Proceedings of the 12th Language Resources and Evaluation Conference. 4034–4043.
[27]
Farhad Nooralahzadeh, Caroline Brun, and Claude Roux. 2014. Part of speech tagging for french social media data. In Proceedings of the 25th International Conference on Computational Linguistics: Technical Papers (COLING’14). 1764–1772.
[28]
Atul Ku Ojha, Pitambar Behera, Srishti Singh, and Girish N. Jha. 2015. Training & evaluation of POS taggers in Indo-Aryan languages: A case of Hindi, Odia and Bhojpuri. In Proceedings of the 7th Language and Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics. 524–529.
[29]
Shantipriya Parida, Ondřej Bojar, and Satya Ranjan Dash. 2020. OdiEnCorp: Odia–English and Odia-only corpus for machine translation. In Smart Intelligent Computing and Applications. Springer, 495–504.
[30]
Shantipriya Parida, Satya Ranjan Dash, Ondřej Bojar, Petr Motlicek, Priyanka Pattnaik, and Debasish Kumar Mallick. 2020. OdiEnCorp 2.0: Odia-English parallel corpus for machine translation. In Proceedings of the WILDRE5–5th Workshop on Indian Language Data: Resources and Evaluation. 14–19.
[31]
Sagarika Pattnaik, Ajit Kumar Nayak, and Srikanta Patnaik. 2020. A semi-supervised learning of HMM to build a POS tagger for a low resourced language. Journal of Information and Communication Convergence Engineering 18, 4 (2020), 207–215.
[32]
Slav Petrov, Dipanjan Das, and Ryan McDonald. 2012. A universal part-of-speech tagset. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC’12). 2089–2096.
[33]
Barbara Plank, Anders Søgaard, and Yoav Goldberg. 2016. Multilingual part-of-speech tagging with bidirectional long short-term memory models and auxiliary loss. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 412–418.
[34]
Ankur Priyadarshi and Sujan Kumar Saha. 2020. Towards the first Maithili part of speech tagger: Resource creation and system development. Computer Speech & Language 62 (2020), 101054.
[35]
Ankur Priyadarshi and Sujan Kumar Saha. 2022. A study on the performance of recurrent neural network based models in Maithili part of speech tagging. ACM Transactions on Asian and Low-Resource Language Information Processing 22, 2 (2022), Article 32, 16 pages.
[36]
Sampo Pyysalo, Jenna Kanerva, Anna Missilä, Veronika Laippala, and Filip Ginter. 2015. Universal dependencies for finnish. In Proceedings of the 20th Nordic Conference of Computational Linguistics (Nodalida’15). 163–172.
[37]
Gowtham Ramesh, Sumanth Doddapaneni, Aravinth Bheemaraj, Mayank Jobanputra, A. K. Raghavan, Ajitesh Sharma, Sujit Sahoo, et al. 2022. Samanantar: The largest publicly available parallel corpora collection for 11 Indic languages. Transactions of the Association for Computational Linguistics 10 (2022), 145–162.
[38]
Manish Shrivastava and Pushpak Bhattacharyya. 2008. Hindi POS tagger using naive stemming: Harnessing morphological information without extensive linguistic knowledge. In Proceedings of the International Conference on NLP (ICON’08).
[39]
N. M. Suraksha, K. Reshma, and K. M. Shiva Kumar. 2017. Part-of-speech tagging and parsing of Kannada text using Conditional Random Fields (CRFs). In Proceedings of the 2017 International Conference on Intelligent Computing and Control (I2C2’17). IEEE, Los Alamitos, CA, 1–5.
[40]
Juhi Tandon, Himani Chaudhry, Riyaz Ahmad Bhat, and Dipti Misra Sharma. 2016. Conversion from Paninian karakas to universal dependencies for Hindi dependency treebank. In Proceedings of the 10th Linguistic Annotation Workshop Held in Conjunction with ACL 2016 (LAW-X’16). 141–150.
[41]
Sunita Warjri, Partha Pakray, Saralin Lyngdoh, and Arnab Kumar Maji. 2019. Identification of POS tag for Khasi language based on hidden Markov model POS tagger. Computación y Sistemas 23, 3 (2019), 795–802.
[42]
Sunita Warjri, Partha Pakray, Saralin A. Lyngdoh, and Arnab Kumar Maji. 2021. Part-of-speech (POS) tagging using deep learning-based approaches on the designed Khasi POS corpus. ACM Transactions on Asian and Low-Resource Language Information Processing 21, 3 (2021), 1–24.
[43]
Yingwei Xin, Ethan Hart, Vibhuti Mahajan, and Jean David Ruvini. 2018. Learning better internal structure of words for sequence labeling. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2584–2593.
[44]
Daniel Zeman and Philip Resnik. 2008. Cross-language parser adaptation between related languages. In Proceedings of the IJCNLP-08 Workshop on NLP for Less Privileged Languages.

Cited By

View all
  • (2024)Multilingual Neural Machine Translation for Indic to Indic LanguagesACM Transactions on Asian and Low-Resource Language Information Processing10.1145/365202623:5(1-32)Online publication date: 10-May-2024
  • (2024)Deep Learning-based POS Tagger and Chunker for Odia Language Using Pre-trained TransformersACM Transactions on Asian and Low-Resource Language Information Processing10.1145/363787723:2(1-23)Online publication date: 8-Feb-2024
  • (2024)Question Template Extraction Using Sequence Labeling Approach2024 International Conference on Data Science and Its Applications (ICoDSA)10.1109/ICoDSA62899.2024.10651739(242-247)Online publication date: 10-Jul-2024
  • Show More Cited By

Index Terms

  1. Part-of-Speech Tagging of Odia Language Using Statistical and Deep Learning Based Approaches

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Asian and Low-Resource Language Information Processing
      ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 22, Issue 6
      June 2023
      635 pages
      ISSN:2375-4699
      EISSN:2375-4702
      DOI:10.1145/3604597
      Issue’s Table of Contents

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 16 June 2023
      Online AM: 30 March 2023
      Accepted: 18 March 2023
      Revised: 17 March 2023
      Received: 22 July 2022
      Published in TALLIP Volume 22, Issue 6

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Part of speech (POS)
      2. conditional random field (CRF)
      3. deep learning
      4. word embedding

      Qualifiers

      • Research-article

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)199
      • Downloads (Last 6 weeks)13
      Reflects downloads up to 12 Dec 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Multilingual Neural Machine Translation for Indic to Indic LanguagesACM Transactions on Asian and Low-Resource Language Information Processing10.1145/365202623:5(1-32)Online publication date: 10-May-2024
      • (2024)Deep Learning-based POS Tagger and Chunker for Odia Language Using Pre-trained TransformersACM Transactions on Asian and Low-Resource Language Information Processing10.1145/363787723:2(1-23)Online publication date: 8-Feb-2024
      • (2024)Question Template Extraction Using Sequence Labeling Approach2024 International Conference on Data Science and Its Applications (ICoDSA)10.1109/ICoDSA62899.2024.10651739(242-247)Online publication date: 10-Jul-2024
      • (2024)Convolutional Neural Network Based on TensorFlow for the Recognition of Handwritten Digits in the Odia2024 International Conference on Advances in Computing Research on Science Engineering and Technology (ACROSET)10.1109/ACROSET62108.2024.10743378(1-5)Online publication date: 27-Sep-2024
      • (2024)Probing a pretrained RoBERTa on Khasi language for POS taggingNatural Language Processing10.1017/nlp.2024.24(1-20)Online publication date: 6-Sep-2024
      • (2024)Deep Learning based Part-of-Speech tagging for Assamese using RNN and GRUProcedia Computer Science10.1016/j.procs.2024.04.161235(1707-1712)Online publication date: 2024
      • (2024)Exploring Character-Level Deep Learning Models for POS Tagging in Assamese LanguageProcedia Computer Science10.1016/j.procs.2024.04.138235(1467-1476)Online publication date: 2024
      • (2024)Parts-of-Speech Tagger in Assamese Using LSTM and Bi-LSTMAdvances in Data-Driven Computing and Intelligent Systems10.1007/978-981-99-9524-0_3(19-31)Online publication date: 26-Feb-2024
      • (2024)Shallow Learning Versus Deep Learning in Natural Language Processing ApplicationsShallow Learning vs. Deep Learning10.1007/978-3-031-69499-8_8(179-206)Online publication date: 13-Oct-2024
      • (2022)BaNeP: An End-to-End Neural Network Based Model for Bangla Parts-of-Speech TaggingIEEE Access10.1109/ACCESS.2022.320826910(102753-102769)Online publication date: 2022
      • Show More Cited By

      View Options

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Full Text

      View this article in Full Text.

      Full Text

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media