research-article

Part-of-Speech Tagging of Odia Language Using Statistical and Deep Learning Based Approaches

Authors:

Tusarkanta Dalai,

Tapas Kumar Mishra,

Pankaj K. SaAuthors Info & Claims

ACM Transactions on Asian and Low-Resource Language Information Processing, Volume 22, Issue 6

Article No.: 167, Pages 1 - 24

https://doi.org/10.1145/3588900

Published: 16 June 2023 Publication History

Get Access

Abstract

Automatic part-of-speech (POS) tagging is a preprocessing step of many natural language processing tasks, such as named entity recognition, speech processing, information extraction, word sense disambiguation, and machine translation. It has already gained promising results in English and European languages. However, in Indian languages, particularly in the Odia language, it is not yet well explored because of the lack of supporting tools, resources, and morphological richness of the language. Unfortunately, we were unable to locate an open source POS tagger for the Odia language, and only a handful of attempts have been made to develop POS taggers for the Odia language. The main contribution of this research work is to present statistical approaches such as the maximum entropy Markov model and conditional random field (CRF), as well as deep learning based approaches, including the convolutional neural network (CNN) and bidirectional long short-term memory (Bi-LSTM) to develop the Odia POS tagger. A publicly accessible corpus annotated with the Bureau of Indian Standards (BIS) tagset is used in our work. However, most of the languages around the globe have used the dataset annotated with the Universal Dependencies (UD) tagset. Hence, to maintain uniformity, the Odia dataset should use the same tagset. Thus, following the BIS and UD guidelines, we constructed a mapping from the BIS tagset to the UD tagset. The maximum entropy Markov model, CRF, Bi-LSTM, and CNN models are trained using the Indian Languages Corpora Initiative corpus with the BIS and UD tagsets. We have experimented with various feature sets as input to the statistical models to prepare a baseline system and observed the impact of constructed feature sets. The deep learning based model includes the Bi-LSTM network, the CNN network, the CRF layer, character sequence information, and a pre-trained word vector. Seven different combinations of neural sequence labeling models are implemented, and their performance measures are investigated. It has been observed that the Bi-LSTM model with the character sequence feature and pre-trained word vector achieved a result with 94.58% accuracy.

A Appendix

A.1 Confusion Matrix of Different Deep Learning Models

Table A.1.

ADJ	4,981	9	18	0	0	88	0	924	47	63	29	60	0	0	0	52	0
ADP	11	1,961	14	0	15	3	0	154	0	16	12	0	0	0	0	2	0
ADV	24	19	1,028	0	8	26	0	225	0	12	13	0	0	20	0	12	0
AUX	0	0	0	151	0	0	0	2	0	0	0	0	0	0	0	94	0
CCONJ	0	19	24	0	3,608	30	0	47	0	14	7	0	0	97	0	28	0
DET	30	0	23	0	13	3,628	0	79	13	30	97	0	0	34	0	0	0
INTJ	0	0	0	0	0	0	44	17	0	2	0	0	0	0	0	0	0
NOUN	962	137	151	0	51	183	5	34,931	114	84	80	796	0	20	0	582	5
NUM	12	0	6	0	0	4	0	51	4,335	22	0	4	0	0	0	2	0
PART	32	12	9	0	29	30	0	82	14	2,485	7	0	0	4	0	30	0
PRON	15	5	15	0	6	140	0	25	2	0	3,152	2	0	7	0	0	0
PROPN	68	0	0	0	0	0	0	720	23	0	12	7,829	0	0	0	20	6
PUNCT	0	0	0	0	0	0	0	0	0	0	0	0	9,905	0	0	0	0
SCONJ	0	0	16	0	71	41	0	12	0	3	2	0	0	729	0	0	1
SYM	0	0	0	0	0	0	0	0	0	0	0	0	0	0	359	0	0
VERB	24	3	4	4	5	7	0	301	2	21	0	3	0	0	0	12,050	0
X	0	0	2	0	2	7	0	82	0	0	0	11	0	0	0	0	119
	ADJ	ADP	ADV	AUX	CCONJ	DET	INTJ	NOUN	NUM	PART	PRON	PROPN	PUNCT	SCONJ	SYM	VERB	X

Table A.1. Confusion Matrix of the CharCNN + CNN + CRF Model

Table A.2.

ADJ	5,382	8	11	0	0	43	0	668	30	35	13	48	0	0	0	33	0
ADP	9	2,029	5	0	1	0	0	125	0	12	7	0	0	0	0		0
ADV	28	22	1,069	0	11	20	0	194	0	12	7	0	0	10	0	14	0
AUX	0	0	0	179	0	0	0	0	0	0	0	0	0	0	0	68	0
CCONJ	0	21	13	0	3,689	24	0	26	0	11	4	0	0	68	0	18	0
DET	35	0	9	0	9	3,665	0	81	10	19	86	0	0	33	0	0	0
INTJ	0	0	0	0	0	0	45	17	0	0	1	0	0	0	0	0	0
NOUN	741	121	67	0	54	119	5	35,711	93	57	52	576	0	21	0	477	7
NUM	21	0	5	0	0	5	0	62	4,322	12	0	9	0	0	0	0	0
PART	32	15	4	0	16	27	0	64	13	2,537	4	0	0	0	0	22	0
PRON	21	3	7	0	7	100	0	35	3	0	3,184	0	0	9	0	0	0
PROPN	67	0	0	0	0	0	0	609	21	0	11	7,946	0	0	0	20	4
PUNCT	0	0	0	0	0	0	0	0	0	0	0	0	9,905	0	0	0	0
SCONJ	0	0	4	0	57	25	0	9	0	2	0	0	0	778	0	0	0
SYM	0	0	0	0	0	0	0	0	0	0	0	0	0	0	359	0	0
VERB	30	2	6	7	7	7	0	299	0	15	0	0	0	0	0	12,051	0
X	0	0	4	0	0	0	0	76	0	0	0	19	0	1	0	0	123
	ADJ	ADP	ADV	AUX	CCONJ	DET	INTJ	NOUN	NUM	PART	PRON	PROPN	PUNCT	SCONJ	SYM	VERB	X

Table A.2. Confusion Matrix of the CharCNN + Bi-LSTM + CRF Model

Table A.3.

ADJ	5,105	8	18	0	0	66	0	850	39	54	24	57	0	0	0	52	0
ADP	12	1,976	20	0	12	0	0	136	0	20	11	0	0	0	0	0	0
ADV	27	18	1,052	0	14	15	0	209	0	10	13	0	0	21	0	9	0
AUX	0	0	0	168	0	0	0	0	0	0	0	0	0	0	0	80	0
CCONJ	0	12	17	0	3,641	24	0	50	0	14	11	0	0	80	0	23	0
DET	33	0	27	0	18	3,569	0	97	9	36	111	0	0	47	0	0	0
INTJ	0	0	0	0	0	0	48	15	0	0	0	0	0	0	0	0	0
NOUN	890	133	168	1	47	117	0	35,264	95	88	72	650	0	17	0	541	18
NUM	15	0	0	0	0	3	0	55	4,348	15	0	0	0	0	0	0	0
PART	34	13	0	0	23	19	0	90	18	2,506	1	0	0	0	0	29	0
PRON	16	0	22	0	6	89	0	34	0	0	3,194	0	0	9	0	0	0
PROPN	69	2	3	0	0	0	0	760	17	0	12	7,783	0	0	0	20	11
PUNCT	0	0	0	0	0	0	0	0	0	0	0	0	9,906	0	0	0	0
SCONJ	0	0	12	0	62	28	0	8	0	0	0	0	0	765	0	0	0
SYM	0	0	0	0	0	0	0	0	0	0	0	0	0	0	359	0	0
VERB	26	0	0	17	6	6	0	276	0	11	0	0	0	0	0	12,082	0
X	0	0	3	0	2	6	0	46	0	0	0	11	0	0	0	0	154
	ADJ	ADP	ADV	AUX	CCONJ	DET	INTJ	NOUN	NUM	PART	PRON	PROPN	PUNCT	SCONJ	SYM	VERB	X

Table A.3. Confusion Matrix of the CharCNN + CNN Model

Table A.4.

ADJ	5,384	7	14	0	0	28	0	684	23	38	17	51	0	0	0	27	0
ADP	6	2,038	10	0	6	0	0	108	0	11	8	0	0	0	0	0	0
ADV	25	9	1,118	0	10	17	0	173	0	12	9	0	0	7	0	8	0
AUX	0	0	0	201	0	0	0	0	0	0	0	0	0	0	0	47	0
CCONJ	0	6	14	0	3,724	19	0	34	0	6	5	0	0	52	0	12	0
DET	31	0	12	0	12	3,636	0	101	12	22	90	0	0	31	0	0	0
INTJ	0	0	0	0	0	0	50	13	0	0	0	0	0	0	0	0	0
NOUN	582	74	84	0	34	77	0	36,220	81	41	44	477	0	16	0	362	9
NUM	13	0	0	0	0	4	0	50	4,356	11	0	0	0	0	0	0	0
PART	25	6	4	0	18	15	0	74	11	2,553	0	0	0	0	0	24	3
PRON	14	0	9	0	5	72	0	37	1	0	3,226	0	0	6	0	0	0
PROPN	65	0	0	0	0	0	0	694	19	0	9	7,870	0	0	0	17	3
PUNCT	0	0	0	0	0	0	0	0	0	0	0	0	9,906	0	0	0	0
SCONJ	0	0	6	0	62	15	0	7	0	0	0	0	0	785	0	0	0
SYM	0	0	0	0	0	0	0	0	0	0	0	0	0	0	359	0	0
VERB	26	0	8	15	6	5	0	294	0	11	0	2	0	0	0	12,057	0
X	0	0	3	0	0	3	0	66	0	0	0	12	0	0	0	0	138
	ADJ	ADP	ADV	AUX	CON	DET	INTJ	NOUN	NUM	PART	PRON	PROPN	PUNCT	SCONJ	SYM	VERB	X

Table A.4. Confusion Matrix of the CharCNN + Bi-LSTM Model

References

[1]

Indian Language Technology Proliferation and Deployment Center. n.d. Home Page. Retrieved August 24, 2021 from http://tdil-dc.in.

Abstract

A Appendix

A.1 Confusion Matrix of Different Deep Learning Models

References

Cited By

Index Terms

Recommendations

Deep Learning-based POS Tagger and Chunker for Odia Language Using Pre-trained Transformers

Part-of-speech Tagger for Assamese Using Ensembling Approach

Part-of-Speech (POS) Tagging Using Deep Learning-Based Approaches on the Designed Khasi POS Corpus

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Full Text

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations