Lessons from building a Persian written corpus: Peykare

Mahmood Bijankhan¹,
Javad Sheykhzadegan²,
Mohammad Bahrani³ &
…
Masood Ghayoomi⁴

628 Accesses
44 Citations
Explore all metrics

Abstract

This paper addresses some of the issues learned during the course of building a written language resource, called ‘Peykare’, for the contemporary Persian. After defining five linguistic varieties and 24 different registers based on these linguistic varieties, we collected texts for Peykare to do a linguistic analysis, including cross-register differences. For tokenization of Persian, we propose a descriptive generalization to normalize orthographic variations existing in texts. To annotate Peykare, we use EAGLES guidelines which result to have a hierarchy in the part-of-speech tags. To this aim, we apply a semi-automatic approach for the annotation methodology. In the paper, we also give a special attention to the Ezafe construction and homographs which are important in Persian text analyses.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Institutional subscriptions

SYN2020: A New Corpus of Czech with an Innovated Annotation

Morphosyntactic Annotation of Historical Texts. The Making of the Baroque Corpus of Polish

Grammatical Annotation of Historical Portuguese: Generating a Corpus-Based Diachronic Dictionary

References

Al-Sulaiti, L., & Atwell, E. (2006). The design of a corpus of contemporary Arabic. International Journal of Corpus Linguistics, 11(2), 135–171.
Article Google Scholar
Assi, M., & Abdolhosseini, M. H. (2000). Grammatical tagging of a Persian corpus. International Journal of Corpus Linguistics, 5(1), 69–81.
Article Google Scholar
Atkins, S., Clear, J., & Ostler, N. (1992). Corpus design criteria. Literary and Linguistic Computing, 7(1), 1–16.
Article Google Scholar
Biber, D. (1992). Representativeness in corpus design. In G. Sampson & D. McCarthy (Eds.), Corpus linguistics: Readings in a widening discipline (pp. 174–197). New York, USA: Continuum.
Google Scholar
Biber, D. (1993). Using register-diversified corpora for general language studies. Computational Linguistics, 19(2), 221–241.
Google Scholar
Bijankhan, M. et al. (1994). Farsi spoken language database: FARSDAT. In Proceedings of the 5th international conference on speech sciences and technology (ICSST), Perth (Vol. 2, pp. 826–829).
Bijankhan, M. et al. (2003). TFARSDAT: Telephone Farsi spoken language database. EuroSpeech, Geneva (3), pp. 1525–1528.
Bijankhan, M. et al. (2004). The large Persian speech database. In Proceedings of the 1st workshop on Persian language and computer, the University of Tehran, Tehran, Iran (pp. 149–150).
Buckwalter, T. (2005). Issues in Arabic orthography and morphology analysis. In Proceedings of the workshop on computational approaches to arabic script-based languages in conjunction with COLING 2004, Switzerland.
Cloeren, J. (1999). Tagsets. In H. V. Halteren (Ed.), Syntactic wordclass tagging. Dordrecht, The Netherlands: Kluwer.
Google Scholar
Douglas, F. M. (2003). The Scottish corpus of texts and speech: Problems of corpus design. Literary and Linguistic Computing, 18(1), 23–37.
Article Google Scholar
Ghayoomi, M., & Momtazi, S. (2009). Challenges in developing Persian corpora from online resources. In Proceedingss of IEEE international conference on Asian language processing, Singapore.
Ghayoomi, M., Momtazi, S., & Bijankhan, M. (2010). A study of corpus development for Persian. International Journal on Asian Language Processing, 20(1), 17–33.
Google Scholar
Ghomeshi, J. (1996). Projection and inflection: A study of persian phrase structure. Ph.D. thesis, University of Toronto, Toronto, ON.
Hajič, J. (2000). Morphological tagging: Data vs. dictionaries. In Proceedings of the 6th applied natural language processing conference, Washington (pp. 94–101).
Hearst, M. A. (1991). Noun homograph disambiguation using local context in large text corpora. In Proceedings of the 7th annual conference of the University of Waterloo, Center for the new OED and text research, Oxford.
Hodge, C. T. (1957). Some aspects of Persian style. Language, 33(3) Part 1, 355–369.
Google Scholar
Hudson, R. (1994). About 37% word-tokens are nouns. Language, 70(2), 331–339.
Article Google Scholar
Hussain, S., & Gul, S. (2005). Road map for localization. Lahore, Pakistan: Center for Research in Urdu Language Processing, National University of Computer and Emerging Sciences.
Google Scholar
Kawata, Y. (2001). Towards a reference tagset for Japanese. In Proceedings of the 6th natural language processing Pacific rim symposium post-conference workshop, Tokyo (pp. 55–62).
Khoja, S., Garside, R., & Knowles, G. (2001). A tagset for the morpho-syntactic tagging of Arabic. Lancaster University, Computing Department. http://archimedes.fas.harvard.edu/mdh/arabic/CL2001.pdf.
Kralik, J., & Šulc, M. (2005). The representativeness of Czeck corpora. International Journal of Corpus Linguistics, 10(3), 357–366.
Article Google Scholar
Kučera, K. (2002). The Czech national corpus: Principles, design, and results. Literary and Linguistic Computing, 17(2), 245–247.
Article Google Scholar
Leech, G. (2002). The importance of reference corpora. Donostia, 2002-10-24/25. www.corpus4u.org/upload/forum/2005060301260076.pdf.
Leech, G., & Wilson, A. (1999). Standards for tagsets. In H. V. Halteren (Ed.), Syntactic wordclass tagging (pp. 55–81). Dordrecht, The Netherlands: Kluwer.
Google Scholar
Manning, C. D., & Schütze, H. (1999). Foundations of statistical natural language processing. Cambridge: The MIT press.
Google Scholar
Marcus, M. P., Santorini, B., & Marcinkiewicz, M. A. (1993). Building a large annotated corpus of english: The penn treebank. http://citeseer.comp.nus.edu.sg/587575.html.
Megerdoomian, K. (2000). Persian computational morphology: A unification-based approach. NMSU, CRL, Memoranda in Computer and Cognitive Science (MCCS-00-320).
Mosavi-Miangah, T. (2006). Automatic lemmatization of Persian words: Project report. Journal of Quantitative Linguistics, 13(1), 1–15.
Article Google Scholar
Muthusamy, Y. K., Cole, R. A., & Oshika, B. T. (1992). The OGI multi-language telephone Speech Corpus. In Proceedings of the 2nd international conference on spoken language processing (ICSLP), Banff (pp. 895–898).
Samvelian, P. (2007). A (phrasal) affix analysis of the Persian Ezafe. Journal of Linguistics, 43, 605–645.
Article Google Scholar
Sheykhzadegan, J., & Bijankhan, M. (2006). The speech databases of Persian language. In Proceedings of the 2nd workshop on Persian language and computing, the University of Tehran, Tehran, Iran (pp. 247–261).
Sinclair, J. (1987). Corpus creation. In G. Sampson and D. McCarthy (Eds.), Corpus linguistics: Readings in a widening discipline, 2004 (pp. 78–84). New York: Continuum.
Voutilainen, A. (1999). A short history of tagging. In H. V. Halteren (Ed.), Syntactic wordclass tagging (pp. 9–19). Dordrecht, The Netherlands: Kluwer.
Google Scholar

Download references

Acknowledgments

This project was funded by the Higher Council for Informatics of Iran and the University of Tehran under the contract number 190/3554. Masood Ghayoomi was funded by the German research council DFG under the contract number MU 2822/3-1. Our special gratitude also goes to Dr. Ali Darzi at the University of Tehran who cooperated with us in the project and the anonymous reviewers for their helpful comments. However, the responsibility for the content of this study lies with the authors alone.

Author information

Authors and Affiliations

Department of Linguistics, The University of Tehran, Tehran, Iran
Mahmood Bijankhan
Research Center for Intelligent Signal Processing, Tehran, Iran
Javad Sheykhzadegan
Computer Engineering Department, Sharif University of Technology, Tehran, Iran
Mohammad Bahrani
German Grammar Group, Freie Universität Berlin, Berlin, Germany
Masood Ghayoomi

Authors

Mahmood Bijankhan
View author publications
You can also search for this author in PubMed Google Scholar
Javad Sheykhzadegan
View author publications
You can also search for this author in PubMed Google Scholar
Mohammad Bahrani
View author publications
You can also search for this author in PubMed Google Scholar
Masood Ghayoomi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mahmood Bijankhan.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bijankhan, M., Sheykhzadegan, J., Bahrani, M. et al. Lessons from building a Persian written corpus: Peykare. Lang Resources & Evaluation 45, 143–164 (2011). https://doi.org/10.1007/s10579-010-9132-x

Download citation

Published: 03 November 2010
Issue Date: May 2011
DOI: https://doi.org/10.1007/s10579-010-9132-x

Lessons from building a Persian written corpus: Peykare

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

SYN2020: A New Corpus of Czech with an Innovated Annotation

Morphosyntactic Annotation of Historical Texts. The Making of the Baroque Corpus of Polish

Grammatical Annotation of Historical Portuguese: Generating a Corpus-Based Diachronic Dictionary

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Lessons from building a Persian written corpus: Peykare

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

SYN2020: A New Corpus of Czech with an Innovated Annotation

Morphosyntactic Annotation of Historical Texts. The Making of the Baroque Corpus of Polish

Grammatical Annotation of Historical Portuguese: Generating a Corpus-Based Diachronic Dictionary

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation