TypeCraft collaborative databasing and resource sharing for linguists

Dorothee Beermann¹ &
Pavel Mihaylov²

560 Accesses
Explore all metrics

Abstract

Interlinear Glossed Text (IGT) is a well established data format within philology and the structural and generative fields of linguistics. The best known format for an IGT is the one found in linguistic publications, where one line of text is followed by one line of glosses and one line of free translation. Although used in different functions, IGTs are ubiquitous in linguistic research and publications. Yet they also have been criticised for being fabricated and unreliable in some of their uses. However that might be, IGTs represent linguistic knowledge, and in particular for less-resourced languages, they are not rarely the only structured data available. Under the auspices of the Digital Humanities, linguists increasingly focus on the advantages of Semantic Web technologies. Presenting the modules and procedures of the web-based linguistic application TypeCraft (TC), we outline how the creation of IGTs can become an integral part of a shared linguistic methodology. Linguistic services have the potential of allowing efficient data management, and their strength lies in facilitating new forms of collaboration beyond social networking. They pave the way towards what one might call shared methodologies. In this paper we would like to discuss the linguistic value of web-based technology. By presenting the functionalities of TC and giving a detailed summary of online linguistic data creation and retrieval, we will present external and internal criteria for a single system evaluation of TC centred on usage objectives.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Institutional subscriptions

Notes

See also Tognini-Bonelli (2001) speaking for the field of corpus linguistics, and work by Bird (2009).
FLEx is the successor of the (Shoebox)Toolbox system and belongs to the SIL group of linguistic software.
ELAN is developed by the Language Archives group at the Max Planck Institute for Psycholinguistics
For more information about SOA for linguists see Dima et al. (2012).
In this article, IGTs are shown in examples (1)–(4), (5) and in Fig. 4.
An in-depth description of the function of interlinear glossing, including a short historical overview, can be found in Lehmann (2004b).
a Kwa language spoken in Ghana (ISO 639-3: aka).
Due to the prosodic structure of Akan these two morphemes cannot always be distinguished by tone.
Christaller (1933) uses as citation form of the verb mã while today mostly the form ma is used. We thank Per Baumann from the University in Zürich for clarifying this point.
As Kofi Abrefa from the University of Cape Coast, Ghana points out, the present tense of mã has a high tone as in Kofi má me sika, meaning ‘Kofi gives him/her money’. The past tense however has a low tone on both vowels as in Kofi mààno sika ‘Kofi gave him/her money.’ However, in a complex sentence the tone marking indicates in addition to the Tense, Aspect, Mood (TAM) features also subordination, so that má à‘gave’ is now marked by a high tone on the first vowel and a low tone on the second, meaning ‘when he gave money …’. This holds for Asante but not for Fante, where the Asante-like subordinate tonal pattern also occurs in main clauses.
In a project together with Uninett Sigma AS, which manages the national infrastructure for computational science in Norway, we worked on testing scalable (elastic) deployment of TC in a unified cloud solution offered by Uninett. Part of the project was scalability testing. To this end we created a test instance of the present TC system and imported 3,148 texts of various length with 19,484 sentences in total from the POS-tagged People’s Daily Corpus from Fujitsu Research and Development Center and Peking University. The work is still ongoing at the time of publication.
Paunaka is a Southern Arawakan language spoken in Bolivia, ISO-639-3:pnk (pending Associated Change request number: 2011-056).
Lower and Upper Tanara are Athabaskan languages spoken in the USA, East central Alaska, at the Tanana River area. ISO-639:tau.
a Bantu language spoken in Uganda. ISO 639-3:lug.
ISO.639-3:po.r.
a Bantu language composed of the languages Nykore (ISO-639-3:nyn) and Kiga (ISO-639-3:cgg) spoken in Uganda.
MediaWiki is a free software open source wiki package written in PHP, originally for use on Wikipedia: http://www.mediawiki.org.
http://typecraft.org/typecraft.xsd.
http://media.cidles.eu/poio/poio-api/ last accessed:09.26.13.
CIDLeS is an acronym for the Interdisciplinary Centre for Social and Language Documentation, Minde Portugal, http://www.cidles.eu.
The IPA editor was created by T.P Szynalski and it accessible at http://ipa.typeit.org.
TC also allows sentence level annotation which we will not further discuss here. We simply would like to note that sentence level tagging allows the flagging of construction level properties, such as its syntactic argument structure, situation type, aspect, modality, force and evidentiality.
last accessed: 03.19.2013.
http://typecraft.org/tc2wiki/Annotating_Akan.
http://typecraft.org/tc2wiki/Typological_Features_Template_for_Ga.
http://typecraft.org/tc2wiki/Current_events#User_survey.
A Bantu language spoken in Uganda.
Date of query 06-02-2013.
http://moin.delph-in.net/.
http://nltk.org.

References

Amar, M., David, S., Panckhurst, R., & Whistlecroft, L. (2008). Classification procedures for software evaluation. In Paper presented at Language Resources and Evaluation Conference (LREC).
Ameka, F., Dench, A., & Evans, N. (Eds.). (2006). In Grammaticography, the art and craft of writing grammars (pp. 41–68). Mouton De Gruyter.
Ameka, F. K. (2003). Multiverb constructions in a West African areal typological perspective. In D. Beermann & L. Hellan (Eds.), Online proceedings of TROSS Trondheim Summer School 2003.
Appiah Amfo, N. (2007). Akan demonstratives. In D. L. Payne & J. Pea (Eds.), Selected proceedings of the 37th annual conference on African linguistics.
Beermann, D., & Prange, A. (2006). Glossing language online. In A. Palmer & E. Ponvert (Eds.), Proceedings of the Texas linguistics society X conference computational linguistics for less-studied languages. Stanford: CSLI Publications.
Beermann, D., Mihaylov, P., & Sloetjes, H. (2012). Linking annotations. Steps towards tool-chaining in language documentation. In E. Hinrichs, H. Neuroth, & P. Wittenburg (Eds.), Proceedings service-oriented architectures (SOAs) for the humanities: Solutions and impacts. Joint CLARIN-D/DARIAH Workshop at Digital Humanities Conference 2012.
Bender, E. M., Ghodke, S., Baldwin, T., & Dridan, R. (2012). From database to treebank: On enhancing hypertext grammars with grammar engineering and treebank search. In S. Nordhoff (Ed.), Electronic Grammaticography, Mouton De Gruyter.
Bird, S. (2009). Last words: Natural language processing and linguistic fieldwork. Computational Linguistics, 35(3), 469–474.
Article Google Scholar
Bird, S., & Liberman, M. (2001). A formal framework for linguistic annotation. Speech Communication, 33(1–2), 23–60.
Article Google Scholar
Bird, S., & Simons, G. (2003). Seven dimensions of portability of language documentation and description. Language, 79(3), 557–582.
Article Google Scholar
Bouda, P., & Helmbrecht, J. (2012). From corpus to grammar: How dobes corpora can be exploited for descriptive linguistics. In S. Nordhoff (Ed.). Electronic Grammaticography, Mouton De Gruyter.
Bow, C., Hughes, B., & Bird, S. (2003). Towards a general model of interlinear text. In Proceedings of EMELD workshop 2003: Digitizing and annotating texts and field recordings. Electronic Metastructure for Endangered Language Data, (EMELD) Project.
Burdick, A., Drucker, J., Lunenfeld, P., Presner, T., & Schnapp, J. (2012). Digital humanities. Cambridge: MIT Press.
Google Scholar
Chiarcos, C., Nordhoff, S., & Hellmann, S. (2012). Interoperability of corpora and annotations. In Linked data in linguistics—Representing and connecting language data and language Metadata. New York: Springer.
Christaller, J. (1933). Dictionary of the Asante and Fante language called Tshi (Twi). Basel: Evangelical Missionary Society.
Google Scholar
Dima, E., Hinrichs, E., Hinrichs, M., Kislev, A., Trippel, T., & Zastrow, T. (2012). Integration of weblicht into the clarin infrastructure. In Proceedings service-oriented architectures (SOAs) for the humanities: Solutions and impacts. Joint CLARIN-D/DARIAH Workshop at Digital Humanities Conference 2012.
Drubig, H. B. (2000). Towards a typology of focus and focus constructions, manuscript. Germany: University of Tübingen.
Google Scholar
Dryer, M. S., & Haspelmath, M. (2011). The world atlas of language structures online. http://wals.info/.
Farrar, S., & Langendoen, T. (2003). A linguistic ontology for the semantic web. GLOT International, 7(3), 97–100.
Google Scholar
Ferreira, V., Bouda, P. & Lopes, A. (2012). Poio api—An annotation framework to bridge language documentation and natural language processing. In F. Mambrini, M. C. Passarotti, & C. Sporleder (Eds.), Proceedings of the 2nd workshop on annotation of Corpora for research in the humanities ACRH-2.
Gibbon, D. (2008). Efficient language documentation: Creation of local multipliers. LSA 2008 Annual Meeting Tutorial.
Gippert, J., Himmelmann, N., & Mosel, U. (Eds.). (2006). Linguistic annotation. Mouton De Gruyter.
Gold, M. K. (Ed.). (2012). Debates in the digital humanities. Minneapolis: Minnesota Press.
Haspelmath, M. (2001). Explaining the ditransitive person-role constraint: A usage-based approach. In Manuscript max-planck-institut für evolutionaere anthropologie.
Haspelmath, M. (2012). Framework-free grammatical theory. In B. Heine & H. Narrog (Eds.), The Oxford handbook of grammatical analysis. Oxford: Oxford University Press.
Google Scholar
Hellan, L. (2010). From descriptive annotation to grammar specification. In Proceedings of the 4th linguistic annotation workshop. ACL, W10-1826, 172–176.
Himmelmann, N. P. (1998). Documentary and descriptive linguistics. Linguistics, 36, 161–195.
Google Scholar
Hinrichs, E., Neuroth, H., & Wittenburg, P. (2012). Service-oriented architectures (soas) for the humanities: Solutions and impacts-conference proceedings. http://www.citeulike.org/user/AlisonBabeu/article/11031309.
Ide, N., & Suderman, K. (2007). Graf: A graph-based format for linguistic annotations. In Proceedings of the linguistic annotation workshop, held in conjunction with ACL, Prague.
Ide, N., & Suderman, K. (2012). Bridging the gaps: interoperability for language engineering architectures using graf. Language Resources and Evaluation, 46(1), 75–89.
Article Google Scholar
Keller, F. (2000). Gradience in grammar: Experimental and computational aspects of degrees of grammaticality. PhD thesis, University of Edinburgh.
Keller, F., & Asudeh, A. (2007). Constraints on linguistic coreference: Structural vs. pragmatic factors. In J. Moore & K. Stenning (Eds.), Proceedings of the 23rd annual conference of the cognitive science society.
Kertsz, A., & Rkosi, C. (2012). Data and evidence in linguistics: A plausible argumentation model. Cambridge: Cambridge University Press.
Book Google Scholar
Lehmann, C. (2004). Data in linguistics. The Linguistic Review, 21(3–4), 175–210.
Google Scholar
Lehmann, C. (2004b). Interlinear morphemic glossing. In G. E. Booji, C. Lehmann & J. Mugdan (Eds.), Morphologie: Ein internationales Handbuch zur Flexion und Wortbildung, (pp. 1834–1856). vol. 2. New York: DeGryter Berlin.
Lewis, W. (2006). Odin: A model for adapting and enriching legacy infrastructure. In e-science, 2nd IEEE international conference on e-Science and grid computing (e-Science’06), p.137.
Lewis, W. D., & Xia, F. (2010). Developing ODIN: A multilingual repository of annotated language data for hundreds of the worlds languages. Literary and Linguistic Computing, 25(3), 303–319.
Google Scholar
Lyons, J. (1977). Semantics. Cambridge University Press.
Mosel, U. (2006). Sketch grammar. In J. Gippert, N. Himmelmann, & U. Mosel (Eds.), Essentials of language documentation, Mouton De Gruyter.
Nordhoff, S. (2008). Electronic rference grammars for typology: Challenges and solutions. Language Documentation & Conservation, 2, 296–324.
Google Scholar
Nordhoff, S. (Ed.). (2012). Electronic grammaticography. Published as a Special Publication of the Language Documentation and Conservation Department of Linguistics, UHM.
Osam, E. K. (1994). Aspects of Akan grammar: A functional perspective. PhD thesis, University of Oregon, Eugene.
Paul, L. M., & Simons, G. F., (Eds.). (2013) Ethnologue: Languages of the world. Online version: http://www.ethnologue.com/.
Schultze-Berndt, E. (2006). Linguistic annotation. In J. Gippert, N. Himmelmann & U. Mosel (Eds.), Essentials of language documentation. Mounton De Gruyter.
Schütze, C. T. (1996). The empirical base of linguistics. Grammaticality judgments and linguistic methodology. Chicago: The University of Chicago Press.
Google Scholar
Schmidt, T. (2010). Linguistic tool development between community practices and technology standards. In Proceedings of the LREC workshop language resource and language technology standards state of the art, emerging needs, and future developments. European Language Resources Association (ELRA), Valletta, Malta.
Schroeter, R., & Thieberger, N. (2006). Eopas, the ethnoer online representation of inter-linear text. In L. Barwick & N. Thieberger (Eds.), Sustainable data from digital fieldwork. Sydney: Sydney University Press.
Google Scholar
Sedlak, P. A. S. (1975). Direct-indirect object word order: A cross-linguistic analysis. Working Papers on Language Universals 18, 117–164.
Google Scholar
Simons, G. F. (2008). Linguistics as a community activity: The paradox of freedom through standards. In W. Lewis, S. Karimi, H. Harley, & S. Farrar (Eds.), Time and again: Papers in honor of D. Terence Langendoen, John Benjamins (pp. 91–117).
Stewart, J. M. (1963). Some restrictions on objects in Twi. Journal of African Languages, 2, 145–149.
Google Scholar
Taylor, C. (1985). Nkore-Kiga. Croom Helm descriptive grammars. London: Croom Helm.
Google Scholar
Thieberger, N., & Berez, A. (2012). Linguistic data management. In N. Thieberger (Ed), The Oxford handbook of linguistic fieldwork. Oxford: Oxford Handbooks in Linguistics.
Google Scholar
Tognini-Bonelli, E. (2001). Corpus linguistics at work. Amsterdam: Benjamins.
Book Google Scholar
Wunderlich, D. (2005). Towards a structural typology of verb classes. Germany: Manuscript University of Berlin.

Download references

Acknowledgments

We would like to thank the three anonymous reviewers of this paper for their invaluable comments.

Author information

Authors and Affiliations

Department of Modern Foreign Languages, Norwegian University of Science and Technology, Trondheim, Norway
Dorothee Beermann
Ontotext AD, 1784, Sofia, Bulgaria
Pavel Mihaylov

Authors

Dorothee Beermann
View author publications
You can also search for this author in PubMed Google Scholar
Pavel Mihaylov
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dorothee Beermann.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Beermann, D., Mihaylov, P. TypeCraft collaborative databasing and resource sharing for linguists. Lang Resources & Evaluation 48, 203–225 (2014). https://doi.org/10.1007/s10579-013-9257-9

Download citation

Published: 15 November 2013
Issue Date: June 2014
DOI: https://doi.org/10.1007/s10579-013-9257-9

TypeCraft collaborative databasing and resource sharing for linguists

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

LLODifying Linguistic Glosses

Multilingual Knowledge Systems as Linguistic Linked Open Data

National Language Technologies Portals for LRLs: A Case Study

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

TypeCraft collaborative databasing and resource sharing for linguists

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

LLODifying Linguistic Glosses

Multilingual Knowledge Systems as Linguistic Linked Open Data

National Language Technologies Portals for LRLs: A Case Study

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation