The goal of this research is to design and build a lexical database for adjectives in the Turkish language. I used text data from on-line newspapers and magazines available on the Internet as a resource for lexical information. I collected a textual corpus of about one million running words. These texts are in the ISO-8859-9 character encoding, which is a superset of the ASCII character set. This lexicon contains syntactic category, semantic category, gradability, and thesaurus information about adjectives as well as selectional restrictions. It supports a variety of Natural Language Processing applications such as parsing, text generation, natural language understanding, and information retrieval. The lexicon is implemented as a relational database using Microsoft Access. The process of building the lexicon from this textual corpus has been performed semiautomatically using a series of AWK and C programs. One of the programs extracts noun-derived adjectives, verb-derived adjectives, and intensified adjectives automatically by looking for a list of suffixes in a word. Another program was written and used for KWIC (Key Word In Context) indexing, which allows browsing and manually extracting adjectives from texts. Kernal Oflazer, director of the TU-LANG project, helped me by utilizing the TU-LANG morphological analyzer program to put part-of-speech tags on some of the texts that I have collected from the Internet. I used those tagged texts and also other untagged texts to extract adjectives and related information. I extracted about 140,000 adjectives from a corpus of one million words. Information about syntactic classification (simple, compound, derived), semantic classification (qualificative, determinative), and gradability are discovered from the texts automatically. Information regarding the selectional restrictions (human, animate, inanimate, abstract, concrete), and thesaurus information (synonyms, antonyms) are entered manually. I also entered qualificative subcategory information for qualificative adjectives, all determinative adjectives and color adjectives manually. The final lexical database has about 2,400 adjective entries; some of them have multiple senses. I also wrote a graphical user interface program in Visual Basic for the lexical database. This program allows a user to browse the lexicon; to add new adjectives, to remove adjectives, or to modify related information about an adjective.
Recommendations
A novel unsupervised corpus-based stemming technique using lexicon and corpus statistics
AbstractWord Stemming is a widely used mechanism in the fields of Natural Language Processing, Information Retrieval, and Language Modeling. Language-independent stemmers discover classes of morphologically related words from the ambient ...
Building a training corpus for word sense disambiguation in English-to-Vietnamese machine translation
COLING-MTIA '02: Proceedings of the 2002 COLING workshop on Machine translation in Asia - Volume 16The most difficult task in machine translation is the elimination of ambiguity in human languages. A certain word in English as well as Vietnamese often has different meanings which depend on their syntactical position in the sentence and the actual ...
A 300 MB Turkish Corpus and Word Analysis
ADVIS '02: Proceedings of the Second International Conference on Advances in Information SystemsIn order to determine some properties of a language, a corpus of that language should be created. To analyze Turkish language, at first, a Turkish corpus having ∼300 MB capacity and more than 44 million words was prepared by using 10 different web sites ...