Designing and building an adjective lexicon for turkish based on a million word corpus

January 1999

Author:
Yasar Erenler,
Adviser:
Martha W. Evens

Publisher:

Illinois Institute of Technology
3300 South Federal Street Chicago, IL
United States

ISBN:978-0-599-42154-7

Order Number:AAI9940650

Pages:

130

Purchase on ProQuest

Bibliometrics

Abstract

The goal of this research is to design and build a lexical database for adjectives in the Turkish language. I used text data from on-line newspapers and magazines available on the Internet as a resource for lexical information. I collected a textual corpus of about one million running words. These texts are in the ISO-8859-9 character encoding, which is a superset of the ASCII character set. This lexicon contains syntactic category, semantic category, gradability, and thesaurus information about adjectives as well as selectional restrictions. It supports a variety of Natural Language Processing applications such as parsing, text generation, natural language understanding, and information retrieval. The lexicon is implemented as a relational database using Microsoft Access. The process of building the lexicon from this textual corpus has been performed semiautomatically using a series of AWK and C programs. One of the programs extracts noun-derived adjectives, verb-derived adjectives, and intensified adjectives automatically by looking for a list of suffixes in a word. Another program was written and used for KWIC (Key Word In Context) indexing, which allows browsing and manually extracting adjectives from texts. Kernal Oflazer, director of the TU-LANG project, helped me by utilizing the TU-LANG morphological analyzer program to put part-of-speech tags on some of the texts that I have collected from the Internet. I used those tagged texts and also other untagged texts to extract adjectives and related information. I extracted about 140,000 adjectives from a corpus of one million words. Information about syntactic classification (simple, compound, derived), semantic classification (qualificative, determinative), and gradability are discovered from the texts automatically. Information regarding the selectional restrictions (human, animate, inanimate, abstract, concrete), and thesaurus information (synonyms, antonyms) are entered manually. I also entered qualificative subcategory information for qualificative adjectives, all determinative adjectives and color adjectives manually. The final lexical database has about 2,400 adjective entries; some of them have multiple senses. I also wrote a graphical user interface program in Visual Basic for the lexical database. This program allows a user to browse the lexicon; to add new adjectives, to remove adjectives, or to modify related information about an adjective.

Contributors

Martha Walton Evens
Illinois Institute of Technology
- Publication Years1975 - 2008
- Publication counts64
- Citation count240
- Available for Download22
- Downloads (cumulative)8,393
- Downloads (12 months)920
- Downloads (6 weeks)148
- Average Downloads per Article382
- Average Citation per Article4
View Full Profile
Yasar Erenler
Illinois Institute of Technology
- Publication Years1999 - 1999
- Publication counts1
- Citation count0
- Available for Download0
- Downloads (cumulative)0
- Downloads (12 months)0
- Downloads (6 weeks)0
- Average Downloads per Article0
- Average Citation per Article0
View Full Profile

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Recommendations

A novel unsupervised corpus-based stemming technique using lexicon and corpus statistics
Abstract
Word Stemming is a widely used mechanism in the fields of Natural Language Processing, Information Retrieval, and Language Modeling. Language-independent stemmers discover classes of morphologically related words from the ambient ...
Building a training corpus for word sense disambiguation in English-to-Vietnamese machine translation
COLING-MTIA '02: Proceedings of the 2002 COLING workshop on Machine translation in Asia - Volume 16

The most difficult task in machine translation is the elimination of ambiguity in human languages. A certain word in English as well as Vietnamese often has different meanings which depend on their syntactical position in the sentence and the actual ...
A 300 MB Turkish Corpus and Word Analysis
ADVIS '02: Proceedings of the Second International Conference on Advances in Information Systems

In order to determine some properties of a language, a corpus of that language should be created. To analyze Turkish language, at first, a Turkish corpus having ∼300 MB capacity and more than 44 million words was prepared by using 10 different web sites ...

Browse Theses

Sections

A novel unsupervised corpus-based stemming technique using lexicon and corpus statistics

Building a training corpus for word sense disambiguation in English-to-Vietnamese machine translation

A 300 MB Turkish Corpus and Word Analysis

Sections

Save to Binder

Recommendations

A novel unsupervised corpus-based stemming technique using lexicon and corpus statistics

Building a training corpus for word sense disambiguation in English-to-Vietnamese machine translation

A 300 MB Turkish Corpus and Word Analysis