[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
Designing and building an adjective lexicon for turkish based on a million word corpus
Publisher:
  • Illinois Institute of Technology
  • 3300 South Federal Street Chicago, IL
  • United States
ISBN:978-0-599-42154-7
Order Number:AAI9940650
Pages:
130
Reflects downloads up to 11 Feb 2025Bibliometrics
Skip Abstract Section
Abstract

The goal of this research is to design and build a lexical database for adjectives in the Turkish language. I used text data from on-line newspapers and magazines available on the Internet as a resource for lexical information. I collected a textual corpus of about one million running words. These texts are in the ISO-8859-9 character encoding, which is a superset of the ASCII character set. This lexicon contains syntactic category, semantic category, gradability, and thesaurus information about adjectives as well as selectional restrictions. It supports a variety of Natural Language Processing applications such as parsing, text generation, natural language understanding, and information retrieval. The lexicon is implemented as a relational database using Microsoft Access. The process of building the lexicon from this textual corpus has been performed semiautomatically using a series of AWK and C programs. One of the programs extracts noun-derived adjectives, verb-derived adjectives, and intensified adjectives automatically by looking for a list of suffixes in a word. Another program was written and used for KWIC (Key Word In Context) indexing, which allows browsing and manually extracting adjectives from texts. Kernal Oflazer, director of the TU-LANG project, helped me by utilizing the TU-LANG morphological analyzer program to put part-of-speech tags on some of the texts that I have collected from the Internet. I used those tagged texts and also other untagged texts to extract adjectives and related information. I extracted about 140,000 adjectives from a corpus of one million words. Information about syntactic classification (simple, compound, derived), semantic classification (qualificative, determinative), and gradability are discovered from the texts automatically. Information regarding the selectional restrictions (human, animate, inanimate, abstract, concrete), and thesaurus information (synonyms, antonyms) are entered manually. I also entered qualificative subcategory information for qualificative adjectives, all determinative adjectives and color adjectives manually. The final lexical database has about 2,400 adjective entries; some of them have multiple senses. I also wrote a graphical user interface program in Visual Basic for the lexical database. This program allows a user to browse the lexicon; to add new adjectives, to remove adjectives, or to modify related information about an adjective.

Contributors
  • Illinois Institute of Technology
  • Illinois Institute of Technology
Please enable JavaScript to view thecomments powered by Disqus.

Recommendations