TIARA 2.0: an interactive tool for annotating discourse structure and text improvement
Discourse structure annotation aims at analysing how discourse units (e.g. sentences or clauses) relate to each other and what roles they play in the overall discourse. Several annotation tools for discourse structure have been developed. However, ...
Statistical quality estimation for partially subjective classification tasks through crowdsourcing
When constructing a large-scale data resource, the quality of artifacts has great significance, especially when they are generated by creators through crowdsourcing. A widely used approach is to estimate the quality of each artifact based on ...
COLLIE: a broad-coverage ontology and lexicon of verbs in English
Progress on deep language understanding is inhibited by the lack of a broad coverage lexicon that connects linguistic behavior to ontological concepts and axioms. We have developed COLLIE-V, a deep lexical resource for verbs, with the coverage of ...
The WASABI song corpus and knowledge graph for music lyrics analysis
We present the WASABI Song Corpus, a large corpus of songs enriched with metadata extracted from music databases on the Web, and resulting from the processing of song lyrics and from audio analysis. More specifically, given that lyrics encode an ...
Between welcome culture and border fence: A dataset on the European refugee crisis in German newspaper reports
Newspaper reports provide a rich source of information on the unfolding of public debates, which can serve as basis for inquiry in political science. Such debates are often triggered by critical events, which attract public attention and incite ...
Investigating the role of swear words in abusive language detection tasks
Swearing plays an ubiquitous role in everyday conversations among humans, both in oral and textual communication, and occurs frequently in social media texts, typically featured by informal language and spontaneous writing. Such occurrences can be ...
EventDNA: a dataset for Dutch news event extraction as a basis for news diversification
News organizations increasingly tailor their news offering to the reader through personalized recommendation algorithms. However, automated recommendation algorithms reflect a commercial logic based on calculated relevance to the user, rather than ...
Usage disambiguation of Turkish discourse connectives
This paper describes a rule-based approach and a machine learning approach to disambiguate the discourse usage of Turkish connectives, which not only has single and phrasal connectives as most languages do, but also suffixal connectives that ...
The impact of preprocessing on word embedding quality: a comparative study
Data preprocessing is among the principal stages in virtually all text-based tasks. In this light, recent approaches have employed word embeddings in the majority of text-based tasks, wherein word co-occurrences are used as the basis of word ...
Spelling errors made by people with dyslexia
In this paper, we present a review of studies that have collected and annotated errors produced by people with dyslexia from corpora of written texts (six studies involving English, Spanish, German and French). Such resources are useful for ...
Nonverbal communication with emojis in social media: dissociating hedonic intensity from frequency
As a popular means of nonverbal communication in social media, emojis provide quick predictions about public sentiments towards social events. Previous analyses of emojis reported that people use positive emojis more frequently than negative ...
Managing, storing, and sharing long-form recordings and their annotations
The technique of long-form recordings via wearables is gaining momentum in different fields of research, notably linguistics and neurology. This technique, however, poses several technical challenges, some of which are amplified by the ...
Manipuri–English comparable corpus for cross-lingual studies
This paper presents Mni-EnCC, a temporal alligned Manipuri–English comparable corpus, to facilitate cross-lingual studies between Manipuri and English. Mni-EnCC has been created by collating text from two publicly published news sources in ...
The ParlaMint corpora of parliamentary proceedings
- Tomaž Erjavec,
- Maciej Ogrodniczuk,
- Petya Osenova,
- Nikola Ljubešić,
- Kiril Simov,
- Andrej Pančur,
- Michał Rudolf,
- Matyáš Kopp,
- Starkaður Barkarson,
- Steinþór Steingrímsson,
- Çağrı Çöltekin,
- Jesse de Does,
- Katrien Depuydt,
- Tommaso Agnoloni,
- Giulia Venturi,
- María Calzada Pérez,
- Luciana D. de Macedo,
- Costanza Navarretta,
- Giancarlo Luxardo,
- Matthew Coole,
- Paul Rayson,
- Vaidas Morkevičius,
- Tomas Krilavičius,
- Roberts Darǵis,
- Orsolya Ring,
- Ruben van Heusden,
- Maarten Marx,
- Darja Fišer
This paper presents the ParlaMint corpora containing transcriptions of the sessions of the 17 European national parliaments with half a billion words. The corpora are uniformly encoded, contain rich meta-data about 11 thousand speakers, and are ...
Resources for Turkish natural language processing: A critical survey
This paper presents a comprehensive survey of corpora and lexical resources available for Turkish. We review a broad range of resources, focusing on the ones that are publicly available. In addition to providing information about the available ...