8000 GitHub - slovak-nlp/resources: A curated list of resources such as tools and datasets useful for the processing of Slovak language
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

A curated list of resources such as tools and datasets useful for the processing of Slovak language

Notifications You must be signed in to change notification settings

slovak-nlp/resources

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 

Repository files navigation

Resources

A curated list of resources for the processing of Slovak language.

Pages

Corpora, datasets, vocabularies

Web

  • the Slovak Portion of The FineWeb2 Dataset.
  • 14.1 billion words across more than 26.5 million documents.
  • sourced from 96 CommonCrawl snapshots, spanning the summer of 2013 to April 2024, and processed using datatrove,
  • Created for training mT5 model.
  • Contains 67GB Slovak part. bv
  • Available in Tensorflow format and HuggingFace JSON format.
  • Can be downloaded from allenai/allennlp#5056 using git LFS.
  • automatic POS (SNK)
  • source: web
  • can be downloaded from Clarin
  • deduplicated
  • source: Common Crawl
  • automatic lemmatization, MSD, POS (AUT, ensemble lemmatization+POS)
  • source: web
  • no annotattion
  • twitter part

Question Answering

  • Manually annotated clone of SQUAD 2.0
  • Contains "unanswerable questions"
  • 92k items
  • Part of INCLUDE
  • 131 manually created math questions. One choice of 4 is correct.
  • Machine translation of SQUAD 2.0 Database
  • 140k annotated items
  • Slovak version of the Question to Declarative Sentence (QA2D).
  • Machine-translated using DeepL service.
  • https://arxiv.org/abs/2312.10171
  • 70k questions and answers
  • 5 000 yes-no questions
  • Each example is a triplet of (question, passage, answer), with the title of the page as optional additional context.
  • Machine translated
  • Can be used also for summarization
  • 785K question-answer pairs for the Slovak language
  • Each example consists of the question and corresponding answer
  • Data are parsed from Common Crawl
  • Questions are divided into: frequently asked questions (FAQ) and Community Question Answering (CQA)
  • 178K question-answer pairs for the Slovak language
  • Data obtained from FQA pages gathered from Common Crawl
  • Data also contains metadata, such as topic, question type

Morpho-syntactic

  • HArmonized Multi-LanguagE Dependency Treebank
  • Integrates 42 treebanks
  • Korpus obsahuje texty právnych predpisov (aktuálnych aj minulých) v slovenčine. Okrem automatickej lematizácie a morfologickej anotácie je korpus anotovaný aj syntakticky
  • Citácia: GARABÍK, Radovan: Corpus of Slovak legislative documents. Jazykovedný časopis, 2022, Vol. 73, No 2, pp. 175-189.
  • 45 miliónov tokenov
  • form, lemma, POS+MSD (SNK)
  • source: SNK
  • form, lemma, POS+MSD (SNK)
  • source: SNK
  • tokenization, segmentation, UPOS, XPOS (SNK), UD, lemma
  • manual annotation
  • format: conllu, PDT tagset
  • source: SNK

Reference:

  • Gajdošová, Katarína; Šimková, Mária and et al., 2016, Slovak Dependency Treebank, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University, http://hdl.handle.net/11234/1-1822.

A conversion of the Slovak Dependency Treebank into Universal Dependency tagset.

  • GitHub page
  • tokenization, segmentation, UPOS, XPOS (SNK), UD, lemma
  • manual annotation
  • format: conllu, UD tagset
  • source: SNK

Reference:

  • Zeman, Daniel. (2017). Slovak Dependency Treebank in Universal Dependencies. Journal of Linguistics/Jazykovedný casopis. 68. 10.1515/jazcas-2017-0048.
  • tokenization, segmentation, UPOS, XPOS (SNK), UD, lemma
  • format: conllu
  • source: Slovak UD, SNK
  • form, lemma, POS (Multext East)

Parallel

  • Corpus of the Šariš dialect
  • 4.7k examples.
  • authors: Viktória Ondrejová and Marek Šuppa
  • 62 languages, 1,782 bitexts
  • Slovak part contains 100 mil. tokens
  • source: Europarl
  • speech, vectors, language
  • automatic POS (SNK)
  • source: Acquis, Europarl, EU-journal, EC-Europa, OPUS
  • automatic POS (SNK)
  • source: Acquis, Europarl, EU-journal, EC-Europa, OPUS
  • Dataset of various English-Slovak legal texts within agenda of the Ministry,
  • plain text format alligned at the sentence level, the size: 112580 words.
  • It was converted into a 2895-TUs English-Slovak resource in TMX format.
  • Dataset of various English-Slovak legal texts within agenda of the Ministry,
  • plain text format alligned at the sentence level,
  • the size: 105791 words
  • It is converted into a 2609-TUs English-Slovak resource in TMX format.
  • Parallel web Corpus with Slovak Part
  • 3.3 mil sentences English-Slovak
  • Unsupervised processing of Wikipedia to obtain parallel corpora
  • Used LASER embeddings.
  • 85 different languages, 1620 language pairs, 134M parallel sentences, out of which 34M are aligned with English

Semantic textual similarity

  • Machine translated by OPUS-en-sk model
  • Sentence similarity dataset contains two sentences with a floating-point number between 0 and 5 as a target, where the highest number means higher similarity. The dataset contains train: 5 749, validation: 1 500 and test: 1 379 examples.
  • Referenced from this report by J. Agarský.

Sentiment

  • Corpus of sentiment in Slovak social media.
  • The dataset contains 34 006 manually annotated comments from the social media platform Facebook.
  • Specifically, it includes:
    • 20 668 comments labeled as "negative"
    • 9 581 comments labeled as "neutral"
    • 3 779 comments labeled as "positive"
  • Author: Zuzana Sokolová
  • is a multilingual dataset consisting of 33,104 unique news articles
  • sourced from 12 online news platforms across the Visegrád Group (V4) countries — Poland, Czech Republic, Slovakia, and Hungary — published between 1998 and 2025. The goal of the dataset is to analyze media narratives surrounding nuclear energy in Central Europe.
  • 30 percent in Slovak
  • annotated by language modes for overall sentiment toward nuclear energy, sentiment of the headline (pessimistic ↔ optimistic), degree of sensationalism / alarmism in the headline
  • 5k items
  • positive and negative class
  • Reference: Samuel Pecar, Marian Simko, and Maria Bielikova. 2019. Improving Sentiment Classification in Slovak Language. In Proceedings of the 7th Workshop on Balto-Slavic Natural Language Processing, pages 114–119, Florence, Italy. Association for Computational Linguistics.
  • Unknown/undocumented source
  • positive/negative
  • source: Twitter
  • 3 categories - positive, negative, neutral
  • Dataset contains totally 1 588 comments in Slovak language from various Facebook pages. The texts are annotated by 5 categories.
  • Machine translated
  • Sentiment analysis dataset, binary classification task: positive sentiment, negative sentiment. It includes reviews from 7 categories with positive, neutral and negative sentiment labels.
  • Source: Slovakbert auxiliary repository BY Matúš Pikuliak, Štefan Grivalský, Martin Konôpka, Miroslav Blšták, Martin Tamajka, Viktor Bachratý, Marián Šimko, Pavol Balážik, Michal Trnka, and Filip Uhlárik. , 2021
  • Referenced from this report by J. Agarský.
  • CSFD Movie Reviews
  • 25k items
  • Collection of multiple datasets across various languages
  • Contains also the data for Slovak
  • positive (29K) / neutral (13K) / negative (14K) labels

Hate Speech

  • Corpus of toxic speech in social networks
  • The dataset contains 8 840 manually annotated comments from Facebook. Specifically, it includes:
    • 4 420 comments labeled as "toxic" (value 1)
    • 4 420 comments labeled as "non-toxic" (value 0)
  • Author: Zuzana Sokolová
  • 13k items
  • Crowdsourced hate and offensive speech in Facebook comments
  • binary classification

Fact Checking

  • Multilingual fact checking database with Slovak part
  • Contains 28k posts in 27 languages from social media, 206k fact-checks in 39 languages written by professional fact-checkers, as well as 31k connections between these two groups.

9.1k Czech, 2.8k Polish and 12.6k Slovak labeled claims with reasoning: demagog.zip (~16.5 MB)

  • Machine translated facts with evidence representend as references to Wikipedia pages.
  • 350k items

Biases

  • GEST dataset used to measure gender-stereotypical reasoning in language models and machine translation systems.

Instructions

  • 4.2 M records for the Slovak language
  • Machine tanslated using the NLLB 3.3B model

Named Entity Recognition

  • Manually annotated set
  • Diploma thesis at Commeius University
  • PER, ORG, LOC, MISC annotations
  • cca 7k sentences.
  • 8,48k train, 1k dev and 2k test sentences from Universal Dependencies
  • Human annotated by 2 annotators as part of the Universal NER project
  • PER, ORG, LOC annotations
  • AYA Instruction format
  • A Large Multilingual Dataset for Entity Linking
  • Slovak part has 41.0M tokens and 1366.4k entities
  • named entity types (PER, ORG, LOC, EVENT, BRAND, WORK_OF_ART,MANUFACTURED
  • automatically annotated Wikipedia for Named Entities
  • massively multilingual
  • Slovak part has 500k sentences.
  • Reference: Al-Rfou, Rami, et al. "Polyglot-NER: Massive multilingual named entity recognition." Proceedings of the 2015 SIAM International Conference on Data Mining. Society for Industrial and Applied Mathematics, 2015.

translated version of the original CONLL2003 dataset (translated from English to Slovak via Google translate

  • CNEC 2.0 Czech model machine translated to Slovak & filtered
  • CNEC entity hierarchy
  • source: JÚĽŠ SAV

translated version of the original CONLL2003 dataset (translated from English to Slovak via Google translate

Spelling

  • corpus of spelling errors created from edits in Wikipedia
  • spelling errors are sorted into 5 categories,

Wordnet

Summarization

  • Multilingual Summarization Dataset
  • Slovak part has 1.3k rows.
  • 200k of news article summaries
  • Reference: Marek Suppa and Jergus Adamec. 2020. A Summarization Dataset of Slovak News Articles. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 6725–6730, Marseille, France. European Language Resources Association.
  • 1788 record for the Slovak language
  • Dataset created from European laws - EurLex

Natural Language Inference

  • 103K samples
  • Translated version of ANLI using the Google Translate

Topic Classification

  • Topic classification dataset consisting of 205 languages
  • 1K samples with 7 categories (science, geography, entertainment, politics, health, travel, sports)
  • Data obtained from the FLORES-200

Multilingual Benchmarks

  • Translated version of the MMLU dataset across 30 languages, including Slovak
  • Translated version of the HellaSwag dataset across 35 languages, including Slovak
  • Translated using GPT-3.5-Turbo
  • Translated version of the ARC dataset across 35 languages, including Slovak
  • Translated using GPT-3.5-Turbo
  • Translated version of the TruthfulQA dataset across 35 languages, including Slovak
  • Translated using GPT-3.5-Turbo

About

A curated list of resources such as tools and datasets useful for the processing of Slovak language

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published
0