Resources

A curated list of resources for the processing of Slovak language.

Models and tools

Pages

Slovak resources by Essential Data
Slovak speech and language processing at KEMT FEI TUKE with tools, demos and language resources.
Slovak National Corpus

Corpora, datasets, vocabularies

Web

FineWeb2

the Slovak Portion of The FineWeb2 Dataset.
14.1 billion words across more than 26.5 million documents.
sourced from 96 CommonCrawl snapshots, spanning the summer of 2013 to April 2024, and processed using datatrove,

Multilingual C4

Created for training mT5 model.
Contains 67GB Slovak part. bv
Available in Tensorflow format and HuggingFace JSON format.
Can be downloaded from allenai/allennlp#5056 using git LFS.

Common Crawl

SkTenTen

automatic POS (SNK)
source: web
can be downloaded from Clarin

Oscar

deduplicated
source: Common Crawl

Aranea

automatic lemmatization, MSD, POS (AUT, ensemble lemmatization+POS)
source: web

HC Corpora

no annotattion
twitter part

HPLT dataset(s)

web corpora
no annotation
no deduplication
lemmatized, POS+MSD+syntactically tagged, deduplicated corpus: https://www.juls.savba.sk/hpltskcorp.html

Question Answering

SK QUAD

Manually annotated clone of SQUAD 2.0
Contains "unanswerable questions"
92k items

Exam Slovak MathBio

Part of INCLUDE
131 manually created math questions. One choice of 4 is correct.

Slovak SQUAD

Machine translation of SQUAD 2.0 Database
140k annotated items

qa2d-sk

Slovak version of the Question to Declarative Sentence (QA2D).
Machine-translated using DeepL service.
https://arxiv.org/abs/2312.10171
70k questions and answers

Slovak BoolQ

5 000 yes-no questions
Each example is a triplet of (question, passage, answer), with the title of the page as optional additional context.
Machine translated
Can be used also for summarization

MQA

785K question-answer pairs for the Slovak language
Each example consists of the question and corresponding answer
Data are parsed from Common Crawl
Questions are divided into: frequently asked questions (FAQ) and Community Question Answering (CQA)

WebFAQ

178K question-answer pairs for the Slovak language
Data obtained from FQA pages gathered from Common Crawl
Data also contains metadata, such as topic, question type

Morpho-syntactic

HamleDT

HArmonized Multi-LanguagE Dependency Treebank
Integrates 42 treebanks

Korpus právnych predpisov v slovenčine

Korpus obsahuje texty právnych predpisov (aktuálnych aj minulých) v slovenčine. Okrem automatickej lematizácie a morfologickej anotácie je korpus anotovaný aj syntakticky
Citácia: GARABÍK, Radovan: Corpus of Slovak legislative documents. Jazykovedný časopis, 2022, Vol. 73, No 2, pp. 175-189.
45 miliónov tokenov

Morphological vocabulary (old version)

form, lemma, POS+MSD (SNK)
source: SNK

Morphological vocabulary (web demo)

form, lemma, POS+MSD (SNK)
source: SNK

Slovak Dependency Treebank

tokenization, segmentation, UPOS, XPOS (SNK), UD, lemma
manual annotation
format: conllu, PDT tagset
source: SNK

Reference:

Gajdošová, Katarína; Šimková, Mária and et al., 2016, Slovak Dependency Treebank, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University, http://hdl.handle.net/11234/1-1822.

Slovak Universal Dependencies

A conversion of the Slovak Dependency Treebank into Universal Dependency tagset.

GitHub page
tokenization, segmentation, UPOS, XPOS (SNK), UD, lemma
manual annotation
format: conllu, UD tagset
source: SNK

Reference:

Zeman, Daniel. (2017). Slovak Dependency Treebank in Universal Dependencies. Journal of Linguistics/Jazykovedný casopis. 68. 10.1515/jazcas-2017-0048.

Artificial Treebank with Ellipsis

tokenization, segmentation, UPOS, XPOS (SNK), UD, lemma
format: conllu
source: Slovak UD, SNK

MULTEXT-East free lexicons 4.0

form, lemma, POS (Multext East)

Parallel

ŠarišSet

Corpus of the Šariš dialect
4.7k examples.
authors: Viktória Ondrejová and Marek Šuppa

OpenSubtitles

62 languages, 1,782 bitexts
Slovak part contains 100 mil. tokens

VoxPopuli

source: Europarl
speech, vectors, language

Czech-Slovak Parallel Corpus

automatic POS (SNK)
source: Acquis, Europarl, EU-journal, EC-Europa, OPUS

English-Slovak Parallel Corpus

automatic POS (SNK)
source: Acquis, Europarl, EU-journal, EC-Europa, OPUS

English-Slovak parallel corpus of texts from The Ministry of Justice of the Slovak Republic

Dataset of various English-Slovak legal texts within agenda of the Ministry,
plain text format alligned at the sentence level, the size: 112580 words.
It was converted into a 2895-TUs English-Slovak resource in TMX format.

English-Slovak parallel corpus of texts from The Ministry of Culture of the Slovak Republic

Dataset of various English-Slovak legal texts within agenda of the Ministry,
plain text format alligned at the sentence level,
the size: 105791 words
It is converted into a 2609-TUs English-Slovak resource in TMX format.

Paracrawl

Parallel web Corpus with Slovak Part
3.3 mil sentences English-Slovak

WikiMatrix

Unsupervised processing of Wikipedia to obtain parallel corpora
Used LASER embeddings.
85 different languages, 1620 language pairs, 134M parallel sentences, out of which 34M are aligned with English

Semantic textual similarity

STSB-sk

Machine translated by OPUS-en-sk model
Sentence similarity dataset contains two sentences with a floating-point number between 0 and 5 as a target, where the highest number means higher similarity. The dataset contains train: 5 749, validation: 1 500 and test: 1 379 examples.
Referenced from this report by J. Agarský.

Sentiment

SentiSK

Corpus of sentiment in Slovak social media.
The dataset contains 34 006 manually annotated comments from the social media platform Facebook.
Specifically, it includes:
- 20 668 comments labeled as "negative"
- 9 581 comments labeled as "neutral"
- 3 779 comments labeled as "positive"
Author: Zuzana Sokolová

The Nuclear News V4 Dataset

is a multilingual dataset consisting of 33,104 unique news articles
sourced from 12 online news platforms across the Visegrád Group (V4) countries — Poland, Czech Republic, Slovakia, and Hungary — published between 1998 and 2025. The goal of the dataset is to analyze media narratives surrounding nuclear energy in Central Europe.
30 percent in Slovak
annotated by language modes for overall sentiment toward nuclear energy, sentiment of the headline (pessimistic ↔ optimistic), degree of sensationalism / alarmism in the headline

Sentiment Analysis Data for the Slovak Language

5k items
positive and negative class
Reference: Samuel Pecar, Marian Simko, and Maria Bielikova. 2019. Improving Sentiment Classification in Slovak Language. In Proceedings of the 7th Workshop on Balto-Slavic Natural Language Processing, pages 114–119, Florence, Italy. Association for Computational Linguistics.

Slovak_sentiment

Unknown/undocumented source
positive/negative

Twitter sentiment for 15 European languages

source: Twitter
3 categories - positive, negative, neutral

SentiGrade

Dataset contains totally 1 588 comments in Slovak language from various Facebook pages. The texts are annotated by 5 categories.

STS2-sk

Machine translated
Sentiment analysis dataset, binary classification task: positive sentiment, negative sentiment. It includes reviews from 7 categories with positive, neutral and negative sentiment labels.
Source: Slovakbert auxiliary repository BY Matúš Pikuliak, Štefan Grivalský, Martin Konôpka, Miroslav Blšták, Martin Tamajka, Viktor Bachratý, Marián Šimko, Pavol Balážik, Michal Trnka, and Filip Uhlárik. , 2021
Referenced from this report by J. Agarský.

sk csfd movie reviews

CSFD Movie Reviews
25k items

Massive Multilingual Sentiment Corpora

Collection of multiple datasets across various languages
Contains also the data for Slovak
positive (29K) / neutral (13K) / negative (14K) labels

Hate Speech

ToxicSK

Corpus of toxic speech in social networks
The dataset contains 8 840 manually annotated comments from Facebook. Specifically, it includes:
- 4 420 comments labeled as "toxic" (value 1)
- 4 420 comments labeled as "non-toxic" (value 0)
Author: Zuzana Sokolová

Hate Speech Slovak

13k items
Crowdsourced hate and offensive speech in Facebook comments
binary classification

Fact Checking

MultiClaim

Multilingual fact checking database with Slovak part
Contains 28k posts in 27 languages from social media, 206k fact-checks in 39 languages written by professional fact-checkers, as well as 31k connections between these two groups.

Demagog

9.1k Czech, 2.8k Polish and 12.6k Slovak labeled claims with reasoning: demagog.zip (~16.5 MB)

qacg-sk

Machine translated facts with evidence representend as references to Wikipedia pages.
350k items

Biases

GEST

GEST dataset used to measure gender-stereotypical reasoning in language models and machine translation systems.

Instructions

SlovAlpaca

Machine translation of the Stanford Alpaca
40k annotations

AYA Collection

4.2 M records for the Slovak language
Machine tanslated using the NLLB 3.3B model

Named Entity Recognition

Contextualized Language Model-based Named Entity Recognition in Slovak Texts

Manually annotated set
Diploma thesis at Commeius University
PER, ORG, LOC, MISC annotations
cca 7k sentences.

Universal NER (UNER) Slovak SNK

8,48k train, 1k dev and 2k test sentences from Universal Dependencies
Human annotated by 2 annotators as part of the Universal NER project
PER, ORG, LOC annotations
AYA Instruction format

WikiGold

10k manually annotated items from Wikipedia
Repository: https://github.com/NaiveNeuron/WikiGoldSK
Paper: https://arxiv.org/abs/2304.04026

DaMuEL

A Large Multilingual Dataset for Entity Linking
Slovak part has 41.0M tokens and 1366.4k entities
named entity types (PER, ORG, LOC, EVENT, BRAND, WORK_OF_ART,MANUFACTURED

Polyglot NER

automatically annotated Wikipedia for Named Entities
massively multilingual
Slovak part has 500k sentences.
Reference: Al-Rfou, Rami, et al. "Polyglot-NER: Massive multilingual named entity recognition." Proceedings of the 2015 SIAM International Conference on Data Mining. Society for Industrial and Applied Mathematics, 2015.

WikiANN

Cross-lingual Name Tagging and Linking for 282 Languages
annotation extracted from wikipedia links.
Slovak part has 40k sentences

ju-bezdek/conll2003-SK-NER

translated version of the original CONLL2003 dataset (translated from English to Slovak via Google translate

CNEC 2.0 cs2sk

CNEC 2.0 Czech model machine translated to Slovak & filtered
CNEC entity hierarchy
source: JÚĽŠ SAV

ju-bezdek/conll2003-SK-NER

translated version of the original CONLL2003 dataset (translated from English to Slovak via Google translate

Spelling

CHIBI

corpus of spelling errors created from edits in Wikipedia
spelling errors are sorted into 5 categories,

Wordnet

Slovak 6D47 Wordnet

Summarization

Eur Lex Sum

Multilingual Summarization Dataset
Slovak part has 1.3k rows.

A Summarization Dataset of Slovak News Articles

200k of news article summaries
Reference: Marek Suppa and Jergus Adamec. 2020. A Summarization Dataset of Slovak News Articles. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 6725–6730, Marseille, France. European Language Resources Association.

Slovak Gigaword

Synthetic Gigaword dataset translated to Slovak using SeamlessM4T-v2.
3.78 mil. summaries
Same as https://huggingface.co/datasets/Plasmoxy/gigatrue but translated to Slovak.

FMI_Summarization

1788 record for the Slovak language
Dataset created from European laws - EurLex

Natural Language Inference

ANLI_SK

103K samples
Translated version of ANLI using the Google Translate

Topic Classification

Sib-200

Topic classification dataset consisting of 205 languages
1K samples with 7 categories (science, geography, entertainment, politics, health, travel, sports)
Data obtained from the FLORES-200

Multilingual Benchmarks

Okapi MMLU

Translated version of the MMLU dataset across 30 languages, including Slovak

Multilingual Hellaswag

Translated version of the HellaSwag dataset across 35 languages, including Slovak
Translated using GPT-3.5-Turbo

Multilingual ARC

Translated version of the ARC dataset across 35 languages, including Slovak
Translated using GPT-3.5-Turbo

Multilingual TruthfulQA

Translated version of the TruthfulQA dataset across 35 languages, including Slovak
Translated using GPT-3.5-Turbo

Name		Name	Last commit message	Last commit date
Latest commit History 146 Commits
MODELS.md		MODELS.md
README.md		README.md

slovak-nlp/resources

Folders and files

Latest commit

History

Repository files navigation

Resources

Pages

Corpora, datasets, vocabularies

Web

Question Answering

Morpho-syntactic

Parallel

Semantic textual similarity

Sentiment

Hate Speech

Fact Checking

Biases

Instructions

Named Entity Recognition

Spelling

Wordnet

Summarization

Natural Language Inference

Topic Classification

Multilingual Benchmarks

About

Resources

Uh oh!

Stars

Watchers

Forks

Uh oh!