A curated list of resources for the processing of Slovak language.
- Slovak resources by Essential Data
- Slovak speech and language processing at KEMT FEI TUKE with tools, demos and language resources.
- Slovak National Corpus
- the Slovak Portion of The FineWeb2 Dataset.
- 14.1 billion words across more than 26.5 million documents.
- sourced from 96 CommonCrawl snapshots, spanning the summer of 2013 to April 2024, and processed using datatrove,
- Created for training mT5 model.
- Contains 67GB Slovak part. bv
- Available in Tensorflow format and HuggingFace JSON format.
- Can be downloaded from allenai/allennlp#5056 using git LFS.
- automatic POS (SNK)
- source: web
- can be downloaded from Clarin
- deduplicated
- source: Common Crawl
- automatic lemmatization, MSD, POS (AUT, ensemble lemmatization+POS)
- source: web
- no annotattion
- twitter part
- web corpora
- no annotation
- no deduplication
- lemmatized, POS+MSD+syntactically tagged, deduplicated corpus: https://www.juls.savba.sk/hpltskcorp.html
- Manually annotated clone of SQUAD 2.0
- Contains "unanswerable questions"
- 92k items
- Part of INCLUDE
- 131 manually created math questions. One choice of 4 is correct.
- Machine translation of SQUAD 2.0 Database
- 140k annotated items
- Slovak version of the Question to Declarative Sentence (QA2D).
- Machine-translated using DeepL service.
- https://arxiv.org/abs/2312.10171
- 70k questions and answers
- 5 000 yes-no questions
- Each example is a triplet of (question, passage, answer), with the title of the page as optional additional context.
- Machine translated
- Can be used also for summarization
- 785K question-answer pairs for the Slovak language
- Each example consists of the question and corresponding answer
- Data are parsed from Common Crawl
- Questions are divided into: frequently asked questions (FAQ) and Community Question Answering (CQA)
- 178K question-answer pairs for the Slovak language
- Data obtained from FQA pages gathered from Common Crawl
- Data also contains metadata, such as topic, question type
- HArmonized Multi-LanguagE Dependency Treebank
- Integrates 42 treebanks
- Korpus obsahuje texty právnych predpisov (aktuálnych aj minulých) v slovenčine. Okrem automatickej lematizácie a morfologickej anotácie je korpus anotovaný aj syntakticky
- Citácia: GARABÍK, Radovan: Corpus of Slovak legislative documents. Jazykovedný časopis, 2022, Vol. 73, No 2, pp. 175-189.
- 45 miliónov tokenov
- form, lemma, POS+MSD (SNK)
- source: SNK
- form, lemma, POS+MSD (SNK)
- source: SNK
- tokenization, segmentation, UPOS, XPOS (SNK), UD, lemma
- manual annotation
- format: conllu, PDT tagset
- source: SNK
Reference:
- Gajdošová, Katarína; Šimková, Mária and et al., 2016, Slovak Dependency Treebank, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University, http://hdl.handle.net/11234/1-1822.
A conversion of the Slovak Dependency Treebank into Universal Dependency tagset.
- GitHub page
- tokenization, segmentation, UPOS, XPOS (SNK), UD, lemma
- manual annotation
- format: conllu, UD tagset
- source: SNK
Reference:
- Zeman, Daniel. (2017). Slovak Dependency Treebank in Universal Dependencies. Journal of Linguistics/Jazykovedný casopis. 68. 10.1515/jazcas-2017-0048.
- tokenization, segmentation, UPOS, XPOS (SNK), UD, lemma
- format: conllu
- source: Slovak UD, SNK
- form, lemma, POS (Multext East)
- Corpus of the Šariš dialect
- 4.7k examples.
- authors: Viktória Ondrejová and Marek Šuppa
- 62 languages, 1,782 bitexts
- Slovak part contains 100 mil. tokens
- source: Europarl
- speech, vectors, language
- automatic POS (SNK)
- source: Acquis, Europarl, EU-journal, EC-Europa, OPUS
- automatic POS (SNK)
- source: Acquis, Europarl, EU-journal, EC-Europa, OPUS
- Dataset of various English-Slovak legal texts within agenda of the Ministry,
- plain text format alligned at the sentence level, the size: 112580 words.
- It was converted into a 2895-TUs English-Slovak resource in TMX format.
- Dataset of various English-Slovak legal texts within agenda of the Ministry,
- plain text format alligned at the sentence level,
- the size: 105791 words
- It is converted into a 2609-TUs English-Slovak resource in TMX format.
- Parallel web Corpus with Slovak Part
- 3.3 mil sentences English-Slovak
- Unsupervised processing of Wikipedia to obtain parallel corpora
- Used LASER embeddings.
- 85 different languages, 1620 language pairs, 134M parallel sentences, out of which 34M are aligned with English
- Machine translated by OPUS-en-sk model
- Sentence similarity dataset contains two sentences with a floating-point number between 0 and 5 as a target, where the highest number means higher similarity. The dataset contains train: 5 749, validation: 1 500 and test: 1 379 examples.
- Referenced from this report by J. Agarský.
- Corpus of sentiment in Slovak social media.
- The dataset contains 34 006 manually annotated comments from the social media platform Facebook.
- Specifically, it includes:
- 20 668 comments labeled as "negative"
- 9 581 comments labeled as "neutral"
- 3 779 comments labeled as "positive"
- Author: Zuzana Sokolová
- is a multilingual dataset consisting of 33,104 unique news articles
- sourced from 12 online news platforms across the Visegrád Group (V4) countries — Poland, Czech Republic, Slovakia, and Hungary — published between 1998 and 2025. The goal of the dataset is to analyze media narratives surrounding nuclear energy in Central Europe.
- 30 percent in Slovak
- annotated by language modes for overall sentiment toward nuclear energy, sentiment of the headline (pessimistic ↔ optimistic), degree of sensationalism / alarmism in the headline
- 5k items
- positive and negative class
- Reference: Samuel Pecar, Marian Simko, and Maria Bielikova. 2019. Improving Sentiment Classification in Slovak Language. In Proceedings of the 7th Workshop on Balto-Slavic Natural Language Processing, pages 114–119, Florence, Italy. Association for Computational Linguistics.
- Unknown/undocumented source
- positive/negative
- source: Twitter
- 3 categories - positive, negative, neutral
- Dataset contains totally 1 588 comments in Slovak language from various Facebook pages. The texts are annotated by 5 categories.
- Machine translated
- Sentiment analysis dataset, binary classification task: positive sentiment, negative sentiment. It includes reviews from 7 categories with positive, neutral and negative sentiment labels.
- Source: Slovakbert auxiliary repository BY Matúš Pikuliak, Štefan Grivalský, Martin Konôpka, Miroslav Blšták, Martin Tamajka, Viktor Bachratý, Marián Šimko, Pavol Balážik, Michal Trnka, and Filip Uhlárik. , 2021
- Referenced from this report by J. Agarský.
- CSFD Movie Reviews
- 25k items
- Collection of multiple datasets across various languages
- Contains also the data for Slovak
- positive (29K) / neutral (13K) / negative (14K) labels
- Corpus of toxic speech in social networks
- The dataset contains 8 840 manually annotated comments from Facebook. Specifically, it includes:
- 4 420 comments labeled as "toxic" (value 1)
- 4 420 comments labeled as "non-toxic" (value 0)
- Author: Zuzana Sokolová
- 13k items
- Crowdsourced hate and offensive speech in Facebook comments
- binary classification
- Multilingual fact checking database with Slovak part
- Contains 28k posts in 27 languages from social media, 206k fact-checks in 39 languages written by professional fact-checkers, as well as 31k connections between these two groups.
9.1k Czech, 2.8k Polish and 12.6k Slovak labeled claims with reasoning: demagog.zip (~16.5 MB)
- Machine translated facts with evidence representend as references to Wikipedia pages.
- 350k items
- GEST dataset used to measure gender-stereotypical reasoning in language models and machine translation systems.
- Machine translation of the Stanford Alpaca
- 40k annotations
- 4.2 M records for the Slovak language
- Machine tanslated using the NLLB 3.3B model
- Manually annotated set
- Diploma thesis at Commeius University
- PER, ORG, LOC, MISC annotations
- cca 7k sentences.
- 8,48k train, 1k dev and 2k test sentences from Universal Dependencies
- Human annotated by 2 annotators as part of the Universal NER project
- PER, ORG, LOC annotations
- AYA Instruction format
- 10k manually annotated items from Wikipedia
- Repository: https://github.com/NaiveNeuron/WikiGoldSK
- Paper: https://arxiv.org/abs/2304.04026
- A Large Multilingual Dataset for Entity Linking
- Slovak part has 41.0M tokens and 1366.4k entities
- named entity types (PER, ORG, LOC, EVENT, BRAND, WORK_OF_ART,MANUFACTURED
- automatically annotated Wikipedia for Named Entities
- massively multilingual
- Slovak part has 500k sentences.
- Reference: Al-Rfou, Rami, et al. "Polyglot-NER: Massive multilingual named entity recognition." Proceedings of the 2015 SIAM International Conference on Data Mining. Society for Industrial and Applied Mathematics, 2015.
- Cross-lingual Name Tagging and Linking for 282 Languages
- annotation extracted from wikipedia links.
- Slovak part has 40k sentences
translated version of the original CONLL2003 dataset (translated from English to Slovak via Google translate
- CNEC 2.0 Czech model machine translated to Slovak & filtered
- CNEC entity hierarchy
- source: JÚĽŠ SAV
translated version of the original CONLL2003 dataset (translated from English to Slovak via Google translate
- corpus of spelling errors created from edits in Wikipedia
- spelling errors are sorted into 5 categories,
- Multilingual Summarization Dataset
- Slovak part has 1.3k rows.
- 200k of news article summaries
- Reference: Marek Suppa and Jergus Adamec. 2020. A Summarization Dataset of Slovak News Articles. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 6725–6730, Marseille, France. European Language Resources Association.
- Synthetic Gigaword dataset translated to Slovak using SeamlessM4T-v2.
- 3.78 mil. summaries
- Same as https://huggingface.co/datasets/Plasmoxy/gigatrue but translated to Slovak.
- 1788 record for the Slovak language
- Dataset created from European laws - EurLex
- 103K samples
- Translated version of ANLI using the Google Translate
- Topic classification dataset consisting of 205 languages
- 1K samples with 7 categories (science, geography, entertainment, politics, health, travel, sports)
- Data obtained from the FLORES-200
- Translated version of the MMLU dataset across 30 languages, including Slovak
- Translated version of the HellaSwag dataset across 35 languages, including Slovak
- Translated using GPT-3.5-Turbo
- Translated version of the ARC dataset across 35 languages, including Slovak
- Translated using GPT-3.5-Turbo
- Translated version of the TruthfulQA dataset across 35 languages, including Slovak
- Translated using GPT-3.5-Turbo