Skip to content

Datasets

Descriptions

The following tables contains description of all the dataset in the benchmark along with with their main score, what type of task it as, what languages it covers and some statistics for each dataset. The domains follows the categories used in the Universal Dependencies project.

Dataset Description Main Score Languages Type Domains Number of Documents Mean Length of Documents (characters)
Angry Tweets A sentiment dataset with 3 classes (positiv, negativ, neutral) for Danish tweets Accuracy da Classification social 1047 156.15 (std: 82.02)
Bornholm Parallel Danish Bornholmsk Parallel Corpus. Bornholmsk is a Danish dialect spoken on the island of Bornholm, Denmark. Historically it is a part of east Danish which was also spoken in Scania and Halland, Sweden. F1 da, da-bornholm BitextMining poetry, wiki, fiction, web, social 1000 44.36 (std: 41.22)
DKHate Danish Tweets annotated for Hate Speech either being Offensive or not Accuracy da Classification social 329 88.18 (std: 168.30)
Da Political Comments A dataset of Danish political comments rated for sentiment Accuracy da Classification social 7206 69.60 (std: 62.85)
DaLAJ A Swedish dataset for linguistic acceptability. Available as a part of Superlim. Accuracy sv Classification fiction, non-fiction 888 120.77 (std: 67.95)
DanFEVER A Danish dataset intended for misinformation research. It follows the same format as the English FEVER dataset. Ndcg_at_10 da Retrieval wiki, non-fiction 8897 124.84 (std: 168.53)
LCC The leipzig corpora collection, annotated for sentiment Accuracy da Classification legal, web, news, social, fiction, non-fiction, academic, government 150 118.73 (std: 57.82)
Language Identification A dataset for Nordic language identification. Accuracy da, sv, nb, nn, is, fo Classification wiki 3000 78.23 (std: 48.54)
Massive Intent MASSIVE: A 1M-Example Multilingual Natural Language Understanding Dataset with 51 Typologically-Diverse Languages Accuracy da, nb, sv Classification spoken 15021 34.65 (std: 16.99)
Massive Scenario MASSIVE: A 1M-Example Multilingual Natural Language Understanding Dataset with 51 Typologically-Diverse Languages Accuracy da, nb, sv Classification spoken 15021 34.65 (std: 16.99)
NoReC A Norwegian dataset for sentiment classification on review Accuracy nb Classification reviews 2048 89.62 (std: 61.21)
NorQuad Human-created question for Norwegian wikipedia passages. Ndcg_at_10 nb Retrieval non-fiction, wiki 2602 502.19 (std: 875.23)
Norwegian courts Nynorsk and Bokmål parallel corpus from Norwegian courts. Norway has two standardised written languages. Bokmål is a variant closer to Danish, while Nynorsk was created to resemble regional dialects of Norwegian. F1 nb, nn BitextMining legal, non-fiction 456 82.11 (std: 49.48)
Norwegian parliament Norwegian parliament speeches annotated with the party of the speaker (Sosialistisk Venstreparti vs Fremskrittspartiet) Accuracy nb Classification spoken 2400 1897.51 (std: 1988.62)
SNL Clustering Webscrabed articles from the Norwegian lexicon 'Det Store Norske Leksikon'. Uses articles categories as clusters. V_measure nb Clustering non-fiction, wiki 2048 1101.30 (std: 2168.35)
SNL Retrieval Webscrabed articles and ingresses from the Norwegian lexicon 'Det Store Norske Leksikon'. Ndcg_at_10 nb Retrieval non-fiction, wiki 2600 1001.43 (std: 2537.83)
ScaLA A linguistic acceptability task for Danish, Norwegian Bokmål Norwegian Nynorsk and Swedish. Accuracy da, nb, sv, nn Classification fiction, news, non-fiction, spoken, blog 8192 102.45 (std: 55.49)
SweFAQ A Swedish QA dataset derived from FAQ Ndcg_at_10 sv Retrieval non-fiction, web 1024 195.44 (std: 209.33)
SweReC A Swedish dataset for sentiment classification on review Accuracy sv Classification reviews 2048 318.83 (std: 499.57)
SwednClustering The SWE-DN corpus is based on 1,963,576 news articles from the Swedish newspaper Dagens Nyheter (DN) during the years 2000--2020. The articles are filtered to resemble the CNN/DailyMail dataset both regarding textual structure. This dataset uses the category labels as clusters. V_measure sv Clustering non-fiction, news 2048 1619.71 (std: 2220.36)
SwednRetrieval News Article Summary Semantic Similarity Estimation. Ndcg_at_10 sv Retrieval non-fiction, news 3070 1946.35 (std: 3071.98)
TV2Nord Retrieval News Article and corresponding summaries extracted from the Danish newspaper TV2 Nord. Ndcg_at_10 da Retrieval news, non-fiction 4096 784.11 (std: 982.97)
Twitterhjerne Danish question asked on Twitter with the Hashtag #Twitterhjerne ('Twitter brain') and their corresponding answer. Ndcg_at_10 da Retrieval social 340 138.23 (std: 82.41)
VG Clustering Articles and their classes (e.g. sports) from VG news articles extracted from Norsk Aviskorpus. V_measure nb Clustering non-fiction, news 2048 1009.65 (std: 1597.60)

Dataset Licenses

Dataset License
Angry Tweets CC-BY-4.0
Bornholm Parallel CC-BY-4.0
DKHate CC-BY-4.0
Da Political Comments
DaLAJ CC-BY-4.0
DanFEVER CC-BY-4.0
LCC CC-BY-4.0
Massive Scenario CC-BY-4.0
NoReC CC-BY-NC-4.0
NorQuad CC0-1.0
Norwegian courts MIT
Norwegian parliament CC-BY-4.0
SNL Clustering CC-BY-NC
SNL Retrieval CC-BY-NC
ScaLA CC-BY-SA-4.0
SweFAQ CC-BY-4.0
SweReC CC-BY-4.0
SwednClustering CC-BY-4.0
SwednRetrieval CC-BY-4.0
TV2Nord Retrieval Apache 2.0
Twitterhjerne CC BY 4.0
VG Clustering CC-BY-NC

Dataset Disclaimer

  • We do not own or host any of the datasets which we use for this benchmark.
  • We only offer refer to existing dataset that we believe we are free to redistribute. If any doubt occurs about the legality of any of our file downloads we will take them off right away after contacting us.

Notice and take down policy

  • Notice: Should you consider that data used by the dataset contains material that is owned by you and should therefore not be reproduced here, please: Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
  • Clearly identify the copyrighted work claimed to be infringed.
  • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
  • And contact the 'Scandinavian Embedding Benchmark' at the following ticket service: https://frontoffice.chcaa.au.dk/hc/en-us/requests/new

We will comply to legitimate requests by removing the affected sources from the next release of the benchmark.