Datasets¶

Descriptions¶

The following tables contains description of all the dataset in the benchmark along with with their main score, what type of task it as, what languages it covers and some statistics for each dataset. The domains follows the categories used in the Universal Dependencies project.

Dataset	Description	Main Score	Languages	Type	Domains	Number of Documents	Mean Length of Documents (characters)
Angry Tweets	A sentiment dataset with 3 classes (positiv, negativ, neutral) for Danish tweets	Accuracy	da	Classification	social	1047	156.15 (std: 82.02)
Bornholm Parallel	Danish Bornholmsk Parallel Corpus. Bornholmsk is a Danish dialect spoken on the island of Bornholm, Denmark. Historically it is a part of east Danish which was also spoken in Scania and Halland, Sweden.	F1	da, da-bornholm	BitextMining	poetry, wiki, fiction, web, social	1000	44.36 (std: 41.22)
DKHate	Danish Tweets annotated for Hate Speech either being Offensive or not	Accuracy	da	Classification	social	329	88.18 (std: 168.30)
Da Political Comments	A dataset of Danish political comments rated for sentiment	Accuracy	da	Classification	social	7206	69.60 (std: 62.85)
DaLAJ	A Swedish dataset for linguistic acceptability. Available as a part of Superlim.	Accuracy	sv	Classification	fiction, non-fiction	888	120.77 (std: 67.95)
DanFEVER	A Danish dataset intended for misinformation research. It follows the same format as the English FEVER dataset.	Ndcg_at_10	da	Retrieval	wiki, non-fiction	8897	124.84 (std: 168.53)
LCC	The leipzig corpora collection, annotated for sentiment	Accuracy	da	Classification	legal, web, news, social, fiction, non-fiction, academic, government	150	118.73 (std: 57.82)
Language Identification	A dataset for Nordic language identification.	Accuracy	da, sv, nb, nn, is, fo	Classification	wiki	3000	78.23 (std: 48.54)
Massive Intent	MASSIVE: A 1M-Example Multilingual Natural Language Understanding Dataset with 51 Typologically-Diverse Languages	Accuracy	da, nb, sv	Classification	spoken	15021	34.65 (std: 16.99)
Massive Scenario	MASSIVE: A 1M-Example Multilingual Natural Language Understanding Dataset with 51 Typologically-Diverse Languages	Accuracy	da, nb, sv	Classification	spoken	15021	34.65 (std: 16.99)
NoReC	A Norwegian dataset for sentiment classification on review	Accuracy	nb	Classification	reviews	2048	89.62 (std: 61.21)
NorQuad	Human-created question for Norwegian wikipedia passages.	Ndcg_at_10	nb	Retrieval	non-fiction, wiki	2602	502.19 (std: 875.23)
Norwegian courts	Nynorsk and Bokmål parallel corpus from Norwegian courts. Norway has two standardised written languages. Bokmål is a variant closer to Danish, while Nynorsk was created to resemble regional dialects of Norwegian.	F1	nb, nn	BitextMining	legal, non-fiction	456	82.11 (std: 49.48)
Norwegian parliament	Norwegian parliament speeches annotated with the party of the speaker (`Sosialistisk Venstreparti` vs `Fremskrittspartiet`)	Accuracy	nb	Classification	spoken	2400	1897.51 (std: 1988.62)
SNL Clustering	Webscrabed articles from the Norwegian lexicon 'Det Store Norske Leksikon'. Uses articles categories as clusters.	V_measure	nb	Clustering	non-fiction, wiki	2048	1101.30 (std: 2168.35)
SNL Retrieval	Webscrabed articles and ingresses from the Norwegian lexicon 'Det Store Norske Leksikon'.	Ndcg_at_10	nb	Retrieval	non-fiction, wiki	2600	1001.43 (std: 2537.83)
ScaLA	A linguistic acceptability task for Danish, Norwegian Bokmål Norwegian Nynorsk and Swedish.	Accuracy	da, nb, sv, nn	Classification	fiction, news, non-fiction, spoken, blog	8192	102.45 (std: 55.49)
SweFAQ	A Swedish QA dataset derived from FAQ	Ndcg_at_10	sv	Retrieval	non-fiction, web	1024	195.44 (std: 209.33)
SweReC	A Swedish dataset for sentiment classification on review	Accuracy	sv	Classification	reviews	2048	318.83 (std: 499.57)
SwednClustering	The SWE-DN corpus is based on 1,963,576 news articles from the Swedish newspaper Dagens Nyheter (DN) during the years 2000--2020. The articles are filtered to resemble the CNN/DailyMail dataset both regarding textual structure. This dataset uses the category labels as clusters.	V_measure	sv	Clustering	non-fiction, news	2048	1619.71 (std: 2220.36)
SwednRetrieval	News Article Summary Semantic Similarity Estimation.	Ndcg_at_10	sv	Retrieval	non-fiction, news	3070	1946.35 (std: 3071.98)
TV2Nord Retrieval	News Article and corresponding summaries extracted from the Danish newspaper TV2 Nord.	Ndcg_at_10	da	Retrieval	news, non-fiction	4096	784.11 (std: 982.97)
Twitterhjerne	Danish question asked on Twitter with the Hashtag #Twitterhjerne ('Twitter brain') and their corresponding answer.	Ndcg_at_10	da	Retrieval	social	340	138.23 (std: 82.41)
VG Clustering	Articles and their classes (e.g. sports) from VG news articles extracted from Norsk Aviskorpus.	V_measure	nb	Clustering	non-fiction, news	2048	1009.65 (std: 1597.60)

Dataset Disclaimer¶

We do not own or host any of the datasets which we use for this benchmark.
We only offer refer to existing dataset that we believe we are free to redistribute. If any doubt occurs about the legality of any of our file downloads we will take them off right away after contacting us.

Notice and take down policy

Notice: Should you consider that data used by the dataset contains material that is owned by you and should therefore not be reproduced here, please: Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
Clearly identify the copyrighted work claimed to be infringed.
Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
And contact the 'Scandinavian Embedding Benchmark' at the following ticket service: https://frontoffice.chcaa.au.dk/hc/en-us/requests/new

We will comply to legitimate requests by removing the affected sources from the next release of the benchmark.