Datasets¶
Descriptions¶
The following tables contains description of all the dataset in the benchmark along with with their main score, what type of task it as, what languages it covers and some statistics for each dataset. The domains follows the categories used in the Universal Dependencies project.
Dataset | Description | Main Score | Languages | Type | Domains | Number of Documents | Mean Length of Documents (characters) |
---|---|---|---|---|---|---|---|
Angry Tweets | A sentiment dataset with 3 classes (positiv, negativ, neutral) for Danish tweets | Accuracy | da | Classification | social | 1047 | 156.15 (std: 82.02) |
Bornholm Parallel | Danish Bornholmsk Parallel Corpus. Bornholmsk is a Danish dialect spoken on the island of Bornholm, Denmark. Historically it is a part of east Danish which was also spoken in Scania and Halland, Sweden. | F1 | da, da-bornholm | BitextMining | poetry, wiki, fiction, web, social | 1000 | 44.36 (std: 41.22) |
DKHate | Danish Tweets annotated for Hate Speech either being Offensive or not | Accuracy | da | Classification | social | 329 | 88.18 (std: 168.30) |
Da Political Comments | A dataset of Danish political comments rated for sentiment | Accuracy | da | Classification | social | 7206 | 69.60 (std: 62.85) |
DaLAJ | A Swedish dataset for linguistic acceptability. Available as a part of Superlim. | Accuracy | sv | Classification | fiction, non-fiction | 888 | 120.77 (std: 67.95) |
DanFEVER | A Danish dataset intended for misinformation research. It follows the same format as the English FEVER dataset. | Ndcg_at_10 | da | Retrieval | wiki, non-fiction | 8897 | 124.84 (std: 168.53) |
LCC | The leipzig corpora collection, annotated for sentiment | Accuracy | da | Classification | legal, web, news, social, fiction, non-fiction, academic, government | 150 | 118.73 (std: 57.82) |
Language Identification | A dataset for Nordic language identification. | Accuracy | da, sv, nb, nn, is, fo | Classification | wiki | 3000 | 78.23 (std: 48.54) |
Massive Intent | MASSIVE: A 1M-Example Multilingual Natural Language Understanding Dataset with 51 Typologically-Diverse Languages | Accuracy | da, nb, sv | Classification | spoken | 15021 | 34.65 (std: 16.99) |
Massive Scenario | MASSIVE: A 1M-Example Multilingual Natural Language Understanding Dataset with 51 Typologically-Diverse Languages | Accuracy | da, nb, sv | Classification | spoken | 15021 | 34.65 (std: 16.99) |
NoReC | A Norwegian dataset for sentiment classification on review | Accuracy | nb | Classification | reviews | 2048 | 89.62 (std: 61.21) |
NorQuad | Human-created question for Norwegian wikipedia passages. | Ndcg_at_10 | nb | Retrieval | non-fiction, wiki | 2602 | 502.19 (std: 875.23) |
Norwegian courts | Nynorsk and Bokmål parallel corpus from Norwegian courts. Norway has two standardised written languages. Bokmål is a variant closer to Danish, while Nynorsk was created to resemble regional dialects of Norwegian. | F1 | nb, nn | BitextMining | legal, non-fiction | 456 | 82.11 (std: 49.48) |
Norwegian parliament | Norwegian parliament speeches annotated with the party of the speaker (Sosialistisk Venstreparti vs Fremskrittspartiet ) |
Accuracy | nb | Classification | spoken | 2400 | 1897.51 (std: 1988.62) |
SNL Clustering | Webscrabed articles from the Norwegian lexicon 'Det Store Norske Leksikon'. Uses articles categories as clusters. | V_measure | nb | Clustering | non-fiction, wiki | 2048 | 1101.30 (std: 2168.35) |
SNL Retrieval | Webscrabed articles and ingresses from the Norwegian lexicon 'Det Store Norske Leksikon'. | Ndcg_at_10 | nb | Retrieval | non-fiction, wiki | 2600 | 1001.43 (std: 2537.83) |
ScaLA | A linguistic acceptability task for Danish, Norwegian Bokmål Norwegian Nynorsk and Swedish. | Accuracy | da, nb, sv, nn | Classification | fiction, news, non-fiction, spoken, blog | 8192 | 102.45 (std: 55.49) |
SweFAQ | A Swedish QA dataset derived from FAQ | Ndcg_at_10 | sv | Retrieval | non-fiction, web | 1024 | 195.44 (std: 209.33) |
SweReC | A Swedish dataset for sentiment classification on review | Accuracy | sv | Classification | reviews | 2048 | 318.83 (std: 499.57) |
SwednClustering | The SWE-DN corpus is based on 1,963,576 news articles from the Swedish newspaper Dagens Nyheter (DN) during the years 2000--2020. The articles are filtered to resemble the CNN/DailyMail dataset both regarding textual structure. This dataset uses the category labels as clusters. | V_measure | sv | Clustering | non-fiction, news | 2048 | 1619.71 (std: 2220.36) |
SwednRetrieval | News Article Summary Semantic Similarity Estimation. | Ndcg_at_10 | sv | Retrieval | non-fiction, news | 3070 | 1946.35 (std: 3071.98) |
TV2Nord Retrieval | News Article and corresponding summaries extracted from the Danish newspaper TV2 Nord. | Ndcg_at_10 | da | Retrieval | news, non-fiction | 4096 | 784.11 (std: 982.97) |
Twitterhjerne | Danish question asked on Twitter with the Hashtag #Twitterhjerne ('Twitter brain') and their corresponding answer. | Ndcg_at_10 | da | Retrieval | social | 340 | 138.23 (std: 82.41) |
VG Clustering | Articles and their classes (e.g. sports) from VG news articles extracted from Norsk Aviskorpus. | V_measure | nb | Clustering | non-fiction, news | 2048 | 1009.65 (std: 1597.60) |
Dataset Licenses¶
Dataset | License |
---|---|
Angry Tweets | CC-BY-4.0 |
Bornholm Parallel | CC-BY-4.0 |
DKHate | CC-BY-4.0 |
Da Political Comments | |
DaLAJ | CC-BY-4.0 |
DanFEVER | CC-BY-4.0 |
LCC | CC-BY-4.0 |
Massive Scenario | CC-BY-4.0 |
NoReC | CC-BY-NC-4.0 |
NorQuad | CC0-1.0 |
Norwegian courts | MIT |
Norwegian parliament | CC-BY-4.0 |
SNL Clustering | CC-BY-NC |
SNL Retrieval | CC-BY-NC |
ScaLA | CC-BY-SA-4.0 |
SweFAQ | CC-BY-4.0 |
SweReC | CC-BY-4.0 |
SwednClustering | CC-BY-4.0 |
SwednRetrieval | CC-BY-4.0 |
TV2Nord Retrieval | Apache 2.0 |
Twitterhjerne | CC BY 4.0 |
VG Clustering | CC-BY-NC |
Dataset Disclaimer¶
- We do not own or host any of the datasets which we use for this benchmark.
- We only offer refer to existing dataset that we believe we are free to redistribute. If any doubt occurs about the legality of any of our file downloads we will take them off right away after contacting us.
Notice and take down policy
- Notice: Should you consider that data used by the dataset contains material that is owned by you and should therefore not be reproduced here, please: Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
- And contact the 'Scandinavian Embedding Benchmark' at the following ticket service: https://frontoffice.chcaa.au.dk/hc/en-us/requests/new
We will comply to legitimate requests by removing the affected sources from the next release of the benchmark.