Using the CLI¶
SEB comes with a simple cli to allow you to run models. This section will show a minimal example of how to use the CLI but if you want to know more check out the CLI documentation. To get a list of available commands you can simply run:
%%bash
seb --help
Available commands: run Runs the Benchmark either on specified models or on all registered mod...
or for more on the specific command you can call seb {command} --help
. To run a model using the CLI you can run it like so:
%%bash
seb run -m all-MiniLM-L6-v2 --output-path model_results/
INFO:seb.cli.run:Model registered in SEB. Loading from registry. Running all-MiniLM-L6-v2: 0%| | 0/1 [00:00<?, ?it/s] Running all-MiniLM-L6-v2: 0%| | 0/16 [00:00<?, ?it/s] Running all-MiniLM-L6-v2 on Angry Tweets: 0%| | 0/16 [00:00<?, ?it/s] Running all-MiniLM-L6-v2 on LCC: 0%| | 0/16 [00:00<?, ?it/s] Running all-MiniLM-L6-v2 on Bornholm Parallel: 0%| | 0/16 [00:00<?, ?it/s] Running all-MiniLM-L6-v2 on DKHate: 0%| | 0/16 [00:00<?, ?it/s] Running all-MiniLM-L6-v2 on Da Political Comments: 0%| | 0/16 [00:00<?, ?it/s] Running all-MiniLM-L6-v2 on Massive Intent: 0%| | 0/16 [00:00<?, ?it/s] Running all-MiniLM-L6-v2 on Massive Scenario: 0%| | 0/16 [00:00<?, ?it/s] Running all-MiniLM-L6-v2 on ScaLA: 0%| | 0/16 [00:00<?, ?it/s] Running all-MiniLM-L6-v2 on Language Identification: 0%| | 0/16 [00:00<?, ?it/s] Running all-MiniLM-L6-v2 on NoReC: 0%| | 0/16 [00:00<?, ?it/s] Running all-MiniLM-L6-v2 on Norwegian parliament: 0%| | 0/16 [00:00<?, ?it/s] Running all-MiniLM-L6-v2 on VGSummarizationClustering: 0%| | 0/16 [00:00<?, ?it/s] Running all-MiniLM-L6-v2 on SweReC: 0%| | 0/16 [00:00<?, ?it/s] Running all-MiniLM-L6-v2 on DaLAJ: 0%| | 0/16 [00:00<?, ?it/s] Running all-MiniLM-L6-v2 on SweFAQ: 0%| | 0/16 [00:00<?, ?it/s] Running all-MiniLM-L6-v2 on SwednClustering: 0%| | 0/16 [00:00<?, ?it/s] Running all-MiniLM-L6-v2: 100%|██████████| 1/1 [00:00<00:00, 25.99it/s] ERROR:seb.benchmark:Error when running VGSummarizationClustering on embed-multilingual-v3.0: Cache for embed-multilingual-v3.0 on VGSummarizationClustering does not exist. Set run_model=True to run the model. ERROR:seb.benchmark:Error when running SwednClustering on embed-multilingual-v3.0: Cache for embed-multilingual-v3.0 on SwednClustering does not exist. Set run_model=True to run the model. ERROR:seb.benchmark:Error when running VGSummarizationClustering on paraphrase-multilingual-MiniLM-L12-v2: Cache for paraphrase-multilingual-MiniLM-L12-v2 on VGSummarizationClustering does not exist. Set run_model=True to run the model. ERROR:seb.benchmark:Error when running SwednClustering on paraphrase-multilingual-MiniLM-L12-v2: Cache for paraphrase-multilingual-MiniLM-L12-v2 on SwednClustering does not exist. Set run_model=True to run the model. ERROR:seb.benchmark:Error when running VGSummarizationClustering on paraphrase-multilingual-mpnet-base-v2: Cache for paraphrase-multilingual-mpnet-base-v2 on VGSummarizationClustering does not exist. Set run_model=True to run the model. ERROR:seb.benchmark:Error when running SwednClustering on paraphrase-multilingual-mpnet-base-v2: Cache for paraphrase-multilingual-mpnet-base-v2 on SwednClustering does not exist. Set run_model=True to run the model. ERROR:seb.benchmark:Error when running VGSummarizationClustering on sentence-bert-swedish-cased: Cache for sentence-bert-swedish-cased on VGSummarizationClustering does not exist. Set run_model=True to run the model. ERROR:seb.benchmark:Error when running SwednClustering on sentence-bert-swedish-cased: Cache for sentence-bert-swedish-cased on SwednClustering does not exist. Set run_model=True to run the model. ERROR:seb.benchmark:Error when running VGSummarizationClustering on electra-small-nordic: Cache for electra-small-nordic on VGSummarizationClustering does not exist. Set run_model=True to run the model. ERROR:seb.benchmark:Error when running SwednClustering on electra-small-nordic: Cache for electra-small-nordic on SwednClustering does not exist. Set run_model=True to run the model. ERROR:seb.benchmark:Error when running VGSummarizationClustering on DanskBERT: Cache for DanskBERT on VGSummarizationClustering does not exist. Set run_model=True to run the model. ERROR:seb.benchmark:Error when running SwednClustering on DanskBERT: Cache for DanskBERT on SwednClustering does not exist. Set run_model=True to run the model. ERROR:seb.benchmark:Error when running VGSummarizationClustering on dfm-encoder-large-v1: Cache for dfm-encoder-large-v1 on VGSummarizationClustering does not exist. Set run_model=True to run the model. ERROR:seb.benchmark:Error when running SwednClustering on dfm-encoder-large-v1: Cache for dfm-encoder-large-v1 on SwednClustering does not exist. Set run_model=True to run the model. ERROR:seb.benchmark:Error when running VGSummarizationClustering on nb-bert-large: Cache for nb-bert-large on VGSummarizationClustering does not exist. Set run_model=True to run the model. ERROR:seb.benchmark:Error when running SwednClustering on nb-bert-large: Cache for nb-bert-large on SwednClustering does not exist. Set run_model=True to run the model. ERROR:seb.benchmark:Error when running VGSummarizationClustering on nb-bert-base: Cache for nb-bert-base on VGSummarizationClustering does not exist. Set run_model=True to run the model. ERROR:seb.benchmark:Error when running SwednClustering on nb-bert-base: Cache for nb-bert-base on SwednClustering does not exist. Set run_model=True to run the model. ERROR:seb.benchmark:Error when running VGSummarizationClustering on bert-base-swedish-cased: Cache for bert-base-swedish-cased on VGSummarizationClustering does not exist. Set run_model=True to run the model. ERROR:seb.benchmark:Error when running SwednClustering on bert-base-swedish-cased: Cache for bert-base-swedish-cased on SwednClustering does not exist. Set run_model=True to run the model. ERROR:seb.benchmark:Error when running VGSummarizationClustering on electra-small-swedish-cased-discriminator: Cache for electra-small-swedish-cased-discriminator on VGSummarizationClustering does not exist. Set run_model=True to run the model. ERROR:seb.benchmark:Error when running SwednClustering on electra-small-swedish-cased-discriminator: Cache for electra-small-swedish-cased-discriminator on SwednClustering does not exist. Set run_model=True to run the model. ERROR:seb.benchmark:Error when running VGSummarizationClustering on xlm-roberta-base: Cache for xlm-roberta-base on VGSummarizationClustering does not exist. Set run_model=True to run the model. ERROR:seb.benchmark:Error when running SwednClustering on xlm-roberta-base: Cache for xlm-roberta-base on SwednClustering does not exist. Set run_model=True to run the model. ERROR:seb.benchmark:Error when running VGSummarizationClustering on dfm-sentence-encoder-large-1: Cache for dfm-sentence-encoder-large-1 on VGSummarizationClustering does not exist. Set run_model=True to run the model. ERROR:seb.benchmark:Error when running SwednClustering on dfm-sentence-encoder-large-1: Cache for dfm-sentence-encoder-large-1 on SwednClustering does not exist. Set run_model=True to run the model. ERROR:seb.benchmark:Error when running VGSummarizationClustering on dfm-sentence-encoder-large-exp1: Cache for dfm-sentence-encoder-large-exp1 on VGSummarizationClustering does not exist. Set run_model=True to run the model. ERROR:seb.benchmark:Error when running SwednClustering on dfm-sentence-encoder-large-exp1: Cache for dfm-sentence-encoder-large-exp1 on SwednClustering does not exist. Set run_model=True to run the model. ERROR:seb.benchmark:Error when running VGSummarizationClustering on dfm-sentence-encoder-small-v1: Cache for dfm-sentence-encoder-small-v1 on VGSummarizationClustering does not exist. Set run_model=True to run the model. ERROR:seb.benchmark:Error when running SwednClustering on dfm-sentence-encoder-small-v1: Cache for dfm-sentence-encoder-small-v1 on SwednClustering does not exist. Set run_model=True to run the model. ERROR:seb.benchmark:Error when running VGSummarizationClustering on dfm-sentence-encoder-medium-v1: Cache for dfm-sentence-encoder-medium-v1 on VGSummarizationClustering does not exist. Set run_model=True to run the model. ERROR:seb.benchmark:Error when running SwednClustering on dfm-sentence-encoder-medium-v1: Cache for dfm-sentence-encoder-medium-v1 on SwednClustering does not exist. Set run_model=True to run the model. ERROR:seb.benchmark:Error when running VGSummarizationClustering on dfm-sentence-encoder-large-exp2-no-lang-align: Cache for dfm-sentence-encoder-large-exp2-no-lang-align on VGSummarizationClustering does not exist. Set run_model=True to run the model. ERROR:seb.benchmark:Error when running SwednClustering on dfm-sentence-encoder-large-exp2-no-lang-align: Cache for dfm-sentence-encoder-large-exp2-no-lang-align on SwednClustering does not exist. Set run_model=True to run the model. ERROR:seb.benchmark:Error when running VGSummarizationClustering on e5-small: Cache for e5-small on VGSummarizationClustering does not exist. Set run_model=True to run the model. ERROR:seb.benchmark:Error when running SwednClustering on e5-small: Cache for e5-small on SwednClustering does not exist. Set run_model=True to run the model. ERROR:seb.benchmark:Error when running VGSummarizationClustering on e5-base: Cache for e5-base on VGSummarizationClustering does not exist. Set run_model=True to run the model. ERROR:seb.benchmark:Error when running SwednClustering on e5-base: Cache for e5-base on SwednClustering does not exist. Set run_model=True to run the model. ERROR:seb.benchmark:Error when running VGSummarizationClustering on e5-large: Cache for e5-large on VGSummarizationClustering does not exist. Set run_model=True to run the model. ERROR:seb.benchmark:Error when running SwednClustering on e5-large: Cache for e5-large on SwednClustering does not exist. Set run_model=True to run the model. ERROR:seb.benchmark:Error when running VGSummarizationClustering on multilingual-e5-base: Cache for multilingual-e5-base on VGSummarizationClustering does not exist. Set run_model=True to run the model. ERROR:seb.benchmark:Error when running SwednClustering on multilingual-e5-base: Cache for multilingual-e5-base on SwednClustering does not exist. Set run_model=True to run the model. ERROR:seb.benchmark:Error when running VGSummarizationClustering on multilingual-e5-large: Cache for multilingual-e5-large on VGSummarizationClustering does not exist. Set run_model=True to run the model. ERROR:seb.benchmark:Error when running SwednClustering on multilingual-e5-large: Cache for multilingual-e5-large on SwednClustering does not exist. Set run_model=True to run the model. ERROR:seb.benchmark:Error when running VGSummarizationClustering on e5-mistral-7b-instruct: Cache for e5-mistral-7b-instruct on VGSummarizationClustering does not exist. Set run_model=True to run the model. ERROR:seb.benchmark:Error when running SwednClustering on e5-mistral-7b-instruct: Cache for e5-mistral-7b-instruct on SwednClustering does not exist. Set run_model=True to run the model. ERROR:seb.benchmark:Error when running VGSummarizationClustering on sonar-dan: Cache for sonar-dan on VGSummarizationClustering does not exist. Set run_model=True to run the model. ERROR:seb.benchmark:Error when running SwednClustering on sonar-dan: Cache for sonar-dan on SwednClustering does not exist. Set run_model=True to run the model. ERROR:seb.benchmark:Error when running VGSummarizationClustering on sonar-swe: Cache for sonar-swe on VGSummarizationClustering does not exist. Set run_model=True to run the model. ERROR:seb.benchmark:Error when running SwednClustering on sonar-swe: Cache for sonar-swe on SwednClustering does not exist. Set run_model=True to run the model. ERROR:seb.benchmark:Error when running VGSummarizationClustering on sonar-nob: Cache for sonar-nob on VGSummarizationClustering does not exist. Set run_model=True to run the model. ERROR:seb.benchmark:Error when running SwednClustering on sonar-nob: Cache for sonar-nob on SwednClustering does not exist. Set run_model=True to run the model. ERROR:seb.benchmark:Error when running VGSummarizationClustering on sonar-nno: Cache for sonar-nno on VGSummarizationClustering does not exist. Set run_model=True to run the model. ERROR:seb.benchmark:Error when running SwednClustering on sonar-nno: Cache for sonar-nno on SwednClustering does not exist. Set run_model=True to run the model. ERROR:seb.benchmark:Error when running VGSummarizationClustering on text-embedding-ada-002: Cache for text-embedding-ada-002 on VGSummarizationClustering does not exist. Set run_model=True to run the model. ERROR:seb.benchmark:Error when running SwednClustering on text-embedding-ada-002: Cache for text-embedding-ada-002 on SwednClustering does not exist. Set run_model=True to run the model.
Benchmark Results ┏━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━┳━━┳━┳━━┳━┳━━┳━┳━━┳━┳━━┳━ ┃ ┃ ┃ Average ┃ Average ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ Rank ┃ Model ┃ Score ┃ Rank ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┡━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━╇━━╇━╇━━╇━╇━━╇━╇━━╇━╇━━╇━ │ 1 │ multilingual-e5-small │ 0.53 │ 9.72 │ │ │ │ │ │ │ │ │ │ │ │ 2 │ NEW: all-MiniLM-L6-v2 │ 0.40 │ 22.12 │ │ │ │ │ │ │ │ │ │ │ │ 3 │ embed-multilingual-v3.0 │ nan │ 5.39 │ │ │ │ │ │ │ │ │ │ │ └──────┴─────────────────────────┴─────────┴─────────┴─┴──┴─┴──┴─┴──┴─┴──┴─┴──┴─
For how to run the benchmark on all models or only on a subset of tasks check out the documentation for the CLI.
Running a task¶
To run a task you will need to fetch the task amd a model run it.
import seb
model = seb.get_model("jonfd/electra-small-nordic")
task = seb.get_task("DKHate")
# initialize benchmark with tasks
benchmark = seb.Benchmark(tasks=[task])
# benchmark the model
benchmark_result = benchmark.evaluate_model(model)
benchmark_result # examine output
BenchmarkResults(meta=ModelMeta(name='electra-small-nordic', description=None, huggingface_name='jonfd/electra-small-nordic', reference='https://huggingface.co/jonfd/electra-small-nordic', languages=['da', 'nb', 'sv', 'nn'], open_source=True, embedding_size=256), task_results=[TaskResult(task_name='DKHate', task_description='Danish Tweets annotated for Hate Speech either being Offensive or not', task_version='1.0.3.dev0', time_of_run=datetime.datetime(2023, 7, 30, 13, 55, 38, 480327), scores={'da': {'accuracy': 0.5945288753799393, 'f1': 0.4912211182797449, 'ap': 0.8950480900418238, 'accuracy_stderr': 0.07818347662767612, 'f1_stderr': 0.05511334661624392, 'ap_stderr': 0.013877821318913264, 'main_score': 0.5945288753799393}}, main_score='accuracy')])
benchmark_result[0] # examine the results for the first task
TaskResult(task_name='DKHate', task_description='Danish Tweets annotated for Hate Speech either being Offensive or not', task_version='1.0.3.dev0', time_of_run=datetime.datetime(2023, 7, 30, 13, 55, 38, 480327), scores={'da': {'accuracy': 0.5945288753799393, 'f1': 0.4912211182797449, 'ap': 0.8950480900418238, 'accuracy_stderr': 0.07818347662767612, 'f1_stderr': 0.05511334661624392, 'ap_stderr': 0.013877821318913264, 'main_score': 0.5945288753799393}}, main_score='accuracy')
Reproducing the Benchmark¶
Reproducing the benchmark is easy and is doable simply using the following command:
models = [seb.get_model("all-MiniLM-L6-v2")]
# for simplicity, we will only run it with one model, but you could run it with multiple models:
# models = seb.get_all_models()
full_benchmark = seb.Benchmark()
results = benchmark.evaluate_models(models=models)
Running all-MiniLM-L6-v2: 100%|██████████| 1/1 [00:00<00:00, 175.16it/s]
This runs the full benchmark on all the specified models as well as all the registrered datasets. Note that all benchmark results are cached as included as a part of the package, this means that you won't have to rerun results that are already run.
mdl_result_on_benchmark = results[0] # results for the first model
mdl_result_on_benchmark[0] # results for the first task
TaskResult(task_name='DKHate', task_description='Danish Tweets annotated for Hate Speech either being Offensive or not', task_version='1.1.0', time_of_run=datetime.datetime(2023, 7, 31, 15, 19, 48, 879189), scores={'da': {'accuracy': 0.5504559270516718, 'f1': 0.4487544754943351, 'ap': 0.8825715897823836, 'accuracy_stderr': 0.08179003177509295, 'f1_stderr': 0.04439449341359171, 'ap_stderr': 0.008146255235874632, 'main_score': 0.5504559270516718}}, main_score='accuracy')
Adding a model¶
The benchmark uses a registry to add models. A model in seb
includes two thing. 1) a metadata object (seb.ModelMeta
) describing the metadata of the model and 2) a loader for the model itself, which is an object that needs an encode methods as described by the seb.ModelInterface
. Here is a minimal example of how to add a new model:
from sentence_transformers import SentenceTransformer
from typing import Any
import seb
import numpy as np
model_name = "sentence-transformers/all-MiniLM-L6-v2"
class MyEncoder(seb.Encoder):
"""
A custom model for SEB that uses the SentenceTransformer library.
"""
def __init__(self):
self.model = SentenceTransformer(model_name)
def encode( # type: ignore
self,
sentences: list[str],
*,
task: seb.Task,
**kwargs: Any,
) -> np.ndarray:
if task.name == "DKHate": # allow you to embed differently based on the task
emb = self.model.encode(sentences, batch_size=32, **kwargs)
else:
emb = self.model.encode(sentences, batch_size=32, **kwargs) # here we just do the same for all tasks
return emb
@seb.models.register(model_name) # add the model to the registry
def create_my_model() -> seb.SebModel:
hf_name = model_name
# create meta data
meta = seb.ModelMeta(
name=hf_name.split("/")[-1],
huggingface_name=hf_name,
reference="https://huggingface.co/{hf_name}",
languages=[],
embedding_size=384,
)
return seb.SebModel(
encoder=MyEncoder(),
meta=meta,
)
Note that if you want to use the CLI with one of your own added models you can import registrered functions from a file specified using the --code
flag.