pipeline Components#

SpaCy-wrap currently includes only two pipeline component the "sequence_classification_transformer" for sequence classification and the "token_classification_transformer" for token classification or named entity recognition. The components are implemented as a subclass of spacy.pipeline.Pipe and can be added to a spaCy pipeline using the nlp.add_pipe method. The components can be configured with a config dictionary.

Sequence Classification Transformer#

spacy_wrap.pipeline_component_seq_clf.make_sequence_classification_transformer(nlp: Language, name: str, model: Model[List[Doc], FullTransformerBatch], set_extra_annotations: Callable[[List[Doc], FullTransformerBatch], None], max_batch_items: int, doc_extension_trf_data: str, doc_extension_prediction: str, labels: List[str] | None = None, assign_to_cats: bool = True)[source]#

Construct a SequenceClassificationTransformer component, which lets you plug a model from the Huggingface transformers library into spaCy so you can use it in your pipeline. The component will add a Doc extension with the name specified in the config/arguments, which you can use to access the transformer’s output.

Parameters:
  • nlp (Language) – The current nlp object.

  • model (Model[List[Doc], FullTransformerBatch]) – A thinc Model object wrapping the transformer. Usually you will want to use the SequenceClassificationTransformer layer for this.

  • set_extra_annotations (Callable[[List[Doc], FullTransformerBatch], None]) – A callback to set additional information onto the batch of Doc objects. The doc._.{doc_extension_trf_data} attribute is set prior to calling the callback as well as doc._.{doc_extension_prediction} and doc._.{doc_extension_prediction}_prob. By default, no additional annotations are set.

  • max_batch_items (int) – The maximum number of items to process in a batch.

  • doc_extension_trf_data (str) – The name of the Doc extension to add the transformer’s output to.

  • doc_extension_prediction (str) – The name of the Doc extension to add the transformer’s prediction to.

  • labels (List[str]) – A list of labels which the transformer model outputs, should be ordered.

  • assign_to_cats (bool) – Whether to assign the predictions to the doc.cats dictionary. Defaults to True.

Returns:

The constructed component.

Return type:

SequenceClassificationTransformer

Example

>>> import spacy
>>> import spacy_wrap
>>>
>>> nlp = spacy.blank("en")
>>>
>>> config = {
>>>     "doc_extension_trf_data": "clf_trf_data",  # document extention for the forward pass
>>>     "doc_extension_prediction": "sentiment",  # document extention for the prediction
>>>     "model": {
>>>         "@architectures": "spacy-transformers.SequenceClassificationTransformer.v1",
>>>         # the model name or path of huggingface model
>>>         "name": "distilbert-base-uncased-finetuned-sst-2-english",
>>>     },
>>> }
>>>
>>> nlp.add_pipe("sequence_classification_transformer", config=config)
>>>
>>> doc = nlp("spaCy is a wonderful tool")
>>> doc.cats
{'NEGATIVE': 0.001, 'POSITIVE': 0.999}

Sequence Classification Transformer#

spacy_wrap.pipeline_component_tok_clf.make_token_classification_transformer(nlp: Language, name: str, model: Model[List[Doc], FullTransformerBatch], set_extra_annotations: Callable[[List[Doc], FullTransformerBatch], None], max_batch_items: int, doc_extension_trf_data: str, doc_extension_prediction: str, aggregation_strategy: Literal['first', 'average', 'max'], labels: List[str] | None = None, predictions_to: List[Literal['pos', 'tag', 'ents']] | None = None) TokenClassificationTransformer[source]#

Construct a ClassificationTransformer component, which lets you plug a model from the Huggingface transformers library into spaCy so you can use it in your pipeline. One or more subsequent spaCy components can use the transformer outputs as features in its model, with gradients backpropagated to the single shared weights.

Parameters:
  • nlp (Language) – The current nlp object.

  • name (str) – The name of the component instance.

  • model (Model[List[Doc], FullTransformerBatch]) – A thinc Model object wrapping the transformer. Usually you will want to use the ClassificationTransformer layer for this.

  • set_extra_annotations (Callable[[List[Doc], FullTransformerBatch], None]) – A callback to set additional information onto the batch of Doc objects. The doc._.{doc_extension_trf_data} attribute is set prior to calling the callback as well as doc._.{doc_extension_prediction} and doc._.{doc_extension_prediction}_prob. By default, no additional annotations are set.

  • max_batch_items (int) – The maximum number of items to process in a batch.

  • doc_extension_trf_data (str) – The name of the doc extension to store the transformer data in.

  • doc_extension_prediction (str) – The name of the doc extension to store the predictions in.

  • aggregation_strategy (Literal["first", "average", "max"]) – The aggregation strategy to use. Chosen to correspond to the aggregation strategies used in the TokenClassificationPipeline in Huggingface: https://huggingface.co/docs/transformers/v4.24.0/en/main_classes/pipelines#transformers.TokenClassificationPipeline.aggregation_strategy “first”: Words will simply use the tag of the first token of the word when there is ambiguity. “average”: Scores will be averaged first across tokens, and then the maximum label is applied. “max”: Word entity will simply be the token with the maximum score.

  • labels (List[str]) – A list of labels which the transformer model outputs, should be ordered.

  • predictions_to (Optional[List[Literal["pos", "tag", "ents"]]]) – A list of attributes the predictions should be written to. Default to None. In which case it is inferred from the labels. If the labels are UPOS tags, the predictions will be written to the “pos” attribute. If the labels are IOB tags, the predictions will be written to the “ents” attribute. “tag” is not inferred from the labels, but can be added explicitly. Note that if the “pos” attribute is set the labels must be UPOS tags and if the “ents” attribute is set the labels must be IOB tags.

Returns:

The constructed component.

Return type:

TokenClassificationTransformer

Example

>>> import spacy
>>> import spacy_wrap
>>>
>>> nlp = spacy.blank("en")
>>> nlp.add_pipe("token_classification_transformer", config={
...     "model": {"name": "vblagoje/bert-english-uncased-finetuned-pos"}}
... )
>>> doc = nlp("My name is Wolfgang and I live in Berlin")
>>> print([tok.pos_ for tok in doc])
["PRON", "NOUN", "AUX", "PROPN", "CCONJ", "PRON", "VERB", "ADP", "PROPN"]