Token-based#

augmenty.token.casing#

augmenty.token.casing.create_conditional_token_casing_augmenter_v1(conditional: Callable, level: float, lower: bool | None = None, upper: bool | None = None) Callable[[Language, Example], Iterator[Example]][source]#

Creates an augmenter that conditionally cases the first letter of a token based on the getter. Either lower og upper needs to specifiedd as True.

Parameters:
  • level (float) –

  • conditional (Callable) –

  • lower (Optional[bool], optional) – If the conditional returns True should the casing the lowercased. Default to None.

  • upper (Optional[bool], optional) – If the conditional returns True should the casing the uppercased. Default to None.

Returns:

The augmenter.

Return type:

Callable[[Language, Example], Iterator[Example]]

Example

>>> def is_pronoun(token):
... if token.pos_ == "PRON":
...    return True
... return False
>>> aug = augmenty.load("conditional_token_casing_v1", level=1, lower=True,
>>>                     conditional=is_pronoun)
augmenty.token.casing.create_starting_case_augmenter_v1(level: float) Callable[[Language, Example], Iterator[Example]][source]#

Creates an augmenter which randomly cases the first letter in each token.

Parameters:

level (float) – Probability to randomly case the first letter of a token.

Returns:

The augmenter.

Return type:

Callable[[Language, Example], Iterator[Example]]

Example

>>> import augmenty
>>> from spacy.lang.en import English
>>> nlp = English()
>>> augmenter = augmenty.load("random_starting_case_v1", level=0.5)
>>> texts = ["one two three"]
>>> list(augmenty.texts(texts, augmenter, nlp))
["one Two Three"]

augmenty.token.replace#

augmenty.token.replace.create_token__dict_replace_augmenter_v1(level: float, replace: ~typing.Dict[str, ~typing.List[str]] | ~typing.Dict[str, ~typing.Dict[str, ~typing.List[str]]], ignore_casing: bool = True, getter: ~typing.Callable[[~spacy.tokens.token.Token], str] = <function <lambda>>, keep_titlecase: bool = True) Callable[[Language, Example], Iterator[Example]][source]#

Creates an augmenter swaps a token with its synonym based on a dictionary.

Parameters:
  • level (float) – Probability to replace token given that it is in synonym dictionary.

  • replace (Union[Dict[str, List[str]], Dict[str, Dict[str, List[str]]]]) – A dictionary of words and a list of their replacement (e.g. synonyms) or a dictionary denoting replacement based on pos tag.

  • ignore_casing (bool, optional) – When doing the lookup should the model ignore casing? Defaults to True.

  • getter (Callable[[Token], str], optional) – A getter function to extract the POS-tag.

  • keep_titlecase (bool) – Should the model keep the titlecase of the replaced word. Defaults to True.

Returns:

The augmenter.

Return type:

Callable[[Language, Example], Iterator[Example]]

Examples

>>> replace = {"act": ["perform", "move", "action"], }
>>> create_token_dict_replace_augmenter(replace=replace, level=.10)
>>> # or
>>> replace = {"act": {"VERB": ["perform", "move"], "NOUN": ["action", "deed"]}}
>>> create_token_dict_replace_augmenter(replace=replace, level=.10)
augmenty.token.replace.create_token_replace_augmenter_v1(replace: Callable[[Token], str], keep_titlecase: bool = True) Callable[[Language, Example], Iterator[Example]][source]#

Creates an augmenter which replaces a token based on a replace function.

Parameters:
  • level (float) – Probability to replace token given that it is in synonym dictionary.

  • replace (Callable[[Token], str) – A callable which takes a spaCy Token as input and returns the replaces word as a string.

  • keep_titlecase (bool, optional) – If original text was uppercased cased should replaces text also be? Defaults to True.

Returns:

The augmenter.

Return type:

Callable[[Language, Example], Iterator[Example]]

Examples

>>> def remove_vowels(token):
...    vowels = ['a','e','i','o','u', 'y']
...    non_vowels = [c for c in token.text if c.lower() not in vowels]
...    return ''.join(non_vowels)
>>> aug = create_token_replace_augmenter(replace=remove_vowels, level=.10)
augmenty.token.replace.create_word_embedding_augmenter_v1(level=<class 'float'>, n: int = 10, nlp: ~spacy.language.Language | None = None, keep_titlecase: bool = True, ignore_casing: bool = True) Callable[[Language, Example], Iterator[Example]][source]#

Creates an augmenter which replaces a token based on a replace function.

Parameters:
  • level (float) – Probability to replace token given that it is in synonym dictionary.

  • n (int, optional) – Number of most similar word vectors to sample from nlp (Optional[Language], optional): A spaCy text-processing pipeline used for supplying the word vectors if the nlp model supplies doesn’t contain word vectors.

  • keep_titlecase (bool, optional) – If original text was uppercased cased should replaces text also be? Defaults to True.

  • ignore_case (bool, optional) – The word embedding augmenter does not replace a word with the same word. Should this operation ignore casing? Default to True.

Returns:

The augmenter.

Return type:

Callable[[Language, Example], Iterator[Example]]

Examples

>>> nlp = spacy.load('en_core_web_lg')
augmenty.token.replace.create_wordnet_synonym_augmenter_v1(level: float, lang: str | None = None, respect_pos: bool = True, getter: ~typing.Callable = <function <lambda>>, keep_titlecase: bool = True) Callable[[Language, Example], Iterator[Example]][source]#

Creates an augmenter swaps a token with its synonym based on a dictionary.

Parameters:
  • lang (Optional[str], optional) – Language supplied a ISO 639-1 language code. Defaults to None, in which case the lang is based on the language of the spacy nlp pipeline used. Possible language codes include: “da”, “ca”, “en”, “eu”, “fa”, “fi”, “fr”, “gl”, “he”, “id”, “it”, “ja”, “nn”, “no”, “pl”, “pt”, “es”, “th”.

  • level (float) – Probability to replace token given that it is in synonym dictionary.

  • respect_pos (bool, optional) – Should POS-tag be respected? Defaults to True.

  • getter (Callable[[Token], str], optional) – A getter function to extract the POS-tag.

  • keep_titlecase (bool) – Should the model keep the titlecase of the replaced word. Defaults to True.

Returns:

The augmenter.

Return type:

Callable[[Language, Example], Iterator[Example]]

Example

>>> english_synonym_augmenter = create_wordnet_synonym_augmenter(level=0.1,
>>>                                                              lang="en")

augmenty.token.spacing#

augmenty.token.spacing.create_letter_spacing_augmenter_v1(level: float) Callable[[Language, Example], Iterator[Example]][source]#

Typically casing is used to add emphasis to words, but letter spacing has also been used to add e m p h a s i s to words (e.g. by Grundtvig; Baunvig, Jarvis and Nielbo, 2020). This augmenter randomly adds letter spacing emphasis to words. This augmentation which are human readable, but which are clearly challenging for systems using a white-space centric tokenization.

Parameters:

level (float) – The probability add grundtvigian letter spacing emphasis.

Returns:

The augmenter.

Return type:

Callable[[Language, Example], Iterator[Example]]

augmenty.token.spacing.create_spacing_insertion_augmenter_v1(level: float, max_insertions: int = 1) Callable[[Language, Example], Iterator[Example]][source]#

Creates and augmneter that randomly adds a space after a chara cter. Tokens are kept the same.

Parameters:
  • level (float) – The probability to add a space after a character.

  • max_insertions (int, optional) – Maximum number of insertions pr. word.

Returns:

The augmenter.

Return type:

Callable[[Language, Example], Iterator[Example]]

augmenty.token.swap#

augmenty.token.swap.create_token_swap_augmenter_v1(level: float, respect_ents: bool = True, respect_sentences: bool = True) Callable[[Language, Example], Iterator[Example]][source]#

Creates an augmenter that randomly swaps two neighbouring tokens.

Parameters:
  • level (float) – The probability to swap two tokens.

  • respect_ents (bool, optional) – Should the pipeline respect entities? Defaults to True. In which case it will not swap a token inside an entity with a token outside the entity span, unless it is a one word span. If false it will disregard correcting the entity labels.

  • respect_sentences (bool, optional) – Should it respect end of sentence bounderies? Default to True, indicating that it will not swap and end of sentence token. If False it will disregard correcting the sentence start as this becomes arbitrary.

Returns:

The augmenter.

Return type:

Callable[[Language, Example], Iterator[Example]]