Token-based#

augmenty.token.casing#

augmenty.token.casing.create_conditional_token_casing_augmenter_v1(conditional: Callable[[Token], bool], level: float, lower: Optional[bool] = None, upper: Optional[bool] = None) Callable[[Language, Example], Iterator[Example]][source]#

Creates an augmenter that conditionally cases the first letter of a token based on the getter. Either lower og upper needs to specifiedd as True.

Parameters:
  • level – The probability to case the first letter of a token.

  • conditional – A function that takes a token and returns True if the token should be cased.

  • lower – If the conditional returns True should the casing the lowercased.

  • upper – If the conditional returns True should the casing the uppercased.

Returns:

The augmenter.

Example

>>> def is_pronoun(token):
... if token.pos_ == "PRON":
...    return True
... return False
>>> aug = augmenty.load("conditional_token_casing_v1", level=1, lower=True,
>>>                     conditional=is_pronoun)
augmenty.token.casing.create_starting_case_augmenter_v1(level: float) Callable[[Language, Example], Iterator[Example]][source]#

Creates an augmenter which randomly cases the first letter in each token.

Parameters:

level – Probability to randomly case the first letter of a token.

Returns:

The augmenter.

Example

>>> import augmenty
>>> from spacy.lang.en import English
>>> nlp = English()
>>> augmenter = augmenty.load("random_starting_case_v1", level=0.5)
>>> texts = ["one two three"]
>>> list(augmenty.texts(texts, augmenter, nlp))
["one Two Three"]

augmenty.token.replace#

augmenty.token.replace.create_token__dict_replace_augmenter_v1(level: float, replace: ~typing.Union[~typing.Dict[str, ~typing.List[str]], ~typing.Dict[str, ~typing.Dict[str, ~typing.List[str]]]], ignore_casing: bool = True, getter: ~typing.Callable[[~spacy.tokens.token.Token], str] = <function <lambda>>, keep_titlecase: bool = True) Callable[[Language, Example], Iterator[Example]][source]#

Creates an augmenter swaps a token with its synonym based on a dictionary.

Parameters:
  • level – Probability to replace token given that it is in synonym dictionary.

  • replace – A dictionary of words and a list of their replacement (e.g. synonyms) or a dictionary denoting replacement based on pos tag.

  • ignore_casing – When doing the lookup should the model ignore casing?

  • getter – A getter function to extract the POS-tag.

  • keep_titlecase – Should the model keep the titlecase of the replaced word.

Returns:

The augmenter.

Examples

>>> replace = {"act": ["perform", "move", "action"], }
>>> create_token_dict_replace_augmenter(replace=replace, level=.10)
>>> # or
>>> replace = {"act": {"VERB": ["perform", "move"], "NOUN": ["action", "deed"]}}
>>> create_token_dict_replace_augmenter(replace=replace, level=.10)
augmenty.token.replace.create_token_replace_augmenter_v1(replace: Callable[[Token], str], keep_titlecase: bool = True) Callable[[Language, Example], Iterator[Example]][source]#

Creates an augmenter which replaces a token based on a replace function.

Parameters:
  • level – Probability to replace token given that it is in synonym dictionary.

  • replace – A callable which takes a spaCy Token as input and returns the replaces word as a string.

  • keep_titlecase – If original text was uppercased cased should replaces text also be?

Returns:

The augmenter.

Examples

>>> def remove_vowels(token):
...    vowels = ['a','e','i','o','u', 'y']
...    non_vowels = [c for c in token.text if c.lower() not in vowels]
...    return ''.join(non_vowels)
>>> aug = create_token_replace_augmenter(replace=remove_vowels, level=.10)
augmenty.token.replace.create_word_embedding_augmenter_v1(level=<class 'float'>, n: int = 10, nlp: ~typing.Optional[~spacy.language.Language] = None, keep_titlecase: bool = True, ignore_casing: bool = True) Callable[[Language, Example], Iterator[Example]][source]#

Creates an augmenter which replaces a token based on a replace function.

Parameters:
  • level – Probability to replace token given that it is in synonym dictionary.

  • n – Number of most similar word vectors to sample from nlp

  • nlp – A spaCy text-processing pipeline used for supplying the word vectors if the nlp model supplies doesn’t contain word vectors.

  • keep_titlecase – If original text was uppercased cased should replaces text also be?

  • ignore_case – The word embedding augmenter does not replace a word with the same word. Should this operation ignore casing?

Returns:

The augmenter.

Examples

>>> nlp = spacy.load('en_core_web_lg')
>>> aug = create_word_embedding_augmenter(nlp=nlp, level=.10)
augmenty.token.replace.create_wordnet_synonym_augmenter_v1(level: float, lang: ~typing.Optional[str] = None, respect_pos: bool = True, getter: ~typing.Callable = <function <lambda>>, keep_titlecase: bool = True) Callable[[Language, Example], Iterator[Example]][source]#

Creates an augmenter swaps a token with its synonym based on a dictionary.

Parameters:
  • lang – Language supplied a ISO 639-1 language code. If None, the lang is based on the language of the spacy nlp pipeline used. Possible language codes include: “da”, “ca”, “en”, “eu”, “fa”, “fi”, “fr”, “gl”, “he”, “id”, “it”, “ja”, “nn”, “no”, “pl”, “pt”, “es”, “th”.

  • level – Probability to replace token given that it is in synonym dictionary.

  • respect_pos – Should POS-tag be respected?

  • getter – A getter function to extract the POS-tag.

  • keep_titlecase – Should the model keep the titlecase of the replaced word.

Returns:

The augmenter.

Example

>>> english_synonym_augmenter = create_wordnet_synonym_augmenter(level=0.1,
>>>                                                              lang="en")

augmenty.token.spacing#

augmenty.token.spacing.create_letter_spacing_augmenter_v1(level: float) Callable[[Language, Example], Iterator[Example]][source]#

Typically casing is used to add emphasis to words, but letter spacing has also been used to add e m p h a s i s to words (e.g. by Grundtvig; Baunvig, Jarvis and Nielbo, 2020). This augmenter randomly adds letter spacing emphasis to words. This augmentation which are human readable, but which are clearly challenging for systems using a white-space centric tokenization.

Parameters:

level – The probability add grundtvigian letter spacing emphasis.

Returns:

The augmenter.

augmenty.token.spacing.create_spacing_insertion_augmenter_v1(level: float, max_insertions: int = 1) Callable[[Language, Example], Iterator[Example]][source]#

Creates and augmneter that randomly adds a space after a chara cter. Tokens are kept the same.

Parameters:
  • level – The probability to add a space after a character.

  • max_insertions – Maximum number of insertions pr. word.

Returns:

The augmenter.

augmenty.token.swap#

augmenty.token.swap.create_token_swap_augmenter_v1(level: float, respect_ents: bool = True, respect_sentences: bool = True) Callable[[Language, Example], Iterator[Example]][source]#

Creates an augmenter that randomly swaps two neighbouring tokens.

Parameters:
  • level – The probability to swap two tokens.

  • respect_ents – Should the pipeline respect entities? Defaults to True. In which case it will not swap a token inside an entity with a token outside the entity span, unless it is a one word span. If false it will disregard correcting the entity labels.

  • respect_sentences – Should it respect end of sentence bounderies? Default to True, indicating that it will not swap and end of sentence token. If False it will disregard correcting the sentence start as this becomes arbitrary.

Returns:

The augmenter.