Span-based#

augmenty.span.entities#

augmenty.span.entities.create_ent_augmenter_v1(level: float, ent_dict: Dict[str, Iterable[Union[str, List[str], Span, Doc]]], replace_consistency: bool = True, resolve_dependencies: bool = True) Callable[[Language, Example], Iterator[Example]][source]#

Create an augmenter which replaces an entity based on a dictionary lookup.

Parameters:
  • level – the percentage of entities to be augmented.

  • ent_dict – A dictionary with keys corresponding the the entity type you wish to replace (e.g. “PER”) and a itarable of the replacements entities. A replacement can be either 1) a list of string of the desired entity i.e. [“Kenneth”, “Enevoldsen”], 2) a string of the desired entity i.e. “Kenneth Enevoldsen”, this will be split using the tokenizer of the nlp pipeline, or 3) Span object with the desired entity, here all information will be passed on except for the dependency tree.

  • replace_consistency – Should an entity always be replaced with the same entity?

  • resolve_dependencies – Attempts to resolve the dependency tree by setting head of the original entitity aa the head of the first token in the new entity. The remainder is the passed as

Returns:

The augmenter

Example

>>> ent_dict = {"ORG": [["Google"], ["Apple"]],
>>>             "PERSON": [["Kenneth"], ["Lasse", "Hansen"]]}
>>> # augment 10% of names
>>> ent_augmenter = create_ent_augmenter(ent_dict, level = 0.1)
augmenty.span.entities.create_ent_format_augmenter_v1(reordering: List[Optional[int]], formatter: List[Optional[Callable[[Token], str]]], level: float, ent_types: Optional[List[str]] = None) Callable[[Language, Example], Iterator[Example]][source]#

Creates an augmenter which reorders and formats a entity according to reordering and formatting functions.

Parameters:
  • reordering – A reordering consisting of a the desired order of the list of indices, where None denotes the remainder. For instance if this function was solely used on names [-1, None] indicate last name (the last token in the name) followed by the remainder of the name. Similarly one could more use the reordering [3, 1, 2] e.g. indicating last name, first name, middle name. Note that if the entity only include two tokens the 3 will be ignored producing the pattern [1, 2].

  • formatter – A list of function taking in a spaCy Token returning the reformatted str. E.g. the function lambda token: token.text[0] + “.” would abbreviate the token and add punctuation. None corresponds to no augmentation.

  • level – The probability of an entities being augmented.

  • ent_types – The entity types which should be augmented. Defaults to None, indicating all entity types.

Returns:

The augmenter

Example

>>> import augmenty
>>> import spacy
>>> nlp = spacy.load("en_core_web_sm")
>>> abbreviate = lambda token: token.text[0] + "."
>>> augmenter = augmenty.load("ents_format_v1", reordering = [-1, None],
>>>                           formatter=[None, abbreviate], level=1,
>>>                            ent_types=["PER"])
>>> texts = ["my name is Kenneth Enevoldsen"]
>>> list(augmenty.texts(texts, augmenter, nlp))
["my name is Enevoldsen K."]
augmenty.span.entities.create_per_replace_augmenter_v1(names: Dict[str, List[str]], patterns: List[List[str]], level: float, names_p: Optional[Dict[str, List[float]]] = None, patterns_p: Optional[List[float]] = None, replace_consistency: bool = True, person_tag: str = 'PERSON') Callable[[Language, Example], Iterator[Example]][source]#

Create an augmenter which replaces a name (PER) with a news sampled from the names dictionary.

Parameters:
  • names – A dictionary of list of names to sample from. These could for example include first name and last names.

  • pattern – The pattern to create the names. This should be a list of patterns. Where a pattern is a list of strings, where the string denote the list in the names dictionary in which to sample from.

  • level – The proportion of PER entities to replace.

  • names_p – The probability to sample each name. An empty dictionary “{}”, indicates equal probability for each name.

  • patterns_p – The probability to sample each pattern. None indicates equal probability for each pattern.

  • replace_consistency – Should the entity always be replaced with the same entity?

  • person_tag – The tag of the person entity (e.g. “PERSON” or “PER”).

Returns:

The augmenter

Example

>>> names = {"firstname": ["Kenneth", "Lasse"],
>>>          "lastname": ["Enevoldsen", "Hansen"]}
>>> patterns = [["firstname"], ["firstname", "lastname"],
>>>             ["firstname", "firstname", "lastname"]]
>>> person_tag = "PERSON"
>>> # replace 10% of names:
>>> per_augmenter = create_per_replace_augmenter(names, patterns, level=0.1,
>>>                                              person_tag=person_tag)