Training using Augmenty#

This tutorial takes you through how to utilize spacy augmenters during training. It build upon the spacy project for training a part-of-speech tagger and dependency parser.

This code will take you through how to adapt the code to allow for training using augmenty, but the you can also just go a see the finished project within the tutorials folder.

Note

This examples assumes that the reader is familiar with spacy projects.

Setting up the spacy project#

You can download the spacy project using:

python -m spacy project clone pipelines/tagger_parser_ud

Which should get you a folder called tagger_parser_ud. You can now run it to see that everything works, by first fetching the assets:

spacy project assets

And then run the whole training pipeline:

spacy project run all

This should give you something like:

 Running workflow 'all'

================================= preprocess =================================
Running command: mkdir -p corpus/UD_English-EWT
[...]
=================================== train ===================================
Running command: /Users/au561649/.virtualenvs/augmenty/bin/python -m spacy train [...]
[...] Initialized pipeline
============================= Training pipeline ============================= Pipeline: ['tok2vec', 'tagger', 'morphologizer',
'trainable_lemmatizer', 'parser'] Initial learn rate: 0.001
E    #       LOSS TOK2VEC  LOSS TAGGER  LOSS MORPH...  LOSS TRAIN...  LOSS PARSER  TAG_ACC  POS_ACC  MORPH_ACC  LEMMA_ACC  DEP_UAS  DEP_LAS  SENTS_F  SCORE 
---  ------  ------------  -----------  -------------  -------------  -----------  -------  -------  ---------  ---------  -------  -------  -------  ------
  0       0          0.00       137.44         138.58         138.94       264.16    21.87    24.48      25.75      76.52    14.36     7.38     0.91    0.29
[...]

Once you start seeing the table feel free to stop the pipeline. We now know that the setup works and we can then adopt it to start using augmenty for augmenting the data.

Adding Augmenty#

To add in augmenty you need to

  1. Install augmenty in your environment

  2. create your desired augmenters

  3. update the config file (located in configs/default.cfg)

  4. Ensure that the code with the augmenters is loaded in when training

1) create your desired augmenters#

To create your desired augmenters, you should be aware what model you are training. For instance in our case we are training a dependency parser and a part-of-speech tagger. This can put some limitations on what augmenters you can use. For instance, removing a token from a text can lead to invalid dependency annotations, thus the token deletion augmentation is not useable. There is an overview of what you can use the augmenters for here.

For our case, we will create a simple augmenter, which introduces some spelling errors, using two existing augmenters:

# file: augmenters.py

import spacy

import augmenty


# register the augmenter such with the name you want to specify in the config
@spacy.registry.augmenters("my_augmenter")
def my_augmenters():
    # create the augmenters you wish to use
    keystroke_augmenter = augmenty.load(
        "keystroke_error_v1",
        keyboard="en_qwerty_v1",
        level=0.05,  # 5% of characters might be too much
    )

    char_swap_augmenter = augmenty.load("char_swap_v1", level=0.03)

    # combine them into a single augmenter to be used for training
    # the order of the augmenters is important, as the first augmenter will be applied first
    return augmenty.combine([keystroke_augmenter, char_swap_augmenter])

Let us quickly check out that our augmenters works as intended:

nlp = spacy.blank("en")
augmenter = my_augmenters()

texts = ["This is a test sentence."]

for i in range(10):
    augmented_texts = augmenty.texts(texts, augmenter, nlp=nlp)

    for text in augmented_texts:
        print(text)
This is a test sentdgce.
This is z test sentence.
This is a test sentence.
This ie a test sentence.
This is a tset sentecne.
This is a test sentecne.
This is a tset segtence.
This is a test sentenve.
This is a test sentence.
This is a tsst sentence.

We then need to add the registered augmenters to a file. In our case it will be augmenters.py

2) update the config file#

Then we will need ot thell the training process that it should use the augmenter. We do this by changing ght econfig located in the configs/default.cfg.

We do this by replacing the line augmenter = null in the following:

# file: configs/default.cfg
[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

with the lines:

[corpora.train.augmenter]
@augmenters = "my_augmenter"

3) Ensure that the code with the augmenters is loaded in when training#

If you were just to run the command spacy project run train (to start the training) you would get an error stating that the augmenter could not be found.

However, that is easily fixable. The spacy project contains the train command which specifies what the spacy project run train should do. In the code below we see that it calls the python -m spacy train command with a sequence of arguments. Luckily for us adding the code that we want executes is as simply as just adding it as an argument as seen in the code below:

# file: project.yml
  - name: train
    help: "Train ${vars.treebank}"
    script:
      - >-
        python -m spacy train 
        configs/${vars.config}.cfg
        --output training/${vars.treebank}
        --gpu-id ${vars.gpu} 
        --paths.train corpus/${vars.treebank}/train.spacy 
        --paths.dev corpus/${vars.treebank}/dev.spacy 
        --nlp.lang=${vars.lang}
        --code augmenters.py # <-- we need to add this line for the code to be run and the augmenters to be registred

Conclusion#

That is it. You can now run:

spacy project run train

and the project will now train using the augmenter.

Evaluation

One important thing when evaluating, especially using augmented training is that you evaluate as close as possible to the target. For instance if you want your model to be able to handle lowercase text you have to make sure that your evaluating set also have some lowercases text. Naturally you can also use Augmenty for this as well.

However augmentations during training does not need to resemble the augmentations during evaluation. In fact it is quite common to see that a model training using only small amounts of augmentation (e.g. ~0.5% spelling errors) handles larger degrees of augmentation notably better (e.g. ~5%) without sacrificing as much performance as if you had trained using a higher degree of augmentation.