Customizing Asent#

colab_tut

Asent is built using a series of extension attributes on the spaCy classes, Doc, Token and Span. This allows you to switch out the parts and also improve on one component at a time. In this tutorial, we will move you through how to customize your own pipeline. This will first include a quick approach using the Asent component to implement a Swedish pipeline. Then we will customize the way the pipeline checks if a word is negated.

To read more about the custom spaCy extensions check out their documentation.

Before we start we will need a spaCy pipeline for Swedish. Here we will use the experimental version which you can install using:

# this can take a while
pip install https://huggingface.co/explosion/sv_udv25_swedishtalbanken_trf/resolve/main/sv_udv25_swedishtalbanken_trf-any-py3-none-any.whl

Which can then be loading using spaCy:

import spacy
nlp = spacy.load("sv_udv25_swedishtalbanken_trf")

Attention

This pipeline requires a GPU and torch>=1.9.0

See also

If you don’t wish to use the experimental version by spaCy we recommend pipeline by the Swedish royal library.

Creating the pipeline#

A large part of the customization is made simple using the Asent component. Here we will implement the Swedish Asent pipeline, for this we need a dictionary of rated words and potentially a list of intensifiers, negation and contrastive conjunctions.

We can extract these using the lexicons.get function:

from asent import lexicons
rated_words = lexicons.get("lexicon_se_v1")
negations = lexicons.get("negations_se_v1")
intensifiers = lexicons.get("intensifiers_se_v1")

Before we move on let us check what each of these contains, let us start with the rated_words, this is a dictionary which contain words as well as a human rating of how positive/negative it is:

list(rated_words.items())[140:150] # the start of the dictionary is mostly emoticons
[('motbjudande', -3.1),
 ('avskyr', -2.9),
 ('förmågor', 1.0),
 ('förmåga', 1.3),
 ('ombord', 0.1),
 ('frånvarande', -1.1),
 ('frikänna', 1.6),
 ('befrias', 1.5),
 ('frikänner', 1.6),
 ('missbruk', -3.2)]

negations is simply a list or a set of words, which is considered negations:

list(negations)[:10]
['utan',
 'aldrig',
 'ingenting',
 'sällan',
 'trots',
 'ingenstans',
 'ingen',
 'nej',
 'varken',
 'inte']

Finally, intensifiers is a dictionary of words such intensifies the valence of another words (e.g. “very”). It is associated with a score on how much it intensifies to the following word.

list(intensifiers.items())[:10]
[('absolut', 0.293),
('otroligt', 0.293),
('väldigt', 0.293),
('helt', 0.293),
('betydande', 0.293),
('betydligt', 0.293),
('bestämt', 0.293),
('djupt', 0.293),
('effing', 0.293),
('enorm', 0.293)]

We can now add a sentiment component to the pipeline, using the Asent component:

import asent
from asent.component import Asent

Asent(
    nlp, name="asent_se", lexicon=rated_words, intensifiers=intensifiers, negations=negations, lowercase=True, lemmatize=False
)

# test it out and visualize results
doc = nlp("Jag är enormt lycklig")
asent.visualize(doc)
Model prediction

Do note, that we specified that when the model should lookup in the dictionaries it should lowercase the lookup word (lowercase=True) and it should not lemmatize(lemmatize=False) the word. This should naturally correspond to the lexicons you are using, if your lexicon contains lemmas then you should lemmatize beforehand, if your lexicon is case sensitive you should not lowercase before the look up.

Customizing your Pipeline#

In the following, we will customize our pipeline a bit further. We will especially look at the negations. The current implementation based on Hutto and Gilbert (2014) assumed that the word is negated if one of the three proceeding words is a negation. This is a simplifying assumption which has been shown to generally work well, however, with spaCy performing a dependency parse and part-of-speech tagging we can do better!

Examining an example#

First let us examine an example where it fails:

doc = nlp("Jag är inte glad men jag skulle inte säga att jag är ledsen.")
# I am not happy but I would not say that I am sad.

asent.visualize(doc)
Model prediction
from spacy import displacy

# examine the part-of-speech tags and dependency tree
displacy.render(doc)
A dependency parse

From this we can notice two things:

1) Negation have the PART part-of-speech tag, indicating that it is a particle, which among other things include negations.

2) Negations is related to other words using the advmod, and the words we wish negated is “down the tree” (or down the arrow if you will) from the negated word.

We can even go a bit further and examine the morph extension:

for t in doc:
    print(t, "\t", t.morph)
Jag          Case=Nom|Definite=Def|Gender=Com|Number=Sing|PronType=Prs
är          Mood=Ind|Tense=Pres|VerbForm=Fin|Voice=Act
inte         Polarity=Neg
glad         Case=Nom|Definite=Ind|Degree=Pos|Gender=Com|Number=Sing
men
jag          Case=Nom|Definite=Def|Gender=Com|Number=Sing|PronType=Prs
skulle       Mood=Ind|Tense=Past|VerbForm=Fin|Voice=Act
inte         Polarity=Neg
säga         VerbForm=Inf|Voice=Act
att
jag          Case=Nom|Definite=Def|Gender=Com|Number=Sing|PronType=Prs
är          Mood=Ind|Tense=Pres|VerbForm=Fin|Voice=Act
ledsen       Case=Nom|Definite=Ind|Gender=Com|Number=Sing|Tense=Past|VerbForm=Part
.

Where we see that the negation “inte” is denoted by Polarity=Neg, indicating that it is a negation.

from this, there are two things we can change, first instead of looking up the negation, we can examine whether it is a negation or at least that it has the right part-of-speech tag. Secondly, we can implement a method which check if a word is negated using the dependency tree.

Morphology and Part-of-Speech for negations#

Asent check is a word is a negation using the is_negation token extension. We can see this using:

for t in doc:
    print(t, "\t", t._.is_negation)
Jag          False
är          False
inte         True
glad         False
men          False
jag          False
skulle       False
inte         True
säga         False
att          False
jag          False
är          False
ledsen       False
.           False

We will now simply overwrite the extension with one using the morph tag. First, we will create a function which applied to a token returns whether it is a negation. Secondly, we will overwrite the extensions using the token’s set_extention method.

from spacy.tokens import Token

def is_negation(token: Token) -> bool:
    """checks is token is a negation

    Args:
        token (Token): A spaCy token

    Returns:
        bool: a boolean indicating whether the token is a negation
    """
    m_dict = token.morph.to_dict()
    return ("Polarity" in m_dict  # if is has the polarity attribute
            and m_dict["Polarity"] == "Neg")  # and it is negative

Token.set_extension("is_negation", getter=is_negation, force=True)

Now our negations use the morph tag, which in this case provides the same results so the result isn’t that interesting. What we really what it the second part:

Using the dependency tree for negations#

In the following, we will overwrite the is_negated extension used by Asent to check if a word is negated. We can start by examining it:

for t in doc:
    print(t, "\t", t._.is_negated)
Jag          None
är          None
inte         None
glad         inte
men          inte
jag          inte
skulle       None
inte         None
säga         inte
att          inte
jag          inte
är          None
ledsen       None
.           None

Noticably, see that ledsen is not negated, although it should be, but we also clearly see the three following words after the negation is negated as was expected from the heuristic rule.

from typing import Optional

def is_negated(token: Token) -> Optional[Token]:
    """checks is token is negated

    Args:
        token (Token): A spaCy token

    Returns:
        Optional[Token]: return the negation if the token is negated
    """
    # only check if a word is negated if it is rated (it is not meaningful to do otherwise)
    if token._.valence:
        for c in token.children:
            # if the token is modified by a negation
            if c.dep_ == "advmod" and c._.is_negation:
                return c
        # or if its head it negated:
        for c in token.head.children:
            if c.dep_ == "advmod" and c._.is_negation:
                return c


Token.set_extension("is_negated", getter=is_negated, force=True)
for t in doc:
    print(t, "\t", t._.valence,  t._.is_negation, "-",  t._.is_negated)
Jag          0.0 False - None
är          0.0 False - None
inte         0.0 True - None
glad         3.1 False - inte
men          0.0 False - None
jag          0.0 False - None
skulle       0.0 False - None
inte         0.0 True - None
säga         0.0 False - None
att          0.0 False - None
jag          0.0 False - None
är          0.0 False - None
ledsen       -2.1 False - inte
.           0.0 False - None
asent.visualize(doc)
doc[-2]._.polarity
Model prediction
TokenPolarityOutput(polarity=1.554, token=ledsen, span=inte säga att jag är ledsen)

Exercise

You will notice that there is no contrastive conjugation for Swedish, but that the part-of-speech tags do include a tag for it (CCONJ). Overwrite the is_contrastive_conj extension to include contrastive conjugations.