Spacy for Natural Language Processing – a Beginners’ Guide

spaCy is a powerful and advanced library that is gaining huge popularity for NLP applications due to its speed, ease of use, accuracy, and extensibility.

I was going through spaCy tutorials and prepared notes for myself. I’m sharing my notes as a blog post here. This is an introductory post that will help you get started with spaCy.

Introduction

At the center of Spacy is the object containing processing pipeline. The variable is usually called as nlp.

To create English nlp object, you can import English language class from spacy.lang.en

You can use NLP object like function to analyze text. It contains all the different components in the pipeline.

It also includes language-specific rules used for tokenizing the text into words and punctuation. spaCy supports a variety of languages that are available in spacy.lang

# Import the English language class
from spacy.lang.en import English

# Create the nlp object
nlp = English()

The Doc object

When you process a text with the nlp object, spaCy creates a Doc object – short for “document”. The Doc lets you access information about the text in a structured way, and no information is lost.

The Doc behaves like a normal Python sequence by the way and lets you iterate over its tokens, or get a token by its index.

# Created by processing a string of text with the nlp object
doc = nlp("Hello world!")

# Iterate over tokens in a Doc
for token in doc:
    print(token.text)

Output

Hello world !

The Token object

Token objects represent the tokens in a document – for example, a word or a punctuation character.

To get a token at a specific position, you can index into the doc.

Token objects also provide various attributes that let you access more information about the tokens. For example, the .text attribute returns the verbatim token text.

doc = nlp("Hello world!")

# Index into the Doc to get a single Token
token = doc[1]

# Get the token text via the .text attribute
print(token.text)

The Span object

Span object is a slice of the document consisting of one or more tokens. It’s only a view of the **Doc**and doesn’t contain any data itself.

To create a span, you can use Python’s slice notation. For example, 1:3 will create a slice starting from the token at position 1, up to – but not including! – the token at position 3.

doc = nlp("Hello world!")

# A slice from the Doc is a Span object
span = doc[1:3]

# Get the span text via the .text attribute
print(span.text)

Output

world!

Lexical Attributes

Let’s see some of the available token attributes:

i is the index of the token within the parent document.

text returns the token text.

is_alphais_punct and like_num return boolean values indicating whether the token consists of alphabetic characters, whether it’s punctuation or whether it resembles a number. For example, a token “10” – one, zero – or the word “ten” – T, E, N.

These attributes are also called lexical attributes: they refer to the entry in the vocabulary and don’t depend on the token’s context.

doc = nlp("Share price is $100.")

print("Index:   ", [token.i for token in doc])
print("Text:    ", [token.text for token in doc])

print("is_alpha:", [token.is_alpha for token in doc])
print("is_punct:", [token.is_punct for token in doc])
print("like_num:", [token.like_num for token in doc])

Output

Index:    [0, 1, 2, 3, 4, 5]
Text:     ['Share', 'price', 'is', '$', '100', '.']
is_alpha: [True, True, True, False, False, False]
is_punct: [False, False, False, False, False, True]
like_num: [False, False, False, False, True, False]

Simple spacy example – doc, span, token

Let’s write a simple program that tests our learning so far. We’ll use the concepts of doc, token and lexical attributes in this program.

from spacy.lang.en import English

nlp = English()

# Process the text
doc = nlp(
    "In India, more than 50% of people are involved in agriculture. "
    "Contribution of Agriculture to GDP is around 15%."
)

# Iterate over the tokens in the doc
for token in doc:
    # Check if the token resembles a number
    if token.like_num:
        # Get the next token in the document
        next_token = doc[token.i + 1]
        # Check if the next token's text equals "%"
        if next_token.text == "%":
            print("Percentage found:", token.text)

Output

Percentage found: 50
Percentage found: 15

In the above example we first load the English class and pass the instance of it to the nlp variable. We process a string through nlp and store the output in doc Object. We then iterate over tokens of doc object and use lexical attributes to identify percentages in the text.

Statistical Models

Statistical models enable spaCy to make predictions in context. This usually includes part-of speech tags, syntactic dependencies and named entities.

Models are trained on large datasets of labeled example texts.

They can be updated with more examples to fine-tune their predictions – for example, to perform better on your specific data.

Model Package

import spacy

nlp = spacy.load("en_core_web_sm")

spaCy provides a number of pre-trained model packages you can download using the spacy download command. For example, the “en_core_web_sm” package is a small English model that supports all core capabilities and is trained on web text.

The spacy.load method loads a model package by name and returns an nlp object.

The package provides the binary weights that enable spaCy to make predictions.

It also includes the vocabulary, and meta information to tell spaCy which language class to use and how to configure the processing pipeline.

Predicting part-of-speech Tags

Let’s take a look at the model’s predictions. In this example, we’re using spaCy to predict part-of-speech tags, the word types in context.

Load the small English model
nlp = spacy.load("en_core_web_sm")

# Process a text
doc = nlp("He drank the juice")

# Iterate over the tokens
for token in doc:
    # Print the text and the predicted part-of-speech tag
    print(token.text, token.pos_)

Output

He PRON
drank VERB
the DET
juice NOUN

First, we load the small English model and receive an nlp object.

Next, we’re processing the text “He drank the juice”.

For each token in the doc, we can print the text and the .pos_ attribute, the predicted part-of-speech tag.

In spaCy, attributes that return strings usually end with an underscore – attributes without the underscore return an integer ID value.

Dependency label scheme

text = ("He drank the juice")
text_doc = nlp(text)
displacy.render(text_doc, style='dep')

To describe syntactic dependencies, spaCy uses a standardized label scheme.

The pronoun “He” is a nominal subject attached to the verb – in this case, to “drank”.

The noun “juice” is a direct object attached to the verb “drank”. It is being drunk by the subject, “he”.

The determiner “the”, also known as an article, is attached to the noun “juice”.

Predicting Named Entities

Named entities are “real world objects” that are assigned a name – for example, a person, an organization or a country.

# Process a text
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")

displacy.render(doc,style='ent')

# Iterate over the predicted entities
for ent in doc.ents:
    # Print the entity text and its label
    print(ent.text, ent.label_)

The doc.ents property lets you access the named entities predicted by the model.

It returns an iterator of Span objects, so we can print the entity text and the entity label using the .label_ attribute.

Rule Based Matching

Why not just regular expressions ?

Compared to regular expressions, the matcher works with Doc and Token objects instead of only strings.

It’s also more flexible: you can search for texts but also other lexical attributes.

You can even write rules that use the model’s predictions.

For example, find the word “duck” only if it’s a verb, not a noun.

Match patterns

Match patterns are lists of dictionaries. Each dictionary describes one token. The keys are the names of token attributes, mapped to their expected values.

#Match exact token texts
[{"TEXT": "iPhone"}, {"TEXT": "X"}]

#Match lexical attributes
[{"LOWER": "iphone"}, {"LOWER": "x"}]

#Match any token attributes
[{"LEMMA": "buy"}, {"POS": "NOUN"}]

In this example, we’re looking for two tokens with the text “iPhone” and “X”.

We can also match on other token attributes. Here, we’re looking for two tokens whose lowercase forms equal “iphone” and “x”.

Using Matcher

# Import the Matcher
from spacy.matcher import Matcher

# Load a model and create the nlp object
nlp = spacy.load("en_core_web_sm")

# Initialize the matcher with the shared vocab
matcher = Matcher(nlp.vocab)

# Add the pattern to the matcher
pattern = [{"TEXT": "iPhone"}, {"TEXT": "X"}]
matcher.add("IPHONE_PATTERN", None, pattern)

# Process some text
doc = nlp("Upcoming iPhone X release date leaked")

# Call the matcher on the doc
matches = matcher(doc)

# Iterate over the matches
for match_id, start, end in matches:
    # Get the matched span
    matched_span = doc[start:end]
    print(matched_span.text)

Output

iPhone X 

To use a pattern, we first import the matcher from spacy.matcher.

We also load a model and create the nlp object.

The matcher is initialized with the shared vocabulary, nlp.vocab. You’ll learn more about this later – for now, just remember to always pass it in.

The matcher.add method lets you add a pattern. The first argument is a unique ID to identify which pattern was matched. The second argument is an optional callback. We don’t need one here, so we set it to None. The third argument is the pattern.

To match the pattern on a text, we can call the matcher on any doc.

When you call the matcher on a doc, it returns a list of tuples.

Each tuple consists of three values: the match ID, the start index and the end index of the matched span.

Matching lexical attributes

Here’s an example of a more complex pattern using lexical attributes.

pattern = [
    {"IS_DIGIT": True},
    {"LOWER": "fifa"},
    {"LOWER": "world"},
    {"LOWER": "cup"},
    {"IS_PUNCT": True}
]
doc = nlp("2018 FIFA World Cup: France won!")
matcher.add("ATTERN", None, pattern)
matches = matcher(doc)

# Iterate over the matches
for match_id, start, end in matches:
    # Get the matched span
    matched_span = doc[start:end]
    print(matched_span.text)

Output

2018 FIFA World Cup:

We’re looking for five tokens:

A token consisting of only digits.

Three case-insensitive tokens for “fifa”, “world” and “cup”.

And a token that consists of punctuation.

The pattern matches the tokens “2018 FIFA World Cup:”.

Using operators and quantifiers

Operators and quantifiers let you define how often a token should be matched. They can be added using the “OP” key.

pattern = [
    {"LEMMA": "buy"},
    {"POS": "DET", "OP": "?"},  # optional: match 0 or 1 times
    {"POS": "NOUN"}
]

doc = nlp("I bought a smartphone. Now I'm buying apps.")
matcher.add("ATTERN", None, pattern)
matches = matcher(doc)

for match_id, start, end in matches:
    # Get the matched span
    matched_span = doc[start:end]
    print(matched_span.text)

Output

bought a smartphone
buying apps

Here, the “?” operator makes the determiner token optional, so it will match a token with the lemma “buy”, an optional article and a noun.

Conclusion

spaCy is one of the important libraries for Natural Language Processing. This tool helps you to perform EDA with ease. It provides option for you to add your own entities if those are not present in pertained model. spaCy also allows you to train your own neural network for better processing of the custom data. We’ll cover some of them in future blog posts.

In this blog post, you have learnt:

  • Basic building blocks of Spacy – nlp, doc, toke and span
  • Statistical package used for predicting Part of Speech, Dependency Labels and Named Entity Recognition
  • Rule based matching using Matcher

References