Intermediate

The spaCy NLP Pipeline

Understand how spaCy processes text through its pipeline: tokenizer, tagger, parser, NER, and custom components.

Pipeline Architecture

When you call nlp(text), spaCy runs the text through a series of pipeline components in order:

Tokenizer
Splits text into tokens (words, punctuation). Rule-based, runs first and always.
Tagger
Assigns part-of-speech tags (NOUN, VERB, ADJ) to each token using a statistical model.
Parser
Builds dependency trees showing grammatical relationships between tokens (subject, object, modifier).
NER
Identifies named entities (people, places, organizations) and their boundaries in the text.
Lemmatizer
Reduces words to their base form (running → run, better → good).

Tokenization

Python — Token attributes

import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple isn't looking at buying U.K. startups for $1 billion.")

for token in doc:
    print(f"{token.text:12} {token.lemma_:12} {token.pos_:6} "
          f"{token.dep_:10} {token.is_stop}")

# Apple        Apple        PROPN  nsubj      False
# is           be           AUX    aux        True
# n't          not          PART   neg        True
# looking      look         VERB   ROOT       False
# ...

Part-of-Speech Tagging

Python — POS tags

doc = nlp("The quick brown fox jumps over the lazy dog")

# Coarse-grained POS (Universal)
for token in doc:
    print(f"{token.text:10} {token.pos_:6} {token.tag_:6} "
          f"{spacy.explain(token.tag_)}")

# The        DET    DT     determiner
# quick      ADJ    JJ     adjective
# brown      ADJ    JJ     adjective
# fox        NOUN   NN     noun, singular or mass
# jumps      VERB   VBZ    verb, 3rd person singular present

Dependency Parsing

Python — Dependency tree

doc = nlp("The cat sat on the mat")

for token in doc:
    print(f"{token.text:10} --{token.dep_:10}--> {token.head.text}")

# The        --det       --> cat
# cat        --nsubj     --> sat
# sat        --ROOT      --> sat
# on         --prep      --> sat
# the        --det       --> mat
# mat        --pobj      --> on

# Extract noun chunks (noun phrases)
for chunk in doc.noun_chunks:
    print(f"{chunk.text:20} root={chunk.root.text} head={chunk.root.head.text}")

Sentence Segmentation

Python — Sentence boundaries

text = "This is the first sentence. Here is another! And a third?"
doc = nlp(text)

for sent in doc.sents:
    print(f"[{sent.start}:{sent.end}] {sent.text}")

# [0:6] This is the first sentence.
# [6:10] Here is another!
# [10:14] And a third?

Similarity and Vectors

Python — Word vectors (requires md or lg model)

nlp = spacy.load("en_core_web_md")  # Need medium+ for vectors

doc1 = nlp("I like cats")
doc2 = nlp("I love dogs")
doc3 = nlp("The stock market crashed")

print(f"cats vs dogs: {doc1.similarity(doc2):.3f}")     # ~0.92
print(f"cats vs stocks: {doc1.similarity(doc3):.3f}")   # ~0.35

# Token-level vectors
token = nlp("king")[0]
print(f"Vector shape: {token.vector.shape}")  # (300,)
print(f"Has vector: {token.has_vector}")

✅

Performance tip: If you only need tokenization, disable other components: nlp = spacy.load("en_core_web_sm", disable=["parser", "ner"]). This can make processing 2-3x faster.

← Previous Installation Next → Named Entity Recognition

The spaCy NLP Pipeline

Pipeline Architecture

Tokenizer

Tagger

Parser

NER

Lemmatizer

Tokenization

Part-of-Speech Tagging

Dependency Parsing

Sentence Segmentation

Similarity and Vectors