Intermediate

The spaCy NLP Pipeline

Understand how spaCy processes text through its pipeline: tokenizer, tagger, parser, NER, and custom components.

Pipeline Architecture

When you call nlp(text), spaCy runs the text through a series of pipeline components in order:

  1. Tokenizer

    Splits text into tokens (words, punctuation). Rule-based, runs first and always.

  2. Tagger

    Assigns part-of-speech tags (NOUN, VERB, ADJ) to each token using a statistical model.

  3. Parser

    Builds dependency trees showing grammatical relationships between tokens (subject, object, modifier).

  4. NER

    Identifies named entities (people, places, organizations) and their boundaries in the text.

  5. Lemmatizer

    Reduces words to their base form (running → run, better → good).

Tokenization

Python — Token attributes
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple isn't looking at buying U.K. startups for $1 billion.")

for token in doc:
    print(f"{token.text:12} {token.lemma_:12} {token.pos_:6} "
          f"{token.dep_:10} {token.is_stop}")

# Apple        Apple        PROPN  nsubj      False
# is           be           AUX    aux        True
# n't          not          PART   neg        True
# looking      look         VERB   ROOT       False
# ...

Part-of-Speech Tagging

Python — POS tags
doc = nlp("The quick brown fox jumps over the lazy dog")

# Coarse-grained POS (Universal)
for token in doc:
    print(f"{token.text:10} {token.pos_:6} {token.tag_:6} "
          f"{spacy.explain(token.tag_)}")

# The        DET    DT     determiner
# quick      ADJ    JJ     adjective
# brown      ADJ    JJ     adjective
# fox        NOUN   NN     noun, singular or mass
# jumps      VERB   VBZ    verb, 3rd person singular present

Dependency Parsing

Python — Dependency tree
doc = nlp("The cat sat on the mat")

for token in doc:
    print(f"{token.text:10} --{token.dep_:10}--> {token.head.text}")

# The        --det       --> cat
# cat        --nsubj     --> sat
# sat        --ROOT      --> sat
# on         --prep      --> sat
# the        --det       --> mat
# mat        --pobj      --> on

# Extract noun chunks (noun phrases)
for chunk in doc.noun_chunks:
    print(f"{chunk.text:20} root={chunk.root.text} head={chunk.root.head.text}")

Sentence Segmentation

Python — Sentence boundaries
text = "This is the first sentence. Here is another! And a third?"
doc = nlp(text)

for sent in doc.sents:
    print(f"[{sent.start}:{sent.end}] {sent.text}")

# [0:6] This is the first sentence.
# [6:10] Here is another!
# [10:14] And a third?

Similarity and Vectors

Python — Word vectors (requires md or lg model)
nlp = spacy.load("en_core_web_md")  # Need medium+ for vectors

doc1 = nlp("I like cats")
doc2 = nlp("I love dogs")
doc3 = nlp("The stock market crashed")

print(f"cats vs dogs: {doc1.similarity(doc2):.3f}")     # ~0.92
print(f"cats vs stocks: {doc1.similarity(doc3):.3f}")   # ~0.35

# Token-level vectors
token = nlp("king")[0]
print(f"Vector shape: {token.vector.shape}")  # (300,)
print(f"Has vector: {token.has_vector}")
Performance tip: If you only need tokenization, disable other components: nlp = spacy.load("en_core_web_sm", disable=["parser", "ner"]). This can make processing 2-3x faster.