Intermediate
The spaCy NLP Pipeline
Understand how spaCy processes text through its pipeline: tokenizer, tagger, parser, NER, and custom components.
Pipeline Architecture
When you call nlp(text), spaCy runs the text through a series of pipeline components in order:
Tokenizer
Splits text into tokens (words, punctuation). Rule-based, runs first and always.
Tagger
Assigns part-of-speech tags (NOUN, VERB, ADJ) to each token using a statistical model.
Parser
Builds dependency trees showing grammatical relationships between tokens (subject, object, modifier).
NER
Identifies named entities (people, places, organizations) and their boundaries in the text.
Lemmatizer
Reduces words to their base form (running → run, better → good).
Tokenization
Python — Token attributes
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple isn't looking at buying U.K. startups for $1 billion.")
for token in doc:
print(f"{token.text:12} {token.lemma_:12} {token.pos_:6} "
f"{token.dep_:10} {token.is_stop}")
# Apple Apple PROPN nsubj False
# is be AUX aux True
# n't not PART neg True
# looking look VERB ROOT False
# ...
Part-of-Speech Tagging
Python — POS tags
doc = nlp("The quick brown fox jumps over the lazy dog")
# Coarse-grained POS (Universal)
for token in doc:
print(f"{token.text:10} {token.pos_:6} {token.tag_:6} "
f"{spacy.explain(token.tag_)}")
# The DET DT determiner
# quick ADJ JJ adjective
# brown ADJ JJ adjective
# fox NOUN NN noun, singular or mass
# jumps VERB VBZ verb, 3rd person singular present
Dependency Parsing
Python — Dependency tree
doc = nlp("The cat sat on the mat")
for token in doc:
print(f"{token.text:10} --{token.dep_:10}--> {token.head.text}")
# The --det --> cat
# cat --nsubj --> sat
# sat --ROOT --> sat
# on --prep --> sat
# the --det --> mat
# mat --pobj --> on
# Extract noun chunks (noun phrases)
for chunk in doc.noun_chunks:
print(f"{chunk.text:20} root={chunk.root.text} head={chunk.root.head.text}")
Sentence Segmentation
Python — Sentence boundaries
text = "This is the first sentence. Here is another! And a third?"
doc = nlp(text)
for sent in doc.sents:
print(f"[{sent.start}:{sent.end}] {sent.text}")
# [0:6] This is the first sentence.
# [6:10] Here is another!
# [10:14] And a third?
Similarity and Vectors
Python — Word vectors (requires md or lg model)
nlp = spacy.load("en_core_web_md") # Need medium+ for vectors
doc1 = nlp("I like cats")
doc2 = nlp("I love dogs")
doc3 = nlp("The stock market crashed")
print(f"cats vs dogs: {doc1.similarity(doc2):.3f}") # ~0.92
print(f"cats vs stocks: {doc1.similarity(doc3):.3f}") # ~0.35
# Token-level vectors
token = nlp("king")[0]
print(f"Vector shape: {token.vector.shape}") # (300,)
print(f"Has vector: {token.has_vector}")
Performance tip: If you only need tokenization, disable other components:
nlp = spacy.load("en_core_web_sm", disable=["parser", "ner"]). This can make processing 2-3x faster.
Lilly Tech Systems