Named Entity Recognition (NER)
NER identifies and classifies named entities in text: persons, organizations, locations, dates, monetary values. It is a token-level classification task — every token gets an entity label. Widely used in search, knowledge extraction, document processing, and information retrieval.
NER with HuggingFace Transformers
from transformers import pipeline, AutoTokenizer, AutoModelForTokenClassification
import torch
# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
# QUICK API — pre-trained NER in 3 lines
# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
ner_pipeline = pipeline("ner", model="dslim/bert-base-NER", aggregation_strategy="simple")
text = "Elon Musk founded SpaceX in 2002 in Hawthorne, California. OpenAI raised $10 billion from Microsoft."
entities = ner_pipeline(text)
for ent in entities:
print(f" [{ent['entity_group']:4s}] '{ent['word']}' (score={ent['score']:.2%})")
# Output:
# [PER ] 'Elon Musk' (score=99.8%)
# [ORG ] 'SpaceX' (score=99.7%)
# [DATE] '2002' (score=98.4%)
# [LOC ] 'Hawthorne' (score=99.1%)
# [LOC ] 'California' (score=99.5%)
# [ORG ] 'OpenAI' (score=99.6%)
# [MONEY] '$10 billion' (score=99.2%)
# [ORG ] 'Microsoft' (score=99.8%)
# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
# BIO TAGGING SCHEME — how NER labels are structured
# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
# B-{TYPE}: Beginning of entity
# I-{TYPE}: Inside entity (continuation)
# O: Outside (not an entity)
example_sentence = ["Elon", "Musk", "founded", "SpaceX", "in", "2002"]
bio_labels = ["B-PER", "I-PER", "O", "B-ORG", "O", "B-DATE"]
# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
# CUSTOM NER MODEL — token classification with BERT
# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
label_list = ["O", "B-PER", "I-PER", "B-ORG", "I-ORG", "B-LOC", "I-LOC", "B-DATE", "I-DATE"]
num_labels = len(label_list)
class CustomNERModel(torch.nn.Module):
def __init__(self, model_name: str = "bert-base-uncased", num_labels: int = 9):
super().__init__()
self.bert = AutoModelForTokenClassification.from_pretrained(
model_name, num_labels=num_labels
)
def forward(self, input_ids, attention_mask, labels=None):
outputs = self.bert(
input_ids=input_ids,
attention_mask=attention_mask,
labels=labels, # if provided, computes loss automatically
)
return outputs
# Training custom NER:
# 1. Tokenize text → account for subword tokens (a word becomes multiple tokens)
# 2. Align labels: first subword gets true label, others get -100 (ignored)
# 3. Train with CrossEntropyLoss over all non-padding, non-continuation tokens
# 4. Evaluate with seqeval: entity-level precision/recall/F1
# Evaluation metric — F1 score per entity type:
# PER: 95.3 ORG: 88.7 LOC: 93.1 Overall: 92.4 (CoNLL-2003 benchmark)Tip
Tip
Practice Named Entity Recognition NER in small, isolated examples before integrating into larger projects. Breaking concepts into small experiments builds genuine understanding faster than reading alone.
Modern NLP = Transformer-based. Pre-train, then fine-tune.
Practice Task
Note
Practice Task — (1) Write a working example of Named Entity Recognition NER from scratch without looking at notes. (2) Modify it to handle an edge case (empty input, null value, or error state). (3) Share your solution in the Priygop community for feedback.
Quick Quiz
Common Mistake
Warning
A common mistake with Named Entity Recognition NER is skipping edge case testing — empty inputs, null values, and unexpected data types. Always validate boundary conditions to write robust, production-ready ai code.