NLP Exam — FINAL Complete Code Reference
Sessions 2 · 3 · 5 · 6 · 7 · 8 · 9   ·   All Topics Included
spaCy  ·  TextBlob  ·  TF-IDF  ·  LDA  ·  K-Means  ·  Classification  ·  Word2Vec  ·  Chatbot
WHAT WAS ADDED IN THIS FINAL VERSION
⚠️  MISSING / ADDED NOW:  TF-IDF — both for similarity (sklearn) AND as classifier input via Pipeline
⚠️  MISSING / ADDED NOW:  LDA Topic Modeling — full Gensim pipeline with coherence score
⚠️  MISSING / ADDED NOW:  TextBlob Sentiment — single text + multiple texts + dataset loop
⚠️  MISSING / ADDED NOW:  Session 7 Text Classification — Pipeline + DT + LR + RF + report reading
 
Topic	Status	Section in this doc
Session 2 — Tokenization, POS, NER	✅ Complete	Section 1
Session 3 — Web Scraping	✅ Complete	Section 2
Session 3 — Rule-Based Matcher	✅ Complete	Section 3
Session 5 — Gensim BoW (Dictionary)	✅ Complete	Section 4
Session 5 — LDA Topic Modeling	✅ ADDED	Section 5
Session 6 — K-Means Clustering	✅ Complete	Section 6
Session 7 — Text Classification	✅ ADDED	Section 7
Session 8 — TextBlob Sentiment	✅ ADDED	Section 8
Session 8 — TF-IDF (Similarity)	✅ ADDED	Section 9
Session 8 — Word2Vec	✅ Complete	Section 10
Session 9 — Chatbot	✅ Complete	Section 11
 
 
SESSION 2:  Tokenization, POS Tagging & NER
SECTION 1 — Tokenization, POS & NER
1.1  Setup — Load Model & Create Doc
🎯  EXAM TIP:  nlp = the pipeline.  doc = processed text.  token = one unit inside doc.
▶  Imports and doc creation
import spacy
from spacy import displacy
import pandas as pd
import matplotlib.pyplot as plt
from collections import Counter
 
nlp = spacy.load("en_core_web_sm")
 
# From a text file
with open("review.txt", "r", encoding="utf-8") as f:
    text = f.read()
 
doc = nlp(text)
 
# type(nlp)  -> spacy.lang.en.English
# type(doc)  -> spacy.tokens.doc.Doc
# type(doc[0]) -> spacy.tokens.token.Token
1.2  Sentences and Token Count
▶  Count sentences and tokens
# Number of sentences
sentences = list(doc.sents)
print("Number of sentences:", len(sentences))
 
# Number of tokens
tokens = list(doc)
print("Number of tokens:", len(tokens))
1.3  POS Tagging — Display POS of Every Token
🎯  EXAM TIP:  token.pos_ gives readable string (NOUN, VERB). token.pos gives integer code. Always use pos_ with underscore.
▶  POS for every token
# POS of each token
for token in doc:
    print(token.text, "->", token.pos_)
 
# Explain any POS tag
print(spacy.explain("AUX"))     # auxiliary
print(spacy.explain("PROPN"))   # proper noun
print(spacy.explain("ADP"))     # adposition (prepositions like 'of', 'in')
1.4  Filtered Tokens — No Stop Words, No Punct, No Numbers
🎯  EXAM TIP:  Standard filter pattern — memorise this. Used in almost every NLP task.
▶  Filter tokens
# STANDARD FILTER — use everywhere
filtered = []
for token in doc:
    if token.is_stop == False and token.is_punct == False and token.like_num == False:
        filtered.append((token.text, token.pos_))
 
for item in filtered:
    print(item[0], "->", item[1])
1.5  Count Nouns and Verbs
▶  Count by POS
noun_count = 0
verb_count = 0
 
for token in doc:
    if token.pos_ == "NOUN":
        noun_count += 1
    if token.pos_ == "VERB":
        verb_count += 1
 
print("Nouns:", noun_count)
print("Verbs:", verb_count)
1.6  Named Entity Recognition (NER)
🎯  EXAM TIP:  doc.ents = all entities.  ent.text = the entity string.  ent.label_ = the type.
▶  NER — entities, displacy, DataFrame, plot
# Display all entities
for ent in doc.ents:
    print(ent.text, "->", ent.label_)
 
# Common entity types:
# PERSON = people   GPE = countries/cities   ORG = companies
# DATE = dates      LOC = locations          MONEY = money values
 
# Explain any label
print(spacy.explain("NORP"))   # Nationalities or religious/political groups
print(spacy.explain("GPE"))    # Countries, cities, states
 
# Visual NER display (renders in Jupyter)
displacy.render(doc, style="ent")
 
# DataFrame of entities
entity_data = []
for ent in doc.ents:
    entity_data.append([ent.text, ent.label_, spacy.explain(ent.label_)])
 
df_ner = pd.DataFrame(entity_data, columns=["Entity", "Type", "Explanation"])
print(df_ner)
 
# Plot entity type counts
type_counts = df_ner["Type"].value_counts()
type_counts.plot(kind="bar")
plt.title("Entity Type Distribution")
plt.xlabel("Entity Type")
plt.ylabel("Count")
plt.show()
 
# Most frequent entity type
print("Most common:", type_counts.index[0])
 
# Most important PERSON (by frequency)
persons = [ent.text for ent in doc.ents if ent.label_ == "PERSON"]
print(Counter(persons))
 
 
SESSION 3:  Web Scraping
SECTION 2 — Web Scraping
🎯  EXAM TIP:  4 steps: requests.get() -> BeautifulSoup() -> .get_text() -> re.sub() to clean
▶  Full web scraping pipeline
import requests
import re
from bs4 import BeautifulSoup
import spacy
 
nlp = spacy.load("en_core_web_sm")
 
# Step 1: Fetch the webpage
url = "https://en.wikipedia.org/wiki/Cinema_of_India"
request = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
# <Response [200]>  — 200 = success, 404 = not found, 403 = blocked
 
# Step 2: Parse HTML
scrap = BeautifulSoup(request.content, "html.parser")
 
# Step 3: Extract plain text from main content
content = scrap.find("div", {"class": "mw-parser-output"}).get_text()
 
# Step 4: Clean — remove footnotes like [1] [23] [100]
filtered = re.sub(r"\[\d+\]", " ", content)
# re.sub(pattern, replacement, string)
# \[\d+\] = '[' + one-or-more-digits + ']'
 
final_content = " ".join(filtered.split())
# .split() removes all whitespace (spaces, newlines, tabs)
# " ".join() puts it back with single spaces
 
# Create doc and use normally
doc = nlp(final_content)
print("Tokens:", len(doc))
 
 
SESSION 3:  Rule-Based Matcher
SECTION 3 — Rule-Based Matcher
🎯  EXAM TIP:  4-step pattern EVERY TIME: Create Matcher -> Define pattern (list of dicts) -> Add -> Apply
📌  Each dict in the list = rules for ONE token. Multiple dicts = phrase (consecutive tokens).
3.1  Token Matching
▶  Match a single word (case-insensitive)
from spacy.matcher import Matcher
 
matcher1 = Matcher(nlp.vocab)
 
# 'lower' matches the lowercase version — catches Cinema, CINEMA, cinema
pattern1 = [{"lower": "cinema"}]
matcher1.add("Cinema_pattern", [pattern1])
 
matches1 = matcher1(doc)
count = 0
for match_id, start, end in matches1:
    span = doc[start:end]    # extract matched span
    count += 1
    print(span.text)
print("Total matches:", count)
3.2  Phrase Matching (multiple consecutive words)
▶  Match phrase + find sentences with phrase
matcher2 = Matcher(nlp.vocab)
# Three dicts = three consecutive tokens must all match
pattern2 = [{"lower": "cinema"}, {"lower": "of"}, {"lower": "india"}]
matcher2.add("Cinema_of_India", [pattern2])
 
matches2 = matcher2(doc)
for match_id, start, end in matches2:
    print(doc[start:end].text)
 
# Find SENTENCES containing the phrase
count = 0
for sent in doc.sents:
    m = matcher2(sent)
    if m:
        count += 1
        print(sent.text.strip())
print("Sentences with phrase:", count)
3.3  Advanced Matching — POS / Lemma / Length / Entity
▶  All advanced pattern types
# Match NUMBER + 'crore'  (e.g., '200 crore')
matcher3 = Matcher(nlp.vocab)
pattern3 = [{"pos": "NUM"}, {"lower": "crore"}]
matcher3.add("Crore", [pattern3])
 
# Match tokens 12+ characters long
matcher4 = Matcher(nlp.vocab)
pattern4 = [{"length": {">=": 12}}]
matcher4.add("LongTokens", [pattern4])
 
# Match any form of act/create/produce (lemma catches all verb forms)
matcher5 = Matcher(nlp.vocab)
pattern5 = [{"lemma": {"in": ["act", "depict", "direct", "create", "produce", "cast"]}}]
matcher5.add("FilmVerbs", [pattern5])
 
# Match by entity type (GPE = countries, cities)
matcher6 = Matcher(nlp.vocab)
pattern6 = [{"ent_type": "GPE"}]
matcher6.add("GPE_match", [pattern6])
 
# Apply and display (same for all matchers)
for match_id, start, end in matcher3(doc):
    print(doc[start:end].text)
 
for match_id, start, end in matcher5(doc):
    print(doc[start:end].text)
 
Matcher Pattern Key Reference
Key	Matches	Example
{"text": "Cinema"}	Exact text (case-sensitive)	Only 'Cinema', not 'cinema'
{"lower": "cinema"}	Any case version	Cinema, CINEMA, cinema
{"pos": "NUM"}	Part of speech tag	Any number token
{"lemma": "go"}	Root form of word	go, goes, went, gone
{"lemma": {"in": [...]}}	One of several lemmas	["act", "create", "produce"]
{"ent_type": "GPE"}	Entity type	Countries, cities, states
{"length": {">=": 12}}	Token length condition	Tokens 12+ characters
 
 
SESSION 5:  Bag of Words (Gensim)
SECTION 4 — Gensim Bag of Words (BoW)
🎯  EXAM TIP:  Dictionary maps each word to an integer ID.  doc2bow converts word list to [(id, count)] pairs.
📌  BoW ignores word order — only word COUNTS matter. This is the input to LDA.
▶  Full BoW pipeline
from gensim import corpora
 
# Step 1: Filter tokens to get clean lemmas (standard filter)
words = []
for token in doc:
    if token.is_stop == False and token.is_punct == False and token.like_num == False:
        if token.is_alpha == True:
            words.append(token.lemma_)   # always use lemma_ (root form)
 
print("Raw tokens:", len(list(doc)))
print("Filtered words:", len(words))
print("Sample:", words[:10])
 
# Step 2: Create Dictionary — assigns each unique word an integer ID
tokenDict = corpora.Dictionary([words])
# [words] = list of word lists.
# For MULTIPLE documents: [[words_doc1], [words_doc2], [words_doc3]]
 
print("Vocabulary size:", len(tokenDict))
 
# Step 3: Create Bag of Words corpus
bows = []
for word_list in [words]:
    bow = tokenDict.doc2bow(word_list)   # [(word_id, count), ...]
    bows.append(bow)
 
# Display what BoW looks like
print("BoW first 5 entries:")
for word_id, count in bows[0][:5]:
    print(" ID", word_id, "->", tokenDict[word_id], ":", count, "time(s)")
 
 
SESSION 5:  LDA Topic Modeling — ADDED
SECTION 5 — LDA Topic Modeling + Coherence Score
⚠️  MISSING / ADDED NOW:  LDA was identified as missing — full pipeline added here
🎯  EXAM TIP:  LDA = Latent Dirichlet Allocation. Unsupervised. Discovers hidden topics in text.
🎯  EXAM TIP:  num_topics = YOU decide how many. Coherence score tells you if you chose correctly.
📌  Full pipeline: raw text -> spaCy doc -> filter -> Dictionary -> BoW -> LDA -> Coherence -> pyLDAvis
5.1  LDA Model — Build and Print Topics
▶  Build LDA and view topics
import PyPDF2
from gensim import corpora
from gensim.models.ldamodel import LdaModel
 
# --- If reading from PDF ---
text = " "
with open("retailnova_reviews.pdf", "rb") as file:
    reader = PyPDF2.PdfReader(file)
    for page in reader.pages:
        text = text + page.extract_text()
 
# --- Create doc and filter ---
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
 
words = []
for token in doc:
    if token.is_stop == False and token.is_punct == False and token.like_num == False:
        if token.is_alpha == True:
            words.append(token.lemma_)
 
# --- Dictionary and BoW ---
tokenDict = corpora.Dictionary([words])
bows = []
for word_list in [words]:
    bows.append(tokenDict.doc2bow(word_list))
 
# --- LDA Model ---
lda = LdaModel(
    corpus=bows,          # the BoW corpus
    num_topics=3,         # how many topics to discover (you choose this!)
    id2word=tokenDict,    # dictionary that maps IDs back to words
    random_state=42,      # for reproducibility (same result every run)
    passes=10             # number of training iterations (more = better but slower)
)
 
print(lda)   # shows model summary
 
# View topics — each line shows top 5 words and their weights
topics = lda.print_topics(num_words=5)
for topic_id, topic_words in topics:
    print("Topic", topic_id, "->", topic_words)
 
# Format: '0.050*"order" + 0.045*"delivery" + ...'
# The number before * = WEIGHT (importance) of that word in this topic
# Higher weight = more characteristic of this topic
5.2  Coherence Score — Evaluate Topic Quality
🎯  EXAM TIP:  Higher coherence = better topics.  0.6+ = Excellent.  0.4–0.6 = Good.  0.3–0.4 = Moderate.  <0.3 = Poor
▶  Coherence score
from gensim.models.coherencemodel import CoherenceModel
 
coh_model = CoherenceModel(
    model=lda,            # the trained LDA model
    texts=[words],        # the filtered word lists
    dictionary=tokenDict  # the dictionary
)
 
coh_score = coh_model.get_coherence()
print("Coherence Score:", coh_score)
 
# Interpretation:
# 0.6 - 1.0  Excellent — topics are very clear
# 0.4 - 0.6  Good — topics are meaningful
# 0.3 - 0.4  Moderate — topics have some coherence
# 0.0 - 0.3  Poor — topics are random / not useful
5.3  Find Optimal num_topics (Coherence Plot)
▶  Plot coherence vs number of topics
import matplotlib.pyplot as plt
 
scores = []
topic_range = range(2, 8)   # try 2, 3, 4, 5, 6, 7 topics
 
for n in topic_range:
    temp_lda = LdaModel(corpus=bows, num_topics=n, id2word=tokenDict, random_state=42)
    temp_cm = CoherenceModel(model=temp_lda, texts=[words], dictionary=tokenDict)
    scores.append(temp_cm.get_coherence())
    print("Topics =", n, "-> Coherence =", round(temp_cm.get_coherence(), 4))
 
plt.plot(list(topic_range), scores, marker="o")
plt.xlabel("Number of Topics")
plt.ylabel("Coherence Score")
plt.title("Coherence Score vs Number of Topics")
plt.show()
 
# Choose num_topics at the PEAK of the curve
5.4  Term-Topic Score for a Specific Word
▶  Find which topic a word belongs to most
# Find the term-topic scores for the word 'order'
word = "order"
 
if word in tokenDict.token2id:
    word_id = tokenDict.token2id[word]
    scores = lda.get_term_topics(word_id)
    print("Topic scores for '", word, "':")
    for topic_id, score in scores:
        print("  Topic", topic_id, "->", round(score, 4))
    # Highest score = word belongs to that topic most
else:
    print(word, "not found in dictionary")
5.5  pyLDAvis Visualization
▶  Interactive topic visualization
import pyLDAvis
from pyLDAvis import gensim_models
 
pyLDAvis.enable_notebook()   # enables display inside Jupyter
 
plot = gensim_models.prepare(lda, corpus=bows, dictionary=tokenDict)
plot   # renders interactive bubble chart in Jupyter
 
# HOW TO READ:
# Left panel: each bubble = one topic
# Bubble SIZE = how often that topic appears in documents
# DISTANCE between bubbles = how different the topics are (far = GOOD)
# Right panel: top words for the selected topic
# Lambda slider: move toward 0 to see more unique/distinctive words
 
 
SESSION 6:  Text Clustering
SECTION 6 — Text Clustering (K-Means)
🎯  EXAM TIP:  Clustering = UNSUPERVISED (no labels given). You discover the groups yourself.
🎯  EXAM TIP:  Classification = SUPERVISED (labels given). You train to predict known categories.
▶  Full K-Means clustering pipeline
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
 
# Load dataset (class used 'bbc-text.csv' with 5 categories)
bbc = pd.read_csv("bbc-text.csv")
print(bbc["category"].value_counts())
 
# Use ONLY text column — no labels for clustering
data = bbc["text"]
 
# --- TF-IDF Vectorization ---
vectorizer = TfidfVectorizer(max_features=3000)
X = vectorizer.fit_transform(data)
# X is a matrix: rows = documents, columns = words, values = TF-IDF scores
 
# --- Elbow Method to find best K ---
WSS = []   # Within-Cluster Sum of Squares (lower = tighter clusters = better)
for k in range(1, 10):
    kmeans = KMeans(n_clusters=k, random_state=10)
    kmeans.fit(X)
    WSS.append(kmeans.inertia_)   # inertia_ stores the WSS value
 
plt.plot(range(1, 10), WSS)
plt.xlabel("Number of Clusters K")
plt.ylabel("WSS / Inertia")
plt.title("Elbow Method — choose K at the sharp bend")
plt.show()
 
# --- Build final model with optimal K (e.g., K=5 for BBC) ---
kmeans = KMeans(n_clusters=5, random_state=10)
kmeans.fit(X)
 
bbc["Label"] = kmeans.labels_    # assign cluster number to each document
print(bbc["Label"].value_counts())
 
# Inspect each cluster — read articles and give the cluster a name manually
print(bbc[bbc["Label"] == 0])   # Cluster 0 — read these to name it
print(bbc[bbc["Label"] == 1])   # Cluster 1
print(bbc[bbc["Label"] == 2])   # Cluster 2
print(bbc[bbc["Label"] == 3])   # Cluster 3
print(bbc[bbc["Label"] == 4])   # Cluster 4
✍️  WRITE THIS IN EXAM:  Clustering is unsupervised — no labels are given. The algorithm groups similar documents based on TF-IDF word frequency patterns. The cluster names are assigned manually by reading sample documents from each cluster.
 
 
SESSION 7:  Text Classification — ADDED
SECTION 7 — Text Classification (Session 7)
⚠️  MISSING / ADDED NOW:  Text Classification was identified as missing — full code added here
🎯  EXAM TIP:  Pipeline = TF-IDF + Classifier chained together. Always use Pipeline in classification.
🎯  EXAM TIP:  Logistic Regression = BEST for text. Always say this in theory answers. Typical accuracy: 75-85%.
7.1  Load Dataset and Prepare
▶  Load 20 Newsgroups and split
from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split
 
# Load the dataset
news = fetch_20newsgroups()
 
# X = text documents (the features/input)
# y = category numbers 0-19 (the labels/output)
X = news["data"]
y = news["target"]
 
print("Total documents:", len(X))
print("Category names:", news["target_names"])
print("Sample label:", y[0], "->", news["target_names"][y[0]])
 
# Split: 80% train, 20% test
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,    # 20% held out for testing
    random_state=10
)
 
print("Train:", len(X_train), "| Test:", len(X_test))
7.2  TF-IDF Explained (as used in Pipeline)
📌  TF-IDF = Term Frequency x Inverse Document Frequency. High score = word frequent HERE but rare across ALL docs.
📌  In Pipeline: TF-IDF learns vocabulary from X_train ONLY. When predicting, it converts X_test using the same vocabulary — no cheating.
7.3  Model 1 — Decision Tree Classifier
▶  Decision Tree pipeline
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, confusion_matrix
 
# Pipeline chains TF-IDF + Classifier into ONE object
pipe_dt = Pipeline([
    ("tfidf", TfidfVectorizer()),                        # Step 1: convert text to numbers
    ("classification_dt", DecisionTreeClassifier(random_state=10))  # Step 2: classify
])
# Decision Tree: builds yes/no questions on word features
# Problem: overfits text data. Accuracy typically 50-60%.
 
# Train the pipeline
pipe_dt.fit(X_train, y_train)
 
# Predict on test set
y_pred = pipe_dt.predict(X_test)
 
# Evaluate
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
7.4  Model 2 — Logistic Regression (Best for Text)
▶  Logistic Regression pipeline
from sklearn.linear_model import LogisticRegression
 
pipe_lr = Pipeline([
    ("tfidf", TfidfVectorizer()),
    ("classification_lr", LogisticRegression(random_state=10))
])
# Logistic Regression: probability-based classifier
# Despite 'regression' in name — it IS a classifier
# Best suited for high-dimensional sparse data like TF-IDF
# Typical accuracy: 75-85%
 
pipe_lr.fit(X_train, y_train)
 
y_pred = pipe_lr.predict(X_test)
 
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
7.5  Model 3 — Random Forest
▶  Random Forest pipeline
from sklearn.ensemble import RandomForestClassifier
 
pipe_rf = Pipeline([
    ("tfidf", TfidfVectorizer()),
    ("classification_rf", RandomForestClassifier(random_state=10))
])
# Random Forest = ensemble of many Decision Trees
# Each tree trains on random subset of data + features
# Final prediction = majority vote of all trees
# Typical accuracy: 65-75%
 
pipe_rf.fit(X_train, y_train)
 
y_pred = pipe_rf.predict(X_test)
 
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
7.6  Reading the Classification Report
🎯  EXAM TIP:  F1-Score is the most important metric — it balances precision and recall.
Metric	Formula	What it means
Precision	TP / (TP + FP)	Of all predicted as X, how many were actually X?
Recall	TP / (TP + FN)	Of all actual X, how many did we find?
F1-Score	2 x P x R / (P + R)	Harmonic mean of Precision and Recall (best overall metric)
Support	—	Number of actual documents in this class
Accuracy	Correct / Total	Overall percentage of correct predictions
 
Model	Typical Accuracy	Why?
Decision Tree	50–60%	Overfits. Not ideal for high-dimensional text.
Logistic Regression	75–85%	BEST. Handles sparse TF-IDF vectors perfectly.
Random Forest	65–75%	Good ensemble model, but slower than LR for text.
✍️  WRITE THIS IN EXAM:  Logistic Regression performs best for text classification because it handles high-dimensional sparse vectors (like TF-IDF) efficiently using probabilistic decision boundaries.
 
 
SESSION 8:  Sentiment Analysis — ADDED
SECTION 8 — Sentiment Analysis (TextBlob)
⚠️  MISSING / ADDED NOW:  TextBlob was identified as missing — all 3 patterns added here
🎯  EXAM TIP:  Polarity: -1 to +1.  Subjectivity: 0 to 1.  MEMORISE these ranges.
8.1  Single Text Sentiment
▶  Basic TextBlob usage
from textblob import TextBlob
 
text = "I watched a movie. It was superb."
 
blob = TextBlob(text)
 
print("Polarity:", blob.sentiment.polarity)       # 0.75
print("Subjectivity:", blob.sentiment.subjectivity)  # 1.0
 
# Polarity = emotional direction of text
#   > 0  -> Positive  (superb, amazing, great)
#   = 0  -> Neutral   (factual statements)
#   < 0  -> Negative  (horrible, terrible, bad)
#   Range: -1.0 to +1.0
 
# Subjectivity = how opinion-based is the text
#   0.0 -> Very factual/objective  ("The lecture is at 10am")
#   1.0 -> Very opinionated        ("This movie is breathtaking!")
#   Range: 0.0 to 1.0
8.2  Multiple Texts — ChatGPT QP Pattern
▶  Sentiment for 4 documents + bar charts
blob1 = TextBlob(text1)
blob2 = TextBlob(text2)
blob3 = TextBlob(text3)
blob4 = TextBlob(text4)
 
polarity = [
    blob1.sentiment.polarity,
    blob2.sentiment.polarity,
    blob3.sentiment.polarity,
    blob4.sentiment.polarity
]
 
subjectivity = [
    blob1.sentiment.subjectivity,
    blob2.sentiment.subjectivity,
    blob3.sentiment.subjectivity,
    blob4.sentiment.subjectivity
]
 
labels = ["Education", "Advertising", "Customer Care", "Language"]
 
# Polarity bar chart
import matplotlib.pyplot as plt
 
plt.bar(labels, polarity)
plt.title("Polarity Comparison")
plt.ylabel("Polarity (-1 to +1)")
plt.show()
 
# Subjectivity bar chart
plt.bar(labels, subjectivity)
plt.title("Subjectivity Comparison")
plt.ylabel("Subjectivity (0 to 1)")
plt.show()
8.3  Dataset Loop — RetailNova / Movie Reviews Pattern
▶  Sentiment for a list of texts with classification
from textblob import TextBlob
import pandas as pd
import matplotlib.pyplot as plt
 
movie_reviews = [
    "The movie was average, nothing special.",
    "Absolutely loved the movie, every scene was engaging.",
    "A complete waste of time.",
    "Good performances, weak storyline.",
]
 
# Process each review
sentiment_list = []
 
for review in movie_reviews:
    polarity = TextBlob(review).sentiment.polarity
    subjectivity = TextBlob(review).sentiment.subjectivity
 
    if polarity > 0:
        label = "Positive"
    elif polarity < 0:
        label = "Negative"
    else:
        label = "Neutral"
 
    sentiment_list.append((review, polarity, label, subjectivity))
 
# Create DataFrame
senti_df = pd.DataFrame(
    sentiment_list,
    columns=["Review", "Polarity", "Sentiment", "Subjectivity"]
)
print(senti_df)
 
# Summary
print(senti_df["Sentiment"].value_counts())
print("Average Polarity:", senti_df["Polarity"].mean())
print("Average Subjectivity:", senti_df["Subjectivity"].mean())
 
# Bar chart
senti_df["Sentiment"].value_counts().plot(kind="bar")
plt.title("Sentiment Distribution")
plt.show()
 
Metric	Range	Interpretation
Polarity	-1.0 to +1.0	> 0 = Positive | = 0 = Neutral | < 0 = Negative
Subjectivity	0.0 to 1.0	0 = Factual/objective | 1 = Opinionated/subjective
✍️  WRITE THIS IN EXAM:  Polarity indicates the emotional tone: positive values mean positive sentiment and negative values mean negative sentiment. Subjectivity measures how opinionated the text is; values closer to 1 indicate personal opinions rather than facts.
 
 
SESSION 8:  TF-IDF Similarity — ADDED
SECTION 9 — TF-IDF Cosine Similarity
⚠️  MISSING / ADDED NOW:  TF-IDF similarity was identified as missing — full code added here
🎯  EXAM TIP:  TF-IDF vectorises documents. Cosine similarity compares them. Higher = more similar.
▶  TF-IDF similarity matrix — ChatGPT QP Q4
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
 
# List of all texts to compare
texts = [text1, text2, text3, text4]
labels = ["Education", "Advertising", "Customer Care", "Language"]
 
# Step 1: TF-IDF vectorisation
vectorizer = TfidfVectorizer(stop_words="english")
tfidf_matrix = vectorizer.fit_transform(texts)
# Each row = one document as a vector of TF-IDF scores
# TF = how often word appears in THIS doc
# IDF = how rare the word is ACROSS ALL docs
# TF-IDF = high for words frequent here but rare elsewhere
 
# Step 2: Cosine Similarity matrix
sim_matrix = cosine_similarity(tfidf_matrix)
print(sim_matrix)
# sim_matrix[i][j] = similarity between document i and document j
# Diagonal is always 1.0 (doc compared to itself = identical)
# Off-diagonal: higher = more similar
 
# Step 3: Find most similar and most dissimilar pair
# Read the matrix: look for highest value NOT on diagonal
print("Education vs Advertising:", sim_matrix[0][1])
print("Education vs Customer Care:", sim_matrix[0][2])
print("Education vs Language:", sim_matrix[0][3])
✍️  WRITE THIS IN EXAM:  The most similar texts share overlapping vocabulary and themes. The most dissimilar texts use domain-specific terminology with little overlap. All texts share core AI/ChatGPT vocabulary but differ in domain-specific terms.
 
 
SESSION 8:  Word2Vec
SECTION 10 — Word2Vec
🎯  EXAM TIP:  Word2Vec = dense meaning vectors. Similar words have similar vectors. Know: vector_size, min_count, most_similar().
▶  Word2Vec — build and use
from gensim.models import Word2Vec
import spacy
 
nlp = spacy.load("en_core_web_sm")
 
movie_reviews = [
    "The movie was average, nothing special.",
    "Absolutely loved the movie, every scene was engaging.",
]
 
# Step 1: Filter tokens from all reviews
doc_gen = nlp.pipe(movie_reviews)
token_list = []
for text_doc in doc_gen:
    for token in text_doc:
        if token.is_stop == False and token.is_punct == False:
            token_list.append(token.lemma_)
 
# Step 2: Build Word2Vec model
wv_model = Word2Vec(
    sentences=[token_list],   # list of word lists
    vector_size=100,          # each word = 100-dimensional vector
    min_count=1               # include words appearing at least once
)
 
# Step 3: Access a word's vector
print(wv_model.wv["movie"][:5])   # first 5 of 100 dimensions
 
# Step 4: Find most similar words
similar = wv_model.wv.most_similar("movie", topn=5)
for word, score in similar:
    print(word, "->", round(score, 4))
# Score: 1.0 = identical meaning, 0 = no relation, -1 = opposite
 
 
SESSION 9:  Chatbot
SECTION 11 — Chatbot (Word2Vec + Cosine Similarity)
🎯  EXAM TIP:  3 helper functions to know: preprocess_text(), get_sentence_vector(), get_response()
▶  Full chatbot code
import spacy
import random
import numpy as np
from gensim.models import Word2Vec
from sklearn.metrics.pairwise import cosine_similarity
 
nlp = spacy.load("en_core_web_sm")
 
# STEP 1: Responses per intent
responses = {
    "greeting": ["Hello!", "Hi there!", "Greetings!"],
    "goodbye": ["Goodbye!", "See you later!", "Take care!"],
    "thanks": ["You are welcome!", "No problem!", "Glad to help!"],
    "default": ["I am not sure. Could you rephrase?"]
}
 
# STEP 2: Training examples per intent
training_data = {
    "greeting": ["hello", "hi", "hey", "good morning", "hi there"],
    "goodbye": ["bye", "goodbye", "see you", "farewell", "take care"],
    "thanks": ["thank you", "thanks", "appreciate it", "thanks a lot"]
}
 
# STEP 3: Flatten all sentences into one list
all_sentences = []
for category in training_data.values():
    for sent in category:
        all_sentences.append(sent)
 
# STEP 4: Tokenize each sentence
tokenized_sentences = []
for sent in all_sentences:
    doc = nlp(sent.lower())
    tokens = []
    for token in doc:
        if token.is_stop == False and token.is_punct == False:
            tokens.append(token.lemma_)
    tokenized_sentences.append(tokens)
 
# STEP 5: Build Word2Vec
w2v_model = Word2Vec(sentences=tokenized_sentences, vector_size=100, min_count=1)
 
# STEP 6: Helper functions
def preprocess_text(text):
    tokens = []
    for token in nlp(text.lower()):
        tokens.append(token.lemma_)
    return tokens
 
def get_sentence_vector(words):
    vectors = []
    for word in words:
        if word in w2v_model.wv:
            vectors.append(w2v_model.wv[word])
    if vectors:
        return np.mean(vectors, axis=0)   # average all vectors
    else:
        return np.zeros(w2v_model.vector_size)
 
def get_response(user_input):
    processed = preprocess_text(user_input)
    input_vector = get_sentence_vector(processed)
    best_match = None
    highest_sim = -1
    for category, sentences in training_data.items():
        for sentence in sentences:
            sent_vec = get_sentence_vector(preprocess_text(sentence))
            sim = cosine_similarity([input_vector], [sent_vec])[0][0]
            if sim > highest_sim:
                highest_sim = sim
                best_match = category
    if best_match and best_match in responses:
        return random.choice(responses[best_match])
    else:
        return random.choice(responses["default"])
 
# STEP 7: Chat loop
print("Chatbot: Hello! Type bye to exit.")
while True:
    user_input = input("You: ")
    if user_input.lower() in ["bye", "exit", "quit"]:
        print("Chatbot:", random.choice(responses["goodbye"]))
        break
    print("Chatbot:", get_response(user_input))
 
 
SECTION 12 — Token Attribute Quick Reference
Attribute	True When	Example
token.text	Always — original string	'Running', 'NMIMS'
token.lemma_	Always — root form	'run' (from 'running')
token.pos_	Always — POS tag string	'NOUN', 'VERB', 'ADJ'
token.is_stop	Token is a stop word	'the', 'is', 'am', 'a'
token.is_punct	Token is punctuation	'.', ',', '!', '?'
token.is_alpha	All chars are alphabetic	'student', 'NMIMS'
token.is_digit	All chars are digits	'2025' (not '24/12/2025')
token.like_num	Looks like a number (broader)	'2025', '24/12/2025', 'five'
token.like_url	Looks like a URL	'www.nmims.edu'
token.like_email	Looks like an email	'student@nmims.edu'
token.is_lower	All lowercase	'hello', 'student'
token.is_upper	All uppercase	'NMIMS', 'AI'
token.is_title	Title case (first cap)	'Today', 'India'
token.is_bracket	Is a bracket character	'(', ')', '[', ']'
token.is_quote	Is a quotation mark	'"', "'"
 
Standard Filter — Use This Everywhere
▶  Standard token filter pattern (MEMORISE)
words = []
for token in doc:
    if token.is_stop == False and token.is_punct == False and token.like_num == False:
        if token.is_alpha == True:
            words.append(token.lemma_)    # always use lemma_ (root form)
 
# This is used in: BoW, LDA, Word2Vec, Chatbot, Topic Modeling
 
LAST-MINUTE CHECKLIST — 10 Must-Know Points
🎯  EXAM TIP:  1. Standard filter: not is_stop, not is_punct, not like_num, is_alpha -> append lemma_
🎯  EXAM TIP:  2. Polarity: -1 to +1 (>0=Positive, <0=Negative, =0=Neutral)
🎯  EXAM TIP:  3. Subjectivity: 0 to 1 (0=factual, 1=very opinionated)
🎯  EXAM TIP:  4. Coherence: 0.6+=Excellent, 0.4-0.6=Good, 0.3-0.4=Moderate, <0.3=Poor
🎯  EXAM TIP:  5. LDA args to know: corpus=bows, num_topics=3, id2word=tokenDict, random_state=42
🎯  EXAM TIP:  6. Pipeline: ('tfidf', TfidfVectorizer()), ('clf', Classifier()) — always 2 steps
🎯  EXAM TIP:  7. Logistic Regression = BEST for text classification (75-85% accuracy)
🎯  EXAM TIP:  8. Matcher 4-step: Create -> Define pattern (list of dicts) -> Add -> Apply to doc
🎯  EXAM TIP:  9. Clustering = Unsupervised (no labels). Classification = Supervised (has labels).
🎯  EXAM TIP:  10. Word2Vec args: sentences=[token_list], vector_size=100, min_count=1
