A comprehensive repository of 10 practical NLP implementations covering everything from basic text processing to advanced deep learning models.
Quick Start β’ Installation β’ Practicals β’ Contributing β’ License
| Field | Details |
|---|---|
| Name | PREXIT JOSHI |
| Roll Number | UE233118 |
| Branch | Computer Science and Engineering (CSE) |
| Institute | University Institute of Engineering and Technology, Punjab University (UIET, PU) |
| π§ prexitjoshi@gmail.com | |
| GitHub | @intronep666 |
Get started in minutes:
# 1. Clone repository
git clone https://github.com/intronep666/Natural-Language-Processing.git
cd Natural-Language-Processing
# 2. Install dependencies
pip install -r requirements.txt
# 3. Download NLP data
python -m spacy download en_core_web_sm
python -c "import nltk; nltk.download('punkt')"
# 4. Launch Jupyter
jupyter notebook
# 5. Open and run practicals!For detailed setup instructions, see GETTING_STARTED.md.
- Python 3.8 or higher
- pip/conda
- ~2GB disk space (for models)
- Virtual environment (recommended)
# Create virtual environment
python -m venv nlp_env
source nlp_env/bin/activate # On Windows: nlp_env\Scripts\activate
# Install all dependencies
pip install -r requirements.txt
# Download spaCy model
python -m spacy download en_core_web_smSee GETTING_STARTED.md for detailed setup, troubleshooting, and next steps.
- What is NLP?
- Core Concepts
- NLP Processing Pipeline
- Key Techniques
- Applications
- Challenges
- Tools & Libraries
- Practical Implementations
Natural Language Processing (NLP) is a subfield of artificial intelligence (AI) and linguistics that focuses on enabling computers to understand, interpret, and generate human language in a meaningful and useful way. It bridges the gap between human communication and computer understanding.
- π¬ Communication Bridge: Enables machines to understand human language naturally
- π Data Extraction: Extract valuable insights from unstructured text data
- π€ Automation: Automate language-based tasks at scale
- π Business Intelligence: Analyze customer feedback, reviews, and sentiment
- π Global Reach: Break language barriers through translation
βββββββββββββββββββββββββββββββββββββββββββ
β NLP Core Objectives β
βββββββββββββββββββββββββββββββββββββββββββ€
β 1. Understanding (Comprehension) β
β 2. Generation (Producing text) β
β 3. Translation (Language to language) β
β 4. Analysis (Extracting information) β
β 5. Classification (Categorizing text) β
βββββββββββββββββββββββββββββββββββββββββββ
Breaking down text into smaller units (words, sentences, or subwords).
Example:
Text: "Natural Language Processing is amazing!"
Tokens: ["Natural", "Language", "Processing", "is", "amazing", "!"]
| Stemming | Lemmatization |
|---|---|
| Removes suffixes mechanically | Uses vocabulary and morphology |
| Fast but may oversimplify | Accurate but slower |
| "running", "runs" β "run" | "running", "runs" β "run" |
Common words (the, is, and, etc.) that are often removed for efficiency.
Example:
Original: "The cat is on the mat"
After removal: "cat mat"
Labeling each word with its grammatical role.
The β DET (Determiner)
cat β NN (Noun)
runs β VB (Verb)
quickly β RB (Adverb)
Identifying and classifying named entities in text.
Text: "Apple Inc. is located in Cupertino, California"
Entities:
- "Apple Inc." β Organization
- "Cupertino" β Location
- "California" β Location
Understanding grammatical relationships between words.
"The cat chased the mouse"
β
nsubj β obj
subject: "cat"
action: "chased"
object: "mouse"
ββββββββββββββββββββ
β Raw Text β
ββββββββββ¬ββββββββββ
β
ββββββββββββββββββββ
β Text Cleaning β (Remove special characters, lowercasing)
ββββββββββ¬ββββββββββ
β
ββββββββββββββββββββ
β Tokenization β (Break into tokens)
ββββββββββ¬ββββββββββ
β
ββββββββββββββββββββ
β Normalization β (Stemming/Lemmatization)
ββββββββββ¬ββββββββββ
β
ββββββββββββββββββββ
β Stop Word β (Remove common words)
β Removal β
ββββββββββ¬ββββββββββ
β
ββββββββββββββββββββ
β Feature β (Convert to numerical vectors)
β Extraction β
ββββββββββ¬ββββββββββ
β
ββββββββββββββββββββ
β ML/DL Model β (Classification, clustering, etc.)
ββββββββββ¬ββββββββββ
β
ββββββββββββββββββββ
β Prediction/ β (Output results)
β Analysis β
ββββββββββββββββββββ
Converts text into a vector of word counts, ignoring word order.
Sentence: "I love NLP, NLP is great"
BoW: {
"I": 1,
"love": 1,
"NLP": 2,
"is": 1,
"great": 1
}Weighs words based on their importance in a document and corpus.
Formula:
TF-IDF(t, d) = TF(t, d) Γ IDF(t)
Where:
- TF = frequency of term in document
- IDF = log(total documents / documents containing term)
Sequences of N consecutive words.
Text: "Natural Language Processing"
Unigrams (1-gram):
["Natural"], ["Language"], ["Processing"]
Bigrams (2-gram):
["Natural", "Language"], ["Language", "Processing"]
Trigrams (3-gram):
["Natural", "Language", "Processing"]
- Captures semantic similarity between words
- Two models: CBOW (Continuous Bag of Words) and Skip-gram
- Output: Dense vector for each word
- Count-based embedding using word co-occurrence matrix
- Combines global statistics with local context
- Extension of Word2Vec
- Treats words as bags of character n-grams
- Can generate vectors for out-of-vocabulary words
- Contextual embeddings based on transformer architecture
- Understands context from both directions
- State-of-the-art for many NLP tasks
Determining the emotional tone or sentiment of text.
Positive Sentiment: "This movie is absolutely amazing!"
Negative Sentiment: "I hate waiting in long lines"
Neutral Sentiment: "The temperature is 25 degrees"
Assigning documents to predefined categories.
Common Algorithms:
- NaΓ―ve Bayes (probabilistic)
- Support Vector Machine (SVM)
- Neural Networks (Deep Learning)
- LSTM (Long Short-Term Memory)
Grouping similar documents without predefined labels.
Popular Method: K-Means
- Partitions documents into K clusters
- Minimizes within-cluster distance
- Maximizes between-cluster distance
- Long Short-Term Memory networks
- Handle sequential data (text)
- Maintain long-term dependencies
- Excellent for sentiment analysis and text generation
- Siri, Alexa, Google Assistant
- Customer support chatbots
- Conversational AI systems
- Filtering spam messages
- Identifying phishing emails
- Priority inbox management
- Google Translate
- Breaking language barriers
- Real-time translation
- Extract structured data from unstructured text
- Resume parsing
- Document analysis
- Monitoring brand reputation
- Analyzing customer reviews
- Social media monitoring
- Market research
- Search engines
- FAQ automation
- Knowledge base systems
- Search engines (Google, Bing)
- Document ranking
- Semantic search
- Person/Place/Organization identification
- Resume screening
- News article analysis
- Autocomplete (Gmail, predictive text)
- Content generation
- Paraphrasing tools
- News categorization
- Document organization
- Topic modeling
- Lexical Ambiguity: Words with multiple meanings
- "bank" (financial institution vs. river bank)
- Syntactic Ambiguity: Multiple grammatical interpretations
- "I saw the man with the telescope"
- Machines struggle with understanding nuanced meanings
- Sarcasm, idioms, and cultural references are difficult
- Different languages have different structures
- Dialects, slang, and informal speech
- Misspellings and typos
- Limited labeled data for training
- Low-resource languages
- Domain-specific terminology
- Understanding relationships between distant words
- Solved partially by LSTM and Transformers
- Training data may contain biases
- Results in biased models and unfair predictions
- Large language models require significant resources
- Training and inference can be expensive
| Library | Purpose | Features |
|---|---|---|
| NLTK | Natural Language Toolkit | Tokenization, POS tagging, stemming, NER |
| spaCy | Industrial-strength NLP | Fast, efficient, production-ready |
| TextBlob | Simple text processing | Sentiment analysis, POS tagging |
| Gensim | Topic modeling & word embeddings | Word2Vec, Doc2Vec, FastText |
| Transformers | Pre-trained models | BERT, GPT, T5 |
| scikit-learn | Machine learning | Text classification, clustering |
| TensorFlow/PyTorch | Deep learning frameworks | Neural networks, LSTM |
| Dataset | Purpose | Size |
|---|---|---|
| 20 Newsgroups | Text classification | ~19,000 documents |
| Movie Reviews | Sentiment analysis | 1,000 positive + 1,000 negative |
| Wikipedia Corpus | General knowledge | Millions of articles |
| Common Crawl | Web data | Petabytes of text |
| GLUE | Model evaluation | Multiple benchmark tasks |
π Overview A complete end-to-end NLP pipeline demonstrating all fundamental linguistic analysis techniques using two powerful libraries: spaCy and NLTK.
π― Objectives
- Understand complete text processing workflow
- Learn multiple NLP techniques in one integrated example
- Perform comprehensive linguistic analysis on sample text
π Key Topics Covered
| Technique | Description | Library |
|---|---|---|
| Tokenization | Breaking text into individual words and sentences | spaCy |
| POS Tagging | Assigning grammatical roles to words | spaCy |
| Lemmatization | Converting words to base form using vocabulary | spaCy |
| Stemming | Reducing words to root form mechanically | NLTK |
| Stop Word Removal | Filtering common, less meaningful words | spaCy |
| Noun Phrase Chunking | Identifying meaningful noun phrases | spaCy |
| Dependency Parsing | Understanding grammatical relationships | spaCy |
| Named Entity Recognition | Identifying persons, places, organizations | spaCy |
π‘ Practical Example
Input: "On May 13, 2025, the Israeli Air Force executed strikes on Gaza's European Hospital"
Processing:
- Tokenization: ["On", "May", "13", ",", "2025", ...]
- POS Tags: DET, PROPN, NUM, PUNCT, NUM, ...
- NER: "May" β DATE, "Israeli Air Force" β ORG, "Gaza" β LOC, "Hospital" β ORG
- Lemmatization: "executed" β "execute"
π Learning Outcomes
- Master spaCy and NLTK libraries
- Perform complete linguistic analysis
- Understand relationship between different NLP tasks
- Handle real-world text data
π Overview Explores n-gram models, a foundational technique in NLP for understanding word sequences, calculating probabilities, and predicting word patterns.
π― Objectives
- Understand tokenization and punctuation removal
- Generate n-grams of varying sizes
- Calculate frequency and probability distributions
π Key Topics Covered
| Concept | Definition | Use Case |
|---|---|---|
| Unigrams (1-grams) | Individual words | Word frequency analysis |
| Bigrams (2-grams) | Two consecutive words | Word associations |
| Trigrams (3-grams) | Three consecutive words | Phrase patterns |
| Frequency Counting | How often each n-gram appears | Statistical analysis |
| Probability Calculation | Relative frequency of n-grams | Language modeling |
π‘ Practical Example
Text: "NLP is amazing. It is widely used in AI applications"
Unigrams: [NLP, is, amazing, It, widely, used, in, AI, applications]
Frequency: {is: 2, NLP: 1, amazing: 1, ...}
Bigrams: [(NLP, is), (is, amazing), (is, widely), (in, AI), ...]
Probability of "is": 2/9 β 0.222
Trigrams: [(NLP, is, amazing), (is, amazing, It), ...]
π’ Mathematical Foundation
Unigram Probability: P(w) = Count(w) / Total_words
Bigram Probability: P(w2|w1) = Count(w1, w2) / Count(w1)
Language Model: P(w1, w2, w3) = P(w1) Γ P(w2|w1) Γ P(w3|w1,w2)
π Learning Outcomes
- Extract and analyze n-grams from text
- Calculate statistical probabilities
- Understand language modeling foundations
- Prepare for more advanced NLP techniques
π Overview Demonstrates two fundamental feature extraction techniques that convert text into numerical vectors suitable for machine learning algorithms.
π― Objectives
- Convert text documents into numerical feature vectors
- Understand importance weighting mechanisms
- Compare simple frequency with intelligent weighting
π Key Topics Covered
- Simple word count approach
- Represents how often a word appears in a document
- Formula:
TF(t, d) = frequency of term t in document d
Example TF Matrix:
Document 1: "NLP is amazing, NLP is great"
NLP is amazing great
Doc 1 2 2 1 1
Document 2: "Machine learning is powerful"
NLP is learning powerful
Doc 2 0 1 1 1
- Weights terms based on importance across documents
- Reduces weight of common words
- Highlights distinctive terms
Formula:
TF-IDF(t, d) = TF(t, d) Γ IDF(t)
IDF(t) = log(Total_Documents / Documents_containing_t)
Comparison Example:
Word "is" (appears in most documents):
- TF: 2 (high count)
- IDF: log(4/3) β 0.29 (low importance)
- TF-IDF: 2 Γ 0.29 β 0.58 (low weight)
Word "NLP" (appears in few documents):
- TF: 2 (high count)
- IDF: log(4/1) β 1.39 (high importance)
- TF-IDF: 2 Γ 1.39 β 2.78 (high weight) β
π Learning Outcomes
- Convert text to numerical vectors
- Understand importance weighting
- Choose appropriate feature extraction method
- Prepare data for ML algorithms
π Overview Comprehensive exploration of modern word embedding techniques that capture semantic and syntactic relationships between words.
π― Objectives
- Learn multiple word embedding approaches
- Understand semantic relationships
- Compare different embedding methods
π Key Topics Covered
- Two architectures: CBOW (Continuous Bag of Words) and Skip-gram
- Predicts words from context (Skip-gram) or context from word (CBOW)
- Vector size: 50-300 dimensions
- Limitation: Cannot handle out-of-vocabulary words
Example:
Word: "king"
Vector: [0.2, -0.4, 0.1, 0.5, -0.2, ...]
Similar words: ["queen", "prince", "emperor"]
Vector distances measure similarity
- Count-based approach using global word-word co-occurrence
- Combines global statistics with local context
- Generally more stable than Word2Vec
- Pre-trained models available (Wikipedia, Common Crawl)
Matrix Factorization:
X[i,j] = count of word j in context of word i
GloVe decomposes this matrix into embeddings
- Extension of Word2Vec
- Treats words as bags of character n-grams
- Advantage: Can generate vectors for out-of-vocabulary words
- Better for morphologically rich languages
Example (OOV handling):
Training vocabulary: ["running", "runner", "run"]
Unknown word: "runs" (not in training)
Word2Vec: Cannot create vector β
FastText: Uses character n-grams ["ru", "un", "nn", "ni", "in", "ng"] β
- Contextual embeddings (word meaning changes with context)
- Bidirectional: understands context from both directions
- Pre-trained on massive corpus
- State-of-the-art for many tasks
Contextual Example:
Sentence 1: "I saw the bank by the river"
Sentence 2: "I deposited money at the bank"
Word: "bank"
- Embedding 1: Vector representing financial institution
- Embedding 2: Vector representing river bank
BERT generates different vectors based on context! β
Comparison Table:
| Method | Type | OOV Handling | Speed | Context |
|---|---|---|---|---|
| Word2Vec | Predictive | β | Fast | Static |
| GloVe | Count-based | β | Medium | Static |
| FastText | Hybrid | β | Medium | Static |
| BERT | Neural | β | Slow | Dynamic |
π Learning Outcomes
- Train and use Word2Vec models
- Utilize pre-trained GloVe embeddings
- Handle OOV words with FastText
- Implement contextual embeddings with BERT
- Choose embeddings based on task requirements
π Overview Implements two classic supervised learning algorithms for text categorization using the 20 Newsgroups dataset.
π― Objectives
- Build text classification models
- Compare probabilistic vs. geometric approaches
- Evaluate model performance with multiple metrics
π Key Topics Covered
Raw Text
β
TF-IDF Vectorization (convert to numerical features)
β
Train/Test Split (prepare data)
β
Model Training (NaΓ―ve Bayes or SVM)
β
Prediction & Evaluation
- Probabilistic classifier based on Bayes' Theorem
- Assumes feature independence (NaΓ―ve assumption)
- Fast training and prediction
- Works well with text (TF-IDF vectors)
Formula:
P(Category|Document) = P(Document|Category) Γ P(Category) / P(Document)
For text: P(category|words) β β P(word|category)
Advantages:
- β Fast training
- β Good with high-dimensional data
- β Effective for text
- β Handles missing values well
Disadvantages:
- β Independence assumption too strong
- β May underestimate probabilities
- Geometric classifier finding optimal hyperplane
- Maximizes margin between classes
- Kernel trick for non-linear problems
- Linear kernel works well for text (TF-IDF)
Concept:
βββββββββββββββββββββββββββββββ
β Feature Space β
β β
β β Class 1 (Spam) β
β β β ββββββββ β Optimal
β β β β Margin β Hyperplane
β β β β
β βββββββββββββββββββββββββ β
β β Margin β β
β β β β β β
β β β ββββββββ β
β β β Class 0 (Ham) β
βββββββββββββββββββββββββββββββ
Advantages:
- β Effective in high dimensions
- β Memory efficient
- β Versatile (different kernels)
- β Handles complex boundaries
Disadvantages:
- β Slower training on large datasets
- β Requires careful kernel selection
- β Hard to interpret
- 18,846 documents
- 20 categories
- Real-world news articles
- Imbalanced distribution
Categories (sample):
alt.atheismsoc.religion.christiancomp.graphicssci.med
Evaluation Metrics:
Accuracy = (TP + TN) / (TP + TN + FP + FN)
Precision = TP / (TP + FP) (of predicted positive, how many correct)
Recall = TP / (TP + FN) (of actual positive, how many caught)
F1-Score = 2 Γ (Precision Γ Recall) / (Precision + Recall)
π Learning Outcomes
- Implement classification pipelines
- Train NaΓ―ve Bayes and SVM classifiers
- Evaluate models with multiple metrics
- Compare algorithm performance
- Make informed algorithm choices
π Overview Unsupervised learning approach to automatically group similar documents into clusters based on their content.
π― Objectives
- Understand unsupervised learning
- Apply clustering to text documents
- Analyze cluster characteristics
π Key Topics Covered
An iterative algorithm that partitions documents into K clusters:
Algorithm Steps:
Step 1: Choose K (number of clusters)
β
Step 2: Randomly initialize K centroids
β
Step 3: Assign each document to nearest centroid (Euclidean distance)
β
Step 4: Recalculate centroids as mean of assigned points
β
Step 5: Repeat steps 3-4 until convergence
β
Step 6: Analyze clusters
Visualization:
Iteration 1: Iteration 2: Final:
ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ
β β
ββ β β β
ββ β β β
ββ β
β βββ β β β βββββ β β β βββββ β β β
β ββ β² β ββ βββ β² β ββ βββ β² β β
β β
β β β
β β β
β
ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ
Initial Converging Final Clusters
Documents β TF-IDF Vectorization β K-Means β Cluster Analysis
Example Output:
Documents:
1. "Machine learning provides systems ability to learn"
2. "Artificial intelligence and ML are related"
3. "Cricket is popular sport in India"
4. "Indian cricket team won match"
TF-IDF Vector Space (sparse)
β
K-Means with K=2
β
Cluster 0: [Doc 1, Doc 2] - ML/AI related
Cluster 1: [Doc 3, Doc 4] - Sports related
-
Choosing K: How many clusters?
- Elbow method
- Silhouette analysis
- Domain knowledge
-
Convergence: May find local optima
- Multiple runs with different initializations
- Select best result
-
Scalability: Slow on very large datasets
- Mini-batch K-Means
- Approximate methods
Top Terms per Cluster:
Cluster 0: ["machine", "learning", "model", "data", "algorithm"]
β ML/AI cluster
Cluster 1: ["cricket", "team", "match", "player", "game"]
β Sports cluster
π Learning Outcomes
- Implement K-Means clustering
- Vectorize text for clustering
- Determine optimal number of clusters
- Interpret and analyze clusters
- Understand unsupervised learning concepts
π Overview Assigns grammatical roles (parts of speech) to each word, enabling syntactic and semantic analysis.
π― Objectives
- Learn POS tagging concepts
- Implement using NLTK
- Understand grammatical relationships
π Key Topics Covered
Common POS tags in English:
| Tag | Meaning | Examples |
|---|---|---|
| NN | Noun | cat, dog, house |
| VB | Verb | run, jump, eat |
| JJ | Adjective | beautiful, quick, tall |
| RB | Adverb | quickly, carefully, very |
| DET | Determiner | the, a, an |
| IN | Preposition | in, on, at, by |
| PRP | Pronoun | he, she, it, they |
| CD | Cardinal Number | one, two, 42 |
Sentence: "The quick brown fox jumps over the lazy dog"
Words: [The quick brown fox jumps over the lazy dog]
β β β β β β β β β
Tags: [DET JJ JJ NN VB IN DET JJ NN]
- Rule-based: Hand-crafted linguistic rules
- Stochastic: Uses probabilistic models
- Neural: Deep learning approaches
- Hybrid: Combination of methods
Example Output:
Sentence: "Prexit submitted the practical on time"
Word POS Tag Description
βββββββββββββββββββββββββββββββββββββββββ
Prexit NNP Proper Noun
submitted VBD Verb (past tense)
the DT Determiner
practical NN Noun
on IN Preposition
time NN Noun
- Information extraction
- Parsing and syntax analysis
- Named entity recognition (filter nouns)
- Spell checking (context-aware)
- Machine translation
- Speech recognition (disambiguation)
π Learning Outcomes
- Understand linguistic grammatical concepts
- Implement POS tagging with NLTK
- Interpret POS tag sequences
- Prepare data for downstream NLP tasks
- Recognize word roles in sentences
π Overview Introduces neural networks for NLP, specifically LSTM (Long Short-Term Memory) networks for sentiment classification.
π― Objectives
- Preprocess text for neural networks
- Build and train LSTM models
- Classify sentiment (positive/negative)
π Key Topics Covered
Text β Tokenization β Integer Sequences β Padding β Embedding β Neural Network
- Tokenization: Convert words to integers
Vocabulary: {love: 1, this: 2, hate: 3, bad: 4}
Text: "I love this"
Tokens: [1, 2] (numbers replacing words)
- Padding: Make all sequences same length
Original: [[1, 2], [3, 4, 5], [6]]
Padded: [[0, 1, 2],
[3, 4, 5],
[0, 0, 6]] (length=3)
- Embedding: Dense vector representation
Word: "love" (ID: 1)
Embedding: [0.2, -0.4, 0.1, 0.5] (50-300 dimensions)
Problem: Regular RNNs suffer from vanishing gradient
RNN: h_t = tanh(W_h * h_{t-1} + W_x * x_t)
Problem: Gradient β 0 over many time steps
Long-range dependencies lost
LSTM Solution: Memory cells + gates
Cell State (C_t): "Long-term memory" (relatively unchanged)
Hidden State (h_t): "Short-term output"
Three Gates:
1. Forget Gate: What to forget from previous cell state
2. Input Gate: What new information to add
3. Output Gate: What to output from cell state
LSTM Cell Equations:
Forget Gate: f_t = Ο(W_f Β· [h_{t-1}, x_t] + b_f)
Input Gate: i_t = Ο(W_i Β· [h_{t-1}, x_t] + b_i)
Cell Update: CΜ_t = tanh(W_c Β· [h_{t-1}, x_t] + b_c)
Cell State: C_t = f_t β C_{t-1} + i_t β CΜ_t
Output Gate: o_t = Ο(W_o Β· [h_{t-1}, x_t] + b_o)
Hidden: h_t = o_t β tanh(C_t)
Network Architecture:
Input Layer (Embedding)
β
[Embedding Vectors] (text β 50-dim vectors)
β
LSTM Layer
β
[Hidden States] (sequential processing)
β
Dense Layer
β
Output Layer (Sigmoid)
β
Sentiment: [0] Negative or [1] Positive
Text: "I love this product"
Label: positive (1)
Text: "This is the worst"
Label: negative (0)
1. Forward pass: Input β LSTM β Dense β Sigmoid β Prediction
2. Calculate loss: Binary Crossentropy
3. Backpropagation: Compute gradients
4. Update weights: Using Adam optimizer
5. Repeat for multiple epochs
π Learning Outcomes
- Preprocess text for neural networks
- Understand LSTM architecture
- Build sentiment classification models
- Train deep learning models
- Handle sequential text data
π Overview Enhanced version of LSTM sentiment classification with advanced techniques including dropout regularization and improved architecture.
π― Objectives
- Implement advanced regularization techniques
- Improve model performance
- Handle overfitting in neural networks
π Key Topics Covered
Training Loss Training Loss & Validation Loss
β² β² Training Loss
β² β² β
β² β² β
β² (Good) β²β (Overfitting)
β² β±β
β²_____ β± Validation Loss β
Good Generalization Poor Generalization
Random deactivation of neurons during training to prevent co-adaptation.
Without Dropout: With Dropout (50%):
βββββββββββββββ ββββββββββββββββ
β β β β β β β β β β β β
β β² β β± β β β² β β± β (Some neurons
β β²ββ± β β β β²ββ± β randomly turned off)
β β β β β β
βββββββββββββββ ββββββββββββββββ
Benefits:
- β Prevents co-adaptation of neurons
- β Forces learning of robust features
- β Acts as ensemble of models
- β Reduces overfitting
Implementation:
Dropout Rate: 0.5 (50% neurons dropped)
After Training: All neurons active, weights Γ (1 - dropout_rate)
Input Layer (Embedding)
β
LSTM Layer 1 (64 units)
β
Dropout (0.5) β Prevents overfitting
β
LSTM Layer 2 (32 units)
β
Dropout (0.5) β Additional regularization
β
Dense Layer (16 units, ReLU)
β
Output Layer (1 unit, Sigmoid)
β
Sentiment Prediction
| Parameter | Purpose | Common Values |
|---|---|---|
| Embedding Dim | Vector size for words | 50, 100, 300 |
| LSTM Units | Hidden state size | 32, 64, 128, 256 |
| Dropout Rate | Fraction to drop | 0.2, 0.5, 0.7 |
| Learning Rate | Optimization step size | 0.001, 0.01, 0.1 |
| Batch Size | Samples per update | 16, 32, 64, 128 |
| Epochs | Training iterations | 10-100 |
Epoch 1/50
Loss: 0.693, Accuracy: 0.50, Val_Loss: 0.691, Val_Accuracy: 0.50
Epoch 2/50
Loss: 0.620, Accuracy: 0.67, Val_Loss: 0.620, Val_Accuracy: 0.65
...
Epoch 50/50
Loss: 0.180, Accuracy: 0.95, Val_Loss: 0.320, Val_Accuracy: 0.88
π Learning Outcomes
- Implement regularization techniques
- Build deeper neural networks
- Tune hyperparameters effectively
- Monitor training with metrics
- Improve model generalization
- Understand overfitting and solutions
π Overview A complete real-world NLP application demonstrating spam detection using Bag-of-Words and Multinomial NaΓ―ve Bayes.
π― Objectives
- Develop a practical NLP application
- Preprocess diverse text data
- Classify messages as spam or legitimate (ham)
π Key Topics Covered
Binary Classification Task:
- Spam: Unsolicited, marketing, phishing messages
- Ham: Legitimate messages
Real-World Examples:
Spam Messages:
"Congratulations! You won a free lottery"
"Call now to claim your prize"
"Earn money fast by clicking this link"
"URGENT: Verify your account immediately"
Ham Messages:
"This is a meeting reminder"
"Let's have lunch tomorrow"
"Your appointment is scheduled"
"Thanks for your help!"
ββββββββββββββββββββββββββββ
β Raw Text Message β
β "Congratulations! You β
β won a free lottery" β
ββββββββββββββ¬ββββββββββββββ
β
ββββββββββββββββββββββββββββ
β Text Preprocessing β
β β’ Lowercase β
β β’ Remove special chars β
β β’ Strip whitespace β
ββββββββββββββ¬ββββββββββββββ
β
β "congratulations you won β
β a free lottery" β
β
ββββββββββββββββββββββββββββ
β Bag-of-Words (BoW) β
β CountVectorizer β
ββββββββββββββ¬ββββββββββββββ
β
β {won: 1, free: 1, β
β lottery: 1, ...} β
β
ββββββββββββββββββββββββββββ
β NaΓ―ve Bayes Classifier β
ββββββββββββββ¬ββββββββββββββ
β
β P(Spam|Words) = ? β
β P(Ham|Words) = ? β
β
ββββββββββββββββββββββββββββ
β Prediction: SPAM β β
ββββββββββββββββββββββββββββ
Step 1: Original
Input: "Congratulations! You won a free lottery"
Step 2: Lowercase
"congratulations! you won a free lottery"
Step 3: Remove non-letters (punctuation, numbers)
"congratulations you won a free lottery"
Step 4: Strip whitespace
["congratulations", "you", "won", "a", "free", "lottery"]
Vocabulary (from training):
{congratulations: 0, you: 1, won: 2, a: 3, free: 4, lottery: 5, ...}
Message 1: "Congratulations you won a free lottery"
BoW Vector: [1, 1, 1, 1, 1, 1, 0, 0, 0, ...]
Message 2: "Let's have lunch tomorrow"
BoW Vector: [0, 0, 0, 0, 0, 0, 1, 1, 1, ...]
Probability calculation:
P(Spam|Message) = P(Message|Spam) Γ P(Spam) / P(Message)
For Bag-of-Words:
P(Message|Spam) = β P(word_i|Spam)
Decision:
If P(Spam|Message) > P(Ham|Message) β Classify as SPAM
Else β Classify as HAM
Example:
Message: "Win cash now!"
P(Spam|"win", "cash", "now") =
P("win"|Spam) Γ P("cash"|Spam) Γ P("now"|Spam) Γ P(Spam) / P(Message)
P(win|Spam) = 0.05 (5% of spam contain "win")
P(cash|Spam) = 0.08 (8% of spam contain "cash")
P(now|Spam) = 0.03 (3% of spam contain "now")
P(Spam) = 0.4 (40% of messages are spam)
Result: P(Spam|Message) = 0.8 > 0.2 = P(Ham|Message) β SPAM β
Confusion Matrix:
Predicted Spam Predicted Ham
Actual Spam TP FN
Actual Ham FP TN
Metrics:
Accuracy = (TP + TN) / Total
Precision = TP / (TP + FP) (of predicted spam, how many correct)
Recall = TP / (TP + FN) (of actual spam, how many caught)
F1-Score = 2 Γ (Precision Γ Recall) / (Precision + Recall)
Example Results:
TP = 95 (correctly identified spam)
FP = 5 (incorrectly marked ham as spam)
FN = 10 (missed spam messages)
TN = 90 (correctly identified ham)
Accuracy = (95 + 90) / 200 = 92.5%
Precision = 95 / (95 + 5) = 95%
Recall = 95 / (95 + 10) = 90.5%
F1-Score = 2 Γ (0.95 Γ 0.905) / (0.95 + 0.905) = 0.926
Test 1: "Win cash now!"
Prediction: SPAM (Probability: 92%)
Test 2: "Are we meeting today?"
Prediction: HAM (Probability: 88%)
Test 3: "Claim your free prize"
Prediction: SPAM (Probability: 95%)
Test 4: "See you at the meeting"
Prediction: HAM (Probability: 91%)
- β Simple and interpretable
- β Fast training and prediction
- β Effective for spam detection
- β Works with limited data
- β Easy to update with new messages
- β Good baseline for classification
Challenges:
1. Spam variations: Attackers constantly change messages
2. False positives: Legitimate messages marked as spam
3. False negatives: Spam gets through
4. Language evolution: New words, slang, emojis
5. Multiple languages: Different preprocessing needed
Solutions:
1. Regular model retraining
2. Balanced evaluation metrics
3. Combine with other features (sender, links, etc.)
4. Use ensemble methods
5. Handle multiple languages
π Learning Outcomes
- Develop end-to-end NLP application
- Preprocess diverse text data
- Implement practical feature extraction
- Apply NaΓ―ve Bayes for binary classification
- Evaluate model performance
- Handle real-world spam detection problem
- Understand practical NLP deployment
After studying these practicals, you will understand:
β
How to preprocess text data
β
How to extract meaningful features from text
β
How to train machine learning models for NLP tasks
β
How word embeddings capture semantic meaning
β
How to classify text using various algorithms
β
How to cluster similar documents
β
How to build deep learning models (LSTM) for NLP
β
How to implement real-world NLP applications
2000s: Statistical methods (n-grams, HMMs)
β
2010s: Machine learning (SVM, NaΓ―ve Bayes)
β
2013: Word embeddings (Word2Vec)
β
2015: Deep learning (RNN, LSTM)
β
2017: Transformer architecture (Attention is All You Need)
β
2018: BERT and contextual embeddings
β
2020+: Large Language Models (GPT-3, T5, ELECTRA)
β
2023+: Multimodal models, RAG, Fine-tuning
- Multimodal Learning: Combining text with images, audio, and video
- Few-Shot Learning: Learning from minimal examples
- Retrieval-Augmented Generation (RAG): Combining retrieval with generation
- Domain Adaptation: Transferring knowledge between domains
- Ethical NLP: Fair, transparent, and responsible AI
- Low-Resource Languages: Improving NLP for under-resourced languages
- Efficient Models: Smaller, faster models for edge devices
- Stanford CS224N: NLP with Deep Learning
- Andrew Ng's Deep Learning Specialization
- Hugging Face NLP Course
- "Speech and Language Processing" by Jurafsky & Martin
- "Natural Language Processing with Python" (NLTK Book)
- "Deep Learning for NLP" by Yoav Goldberg
- "Attention is All You Need" (Transformer)
- "BERT: Pre-training of Deep Bidirectional Transformers"
- "Sequence to Sequence Learning with Neural Networks"
For questions or clarifications regarding this summary or the practical implementations:
π§ Email: prexitjoshi@gmail.com
π Institution: University Institute of Engineering and Technology, Punjab University (UIET, PU)
π€ Author: PREXIT JOSHI (Roll No. UE233118)
π« Department: Computer Science and Engineering (CSE)
Natural-Language-Processing/
βββ 01_Comprehensive_NLP_Pipeline_Linguistic_Analysis.ipynb
βββ 02_N_Gram_Analysis_Tokenization_Probability.ipynb
βββ 03_Feature_Extraction_TF_TF-IDF.ipynb
βββ 04_Word_Embeddings_Word2Vec_GloVe_FastText_BERT.ipynb
βββ 05_Text_Classification_Naive_Bayes_SVM.ipynb
βββ 06_K-Means_Text_Clustering.ipynb
βββ 07_POS_Tagging_Part_of_Speech.ipynb
βββ 08_Text_Processing_LSTM_Sentiment_Classification.ipynb
βββ 09_Advanced_LSTM_Sentiment_Classification.ipynb
βββ 10_Spam_Detection_Naive_Bayes_Application.ipynb
βββ README.md # This file
βββ GETTING_STARTED.md # Setup and quick start guide
βββ CONTRIBUTING.md # Contribution guidelines
βββ CHANGELOG.md # Version history
βββ LICENSE # MIT License
βββ requirements.txt # Python dependencies
βββ .gitignore # Git ignore rules
- NLTK - Natural Language Toolkit
- spaCy - Industrial-strength NLP
- Gensim - Word embeddings (Word2Vec, FastText)
- Transformers - Pre-trained models (BERT, GPT)
- scikit-learn - Classic ML algorithms
- TensorFlow/Keras - Deep learning framework
- PyTorch - Alternative DL framework
- Pandas - Data manipulation
- NumPy - Numerical computing
- Jupyter - Interactive notebooks
| Metric | Value |
|---|---|
| Total Practicals | 10 |
| Total Code Cells | 100+ |
| Documentation Lines | 1400+ |
| Code Examples | 50+ |
| Diagrams/Visualizations | 30+ |
| Topics Covered | 50+ |
| Estimated Learning Time | 30-40 hours |
We welcome contributions! See CONTRIBUTING.md for:
- How to report bugs
- How to suggest features
- Pull request process
- Coding standards
- Commit message guidelines
# 1. Fork the repository
# 2. Create feature branch
git checkout -b feature/amazing-addition
# 3. Make changes and commit
git commit -m "feat: add amazing NLP feature"
# 4. Push and create PR
git push origin feature/amazing-addition- Issues: GitHub Issues
- Email: prexitjoshi@gmail.com
- Discussions: GitHub Discussions
This project is licensed under the MIT License - see LICENSE file for details.
MIT License - Free for personal, educational, and commercial use
with attribution required.
If you use this project in your research or work, please cite:
@misc{joshi2025nlp,
title={Natural Language Processing: Comprehensive Practicals},
author={Joshi, Prexit},
year={2025},
url={https://github.com/intronep666/Natural-Language-Processing}
}Current Version: 1.0.0 (November 29, 2025)
See CHANGELOG.md for detailed version history and planned features.
- NLTK & spaCy Teams for exceptional NLP libraries
- Hugging Face for transformer models and community
- TensorFlow & PyTorch communities
- scikit-learn for ML tools
- All Contributors and supporters
- GitHub Repository: https://github.com/intronep666/Natural-Language-Processing
- Author GitHub: https://github.com/intronep666
- Institution: UIET, PU
Natural Language Processing is a rapidly evolving field that combines linguistics, computer science, and machine learning. From simple text preprocessing to advanced transformer-based models, NLP enables machines to understand and generate human language in increasingly sophisticated ways.
The practical implementations in this repository demonstrate fundamental and advanced NLP concepts, providing hands-on experience with real-world applications and techniques. Whether you're interested in sentiment analysis, text classification, machine translation, or information extraction, NLP offers powerful tools and methodologies to solve complex language-based problems.
This repository is designed to:
- β Provide comprehensive, hands-on learning
- β Cover beginner to intermediate concepts
- β Include well-documented, runnable code
- β Foster community contributions
- β Serve as a portfolio project
Happy Learning! π