Intelligent SMS filtering using NLP and Deep Learning
- Overview
- Features
- Demo
- How It Works
- Tech Stack
- Installation
- Usage
- Model Architecture
- Performance
- Dataset
- API Documentation
- Deployment
- Contributing
- License
A machine learning-powered application that identifies spam messages using Natural Language Processing (NLP). The system:
- ๐ฑ Analyzes SMS/text messages in real-time
- ๐ฏ Classifies text as Spam or Ham (legitimate)
- ๐ Provides confidence scores for predictions
- โก Processes messages instantly with <100ms latency
- ๐จ Features an intuitive web interface
With over 45% of SMS messages being spam globally, this tool helps:
- ๐ก๏ธ Protect users from phishing attempts
- ๐ฐ Prevent financial scams
- ๐ Filter malicious links and content
- โฐ Save time by auto-filtering unwanted messages
- โ NLP-Based Classification - Advanced text processing
- โ TF-IDF Vectorization - Smart feature extraction
- โ Deep Learning Model - TensorFlow/Keras neural network
- โ Real-Time Prediction - Instant message analysis
- โ Confidence Scoring - Probability-based results
- โ Batch Processing - Analyze multiple messages
- โ Interactive Dashboard - Streamlit web interface
- โ Multi-Language Support - Detect spam in various languages
- โ Pattern Recognition - Identify common spam patterns
- โ URL Detection - Flag suspicious links
- โ Phone Number Extraction - Identify spam sender patterns
- โ Export Results - Download classification reports
- โ API Integration - RESTful API for developers
# Launch the Streamlit app
streamlit run app.py| Message | Classification | Confidence |
|---|---|---|
| "Congratulations! You've won $1000. Click here to claim!" | ๐ซ SPAM | 98.7% |
| "Hey, are we still meeting for lunch at 1pm?" | โ HAM | 95.3% |
| "URGENT: Your account will be suspended. Verify now!" | ๐ซ SPAM | 99.2% |
| "Thanks for the help yesterday. Really appreciate it!" | โ HAM | 96.8% |
โโโโโโโโโโโโโโโโ
โ Input Text โ
โ "FREE PRIZE" โ
โโโโโโโโฌโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโ
โ Text Cleaning โ
โ โข Lowercase โ
โ โข Remove punct. โ
โ โข Tokenization โ
โโโโโโโโโโฌโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโ
โ Feature โ
โ Extraction โ
โ โข TF-IDF โ
โ โข N-grams โ
โ โข Word vectors โ
โโโโโโโโโโฌโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโ
โ Neural Network โ
โ Classification โ
โ โข Dense layers โ
โ โข Dropout โ
โ โข Softmax output โ
โโโโโโโโโโฌโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโ
โ Prediction โ
โ SPAM: 98.7% โ
โ HAM: 1.3% โ
โโโโโโโโโโโโโโโโโโโโ
from spam_detector import SpamDetector
# Initialize detector
detector = SpamDetector(model_path='models/spam_classifier.h5')
# Analyze single message
message = "Congratulations! You've won a FREE iPhone. Click here now!"
result = detector.predict(message)
print(f"Classification: {result['class']}")
print(f"Confidence: {result['confidence']:.2%}")
print(f"Spam Score: {result['spam_probability']:.2%}")Output:
Classification: SPAM
Confidence: 98.7%
Spam Score: 98.7%
# 1. Clone repository
git clone https://github.com/ares-coding/spam-message-detection.git
cd spam-message-detection
# 2. Create virtual environment
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
# 3. Install dependencies
pip install -r requirements.txt
# 4. Download NLTK data
python -c "import nltk; nltk.download('stopwords'); nltk.download('punkt')"
# 5. Run the app
streamlit run app.py# Build image
docker build -t spam-detector .
# Run container
docker run -p 8501:8501 spam-detectorstreamlit run app.pyVisit http://localhost:8501 and start classifying messages!
from spam_detector import SpamDetector
import pandas as pd
# Initialize detector
detector = SpamDetector()
# Single message prediction
message = "Win a FREE trip to Bahamas! Call now!"
result = detector.predict(message)
print(f"Is Spam: {result['is_spam']}")
print(f"Confidence: {result['confidence']:.2%}")
# Batch prediction
messages = [
"Hey, want to grab coffee?",
"URGENT: Your account needs verification",
"Meeting rescheduled to 3pm tomorrow"
]
results = detector.predict_batch(messages)
df = pd.DataFrame(results)
print(df)# Start Flask API server
python api.py
# Make prediction request
curl -X POST http://localhost:5000/predict \
-H "Content-Type: application/json" \
-d '{"message": "FREE PRIZE! Click now!"}'Response:
{
"message": "FREE PRIZE! Click now!",
"is_spam": true,
"confidence": 0.987,
"spam_probability": 0.987,
"ham_probability": 0.013,
"detected_patterns": ["FREE", "PRIZE", "Click now"],
"risk_level": "HIGH"
}# Classify single message
python classify.py --text "Your message here"
# Classify from file
python classify.py --file messages.txt --output results.csv
# Batch processing
python classify.py --batch input_folder/ --output output_folder/model = Sequential([
# Input layer
Dense(128, activation='relu', input_shape=(5000,)),
Dropout(0.5),
# Hidden layers
Dense(64, activation='relu'),
Dropout(0.4),
Dense(32, activation='relu'),
Dropout(0.3),
# Output layer
Dense(2, activation='softmax') # Binary classification
])
model.compile(
optimizer='adam',
loss='categorical_crossentropy',
metrics=['accuracy']
)| Parameter | Value |
|---|---|
| Epochs | 10 |
| Batch Size | 32 |
| Optimizer | Adam |
| Learning Rate | 0.001 |
| Validation Split | 20% |
| Early Stopping | Enabled (patience=3) |
TF-IDF Vectorization:
- Max Features: 5000
- N-gram Range: (1, 2) - unigrams and bigrams
- Min Document Frequency: 2
- Max Document Frequency: 95%
Text Preprocessing:
- Convert to lowercase
- Remove punctuation and special characters
- Remove stop words (English)
- Tokenization
- Lemmatization
| Metric | Score |
|---|---|
| Accuracy | 98.2% |
| Precision (Spam) | 97.8% |
| Recall (Spam) | 96.5% |
| F1-Score | 97.1% |
| AUC-ROC | 0.99 |
Predicted
HAM SPAM
Actual HAM 965 12
SPAM 18 505
precision recall f1-score support
HAM 0.98 0.99 0.98 977
SPAM 0.98 0.97 0.97 523
accuracy 0.98 1500
macro avg 0.98 0.98 0.98 1500
weighted avg 0.98 0.98 0.98 1500
| Operation | Time |
|---|---|
| Single Message | 45ms |
| Batch (100 msgs) | 1.2s |
| Model Load Time | 850ms |
| Preprocessing | 15ms |
| Prediction | 30ms |
- Source: SMS Spam Collection Dataset (UCI ML Repository)
- Total Messages: 5,574
- Ham Messages: 4,827 (86.6%)
- Spam Messages: 747 (13.4%)
- Languages: Primarily English
| Label | Message |
|---|---|
| ham | "Ok lar... Joking wif u oni..." |
| spam | "Free entry in 2 a wkly comp to win FA Cup final tkts..." |
| ham | "U dun say so early hor... U c already then say..." |
| spam | "XXXMobileMovieClub: To use your credit, click the WAP link..." |
Class Balance:
โโโ HAM: 86.6% (4,827 messages)
โโโ SPAM: 13.4% (747 messages)
Message Length Distribution:
โโโ Min: 2 characters
โโโ Max: 910 characters
โโโ Avg: 80 characters
โโโ Median: 62 characters
Classify a single SMS message.
Request:
{
"message": "Congratulations! You've won $5000. Click here to claim your prize!"
}Response:
{
"message": "Congratulations! You've won $5000...",
"is_spam": true,
"confidence": 0.992,
"spam_probability": 0.992,
"ham_probability": 0.008,
"risk_level": "HIGH",
"detected_patterns": [
"Congratulations",
"won",
"prize",
"Click here"
],
"features": {
"has_url": false,
"has_phone": false,
"exclamation_marks": 2,
"capital_ratio": 0.15
},
"timestamp": "2025-02-13T10:30:45Z"
}Classify multiple messages.
Request:
{
"messages": [
"Hey, want to meet for coffee?",
"WIN FREE PRIZES NOW!!!",
"Your package has been delivered"
]
}Response:
{
"results": [
{"index": 0, "is_spam": false, "confidence": 0.954},
{"index": 1, "is_spam": true, "confidence": 0.998},
{"index": 2, "is_spam": false, "confidence": 0.923}
],
"summary": {
"total": 3,
"spam_count": 1,
"ham_count": 2,
"avg_confidence": 0.958
}
}spam-message-detection/
โโโ ๐ data/
โ โโโ raw/
โ โ โโโ spam.csv # Original dataset
โ โโโ processed/
โ โ โโโ X_train.npy # Training features
โ โ โโโ X_test.npy # Test features
โ โ โโโ y_train.npy # Training labels
โ โ โโโ y_test.npy # Test labels
โ โโโ models/
โ โโโ spam_classifier.h5 # Trained model
โ โโโ tfidf_vectorizer.pkl # TF-IDF vectorizer
โโโ ๐ src/
โ โโโ preprocessing.py # Text preprocessing
โ โโโ feature_extraction.py # TF-IDF vectorization
โ โโโ model.py # Neural network
โ โโโ train.py # Training script
โ โโโ predict.py # Prediction functions
โโโ ๐ notebooks/
โ โโโ 01_data_exploration.ipynb # EDA
โ โโโ 02_preprocessing.ipynb # Text cleaning
โ โโโ 03_model_training.ipynb # Model development
โ โโโ 04_evaluation.ipynb # Performance analysis
โโโ ๐ api/
โ โโโ app.py # Flask API
โ โโโ schemas.py # Pydantic models
โ โโโ utils.py # Helper functions
โโโ ๐ web/
โ โโโ streamlit_app.py # Streamlit interface
โโโ ๐ tests/
โ โโโ test_preprocessing.py
โ โโโ test_model.py
โ โโโ test_api.py
โโโ app.py # Main Streamlit app
โโโ classify.py # CLI tool
โโโ requirements.txt # Dependencies
โโโ Dockerfile # Docker configuration
โโโ README.md # This file
# Push to GitHub
git push origin main
# Deploy on Streamlit Cloud
# Visit: https://share.streamlit.io# Create Heroku app
heroku create spam-detector-app
# Deploy
git push heroku main
# Open app
heroku open# Build
docker build -t spam-detector:latest .
# Run
docker run -d -p 8501:8501 spam-detector:latest
# Access
open http://localhost:8501# Run all tests
pytest tests/ -v
# Test with coverage
pytest --cov=src tests/
# Test specific module
pytest tests/test_model.py -v
# Generate HTML coverage report
pytest --cov=src --cov-report=html tests/Contributions welcome! Please follow these steps:
- Fork the repository
- Create feature branch (
git checkout -b feature/AmazingFeature) - Commit changes (
git commit -m 'Add AmazingFeature') - Push to branch (
git push origin feature/AmazingFeature) - Open Pull Request
# Install development dependencies
pip install -r requirements-dev.txt
# Install pre-commit hooks
pre-commit install
# Run linting
flake8 src/
black src/This project is licensed under the MIT License - see LICENSE for details.
Au Amores - Full Stack Developer & ML Engineer
- UCI Machine Learning Repository for the dataset
- TensorFlow and Keras teams
- NLTK contributors
- Streamlit community
@software{spam_message_detection,
author = {Amores, Au},
title = {Spam Message Detection using NLP and Deep Learning},
year = {2025},
url = {https://github.com/ares-coding/spam-message-detection}
}- Multi-language spam detection
- WhatsApp/Telegram integration
- Browser extension
- Mobile app (React Native)
- Real-time learning from user feedback
- Explainable AI (LIME/SHAP)
- Email spam detection
- Image-based spam detection
โญ Star this repository if you found it useful!
๐ง Stop spam, stay safe!
Made with ๐ง and โ by Ares
