A comprehensive machine learning project for spam email detection using multiple classification algorithms and feature representations. This project compares traditional Bag-of-Words (BoW) with modern Sentence Embeddings (SBERT) approaches, achieving up to 96.85% F1-score and 99.93% AUC.
This repository contains a complete implementation and analysis of spam email classification, developed as part of a thesis project. The project explores:
- Multiple ML Algorithms: Naive Bayes, Logistic Regression, k-NN, SVM
- Feature Representations: Bag-of-Words with CountVectorizer (5000 features) vs. Sentence Transformers (384-dim embeddings)
- Dimensionality Reduction: PCA with various variance retention ratios
- Comprehensive Evaluation: Precision, Recall, F1-Score, AUC-ROC metrics
| Model | Representation | Precision | Recall | F1-Score | AUC | Rank |
|---|---|---|---|---|---|---|
| SVM (Polynomial) | Embeddings (384) | 0.9762 | 0.9609 | 0.9685 | 0.9993 | π₯ |
| Naive Bayes | BoW (5000) | 0.9609 | 0.9609 | 0.9609 | 0.9962 | π₯ |
| Logistic Regression | Embeddings (384) | 0.9603 | 0.9453 | 0.9528 | 0.9949 | π₯ |
| SVM + PCA (90%) | Embeddings+PCA | 0.9915 | 0.9062 | 0.9469 | 0.9984 | 4 |
| k-NN (k=1) | BoW (5000) | 1.0000 | 0.8984 | 0.9465 | 0.9492 | 5 |
| LogReg + PCA (10) | Embeddings+PCA | 0.9040 | 0.8828 | 0.8933 | 0.9889 | 6 |
- Best Overall: SVM Polynomial with Embeddings (F1: 96.85%, AUC: 99.93%)
- Highest Precision: k-NN (100.00%, zero false positives)
- Highest Recall: Naive Bayes + SVM tie (96.09%, catches most spam)
- Best AUC: SVM with Polynomial Kernel (99.93%, superior discrimination)
- Best Balance: SVM Polynomial (Precision: 97.62%, Recall: 96.09%)
-
Text Preprocessing Pipeline
- URL and email address removal
- Special character filtering
- Lowercase conversion
- Stopword removal (NLTK)
- Porter stemming
-
Feature Extraction
- Bag-of-Words with CountVectorizer (5000 vocabulary size)
- Sentence embeddings using
all-MiniLM-L6-v2 - PCA dimensionality reduction
-
Classification Models
- Multinomial Naive Bayes
- Logistic Regression (C=1.0)
- k-Nearest Neighbors (k=1, 3, 5, 11, 21)
- Support Vector Machines (Linear, Polynomial, RBF kernels)
-
Comprehensive Evaluation
- Confusion matrices
- Precision/Recall/F1-Score
- ROC curves and AUC
- Cross-validation
- Computational performance analysis
spam-recognision/
βββ spam_classification_assignment.ipynb # Main Jupyter notebook
βββ emails.csv # Dataset (5572 emails)
βββ doc.tex # Full LaTeX thesis documentation
βββ doc-stock.tex # Thesis template
βββ references.bib # Bibliography
βββ hellas.bst # Greek BibTeX style
βββ media/
β βββ graphs/ # Generated visualizations
β βββ confusion_matrix_*.png # Confusion matrices
β βββ roc_curves_*.png # ROC curves
β βββ *-comparison.png # Model comparisons
β βββ pca-*.png # PCA analysis plots
βββ README.md # This file
- Python 3.8 or higher
- pip package manager
- 4GB RAM minimum (8GB recommended)
- 2GB free disk space
-
Clone the repository
git clone https://github.com/IBilba/spam-recognision.git cd spam-recognision -
Create a virtual environment
# Windows python -m venv venv venv\Scripts\activate # macOS/Linux python3 -m venv venv source venv/bin/activate
-
Install dependencies
pip install --upgrade pip pip install numpy pandas scikit-learn pip install nltk sentence-transformers pip install matplotlib seaborn pip install jupyter ipykernel pip install tqdm
-
Download NLTK data
python -c "import nltk; nltk.download('stopwords'); nltk.download('punkt')"
-
Launch Jupyter Notebook
jupyter notebook
-
Open the notebook
- Navigate to
spam_classification_assignment.ipynb - Run cells sequentially (Cell β Run All)
- Navigate to
-
Expected runtime
- Full execution: ~10-15 minutes (CPU)
- With BERT embeddings: ~5-10 minutes (GPU recommended)
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report
# Load data
df = pd.read_csv('emails.csv')
# Preprocess
from preprocessing import preprocess_text
df['clean_text'] = df['text'].apply(preprocess_text)
# Split data
X_train, X_test, y_train, y_test = train_test_split(
df['clean_text'], df['spam'],
test_size=0.2, random_state=42, stratify=df['spam']
)
# Create features (Bag-of-Words)
vectorizer = CountVectorizer(max_features=5000)
# Train model
model = MultinomialNB()
model.fit(X_train_bow, y_train)
# Evaluate
y_pred = model.predict(X_test_bow)
print(classification_report(y_test, y_pred))
# Predict on new email
sample = "Congratulations! You won $1000! Click here!"
sample_vec = vectorizer.transform([preprocess_text(sample)])
prediction = model.predict(sample_vec)[0]
print(f"Prediction: {'SPAM' if prediction == 1 else 'HAM'}")from sentence_transformers import SentenceTransformer
from sklearn.svm import SVC
# Load SBERT model
encoder = SentenceTransformer('all-MiniLM-L6-v2')
# Generate embeddings
X_train_emb = encoder.encode(X_train.tolist(), show_progress_bar=True)
X_test_emb = encoder.encode(X_test.tolist(), show_progress_bar=True)
# Train SVM with polynomial kernel
svm = SVC(kernel='poly', degree=3, C=1.0, probability=True)
svm.fit(X_train_emb, y_train)
# Evaluate
y_pred = svm.predict(X_test_emb)
print(classification_report(y_test, y_pred))The emails.csv dataset contains:
- Total: 5,728 emails
- Spam: 1,368 emails (23.88%)
- Ham: 4,360 emails (76.12%)
- Split: Train / Validation / Test
- Features: Raw email text with subject lines
Data Distribution:
- Training set: 4,582 emails (1,114 spam / 3,468 ham)
- Validation set: 572 emails (126 spam / 446 ham)
- Test set: 574 emails (128 spam / 446 ham)
- Remove URLs, email addresses, special characters
- Convert to lowercase
- Tokenization
- Remove English stopwords
- Apply Porter stemming
Bag-of-Words (BoW):
- CountVectorizer with vocabulary size 5000
- Frequency word occurrence representation
- Sparse representation (~1% density)
- Captures presence of spam-indicative words
Sentence Embeddings:
- all-MiniLM-L6-v2 model
- 384-dimensional dense vectors
- Semantic meaning preservation
- Transfer learning from 1B+ sentence pairs
- Stratified train/test split
- Hyperparameter tuning (k-values, kernels, PCA components)
- Cross-validation for robustness
- Multiple algorithm comparison
- Confusion matrices
- Precision, Recall, F1-Score
- ROC curves and AUC
- Statistical significance testing
The complete thesis documentation is available in the PDF
- Python 3.10: Core programming language
- NumPy & Pandas: Data manipulation
- scikit-learn: Machine learning algorithms
- NLTK: Natural language processing
- Sentence-Transformers: State-of-the-art embeddings
- Matplotlib & Seaborn: Visualization
- Jupyter: Interactive development
-
Embeddings + SVM Win: Modern sentence embeddings with SVM Polynomial kernel achieve the best performance (F1: 96.85%, AUC: 99.93%), demonstrating the power of semantic understanding combined with non-linear modeling.
-
Naive Bayes Remains Competitive: Traditional BoW with Naive Bayes achieves F1=96.09% (only -0.76% behind SVM), making it an excellent choice for speed-critical applications.
-
Computational Trade-offs:
- Naive Bayes: <1s training, instant inference (F1: 96.09%)
- SVM: 15-30s training, 0.1s per email (F1: 96.85%)
- BERT embeddings: 5-10 min generation (one-time cost)
-
PCA Impact: Dimensionality reduction from 384β115 dims (90% variance) maintains 99.84% AUC but reduces F1 from 96.85% to 94.69% (-2.16%).
-
Optimal Hyperparameters:
- k-NN: k=1 achieves perfect precision (100.00%)
- SVM: Polynomial kernel (degree=3) optimal for embeddings
- Feature count: 5000 features sufficient for BoW
This is a proprietary project. Contributions are not accepted at this time. For questions or collaboration inquiries, please contact the author.
This project is under a Proprietary License. You may use this software for personal, educational, or research purposes only. You may NOT modify, distribute, sell, or use it commercially without explicit permission. See the LICENSE file for complete terms.
ΞΟΞ―ΟΞΆΞ±Ο ΞΞ±ΟΞ―Ξ»Ξ΅ΞΉΞΏΟ (Vasileios Bitzas)
- GitHub: @IBilba
- Dataset sourced from public spam email collections
- Sentence-Transformers library by Nils Reimers and Iryna Gurevych
- scikit-learn community for excellent ML tools
- NLTK project for NLP resources
β If you found this project helpful, please consider giving it a star!