Skip to content

A comprehensive machine learning project for spam email detection using multiple classification algorithms and feature representations. This project compares traditional Bag-of-Words (BoW) with modern Sentence Embeddings (SBERT) approaches, achieving up to 96.85% F1-score and 99.93% AUC.

License

Notifications You must be signed in to change notification settings

IBilba/spam-recognision

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

9 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ“§ Spam Email Classification

Python Jupyter scikit-learn License

A comprehensive machine learning project for spam email detection using multiple classification algorithms and feature representations. This project compares traditional Bag-of-Words (BoW) with modern Sentence Embeddings (SBERT) approaches, achieving up to 96.85% F1-score and 99.93% AUC.

🎯 Project Overview

This repository contains a complete implementation and analysis of spam email classification, developed as part of a thesis project. The project explores:

  • Multiple ML Algorithms: Naive Bayes, Logistic Regression, k-NN, SVM
  • Feature Representations: Bag-of-Words with CountVectorizer (5000 features) vs. Sentence Transformers (384-dim embeddings)
  • Dimensionality Reduction: PCA with various variance retention ratios
  • Comprehensive Evaluation: Precision, Recall, F1-Score, AUC-ROC metrics

πŸ“Š Key Results

Model Representation Precision Recall F1-Score AUC Rank
SVM (Polynomial) Embeddings (384) 0.9762 0.9609 0.9685 0.9993 πŸ₯‡
Naive Bayes BoW (5000) 0.9609 0.9609 0.9609 0.9962 πŸ₯ˆ
Logistic Regression Embeddings (384) 0.9603 0.9453 0.9528 0.9949 πŸ₯‰
SVM + PCA (90%) Embeddings+PCA 0.9915 0.9062 0.9469 0.9984 4
k-NN (k=1) BoW (5000) 1.0000 0.8984 0.9465 0.9492 5
LogReg + PCA (10) Embeddings+PCA 0.9040 0.8828 0.8933 0.9889 6

πŸ† Best Performers

  • Best Overall: SVM Polynomial with Embeddings (F1: 96.85%, AUC: 99.93%)
  • Highest Precision: k-NN (100.00%, zero false positives)
  • Highest Recall: Naive Bayes + SVM tie (96.09%, catches most spam)
  • Best AUC: SVM with Polynomial Kernel (99.93%, superior discrimination)
  • Best Balance: SVM Polynomial (Precision: 97.62%, Recall: 96.09%)

✨ Features

  • Text Preprocessing Pipeline

    • URL and email address removal
    • Special character filtering
    • Lowercase conversion
    • Stopword removal (NLTK)
    • Porter stemming
  • Feature Extraction

    • Bag-of-Words with CountVectorizer (5000 vocabulary size)
    • Sentence embeddings using all-MiniLM-L6-v2
    • PCA dimensionality reduction
  • Classification Models

    • Multinomial Naive Bayes
    • Logistic Regression (C=1.0)
    • k-Nearest Neighbors (k=1, 3, 5, 11, 21)
    • Support Vector Machines (Linear, Polynomial, RBF kernels)
  • Comprehensive Evaluation

    • Confusion matrices
    • Precision/Recall/F1-Score
    • ROC curves and AUC
    • Cross-validation
    • Computational performance analysis

πŸ“ Project Structure

spam-recognision/
β”œβ”€β”€ spam_classification_assignment.ipynb  # Main Jupyter notebook
β”œβ”€β”€ emails.csv                            # Dataset (5572 emails)
β”œβ”€β”€ doc.tex                               # Full LaTeX thesis documentation
β”œβ”€β”€ doc-stock.tex                         # Thesis template
β”œβ”€β”€ references.bib                        # Bibliography
β”œβ”€β”€ hellas.bst                           # Greek BibTeX style
β”œβ”€β”€ media/
β”‚   └── graphs/                          # Generated visualizations
β”‚       β”œβ”€β”€ confusion_matrix_*.png       # Confusion matrices
β”‚       β”œβ”€β”€ roc_curves_*.png            # ROC curves
β”‚       β”œβ”€β”€ *-comparison.png            # Model comparisons
β”‚       └── pca-*.png                   # PCA analysis plots
└── README.md                            # This file

πŸš€ Getting Started

Prerequisites

  • Python 3.8 or higher
  • pip package manager
  • 4GB RAM minimum (8GB recommended)
  • 2GB free disk space

Installation

  1. Clone the repository

    git clone https://github.com/IBilba/spam-recognision.git
    cd spam-recognision
  2. Create a virtual environment

    # Windows
    python -m venv venv
    venv\Scripts\activate
    
    # macOS/Linux
    python3 -m venv venv
    source venv/bin/activate
  3. Install dependencies

    pip install --upgrade pip
    pip install numpy pandas scikit-learn
    pip install nltk sentence-transformers
    pip install matplotlib seaborn
    pip install jupyter ipykernel
    pip install tqdm
  4. Download NLTK data

    python -c "import nltk; nltk.download('stopwords'); nltk.download('punkt')"

Quick Start

  1. Launch Jupyter Notebook

    jupyter notebook
  2. Open the notebook

    • Navigate to spam_classification_assignment.ipynb
    • Run cells sequentially (Cell β†’ Run All)
  3. Expected runtime

    • Full execution: ~10-15 minutes (CPU)
    • With BERT embeddings: ~5-10 minutes (GPU recommended)

πŸ’» Usage

Basic Classification Example

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report

# Load data
df = pd.read_csv('emails.csv')

# Preprocess
from preprocessing import preprocess_text
df['clean_text'] = df['text'].apply(preprocess_text)

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    df['clean_text'], df['spam'],
    test_size=0.2, random_state=42, stratify=df['spam']
)

# Create features (Bag-of-Words)
vectorizer = CountVectorizer(max_features=5000)

# Train model
model = MultinomialNB()
model.fit(X_train_bow, y_train)

# Evaluate
y_pred = model.predict(X_test_bow)
print(classification_report(y_test, y_pred))

# Predict on new email
sample = "Congratulations! You won $1000! Click here!"
sample_vec = vectorizer.transform([preprocess_text(sample)])
prediction = model.predict(sample_vec)[0]
print(f"Prediction: {'SPAM' if prediction == 1 else 'HAM'}")

Advanced: Using Sentence Embeddings

from sentence_transformers import SentenceTransformer
from sklearn.svm import SVC

# Load SBERT model
encoder = SentenceTransformer('all-MiniLM-L6-v2')

# Generate embeddings
X_train_emb = encoder.encode(X_train.tolist(), show_progress_bar=True)
X_test_emb = encoder.encode(X_test.tolist(), show_progress_bar=True)

# Train SVM with polynomial kernel
svm = SVC(kernel='poly', degree=3, C=1.0, probability=True)
svm.fit(X_train_emb, y_train)

# Evaluate
y_pred = svm.predict(X_test_emb)
print(classification_report(y_test, y_pred))

πŸ“ˆ Dataset

The emails.csv dataset contains:

  • Total: 5,728 emails
  • Spam: 1,368 emails (23.88%)
  • Ham: 4,360 emails (76.12%)
  • Split: Train / Validation / Test
  • Features: Raw email text with subject lines

Data Distribution:

  • Training set: 4,582 emails (1,114 spam / 3,468 ham)
  • Validation set: 572 emails (126 spam / 446 ham)
  • Test set: 574 emails (128 spam / 446 ham)

πŸ”¬ Methodology

1. Text Preprocessing

  • Remove URLs, email addresses, special characters
  • Convert to lowercase
  • Tokenization
  • Remove English stopwords
  • Apply Porter stemming

2. Feature Extraction

Bag-of-Words (BoW):

  • CountVectorizer with vocabulary size 5000
  • Frequency word occurrence representation
  • Sparse representation (~1% density)
  • Captures presence of spam-indicative words

Sentence Embeddings:

  • all-MiniLM-L6-v2 model
  • 384-dimensional dense vectors
  • Semantic meaning preservation
  • Transfer learning from 1B+ sentence pairs

3. Model Training

  • Stratified train/test split
  • Hyperparameter tuning (k-values, kernels, PCA components)
  • Cross-validation for robustness
  • Multiple algorithm comparison

4. Evaluation

  • Confusion matrices
  • Precision, Recall, F1-Score
  • ROC curves and AUC
  • Statistical significance testing

πŸ“š Documentation

The complete thesis documentation is available in the PDF

πŸ› οΈ Technologies Used

  • Python 3.10: Core programming language
  • NumPy & Pandas: Data manipulation
  • scikit-learn: Machine learning algorithms
  • NLTK: Natural language processing
  • Sentence-Transformers: State-of-the-art embeddings
  • Matplotlib & Seaborn: Visualization
  • Jupyter: Interactive development

πŸ“Š Key Insights

  1. Embeddings + SVM Win: Modern sentence embeddings with SVM Polynomial kernel achieve the best performance (F1: 96.85%, AUC: 99.93%), demonstrating the power of semantic understanding combined with non-linear modeling.

  2. Naive Bayes Remains Competitive: Traditional BoW with Naive Bayes achieves F1=96.09% (only -0.76% behind SVM), making it an excellent choice for speed-critical applications.

  3. Computational Trade-offs:

    • Naive Bayes: <1s training, instant inference (F1: 96.09%)
    • SVM: 15-30s training, 0.1s per email (F1: 96.85%)
    • BERT embeddings: 5-10 min generation (one-time cost)
  4. PCA Impact: Dimensionality reduction from 384β†’115 dims (90% variance) maintains 99.84% AUC but reduces F1 from 96.85% to 94.69% (-2.16%).

  5. Optimal Hyperparameters:

    • k-NN: k=1 achieves perfect precision (100.00%)
    • SVM: Polynomial kernel (degree=3) optimal for embeddings
    • Feature count: 5000 features sufficient for BoW

🀝 Contributing

This is a proprietary project. Contributions are not accepted at this time. For questions or collaboration inquiries, please contact the author.

πŸ“ License

This project is under a Proprietary License. You may use this software for personal, educational, or research purposes only. You may NOT modify, distribute, sell, or use it commercially without explicit permission. See the LICENSE file for complete terms.

πŸ‘€ Author

ΞœΟ€Ξ―Ο„ΞΆΞ±Ο‚ ΒασίλΡιος (Vasileios Bitzas)

πŸ™ Acknowledgments

  • Dataset sourced from public spam email collections
  • Sentence-Transformers library by Nils Reimers and Iryna Gurevych
  • scikit-learn community for excellent ML tools
  • NLTK project for NLP resources

⭐ If you found this project helpful, please consider giving it a star!

About

A comprehensive machine learning project for spam email detection using multiple classification algorithms and feature representations. This project compares traditional Bag-of-Words (BoW) with modern Sentence Embeddings (SBERT) approaches, achieving up to 96.85% F1-score and 99.93% AUC.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published