Skip to content

A complete project for multi-class text categorization. Features TF-IDF vectorization, hyperparameter tuning, and a performance comparison of SVM, Ridge, and Random Forest models.

License

Notifications You must be signed in to change notification settings

evankost/text-classification-multiclass

Repository files navigation

Multiclass Text Classification

A machine learning pipeline for multi-class text categorization. This project processes text data to classify articles into four distinct categories: Business, Entertainment, Health, and Technology. It features data cleaning, feature engineering with TF-IDF, and a performance comparison of machine learning models: Support Vector Machines (SVM), Ridge Classifiers, and Random Forests.

Project Overview

This repository implements a Natural Language Processing (NLP) workflow:

  • Text Preprocessing: Cleans and normalizes raw text through lowercasing, noise removal (URLs, HTML tags, special characters), tokenization, chat word expansion, stopword removal, and lemmatization.
  • Feature Extraction: Utilizes TF-IDF vectorization with bigrams, demonstrating its effectiveness over raw frequency-based representations (CountVectorizer) for model convergence and accuracy.
  • Model Training: Trains and evaluates classifiers, handling class imbalances via stratified cross-validation and weighted loss functions.
  • Optimization: Fine-tunes model hyperparameters using grid search (GridSearchCV) and Bayesian optimization (Optuna).

Data Preprocessing & Feature Engineering

Raw text data often contains noise that hinders model learning. The preprocessing pipeline applies the following transformations:

  • Cleaning: Strips URLs, HTML tags, and non-alphanumeric characters.
  • Normalization: Expands common chat slang using a custom dictionary (chat_words.py), removes NLTK stopwords, and lemmatizes words to their base forms.
  • Vectorization: Transforms the cleaned text into numerical features using TfidfVectorizer. The text from the article Titles and Content are vectorized separately, weighted appropriately to account for the information density in short titles, and then combined into a single sparse feature matrix.

Modeling & Hyperparameter Tuning

Several algorithms were evaluated using Stratified 5-Fold Cross-Validation to prevent data leakage and ensure consistent representation of all classes:

  1. Linear SVM: A linear classifier optimized via GridSearchCV and Optuna to find the best regularization parameter (C), loss function (hinge), and tolerance.
  2. Ridge Classifier: A linear model utilizing L2 regularization to prevent overfitting on the sparse TF-IDF vectors.
  3. Random Forest: A model that builds multiple decision trees and takes a majority vote to classify the text data.

Results & Comparative Evaluation

The linear models outperformed the ensemble methods for this specific text classification task. The Ridge Classifier and Linear SVM achieved the best balance of accuracy, generalization, and computational efficiency.

Model Cross-Validation Accuracy Precision (Weighted) Recall (Weighted) F1-Score (Weighted)
Ridge Classifier (TF-IDF) 0.9769 ± 0.0005 0.98 0.98 0.98
Linear SVM (TF-IDF) 0.9765 ± 0.0005 0.98 0.98 0.98
Random Forest (TF-IDF) 0.9268 ± 0.0018 0.93 0.93 0.93

Repository Files

  • Text Classification.ipynb: The main Jupyter Notebook containing the pipeline.
  • chat_words.py: A custom Python dictionary used during the preprocessing phase to map internet slang and abbreviations to their full English words.
  • train.csv: The labeled training dataset.
  • test_without_labels.csv: The unlabeled test dataset used for generating final predictions

License

This project is licensed under the MIT License.

About

A complete project for multi-class text categorization. Features TF-IDF vectorization, hyperparameter tuning, and a performance comparison of SVM, Ridge, and Random Forest models.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors