Multiclass Text Classification

A machine learning pipeline for multi-class text categorization. This project processes text data to classify articles into four distinct categories: Business, Entertainment, Health, and Technology. It features data cleaning, feature engineering with TF-IDF, and a performance comparison of machine learning models: Support Vector Machines (SVM), Ridge Classifiers, and Random Forests.

Project Overview

This repository implements a Natural Language Processing (NLP) workflow:

Text Preprocessing: Cleans and normalizes raw text through lowercasing, noise removal (URLs, HTML tags, special characters), tokenization, chat word expansion, stopword removal, and lemmatization.
Feature Extraction: Utilizes TF-IDF vectorization with bigrams, demonstrating its effectiveness over raw frequency-based representations (CountVectorizer) for model convergence and accuracy.
Model Training: Trains and evaluates classifiers, handling class imbalances via stratified cross-validation and weighted loss functions.
Optimization: Fine-tunes model hyperparameters using grid search (GridSearchCV) and Bayesian optimization (Optuna).

Data Preprocessing & Feature Engineering

Raw text data often contains noise that hinders model learning. The preprocessing pipeline applies the following transformations:

Cleaning: Strips URLs, HTML tags, and non-alphanumeric characters.
Normalization: Expands common chat slang using a custom dictionary (chat_words.py), removes NLTK stopwords, and lemmatizes words to their base forms.
Vectorization: Transforms the cleaned text into numerical features using TfidfVectorizer. The text from the article Titles and Content are vectorized separately, weighted appropriately to account for the information density in short titles, and then combined into a single sparse feature matrix.

Modeling & Hyperparameter Tuning

Several algorithms were evaluated using Stratified 5-Fold Cross-Validation to prevent data leakage and ensure consistent representation of all classes:

Linear SVM: A linear classifier optimized via GridSearchCV and Optuna to find the best regularization parameter (C), loss function (hinge), and tolerance.
Ridge Classifier: A linear model utilizing L2 regularization to prevent overfitting on the sparse TF-IDF vectors.
Random Forest: A model that builds multiple decision trees and takes a majority vote to classify the text data.

Results & Comparative Evaluation

The linear models outperformed the ensemble methods for this specific text classification task. The Ridge Classifier and Linear SVM achieved the best balance of accuracy, generalization, and computational efficiency.

Model	Cross-Validation Accuracy	Precision (Weighted)	Recall (Weighted)	F1-Score (Weighted)
Ridge Classifier (TF-IDF)	0.9769 ± 0.0005	0.98	0.98	0.98
Linear SVM (TF-IDF)	0.9765 ± 0.0005	0.98	0.98	0.98
Random Forest (TF-IDF)	0.9268 ± 0.0018	0.93	0.93	0.93

Repository Files

Text Classification.ipynb: The main Jupyter Notebook containing the pipeline.
chat_words.py: A custom Python dictionary used during the preprocessing phase to map internet slang and abbreviations to their full English words.
train.csv: The labeled training dataset.
test_without_labels.csv: The unlabeled test dataset used for generating final predictions

License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitattributes		.gitattributes
LICENSE		LICENSE
README.md		README.md
Text Classification.ipynb		Text Classification.ipynb
chat_words.py		chat_words.py
test_without_labels.csv		test_without_labels.csv
train.csv		train.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multiclass Text Classification

Project Overview

Data Preprocessing & Feature Engineering

Modeling & Hyperparameter Tuning

Results & Comparative Evaluation

Repository Files

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

License

evankost/text-classification-multiclass

Folders and files

Latest commit

History

Repository files navigation

Multiclass Text Classification

Project Overview

Data Preprocessing & Feature Engineering

Modeling & Hyperparameter Tuning

Results & Comparative Evaluation

Repository Files

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages