A machine learning pipeline for multi-class text categorization. This project processes text data to classify articles into four distinct categories: Business, Entertainment, Health, and Technology. It features data cleaning, feature engineering with TF-IDF, and a performance comparison of machine learning models: Support Vector Machines (SVM), Ridge Classifiers, and Random Forests.
This repository implements a Natural Language Processing (NLP) workflow:
- Text Preprocessing: Cleans and normalizes raw text through lowercasing, noise removal (URLs, HTML tags, special characters), tokenization, chat word expansion, stopword removal, and lemmatization.
- Feature Extraction: Utilizes TF-IDF vectorization with bigrams, demonstrating its effectiveness over raw frequency-based representations (CountVectorizer) for model convergence and accuracy.
- Model Training: Trains and evaluates classifiers, handling class imbalances via stratified cross-validation and weighted loss functions.
- Optimization: Fine-tunes model hyperparameters using grid search (
GridSearchCV) and Bayesian optimization (Optuna).
Raw text data often contains noise that hinders model learning. The preprocessing pipeline applies the following transformations:
- Cleaning: Strips URLs, HTML tags, and non-alphanumeric characters.
- Normalization: Expands common chat slang using a custom dictionary (
chat_words.py), removes NLTK stopwords, and lemmatizes words to their base forms. - Vectorization: Transforms the cleaned text into numerical features using
TfidfVectorizer. The text from the article Titles and Content are vectorized separately, weighted appropriately to account for the information density in short titles, and then combined into a single sparse feature matrix.
Several algorithms were evaluated using Stratified 5-Fold Cross-Validation to prevent data leakage and ensure consistent representation of all classes:
- Linear SVM: A linear classifier optimized via
GridSearchCVandOptunato find the best regularization parameter (C), loss function (hinge), and tolerance. - Ridge Classifier: A linear model utilizing L2 regularization to prevent overfitting on the sparse TF-IDF vectors.
- Random Forest: A model that builds multiple decision trees and takes a majority vote to classify the text data.
The linear models outperformed the ensemble methods for this specific text classification task. The Ridge Classifier and Linear SVM achieved the best balance of accuracy, generalization, and computational efficiency.
| Model | Cross-Validation Accuracy | Precision (Weighted) | Recall (Weighted) | F1-Score (Weighted) |
|---|---|---|---|---|
| Ridge Classifier (TF-IDF) | 0.9769 ± 0.0005 | 0.98 | 0.98 | 0.98 |
| Linear SVM (TF-IDF) | 0.9765 ± 0.0005 | 0.98 | 0.98 | 0.98 |
| Random Forest (TF-IDF) | 0.9268 ± 0.0018 | 0.93 | 0.93 | 0.93 |
Text Classification.ipynb: The main Jupyter Notebook containing the pipeline.chat_words.py: A custom Python dictionary used during the preprocessing phase to map internet slang and abbreviations to their full English words.train.csv: The labeled training dataset.test_without_labels.csv: The unlabeled test dataset used for generating final predictions
This project is licensed under the MIT License.