A machine learning project for detecting fraudulent job postings using natural language processing (NLP) and classification models. This project was developed for the LIN340 Computational Linguistics course.
This project analyzes job posting data to classify listings as fraudulent or legitimate. It uses a combination of text preprocessing, feature extraction, and machine learning models, with a focus on handling class imbalance and maximizing recall to reduce missed fraud cases.
-
Data Preprocessing
- Lowercasing, punctuation/special character removal
- Tokenization, stop word removal, lemmatization
- Handling missing values and cleaning categorical columns
- Binary encoding for relevant columns
- Combining text columns into a single feature
-
Feature Extraction
- Bag-of-Words (BoW) unigram model
- TF-IDF (unigrams and bigrams)
-
Handling Class Imbalance
- SMOTE oversampling for the minority (fraudulent) class
-
Modeling
- Multinomial Naive Bayes (primary model)
- Logistic Regression (baseline)
- Neural Network (4-layer, with class weights and early stopping)
- 80/20 stratified train/test split
- 5-fold cross-validation
-
Evaluation
- Metrics: Accuracy, Precision, Recall, F1 Score, Confusion Matrix
- Emphasis on high recall to minimize missed fraud cases
- Recruitment Scam Dataset on Kaggle
- Downloaded using
kagglehub
- Install dependencies:
pip install -r requirements.txt
- Run the notebook:
Open
FakeJobClassifier_EDA_and_Modelling.ipynbin Jupyter or VSCode and run all cells.
FakeJobClassifier_EDA_and_Modelling.ipynb: Main notebook with all code, analysis, and results.FakeJobClassifier_Final_Report.pdf: Full project report, including Abstract, Introduction, Methods, Results, and Discussion.- Please Download Report to view fully
- Marvin Roopchan, Aamid Mohsin, Christian Kevin Sidharta
This project is for academic use only.