A comprehensive machine learning project for forecasting SaaS company revenues using advanced regression techniques, hyperparameter optimization, and model explainability.
SaaSRevCast is a predictive analytics pipeline designed to forecast revenue for SaaS (Software as a Service) companies using historical financial and market data. The project implements multiple regression models, performs rigorous hyperparameter tuning with Optuna, and provides interpretable insights through SHAP (SHapley Additive exPlanations) values.
- Time-Series Revenue Forecasting: Predicts SaaS company revenues using temporal data (2020-2024)
- Multiple ML Models: Implements and compares Linear Regression, Random Forest, XGBoost, and Support Vector Regression
- Automated Hyperparameter Tuning: Utilizes Optuna for optimization with 30+ trials
- Model Explainability: SHAP analysis for feature importance and impact visualization
- Feature Engineering: Advanced lag features, growth rates, profit margins, and customer metrics
- Comprehensive Evaluation: RMSE, MAE, and MAPE metrics for model comparison
SaasRevCast/
β
βββ data/ # Dataset and results
β βββ saas_financial_market_dataset.csv # Main dataset (2,500+ records)
β βββ results_saas_revcast.csv # Model predictions and results
β βββ model_comparison.csv # Performance metrics comparison
β βββ feature_importance.csv # SHAP feature importance values
β
βββ notebooks/ # Jupyter notebooks
β βββ MLPipeline.ipynb # Complete ML pipeline implementation
β
βββ paper/ # Research documentation
β βββ Final Term Paper ML.pdf # Complete research paper
β
βββ README.md # Project documentation (this file)
The dataset contains financial and operational metrics for multiple SaaS companies across different industries and regions:
- Size: 2,500+ records
- Time Period: 2020-2024
- Companies: Multiple SaaS companies across various industries
- Features:
- Revenue (USD)
- Expenses (USD)
- Profit (USD)
- Customer Count
- Churn Rate
- ARPU (Average Revenue Per User)
- Market Share (%)
- Industry, Region, Founded Year
- Lag Features:
revenue_lag_1,market_share_lag_1,customer_count_lag_1,churn_rate_lag_1 - Growth Metrics:
revenue_growth(percentage change) - Financial Ratios:
profit_margin,expenses_per_customer - Log Transformation:
log_revenue(target variable) to reduce skewness
- Linear Regression (Baseline)
- Random Forest Regressor (with Optuna tuning)
- XGBoost Regressor
- Support Vector Regression (SVR)
Models are evaluated using:
- RMSE (Root Mean Squared Error)
- MAE (Mean Absolute Error)
- MAPE (Mean Absolute Percentage Error)
Results are saved in data/model_comparison.csv for detailed comparison.
The project uses SHAP (SHapley Additive exPlanations) to provide:
- Feature importance rankings
- Feature impact visualization
- Individual prediction explanations
pip install pandas numpy matplotlib seaborn scikit-learn xgboost optuna shap- Navigate to the
notebooks/directory - Open
MLPipeline.ipynbin Jupyter Notebook or VS Code - Run all cells sequentially to:
- Load and preprocess data
- Engineer features
- Train models
- Evaluate performance
- Generate SHAP visualizations
A complete research paper documenting the methodology, experiments, results, and insights is available in the paper/ folder:
The paper includes:
- Literature review
- Detailed methodology
- Experimental setup and results
- Model comparison and analysis
- Conclusions and future work
All results are automatically saved to the data/ folder:
- Model predictions
- Performance metrics
- Feature importance scores
- Comparison tables
- Python 3.x
- Data Processing: Pandas, NumPy
- Visualization: Matplotlib, Seaborn
- Machine Learning: Scikit-learn, XGBoost
- Optimization: Optuna
- Explainability: SHAP
- Environment: Jupyter Notebook
- Training Set: 2020-2022
- Validation Set: 2023 (for hyperparameter tuning)
- Test Set: 2024 (for final evaluation)
This project is suitable for:
- Revenue forecasting for SaaS businesses
- Financial planning and budgeting
- Investor analysis and due diligence
- Academic research in time-series forecasting
- Learning ML pipelines and model explainability
This project was developed as a machine learning research project. For questions or contributions, please refer to the research paper in the paper/ folder for detailed methodology and references.
This project is for educational and research purposes.
Note: The complete technical details, mathematical formulations, and experimental results are documented in the research paper located in the paper/ folder.