This project implements a Bigram (2-gram) Language Model using Natural Language Processing. It supports two smoothing techniques—Laplace Smoothing and Good-Turing Smoothing—to estimate probabilities for seen and unseen word pairs. The model is interactive and runs in a Streamlit web app.
🚀 Features
- 📁 Upload your own
.txtcorpus - 🧠 Generate bigrams and count their frequencies
- 📊 Apply Laplace and Good-Turing smoothing
- 🔍 Enter custom bigrams to check their probabilities
- 📈 Compare probabilities side-by-side
Install the required Python packages: pip install -r requirements.txt
streamlit run app.py
ngram_language_model/ │ ├── app.py # Main Streamlit application ├── ngram_utils.py # Utility functions for preprocessing and probability calculation ├── requirements.txt # Python dependencies └── long_sample_corpus.txt# Example text corpus
Laplace Smoothing: Adds 1 to every bigram count to avoid zero probabilities. Good-Turing Smoothing: Recalculates probability based on the frequency of frequency counts.
("language", "models") ("speech", "recognition") ("deep", "learning") (Unseen bigram)
This project is great for: Understanding the mechanics of N-gram models Seeing the effects of smoothing Hands-on exploration of NLP probability models