⚠️ Important Note: This project has been tested on Ubuntu 24.04 and Windows 11, but has not yet been tested on Raspberry Pi. If you plan to deploy on Raspberry Pi, please note that additional performance optimizations may be required.
📖 中文版本: README.md
EdgeWhisperPi is an offline speech-to-text solution designed specifically for Raspberry Pi, particularly suitable for scenarios requiring real-time speech transcription without relying on network connectivity. This project uses the Whisper model for speech recognition and is optimized for the hardware characteristics of Raspberry Pi 5, ensuring stable speech transcription service even with limited hardware resources.
💻 Development Tools: This project uses Cursor as the primary development environment, leveraging its powerful AI assistance features to accelerate the development process and practice Vibe Coding development philosophy.
# Clone the project
git clone [https://github.com/sheng1111/EdgeWhisperPi.git]
cd EdgeWhisperPi
# Run the installation script
./setup.sh
# Download models
python download_models.py
# Start the service
./run.sh- Operating System: Ubuntu 24.04 (tested) or Raspberry Pi OS (pending testing)
- Python 3.8 or above
- Network connection (only for initial installation)
- Microphone (for real-time recording functionality)
-
Clone the Project
git clone [https://github.com/sheng1111/EdgeWhisperPi.git] cd EdgeWhisperPi -
Run the Installation Script
./setup.sh
The installation script will automatically:
- Install system dependencies
- Create Python virtual environment
- Install Python packages
- Create necessary folder structure
- Check Whisper model files
-
Download Whisper Models
python download_models.py
- Automatically downloads tiny and small models
- Model file locations:
models/whisper/tiny/model.binmodels/whisper/small/model.bin
This project is designed to run on Raspberry Pi 5 (8GB version) with the following key hardware resources:
- CPU: 2.4GHz quad-core 64-bit Arm Cortex-A76 with 512KB L2 cache per core and 2MB shared L3 cache, providing excellent medium inference performance
- GPU: VideoCore VII, supporting OpenGL ES 3.1 and Vulkan 1.2, enabling future hardware-accelerated rendering and simple image processing
- RAM: 8GB LPDDR4X-4267 SDRAM, sufficient for loading tiny/small models and performing edge inference
- USB 3.0: Two ports supporting 5Gbps data transfer, suitable for connecting external SSDs, microphones, or other audio devices
- Ethernet and Wi-Fi: Support for Gigabit Ethernet and 802.11ac Wi-Fi, used only for initial setup, not required for deployment
- PCIe 2.0 x1: Expandable for AI accelerators (e.g., Coral USB) for future upgrades
EdgeWhisperPi/
├── app/ # Core application modules
│ ├── __init__.py # Python package initialization
│ ├── transcriber.py # Audio-to-text core logic
│ ├── recorder.py # Recording processing and audio capture
│ └── vad.py # Voice activity detection module
├── ui/ # User interface related files
│ ├── static/ # Static resources
│ │ ├── css/ # Style sheets
│ │ ├── js/ # JavaScript files
│ │ ├── lib/ # Third-party libraries
│ │ └── images/ # Image resources
│ ├── templates/ # HTML templates
│ │ └── index.html # Main page template
│ └── app.py # Flask application main program
├── models/ # Model files directory
│ └── whisper/ # Whisper models
│ ├── tiny/ # Tiny model
│ ├── base/ # Base model
│ └── small/ # Small model
├── outputs/ # Transcription output and audio storage location
├── run.sh # Startup script
├── setup.sh # Installation script
├── config.py # System configuration
├── download_models.py # Model download script
├── requirements.txt # Python package requirements
├── .gitignore # Git ignore settings
├── LICENSE # License terms
└── README.md # Project documentation
- Network connection allowed for initial dependency installation
- Project can be packaged and copied to other devices via SD card or USB drive after completion
- UI must be simple, beautiful, and intuitive, suitable for non-technical users
- All transcription and processing functions run offline
- Startup process must be simple, allowing non-technical users to complete all initialization and open the UI with just one button press
- All recordings and transcription results must be stored in the
outputs/folder with timestamp-based naming to ensure consistent record identification across international deployments
- 🎙️ Record Button: Start real-time recording and convert to text
- 📁 Audio Upload: Support wav/mp3 upload for transcription
- ⏳ Processing progress indication and completion notification
- 📄 Text result display and copy
- 🚀 Simple Startup: Just run
run.shonce after inserting the device, automatically opens web interface (e.g., http://localhost:3535)
- Start the service:
./run.sh-
After service starts, it will automatically open the browser to:
-
Use the web interface:
- Upload audio files
- Select recognition model (tiny, base, or small)
- Start recognition
- View recognition results
⚙️ Based on actual testing and hardware limitations, the following component combination has been verified to run stably on Raspberry Pi 5 8GB device:
| Stage | Technology | Load Level | Function Description |
|---|---|---|---|
| Audio Pre-processing | Silero VAD | Low | Voice activity detection, improving transcription quality |
| Speech Recognition | Whisper Tiny/Base (int8/float32) | Medium | Support for multiple languages, adjustable precision |
| Post-processing | Custom rules | Low | Correct common word errors (e.g., names, places) |
✅ Feasibility Planning Recommendations:
-
Real-time speech-to-text (user interaction with microphone)
- Use Tiny/Base model
- Precision options:
- int8: Faster but slightly lower accuracy, suitable for real-time transcription
- float32: Slower but higher accuracy, suitable for high-quality requirements
- Use Silero VAD for voice activity detection
- Ensure low latency and stable output
-
Upload audio file transcription (allows longer waiting time)
- Use Base/Small model
- Precision options:
- int8: Faster but slightly lower accuracy, suitable for quick transcription
- float32: Slower but higher accuracy, suitable for high-quality requirements
- Use Silero VAD to improve recognition intervals
-
Use faster-whisper, choose different models based on usage scenarios:
- tiny: Lightest, suitable for real-time speech-to-text, minimal resource usage
- base: Medium size, balanced performance and accuracy, usable for real-time or file transcription
- small: Larger model, suitable for file transcription, provides best accuracy
- Each model can choose int8 or float32 precision
- Recommended to use
ffmpegfor audio pre-processing and format conversion:- Ensure audio is in 16-bit PCM WAV format (best format for whisper)
- Automatically convert compressed audio (e.g., mp3) to high-quality input
- ffmpeg can be used with
subprocessto automatically handle uploaded file quality
bash run.sh- This script automatically starts the Flask UI and uses
xdg-open(Linux) orwebbrowsermodule to automatically open the default browser, no need to manually enter the URL
- Package the entire
EdgeWhisperPi/project folder after completion - Copy to other devices for execution, no need to install or download again
- Recommended to use desktop shortcuts, allowing users to start the system with just one click
- Backend: Python 3 + Flask
- Frontend: HTML + Tailwind CSS + JavaScript
- Audio Processing: faster-whisper + sounddevice / pyaudio + ffmpeg
- UI Model: Web UI focused, suitable for non-technical users
- Support for more audio formats
- Local history recording
- Interface beautification and mobile adaptation
- Support for voice hotword enhancement (future upgrade)
- Before first use, ensure correct Whisper model files are downloaded and placed
- Recommended to use tiny model for testing, small model requires more system resources
- Keep network connection during service operation (only for initial installation)
- Press Ctrl+C to stop the service
- Regular backup of transcription results in
outputs/folder is recommended
If encountering issues:
- Ensure all dependencies are correctly installed
- Check if model files are placed in correct locations
- Confirm virtual environment is properly activated
- Check network connection status
- Verify microphone permission settings
- Check if audio devices are properly connected
- Initial version release
- Support for real-time speech-to-text functionality (tiny/base models, optional int8/float32 precision)
- Support for audio file upload transcription (base/small models, optional int8/float32 precision)
- Provide clean web interface
- Support offline operation mode
- Optimize Raspberry Pi 5 performance
- Before first use, ensure correct Whisper model files are downloaded and placed
- Recommended to use tiny model for testing, small model requires more system resources
- Keep network connection during service operation (only for initial installation)
- Press Ctrl+C to stop the service
- Regular backup of transcription results in
outputs/folder is recommended
If encountering issues:
- Ensure all dependencies are correctly installed
- Check if model files are placed in correct locations
- Confirm virtual environment is properly activated
- Check network connection status
- Verify microphone permission settings
- Check if audio devices are properly connected
This project is licensed under the MIT License, see LICENSE file for details.
graph TD
A[User Input] --> B{Input Method}
B -->|Real-time Recording| C[Microphone Recording]
B -->|File Upload| D[Audio Upload]
C --> E[Silero VAD<br/>Voice Activity Detection]
D --> F[Audio Format Check<br/>& Conversion]
E --> G[Whisper Model]
F --> G
G --> H[Text Post-processing]
H --> I[Display Results]
I --> J{Select Action}
J -->|Copy| K[Copy to Clipboard]
J -->|Download| L[Download Text File]
J -->|Save| M[Save to History]
| Stage | Technology | Load Level | Function Description |
|---|---|---|---|
| Audio Pre-processing | Silero VAD | Low | Voice activity detection, improving transcription quality |
| Speech Recognition | Whisper Tiny/Base (int8/float32) | Medium | Support for multiple languages, adjustable precision |
| Post-processing | Custom rules | Low | Correct common word errors (e.g., names, places) |
✅ Feasibility Planning Recommendations:
-
Real-time speech-to-text (user interaction with microphone)
- Use Tiny/Base model
- Precision options:
- int8: Faster but slightly lower accuracy, suitable for real-time transcription
- float32: Slower but higher accuracy, suitable for high-quality requirements
- Use Silero VAD for voice activity detection
- Ensure low latency and stable output
-
Upload audio file transcription (allows longer waiting time)
- Use Base/Small model
- Precision options:
- int8: Faster but slightly lower accuracy, suitable for quick transcription
- float32: Slower but higher accuracy, suitable for high-quality requirements
- Use Silero VAD to improve recognition intervals
-
Use faster-whisper, choose different models based on usage scenarios:
- tiny: Lightest, suitable for real-time speech-to-text, minimal resource usage
- base: Medium size, balanced performance and accuracy, usable for real-time or file transcription
- small: Larger model, suitable for file transcription, provides best accuracy
- Each model can choose int8 or float32 precision
- Recommended to use
ffmpegfor audio pre-processing and format conversion:- Ensure audio is in 16-bit PCM WAV format (best format for whisper)
- Automatically convert compressed audio (e.g., mp3) to high-quality input
- ffmpeg can be used with
subprocessto automatically handle uploaded file quality
bash run.sh- This script automatically starts the Flask UI and uses
xdg-open(Linux) orwebbrowsermodule to automatically open the default browser, no need to manually enter the URL
- Package the entire
EdgeWhisperPi/project folder after completion - Copy to other devices for execution, no need to install or download again
- Recommended to use desktop shortcuts, allowing users to start the system with just one click
- Backend: Python 3 + Flask
- Frontend: HTML + Tailwind CSS + JavaScript
- Audio Processing: faster-whisper + sounddevice / pyaudio + ffmpeg
- UI Model: Web UI focused, suitable for non-technical users
- Support for more audio formats
- Local history recording
- Interface beautification and mobile adaptation
- Support for voice hotword enhancement (future upgrade)
- Before first use, ensure correct Whisper model files are downloaded and placed
- Recommended to use tiny model for testing, small model requires more system resources
- Keep network connection during service operation (only for initial installation)
- Press Ctrl+C to stop the service
- Regular backup of transcription results in
outputs/folder is recommended
If encountering issues:
- Ensure all dependencies are correctly installed
- Check if model files are placed in correct locations
- Confirm virtual environment is properly activated
- Check network connection status
- Verify microphone permission settings
- Check if audio devices are properly connected
- Initial version release
- Support for real-time speech-to-text functionality (tiny/base models, optional int8/float32 precision)
- Support for audio file upload transcription (base/small models, optional int8/float32 precision)
- Provide clean web interface
- Support offline operation mode
- Optimize Raspberry Pi 5 performance
- Before first use, ensure correct Whisper model files are downloaded and placed
- Recommended to use tiny model for testing, small model requires more system resources
- Keep network connection during service operation (only for initial installation)
- Press Ctrl+C to stop the service
- Regular backup of transcription results in
outputs/folder is recommended
If encountering issues:
- Ensure all dependencies are correctly installed
- Check if model files are placed in correct locations
- Confirm virtual environment is properly activated
- Check network connection status
- Verify microphone permission settings
- Check if audio devices are properly connected
This project is licensed under the MIT License, see LICENSE file for details.
