EdgeWhisperPi - Offline Speech-to-Text Device

⚠️ Important Note: This project has been tested on Ubuntu 24.04 and Windows 11, but has not yet been tested on Raspberry Pi. If you plan to deploy on Raspberry Pi, please note that additional performance optimizations may be required.

📖 中文版本: README.md

Project Introduction

EdgeWhisperPi is an offline speech-to-text solution designed specifically for Raspberry Pi, particularly suitable for scenarios requiring real-time speech transcription without relying on network connectivity. This project uses the Whisper model for speech recognition and is optimized for the hardware characteristics of Raspberry Pi 5, ensuring stable speech transcription service even with limited hardware resources.

Interface Preview

Main Interface Preview

💻 Development Tools: This project uses Cursor as the primary development environment, leveraging its powerful AI assistance features to accelerate the development process and practice Vibe Coding development philosophy.

🚀 Quick Start

# Clone the project
git clone [https://github.com/sheng1111/EdgeWhisperPi.git]
cd EdgeWhisperPi

# Run the installation script
./setup.sh

# Download models
python download_models.py

# Start the service
./run.sh

🖥️ System Requirements

Operating System: Ubuntu 24.04 (tested) or Raspberry Pi OS (pending testing)
Python 3.8 or above
Network connection (only for initial installation)
Microphone (for real-time recording functionality)

📦 Installation Steps

Clone the Project

git clone [https://github.com/sheng1111/EdgeWhisperPi.git]
cd EdgeWhisperPi

Run the Installation Script
```
./setup.sh
```
The installation script will automatically:
- Install system dependencies
- Create Python virtual environment
- Install Python packages
- Create necessary folder structure
- Check Whisper model files
Download Whisper Models
```
python download_models.py
```
- Automatically downloads tiny and small models
- Model file locations:
  - models/whisper/tiny/model.bin
  - models/whisper/small/model.bin

Raspberry Pi 5 8GB Device Specifications Overview

This project is designed to run on Raspberry Pi 5 (8GB version) with the following key hardware resources:

CPU: 2.4GHz quad-core 64-bit Arm Cortex-A76 with 512KB L2 cache per core and 2MB shared L3 cache, providing excellent medium inference performance
GPU: VideoCore VII, supporting OpenGL ES 3.1 and Vulkan 1.2, enabling future hardware-accelerated rendering and simple image processing
RAM: 8GB LPDDR4X-4267 SDRAM, sufficient for loading tiny/small models and performing edge inference
USB 3.0: Two ports supporting 5Gbps data transfer, suitable for connecting external SSDs, microphones, or other audio devices
Ethernet and Wi-Fi: Support for Gigabit Ethernet and 802.11ac Wi-Fi, used only for initial setup, not required for deployment
PCIe 2.0 x1: Expandable for AI accelerators (e.g., Coral USB) for future upgrades

Project Structure

EdgeWhisperPi/
├── app/                      # Core application modules
│   ├── __init__.py          # Python package initialization
│   ├── transcriber.py       # Audio-to-text core logic
│   ├── recorder.py          # Recording processing and audio capture
│   └── vad.py               # Voice activity detection module
├── ui/                       # User interface related files
│   ├── static/              # Static resources
│   │   ├── css/             # Style sheets
│   │   ├── js/              # JavaScript files
│   │   ├── lib/             # Third-party libraries
│   │   └── images/          # Image resources
│   ├── templates/           # HTML templates
│   │   └── index.html       # Main page template
│   └── app.py               # Flask application main program
├── models/                   # Model files directory
│   └── whisper/             # Whisper models
│       ├── tiny/            # Tiny model
│       ├── base/            # Base model
│       └── small/           # Small model
├── outputs/                  # Transcription output and audio storage location
├── run.sh                    # Startup script
├── setup.sh                  # Installation script
├── config.py                 # System configuration
├── download_models.py        # Model download script
├── requirements.txt          # Python package requirements
├── .gitignore               # Git ignore settings
├── LICENSE                  # License terms
└── README.md                # Project documentation

Functional Requirements

Network connection allowed for initial dependency installation
Project can be packaged and copied to other devices via SD card or USB drive after completion
UI must be simple, beautiful, and intuitive, suitable for non-technical users
All transcription and processing functions run offline
Startup process must be simple, allowing non-technical users to complete all initialization and open the UI with just one button press
All recordings and transcription results must be stored in the outputs/ folder with timestamp-based naming to ensure consistent record identification across international deployments

UI Features

🎙️ Record Button: Start real-time recording and convert to text
📁 Audio Upload: Support wav/mp3 upload for transcription
⏳ Processing progress indication and completion notification
📄 Text result display and copy
🚀 Simple Startup: Just run run.sh once after inserting the device, automatically opens web interface (e.g., http://localhost:3535)

Usage Instructions

Start the service:

./run.sh

After service starts, it will automatically open the browser to:
- http://localhost:3535
Use the web interface:
- Upload audio files
- Select recognition model (tiny, base, or small)
- Start recognition
- View recognition results

Model and Audio Processing

Processing Flow Component Integration

⚙️ Based on actual testing and hardware limitations, the following component combination has been verified to run stably on Raspberry Pi 5 8GB device:

Stage	Technology	Load Level	Function Description
Audio Pre-processing	Silero VAD	Low	Voice activity detection, improving transcription quality
Speech Recognition	Whisper Tiny/Base (int8/float32)	Medium	Support for multiple languages, adjustable precision
Post-processing	Custom rules	Low	Correct common word errors (e.g., names, places)

✅ Feasibility Planning Recommendations:

Real-time speech-to-text (user interaction with microphone)
- Use Tiny/Base model
- Precision options:
  - int8: Faster but slightly lower accuracy, suitable for real-time transcription
  - float32: Slower but higher accuracy, suitable for high-quality requirements
- Use Silero VAD for voice activity detection
- Ensure low latency and stable output
Upload audio file transcription (allows longer waiting time)
- Use Base/Small model
- Precision options:
  - int8: Faster but slightly lower accuracy, suitable for quick transcription
  - float32: Slower but higher accuracy, suitable for high-quality requirements
- Use Silero VAD to improve recognition intervals
Use faster-whisper, choose different models based on usage scenarios:
- tiny: Lightest, suitable for real-time speech-to-text, minimal resource usage
- base: Medium size, balanced performance and accuracy, usable for real-time or file transcription
- small: Larger model, suitable for file transcription, provides best accuracy
- Each model can choose int8 or float32 precision

Audio Processing Quality Recommendations

Recommended to use ffmpeg for audio pre-processing and format conversion:
- Ensure audio is in 16-bit PCM WAV format (best format for whisper)
- Automatically convert compressed audio (e.g., mp3) to high-quality input
ffmpeg can be used with subprocess to automatically handle uploaded file quality

Quick Start

bash run.sh

This script automatically starts the Flask UI and uses xdg-open (Linux) or webbrowser module to automatically open the default browser, no need to manually enter the URL

Offline Deployment

Package the entire EdgeWhisperPi/ project folder after completion
Copy to other devices for execution, no need to install or download again
Recommended to use desktop shortcuts, allowing users to start the system with just one click

Technology Choices

Backend: Python 3 + Flask
Frontend: HTML + Tailwind CSS + JavaScript
Audio Processing: faster-whisper + sounddevice / pyaudio + ffmpeg
UI Model: Web UI focused, suitable for non-technical users

Extensibility

Support for more audio formats
Local history recording
Interface beautification and mobile adaptation
Support for voice hotword enhancement (future upgrade)

Notes

Before first use, ensure correct Whisper model files are downloaded and placed
Recommended to use tiny model for testing, small model requires more system resources
Keep network connection during service operation (only for initial installation)
Press Ctrl+C to stop the service
Regular backup of transcription results in outputs/ folder is recommended

Troubleshooting

If encountering issues:

Ensure all dependencies are correctly installed
Check if model files are placed in correct locations
Confirm virtual environment is properly activated
Check network connection status
Verify microphone permission settings
Check if audio devices are properly connected

📝 Version History

v0.8.0 (2025/04/23)

Initial version release
Support for real-time speech-to-text functionality (tiny/base models, optional int8/float32 precision)
Support for audio file upload transcription (base/small models, optional int8/float32 precision)
Provide clean web interface
Support offline operation mode
Optimize Raspberry Pi 5 performance

⚠️ Notes

Before first use, ensure correct Whisper model files are downloaded and placed
Recommended to use tiny model for testing, small model requires more system resources
Keep network connection during service operation (only for initial installation)
Press Ctrl+C to stop the service
Regular backup of transcription results in outputs/ folder is recommended

🔧 Troubleshooting

If encountering issues:

Ensure all dependencies are correctly installed
Check if model files are placed in correct locations
Confirm virtual environment is properly activated
Check network connection status
Verify microphone permission settings
Check if audio devices are properly connected

📄 License

This project is licensed under the MIT License, see LICENSE file for details.

System Architecture

System Flow Diagram

graph TD
    A[User Input] --> B{Input Method}
    B -->|Real-time Recording| C[Microphone Recording]
    B -->|File Upload| D[Audio Upload]
    
    C --> E[Silero VAD<br/>Voice Activity Detection]
    D --> F[Audio Format Check<br/>& Conversion]
    
    E --> G[Whisper Model]
    F --> G
    
    G --> H[Text Post-processing]
    H --> I[Display Results]
    
    I --> J{Select Action}
    J -->|Copy| K[Copy to Clipboard]
    J -->|Download| L[Download Text File]
    J -->|Save| M[Save to History]

Processing Flow Component Integration

Stage	Technology	Load Level	Function Description
Audio Pre-processing	Silero VAD	Low	Voice activity detection, improving transcription quality
Speech Recognition	Whisper Tiny/Base (int8/float32)	Medium	Support for multiple languages, adjustable precision
Post-processing	Custom rules	Low	Correct common word errors (e.g., names, places)

✅ Feasibility Planning Recommendations:

Real-time speech-to-text (user interaction with microphone)
- Use Tiny/Base model
- Precision options:
  - int8: Faster but slightly lower accuracy, suitable for real-time transcription
  - float32: Slower but higher accuracy, suitable for high-quality requirements
- Use Silero VAD for voice activity detection
- Ensure low latency and stable output
Upload audio file transcription (allows longer waiting time)
- Use Base/Small model
- Precision options:
  - int8: Faster but slightly lower accuracy, suitable for quick transcription
  - float32: Slower but higher accuracy, suitable for high-quality requirements
- Use Silero VAD to improve recognition intervals
Use faster-whisper, choose different models based on usage scenarios:
- tiny: Lightest, suitable for real-time speech-to-text, minimal resource usage
- base: Medium size, balanced performance and accuracy, usable for real-time or file transcription
- small: Larger model, suitable for file transcription, provides best accuracy
- Each model can choose int8 or float32 precision

Audio Processing Quality Recommendations

Recommended to use ffmpeg for audio pre-processing and format conversion:
- Ensure audio is in 16-bit PCM WAV format (best format for whisper)
- Automatically convert compressed audio (e.g., mp3) to high-quality input
ffmpeg can be used with subprocess to automatically handle uploaded file quality

Quick Start

bash run.sh

This script automatically starts the Flask UI and uses xdg-open (Linux) or webbrowser module to automatically open the default browser, no need to manually enter the URL

Offline Deployment

Package the entire EdgeWhisperPi/ project folder after completion
Copy to other devices for execution, no need to install or download again
Recommended to use desktop shortcuts, allowing users to start the system with just one click

Technology Choices

Backend: Python 3 + Flask
Frontend: HTML + Tailwind CSS + JavaScript
Audio Processing: faster-whisper + sounddevice / pyaudio + ffmpeg
UI Model: Web UI focused, suitable for non-technical users

Extensibility

Support for more audio formats
Local history recording
Interface beautification and mobile adaptation
Support for voice hotword enhancement (future upgrade)

Notes

Before first use, ensure correct Whisper model files are downloaded and placed
Recommended to use tiny model for testing, small model requires more system resources
Keep network connection during service operation (only for initial installation)
Press Ctrl+C to stop the service
Regular backup of transcription results in outputs/ folder is recommended

Troubleshooting

If encountering issues:

Ensure all dependencies are correctly installed
Check if model files are placed in correct locations
Confirm virtual environment is properly activated
Check network connection status
Verify microphone permission settings
Check if audio devices are properly connected

📝 Version History

v0.8.0 (2025/04/23)

Initial version release
Support for real-time speech-to-text functionality (tiny/base models, optional int8/float32 precision)
Support for audio file upload transcription (base/small models, optional int8/float32 precision)
Provide clean web interface
Support offline operation mode
Optimize Raspberry Pi 5 performance

⚠️ Notes

Before first use, ensure correct Whisper model files are downloaded and placed
Recommended to use tiny model for testing, small model requires more system resources
Keep network connection during service operation (only for initial installation)
Press Ctrl+C to stop the service
Regular backup of transcription results in outputs/ folder is recommended

🔧 Troubleshooting

If encountering issues:

Ensure all dependencies are correctly installed
Check if model files are placed in correct locations
Confirm virtual environment is properly activated
Check network connection status
Verify microphone permission settings
Check if audio devices are properly connected

📄 License

This project is licensed under the MIT License, see LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
app		app
docs/demo		docs/demo
models		models
outputs		outputs
ui		ui
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README_zh.md		README_zh.md
config.py		config.py
download_models.py		download_models.py
package-lock.json		package-lock.json
package.json		package.json
requirements.txt		requirements.txt
run.sh		run.sh
setup.sh		setup.sh

License

sheng1111/EdgeWhisperPi

Folders and files

Latest commit

History

Repository files navigation

EdgeWhisperPi - Offline Speech-to-Text Device

Project Introduction

Interface Preview

🚀 Quick Start

🖥️ System Requirements

📦 Installation Steps

Raspberry Pi 5 8GB Device Specifications Overview

Project Structure

Functional Requirements

UI Features

Usage Instructions

Model and Audio Processing

Processing Flow Component Integration

Audio Processing Quality Recommendations

Quick Start

Offline Deployment

Technology Choices

Extensibility

Notes

Troubleshooting

📝 Version History

v0.8.0 (2025/04/23)

⚠️ Notes

🔧 Troubleshooting

📄 License

System Architecture

System Flow Diagram

Processing Flow Component Integration

Audio Processing Quality Recommendations

Quick Start

Offline Deployment

Technology Choices

Extensibility

Notes

Troubleshooting

📝 Version History

v0.8.0 (2025/04/23)

⚠️ Notes

🔧 Troubleshooting

📄 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages