Skip to content

🧠 EdgeWhisperPi – 離線語音轉文字系統,專為 Raspberry Pi 打造,支援即時錄音與音檔轉錄,完美離線部署。EdgeWhisperPi is an offline speech-to-text system optimized for Raspberry Pi. It supports real-time audio recording and transcription of audio files, making it ideal for fully offline deployments.

License

Notifications You must be signed in to change notification settings

sheng1111/EdgeWhisperPi

Repository files navigation

EdgeWhisperPi - Offline Speech-to-Text Device

Version Python Platform License Cursor

⚠️ Important Note: This project has been tested on Ubuntu 24.04 and Windows 11, but has not yet been tested on Raspberry Pi. If you plan to deploy on Raspberry Pi, please note that additional performance optimizations may be required.

📖 中文版本: README.md

Project Introduction

EdgeWhisperPi is an offline speech-to-text solution designed specifically for Raspberry Pi, particularly suitable for scenarios requiring real-time speech transcription without relying on network connectivity. This project uses the Whisper model for speech recognition and is optimized for the hardware characteristics of Raspberry Pi 5, ensuring stable speech transcription service even with limited hardware resources.

Interface Preview

Main Interface Preview

Main Interface Preview

💻 Development Tools: This project uses Cursor as the primary development environment, leveraging its powerful AI assistance features to accelerate the development process and practice Vibe Coding development philosophy.

🚀 Quick Start

# Clone the project
git clone [https://github.com/sheng1111/EdgeWhisperPi.git]
cd EdgeWhisperPi

# Run the installation script
./setup.sh

# Download models
python download_models.py

# Start the service
./run.sh

🖥️ System Requirements

  • Operating System: Ubuntu 24.04 (tested) or Raspberry Pi OS (pending testing)
  • Python 3.8 or above
  • Network connection (only for initial installation)
  • Microphone (for real-time recording functionality)

📦 Installation Steps

  1. Clone the Project

    git clone [https://github.com/sheng1111/EdgeWhisperPi.git]
    cd EdgeWhisperPi
  2. Run the Installation Script

    ./setup.sh

    The installation script will automatically:

    • Install system dependencies
    • Create Python virtual environment
    • Install Python packages
    • Create necessary folder structure
    • Check Whisper model files
  3. Download Whisper Models

    python download_models.py
    • Automatically downloads tiny and small models
    • Model file locations:
      • models/whisper/tiny/model.bin
      • models/whisper/small/model.bin

Raspberry Pi 5 8GB Device Specifications Overview

This project is designed to run on Raspberry Pi 5 (8GB version) with the following key hardware resources:

  • CPU: 2.4GHz quad-core 64-bit Arm Cortex-A76 with 512KB L2 cache per core and 2MB shared L3 cache, providing excellent medium inference performance
  • GPU: VideoCore VII, supporting OpenGL ES 3.1 and Vulkan 1.2, enabling future hardware-accelerated rendering and simple image processing
  • RAM: 8GB LPDDR4X-4267 SDRAM, sufficient for loading tiny/small models and performing edge inference
  • USB 3.0: Two ports supporting 5Gbps data transfer, suitable for connecting external SSDs, microphones, or other audio devices
  • Ethernet and Wi-Fi: Support for Gigabit Ethernet and 802.11ac Wi-Fi, used only for initial setup, not required for deployment
  • PCIe 2.0 x1: Expandable for AI accelerators (e.g., Coral USB) for future upgrades

Project Structure

EdgeWhisperPi/
├── app/                      # Core application modules
│   ├── __init__.py          # Python package initialization
│   ├── transcriber.py       # Audio-to-text core logic
│   ├── recorder.py          # Recording processing and audio capture
│   └── vad.py               # Voice activity detection module
├── ui/                       # User interface related files
│   ├── static/              # Static resources
│   │   ├── css/             # Style sheets
│   │   ├── js/              # JavaScript files
│   │   ├── lib/             # Third-party libraries
│   │   └── images/          # Image resources
│   ├── templates/           # HTML templates
│   │   └── index.html       # Main page template
│   └── app.py               # Flask application main program
├── models/                   # Model files directory
│   └── whisper/             # Whisper models
│       ├── tiny/            # Tiny model
│       ├── base/            # Base model
│       └── small/           # Small model
├── outputs/                  # Transcription output and audio storage location
├── run.sh                    # Startup script
├── setup.sh                  # Installation script
├── config.py                 # System configuration
├── download_models.py        # Model download script
├── requirements.txt          # Python package requirements
├── .gitignore               # Git ignore settings
├── LICENSE                  # License terms
└── README.md                # Project documentation

Functional Requirements

  1. Network connection allowed for initial dependency installation
  2. Project can be packaged and copied to other devices via SD card or USB drive after completion
  3. UI must be simple, beautiful, and intuitive, suitable for non-technical users
  4. All transcription and processing functions run offline
  5. Startup process must be simple, allowing non-technical users to complete all initialization and open the UI with just one button press
  6. All recordings and transcription results must be stored in the outputs/ folder with timestamp-based naming to ensure consistent record identification across international deployments

UI Features

  • 🎙️ Record Button: Start real-time recording and convert to text
  • 📁 Audio Upload: Support wav/mp3 upload for transcription
  • ⏳ Processing progress indication and completion notification
  • 📄 Text result display and copy
  • 🚀 Simple Startup: Just run run.sh once after inserting the device, automatically opens web interface (e.g., http://localhost:3535)

Usage Instructions

  1. Start the service:
./run.sh
  1. After service starts, it will automatically open the browser to:

  2. Use the web interface:

    • Upload audio files
    • Select recognition model (tiny, base, or small)
    • Start recognition
    • View recognition results

Model and Audio Processing

Processing Flow Component Integration

⚙️ Based on actual testing and hardware limitations, the following component combination has been verified to run stably on Raspberry Pi 5 8GB device:

Stage Technology Load Level Function Description
Audio Pre-processing Silero VAD Low Voice activity detection, improving transcription quality
Speech Recognition Whisper Tiny/Base (int8/float32) Medium Support for multiple languages, adjustable precision
Post-processing Custom rules Low Correct common word errors (e.g., names, places)

Feasibility Planning Recommendations:

  • Real-time speech-to-text (user interaction with microphone)

    • Use Tiny/Base model
    • Precision options:
      • int8: Faster but slightly lower accuracy, suitable for real-time transcription
      • float32: Slower but higher accuracy, suitable for high-quality requirements
    • Use Silero VAD for voice activity detection
    • Ensure low latency and stable output
  • Upload audio file transcription (allows longer waiting time)

    • Use Base/Small model
    • Precision options:
      • int8: Faster but slightly lower accuracy, suitable for quick transcription
      • float32: Slower but higher accuracy, suitable for high-quality requirements
    • Use Silero VAD to improve recognition intervals
  • Use faster-whisper, choose different models based on usage scenarios:

    • tiny: Lightest, suitable for real-time speech-to-text, minimal resource usage
    • base: Medium size, balanced performance and accuracy, usable for real-time or file transcription
    • small: Larger model, suitable for file transcription, provides best accuracy
    • Each model can choose int8 or float32 precision

Audio Processing Quality Recommendations

  • Recommended to use ffmpeg for audio pre-processing and format conversion:
    • Ensure audio is in 16-bit PCM WAV format (best format for whisper)
    • Automatically convert compressed audio (e.g., mp3) to high-quality input
  • ffmpeg can be used with subprocess to automatically handle uploaded file quality

Quick Start

bash run.sh
  • This script automatically starts the Flask UI and uses xdg-open (Linux) or webbrowser module to automatically open the default browser, no need to manually enter the URL

Offline Deployment

  • Package the entire EdgeWhisperPi/ project folder after completion
  • Copy to other devices for execution, no need to install or download again
  • Recommended to use desktop shortcuts, allowing users to start the system with just one click

Technology Choices

  • Backend: Python 3 + Flask
  • Frontend: HTML + Tailwind CSS + JavaScript
  • Audio Processing: faster-whisper + sounddevice / pyaudio + ffmpeg
  • UI Model: Web UI focused, suitable for non-technical users

Extensibility

  • Support for more audio formats
  • Local history recording
  • Interface beautification and mobile adaptation
  • Support for voice hotword enhancement (future upgrade)

Notes

  • Before first use, ensure correct Whisper model files are downloaded and placed
  • Recommended to use tiny model for testing, small model requires more system resources
  • Keep network connection during service operation (only for initial installation)
  • Press Ctrl+C to stop the service
  • Regular backup of transcription results in outputs/ folder is recommended

Troubleshooting

If encountering issues:

  1. Ensure all dependencies are correctly installed
  2. Check if model files are placed in correct locations
  3. Confirm virtual environment is properly activated
  4. Check network connection status
  5. Verify microphone permission settings
  6. Check if audio devices are properly connected

📝 Version History

v0.8.0 (2025/04/23)

  • Initial version release
  • Support for real-time speech-to-text functionality (tiny/base models, optional int8/float32 precision)
  • Support for audio file upload transcription (base/small models, optional int8/float32 precision)
  • Provide clean web interface
  • Support offline operation mode
  • Optimize Raspberry Pi 5 performance

⚠️ Notes

  • Before first use, ensure correct Whisper model files are downloaded and placed
  • Recommended to use tiny model for testing, small model requires more system resources
  • Keep network connection during service operation (only for initial installation)
  • Press Ctrl+C to stop the service
  • Regular backup of transcription results in outputs/ folder is recommended

🔧 Troubleshooting

If encountering issues:

  1. Ensure all dependencies are correctly installed
  2. Check if model files are placed in correct locations
  3. Confirm virtual environment is properly activated
  4. Check network connection status
  5. Verify microphone permission settings
  6. Check if audio devices are properly connected

📄 License

This project is licensed under the MIT License, see LICENSE file for details.

System Architecture

System Flow Diagram

graph TD
    A[User Input] --> B{Input Method}
    B -->|Real-time Recording| C[Microphone Recording]
    B -->|File Upload| D[Audio Upload]
    
    C --> E[Silero VAD<br/>Voice Activity Detection]
    D --> F[Audio Format Check<br/>& Conversion]
    
    E --> G[Whisper Model]
    F --> G
    
    G --> H[Text Post-processing]
    H --> I[Display Results]
    
    I --> J{Select Action}
    J -->|Copy| K[Copy to Clipboard]
    J -->|Download| L[Download Text File]
    J -->|Save| M[Save to History]
Loading

Processing Flow Component Integration

Stage Technology Load Level Function Description
Audio Pre-processing Silero VAD Low Voice activity detection, improving transcription quality
Speech Recognition Whisper Tiny/Base (int8/float32) Medium Support for multiple languages, adjustable precision
Post-processing Custom rules Low Correct common word errors (e.g., names, places)

Feasibility Planning Recommendations:

  • Real-time speech-to-text (user interaction with microphone)

    • Use Tiny/Base model
    • Precision options:
      • int8: Faster but slightly lower accuracy, suitable for real-time transcription
      • float32: Slower but higher accuracy, suitable for high-quality requirements
    • Use Silero VAD for voice activity detection
    • Ensure low latency and stable output
  • Upload audio file transcription (allows longer waiting time)

    • Use Base/Small model
    • Precision options:
      • int8: Faster but slightly lower accuracy, suitable for quick transcription
      • float32: Slower but higher accuracy, suitable for high-quality requirements
    • Use Silero VAD to improve recognition intervals
  • Use faster-whisper, choose different models based on usage scenarios:

    • tiny: Lightest, suitable for real-time speech-to-text, minimal resource usage
    • base: Medium size, balanced performance and accuracy, usable for real-time or file transcription
    • small: Larger model, suitable for file transcription, provides best accuracy
    • Each model can choose int8 or float32 precision

Audio Processing Quality Recommendations

  • Recommended to use ffmpeg for audio pre-processing and format conversion:
    • Ensure audio is in 16-bit PCM WAV format (best format for whisper)
    • Automatically convert compressed audio (e.g., mp3) to high-quality input
  • ffmpeg can be used with subprocess to automatically handle uploaded file quality

Quick Start

bash run.sh
  • This script automatically starts the Flask UI and uses xdg-open (Linux) or webbrowser module to automatically open the default browser, no need to manually enter the URL

Offline Deployment

  • Package the entire EdgeWhisperPi/ project folder after completion
  • Copy to other devices for execution, no need to install or download again
  • Recommended to use desktop shortcuts, allowing users to start the system with just one click

Technology Choices

  • Backend: Python 3 + Flask
  • Frontend: HTML + Tailwind CSS + JavaScript
  • Audio Processing: faster-whisper + sounddevice / pyaudio + ffmpeg
  • UI Model: Web UI focused, suitable for non-technical users

Extensibility

  • Support for more audio formats
  • Local history recording
  • Interface beautification and mobile adaptation
  • Support for voice hotword enhancement (future upgrade)

Notes

  • Before first use, ensure correct Whisper model files are downloaded and placed
  • Recommended to use tiny model for testing, small model requires more system resources
  • Keep network connection during service operation (only for initial installation)
  • Press Ctrl+C to stop the service
  • Regular backup of transcription results in outputs/ folder is recommended

Troubleshooting

If encountering issues:

  1. Ensure all dependencies are correctly installed
  2. Check if model files are placed in correct locations
  3. Confirm virtual environment is properly activated
  4. Check network connection status
  5. Verify microphone permission settings
  6. Check if audio devices are properly connected

📝 Version History

v0.8.0 (2025/04/23)

  • Initial version release
  • Support for real-time speech-to-text functionality (tiny/base models, optional int8/float32 precision)
  • Support for audio file upload transcription (base/small models, optional int8/float32 precision)
  • Provide clean web interface
  • Support offline operation mode
  • Optimize Raspberry Pi 5 performance

⚠️ Notes

  • Before first use, ensure correct Whisper model files are downloaded and placed
  • Recommended to use tiny model for testing, small model requires more system resources
  • Keep network connection during service operation (only for initial installation)
  • Press Ctrl+C to stop the service
  • Regular backup of transcription results in outputs/ folder is recommended

🔧 Troubleshooting

If encountering issues:

  1. Ensure all dependencies are correctly installed
  2. Check if model files are placed in correct locations
  3. Confirm virtual environment is properly activated
  4. Check network connection status
  5. Verify microphone permission settings
  6. Check if audio devices are properly connected

📄 License

This project is licensed under the MIT License, see LICENSE file for details.

About

🧠 EdgeWhisperPi – 離線語音轉文字系統,專為 Raspberry Pi 打造,支援即時錄音與音檔轉錄,完美離線部署。EdgeWhisperPi is an offline speech-to-text system optimized for Raspberry Pi. It supports real-time audio recording and transcription of audio files, making it ideal for fully offline deployments.

Resources

License

Stars

Watchers

Forks

Packages

No packages published