High-performance Python acceleration engine — CPU, threads, virtual threads, multi-GPU, NPU, ARM/Android/Termux, IoT/SBC, auto-tuning, Prometheus metrics, HTTP/gRPC server, Kubernetes auto-scaling, and maximum optimization mode.
| Module | Description |
|---|---|
cpu |
CPU detection, topology, NUMA, affinity, ISA flags, ARM big.LITTLE/DynamIQ, dynamic worker recommendations |
threads |
Persistent virtual-thread pool, sliding-window executor, async bridge, process pool |
gpu |
Multi-vendor GPU detection (NVIDIA/CUDA, AMD/OpenCL, Intel oneAPI, ARM Adreno/Mali/Immortalis), ranking, multi-GPU dispatch |
npu |
NPU detection & inference (OpenVINO, ONNX Runtime, DirectML, CoreML, ARM Hexagon/Samsung NPU/Tensor TPU/MediaTek APU) |
virt |
Virtualization detection (Hyper-V, VT-x/AMD-V, KVM, WSL2, Docker, container detection) |
memory |
Memory pressure monitoring, automatic worker clamping, reusable buffer pool |
profiler |
@timed, @profile_memory decorators, Timer context manager, Tracker statistics |
benchmark |
Built-in micro-benchmarks (CPU, threads, memory bandwidth, GPU compute) |
priority |
OS-level task priority (IDLE → REALTIME) & energy profiles (POWER_SAVER → ULTRA_PERFORMANCE) |
max_mode |
Maximum optimization mode — activates ALL resources simultaneously with OS tuning |
android |
Android/Termux platform detection, ARM SoC database (25+ chipsets), big.LITTLE, thermal & battery |
iot |
IoT / SBC detection (Raspberry Pi, Jetson, BeagleBone, Coral, Hailo, 30+ SoCs) |
autotune |
Auto-tuning feedback loop — benchmark → config → re-tune, persistent profiles |
metrics |
Prometheus metrics exporter (/metrics HTTP endpoint, all subsystems) |
server |
JSON HTTP & gRPC server for multi-language integration (Node.js, Go, Java, etc.) |
k8s |
Kubernetes operator — pod info, GPU node capacity, auto-scaling, manifest generation |
engine |
Unified orchestrator — auto-detects everything and provides a single API |
pip install pyacceleratefrom pyaccelerate import Engine
engine = Engine()
print(engine.summary())
# Submit I/O-bound tasks to the virtual thread pool
future = engine.submit(my_io_func, arg1, arg2)
# Run many tasks with auto-tuned concurrency
engine.run_parallel(process_file, [(f,) for f in files])
# GPU dispatch (auto-fallback to CPU)
results = engine.gpu_dispatch(my_kernel, data_chunks)Activates all available hardware resources in parallel with OS-level tuning:
from pyaccelerate.max_mode import MaxMode
with MaxMode() as m:
print(m.summary()) # hardware manifest
# Run CPU + I/O simultaneously
results = m.run_all(
cpu_fn=cpu_heavy_task, cpu_items=cpu_data,
io_fn=io_heavy_task, io_items=io_data,
)
# I/O only (thread pool)
downloaded = m.run_io(download, [(url,) for url in urls])
# CPU only (process pool)
computed = m.run_cpu(crunch, [(n,) for n in numbers])
# Multi-stage pipeline
results = m.run_pipeline([
("download", download_fn, urls),
("transform", transform_fn, data),
("save", save_fn, output),
])Or via the Engine:
engine = Engine()
with engine.max_mode() as m:
results = m.run_all(...)Control process scheduling and power profiles across Windows, Linux & macOS:
from pyaccelerate.priority import (
TaskPriority, EnergyProfile,
set_task_priority, set_energy_profile,
max_performance, balanced, power_saver,
)
# Quick presets
max_performance() # HIGH priority + ULTRA_PERFORMANCE energy
balanced() # Restore defaults
power_saver() # BELOW_NORMAL + POWER_SAVER
# Fine-grained control
set_task_priority(TaskPriority.ABOVE_NORMAL)
set_energy_profile(EnergyProfile.PERFORMANCE)pyaccelerate info # Full hardware report
pyaccelerate benchmark # Run micro-benchmarks
pyaccelerate gpu # GPU details
pyaccelerate cpu # CPU details
pyaccelerate npu # NPU details
pyaccelerate android # ARM/Android device details (SoC, clusters, thermal)
pyaccelerate virt # Virtualization info
pyaccelerate memory # Memory stats
pyaccelerate status # One-liner
pyaccelerate priority # Show current priority/energy
pyaccelerate priority --preset max # Apply max performance preset
pyaccelerate priority --set high # Set task priority
pyaccelerate priority --energy performance # Set energy profile
pyaccelerate max-mode # Show max-mode hardware manifest
pyaccelerate tune # Auto-tune: benchmark → optimise → save
pyaccelerate tune --apply # Tune and apply to current process
pyaccelerate tune --show # Show current tune profile
pyaccelerate metrics # Start Prometheus /metrics server (:9090)
pyaccelerate metrics --once# Print metrics and exit
pyaccelerate serve # Start HTTP/gRPC API server (:8420)
pyaccelerate k8s # Kubernetes pod & GPU info
pyaccelerate k8s --manifest# Generate K8s Deployment YAML
pyaccelerate iot # IoT / SBC board details
pyaccelerate version # Print versionFull hardware detection for ARM devices — phones (Termux, Pydroid), tablets, Raspberry Pi, ARM laptops (Snapdragon X Elite), and ARM servers:
from pyaccelerate.android import (
is_android, is_termux, is_arm,
get_device_info, get_soc_info,
detect_big_little, get_arm_features,
get_thermal_zones, get_battery_info,
)
if is_arm():
soc = get_soc_info()
if soc:
print(f"{soc.name} ({soc.vendor})") # Snapdragon 8 Gen 3 (Qualcomm)
print(f"GPU: {soc.gpu_name}") # Adreno 750
print(f"NPU: {soc.npu_name} ({soc.npu_tops} TOPS)") # Hexagon NPU (73.0 TOPS)
clusters = detect_big_little()
# {"Cortex-X4": [0], "Cortex-A720": [1,2,3], "Cortex-A520": [4,5,6,7]}
features = get_arm_features()
# ["aes", "asimd", "bf16", "crc32", "neon", "sve", "sve2", ...]Supported SoC families (25+ chipsets in database):
- Qualcomm — Snapdragon 8 Elite, 8/7/6 Gen 1-3, 888, 865, X Elite
- Samsung — Exynos 2500, 2200, 2100, 1380, 990
- Google — Tensor G1–G4
- MediaTek — Dimensity 9300, 9200, 9000, 8300, 1200, 1100, 900
- HiSilicon — Kirin 9010, 9000
- Unisoc — T616
ARM GPU detection — Adreno, Mali, Immortalis, Xclipse, PowerVR, Maleoon (via SoC DB, sysfs, Vulkan, OpenCL)
ARM NPU detection — Hexagon, Samsung NPU, Google TPU, MediaTek APU, Da Vinci NPU (via SoC DB, NNAPI, TFLite)
Inspired by Java's virtual threads — a persistent ThreadPoolExecutor sized for I/O (cores × 3, cap 32). All I/O-bound work shares this pool instead of creating/destroying threads per operation.
from pyaccelerate.threads import get_pool, run_parallel, submit
# Single task
fut = submit(download_file, url)
# Bounded concurrency (sliding window)
run_parallel(process, [(item,) for item in items], max_concurrent=8)Auto-detects GPUs across CUDA, OpenCL and Intel oneAPI. Distributes workloads with configurable strategies.
from pyaccelerate.gpu import detect_all, dispatch
gpus = detect_all()
results = dispatch(my_kernel, data_chunks, strategy="score-weighted")Zero-config decorators for timing and memory tracking:
from pyaccelerate.profiler import timed, profile_memory, Tracker
@timed(level=logging.INFO)
def heavy_computation():
...
tracker = Tracker("db_queries")
for batch in batches:
with tracker.measure():
run_query(batch)
print(tracker.summary())Benchmark your hardware, persist the optimal configuration, and auto-apply it:
from pyaccelerate.autotune import auto_tune, get_or_tune, apply_profile
# Run a full tune cycle (benchmark → save to ~/.pyaccelerate/)
profile = auto_tune()
print(f"Overall score: {profile.overall_score}/100")
print(f"Optimal IO workers: {profile.optimal_io_workers}")
print(f"Optimal CPU workers: {profile.optimal_cpu_workers}")
# Load existing or re-tune if hardware changed / profile stale
profile = get_or_tune()
# Apply to running process (sets workers, priority, energy)
apply_profile()Expose CPU/GPU/NPU/memory/pool metrics in Prometheus format:
from pyaccelerate.metrics import start_metrics_server, get_metrics_text
# Start /metrics endpoint on port 9090
start_metrics_server(port=9090)
# Or get text for your own framework
text = get_metrics_text()pyaccelerate metrics --port 9090 # Start server
pyaccelerate metrics --once # Print and exit
curl http://localhost:9090/metrics # ScrapeMulti-language access to all PyAccelerate features:
from pyaccelerate.server import PyAccelerateServer
with PyAccelerateServer(http_port=8420, grpc_port=50051) as srv:
print(f"HTTP: {srv.http_url}/api/v1")
# Block until Ctrl+C
srv.start(block=True)pyaccelerate serve --http-port 8420 --grpc-port 50051
curl http://localhost:8420/api/v1/info # JSON
curl http://localhost:8420/api/v1/cpu
curl http://localhost:8420/api/v1/gpu
curl http://localhost:8420/api/v1/metrics # Prometheus textPod detection, GPU node capacity, auto-scaling recommendations & manifest generation:
from pyaccelerate.k8s import (
is_kubernetes, get_pod_info,
get_scaling_recommendation, generate_resource_manifest,
)
if is_kubernetes():
pod = get_pod_info()
print(f"Pod: {pod.name} | GPU: {pod.gpu_limit}")
rec = get_scaling_recommendation()
print(f"Replicas: {rec.recommended_replicas} ({rec.reason})")
yaml = generate_resource_manifest(name="ml-worker", gpu_per_replica=1)pyaccelerate k8s # Show pod & GPU info
pyaccelerate k8s --manifest # Generate Deployment YAML
pyaccelerate k8s --json # Machine-readableA zero-dependency Node.js client is included in bindings/nodejs/:
const { PyAccelerate } = require('pyaccelerate');
const client = new PyAccelerate('http://localhost:8420');
const info = await client.getInfo();
const metrics = await client.getMetrics();
const bench = await client.runBenchmark();# Core (CPU + threads + memory + virt)
pip install pyaccelerate
# With NVIDIA GPU support
pip install pyaccelerate[cuda]
# With OpenCL support (AMD/Intel/NVIDIA)
pip install pyaccelerate[opencl]
# With Intel oneAPI support
pip install pyaccelerate[intel]
# All GPU backends
pip install pyaccelerate[all-gpu]
# gRPC server mode
pip install pyaccelerate[grpc]
# Kubernetes integration
pip install pyaccelerate[k8s]
# Development
pip install pyaccelerate[dev]# CPU-only
docker build -t pyaccelerate .
docker run --rm pyaccelerate info
# With NVIDIA GPU
docker build -f Dockerfile.gpu -t pyaccelerate:gpu .
docker run --rm --gpus all pyaccelerate:gpu info
# Docker Compose
docker compose up pyaccelerate # CPU
docker compose up gpu # GPUgit clone https://github.com/GuilhermeP96/pyaccelerate.git
cd pyaccelerate
pip install -e ".[dev]"
# Run tests
pytest -v
# Lint + format
ruff check src/ tests/
ruff format src/ tests/
# Type check
mypy src/
# Build wheel
python -m buildpyaccelerate/
├── cpu.py # CPU detection & topology
├── threads.py # Virtual thread pool & executors
├── gpu/
│ ├── detector.py # Multi-vendor GPU enumeration
│ ├── cuda.py # CUDA/CuPy helpers
│ ├── opencl.py # PyOpenCL helpers
│ ├── intel.py # Intel oneAPI helpers
│ └── dispatch.py # Multi-GPU load balancer
├── npu/
│ ├── detector.py # NPU detection (Intel, Qualcomm, Apple)
│ ├── onnx_rt.py # ONNX Runtime inference
│ ├── openvino.py # OpenVINO inference
│ └── inference.py# Unified inference API
├── virt.py # Virtualization detection
├── memory.py # Memory monitoring & buffer pool
├── profiler.py # Timing & profiling utilities
├── benchmark.py # Built-in micro-benchmarks
├── priority.py # OS task priority & energy profiles
├── max_mode.py # Maximum optimization mode
├── iot.py # IoT / SBC hardware detection
├── autotune.py # Auto-tuning feedback loop
├── metrics.py # Prometheus metrics exporter
├── server.py # HTTP + gRPC multi-language API
├── k8s.py # Kubernetes pod & GPU integration
├── engine.py # Unified orchestrator
├── cli.py # Command-line interface
└── bindings/
└── nodejs/ # npm client for Node.js / TypeScript
The examples/ directory contains runnable scripts demonstrating all features:
| Example | Description |
|---|---|
example_basic.py |
Engine creation, summary, submit, run_parallel, batch |
example_parallel_io.py |
Parallel download/process/write with public UCI ML datasets |
example_cpu_bound.py |
Sequential vs thread pool vs process pool comparison |
example_max_mode.py |
MaxMode context manager, run_all, run_io, run_cpu, pipeline |
example_pipeline.py |
Multi-stage data pipeline (download → analyze → report) |
example_priority.py |
TaskPriority levels, EnergyProfile, presets, benchmarking |
cd examples
python example_basic.py
python example_max_mode.py
python example_priority.py- IoT / SBC detection (Raspberry Pi, Jetson, Coral, Hailo)
- Auto-tuning feedback loop (benchmark → config → re-tune)
- Prometheus metrics exporter
- gRPC server mode for multi-language integration
- Kubernetes operator for auto-scaling GPU workloads
- npm package (Node.js bindings via HTTP API)
Evolved from the acceleration & virtual-thread systems built for:
- adb-toolkit — multi-GPU acceleration, virtual thread pool
- python-gpu-statistical-analysis — GPU compute foundations
MIT — see LICENSE.