Hey! Some Science Guy here. Welcome to HSPMN v3.0 - an LLM architecture built directly for the NVIDIA Blackwell (RTX 5090) GPU. We stop burning cycles on every single token and borrow a trick from the mammalian brain: route the predictable stuff fast, and save the heavy matrix math for the complex problems.
"The brain does not use the full neocortex to process a simple 'hello'. I just apply this exact same rule to my models."
Instead of imposing a monolithic
graph TD
A[Input Stream] -->|Token Embeddings| B{ALF-LB Router}
B -->|Predictable & Routine 80%| C["Reflexive Stream<br/>(Linear Attn + SwiGLU O(N))"]
B -->|Complex Anomaly 20%| D["Contextual Stream<br/>(Full SQSK Attention O(NΒ²))"]
C --> E{Merge & Add}
D --> E
E --> F[Output to Next Block]
style A fill:#2d3436,stroke:#fff,color:#fff
style B fill:#e17055,stroke:#fff,color:#fff
style C fill:#74b9ff,stroke:#fff,color:#fff,font-weight:bold
style D fill:#a29bfe,stroke:#fff,color:#fff,font-weight:bold
style E fill:#00b894,stroke:#fff,color:#fff
style F fill:#2d3436,stroke:#fff,color:#fff
Imagine processing hundreds of thousands of routine firewall logs per second. A standard transformer immediately allocates massive VRAM tensors attempting to map relations between routine [INFO] pings until it crashes with an Out-of-Memory (OOM) error.
HSPMN sidesteps this limit gracefully: background [INFO] logs are compressed linearly (near-zero memory footprint). But the exact millisecond a rogue [SQL_INJECTION_ATTACK] is parsed, the hardware router mathematically snaps the anomaly into the heavy Contextual Stream for deep, focused reasoning.
Result: You get massive, sustained sequence parsing where routine data incurs a flat O(1) VRAM cost, geometrically expanding your effective context window without exploding memory.
- ποΈ Hybrid Execution: FlexAttention for training + Custom Triton kernels for inference.
- π Hardware Sparsity: Custom Triton kernels built ground-up for Blackwell architectures.
- π§ 328k Context Window: Tested on the RTX 5090 using just 30.24 GB VRAM via True Sparse KV Cache.
- β‘ Silly Fast: 1.33M tokens/sec at BF16 precision.
- π² ALF-LB Routing: A bias-based routing method without that annoying gradient/Gumbel noise.
- βοΈ Dual Entropy Loss: Forces strict 0 or 1 token choices while keeping the hardware load totally even across batches.
- π« Zero Graph Breaks: Native static routing (
torch.topk) sotorch.compile(fullgraph=True)actually does its job. - π¦ CUDAGraphs Compatible: Sparsity targets stored as core Python floats (no
.item()sync!). Captured neatly in precisely 2 partitions.
| Metric | Value | Notes |
|---|---|---|
| Throughput | 1,329,516 tok/s | Batch=64, Seq=4096, Dim=2048 |
| VRAM (throughput) | 12.28 GB | CUDAGraphs 2 partitions |
| Max Context | 335,872 tokens | Batch=1, Dim=2048 (30.24 GB VRAM) |
| Latency | 197.17 ms avg | Full forward pass (P95: 197.81 ms) |
| Training Speed | ~980k tok/s | Real training speed using FlexAttention |
The codebase is strictly modularized into core architectural models, hardware-accelerated execution pipelines, and rigorous validation suites. This ensures a clean separation between the mathematical framework and its runtime components.
Here is the high-level topology of the repository:
graph LR
Root["π HSPMN-v3"]
subgraph Core ["π§ Core Architecture"]
A1["π€ hspmn_v3_0.py<br/><small>Main Architecture</small>"]
A2["π€ hspmn_hf_wrapper.py<br/><small>HuggingFace Wrap</small>"]
A3["βοΈ kernels_v3_0.py<br/><small>Triton Magic!</small>"]
end
subgraph Runners ["πββοΈ Execution & Training"]
B1["ποΈ benchmark_v3_0.py<br/><small>Go Fast</small>"]
B2["ποΈ train_v3_0.py<br/><small>Get Smart</small>"]
B3["π οΈ utils_v3_0.py<br/><small>Helper Logic</small>"]
end
subgraph Testing ["π§ͺ Validation & Tests"]
C1["π§ͺ test_v3_0.py<br/><small>Unit Tests</small>"]
C2["β‘ test_kernels_v3_0.py"]
C3["πͺ‘ needle_test.py<br/><small>Context Check</small>"]
C4["β
verify_models.py"]
end
subgraph Docs ["π Documentation & Config"]
D1["π README.md<br/><small>You are here</small>"]
D2["π HSPMN_v3_0.tex & .pdf<br/><small>Architecture Paper</small>"]
D3["π¦ requirements.txt"]
D4["βοΈ LICENSE"]
end
Root --> Core
Root --> Runners
Root --> Testing
Root --> Docs
%% Colors optimized for dark mode (white/bright text, saturated dark backgrounds)
style Root fill:#d63031,stroke:#fff,stroke-width:2px,color:#fff,font-weight:bold
style A1 fill:#0984e3,stroke:#fff,color:#fff
style A2 fill:#0984e3,stroke:#fff,color:#fff
style A3 fill:#0984e3,stroke:#fff,color:#fff
style B1 fill:#00b894,stroke:#fff,color:#fff
style B2 fill:#00b894,stroke:#fff,color:#fff
style B3 fill:#00b894,stroke:#fff,color:#fff
style C1 fill:#e17055,stroke:#fff,color:#fff,font-weight:bold
style C2 fill:#e17055,stroke:#fff,color:#fff,font-weight:bold
style C3 fill:#e17055,stroke:#fff,color:#fff,font-weight:bold
style C4 fill:#e17055,stroke:#fff,color:#fff,font-weight:bold
style D1 fill:#6c5ce7,stroke:#fff,color:#fff
style D2 fill:#6c5ce7,stroke:#fff,color:#fff
style D3 fill:#6c5ce7,stroke:#fff,color:#fff
style D4 fill:#6c5ce7,stroke:#fff,color:#fff
classDef default font-family:sans-serif,font-size:14px;
classDef title font-weight:bold,color:#fff;
Running things on bleeding-edge tech like the NVIDIA GB202 (RTX 5090) isn't without quirks. Here's what I fixed under the hood:
- TF32 Math Errors: PyTorch defaults to TF32, which broke our router sigmoid gate math due to precision. Forced FP32 via
set_float32_matmul_precision('highest'). Boom. Sorted. - Quantization Noise Gate: Fast MXFP8 math was bleeding noise. I added a
< 0.05hard floor to protect the routing logic. - SiLU NaN Errors: Deep padding into Blackwell SiLU kernels crashed them. Fixed with a good old clamp and
nan_to_num. - TMA Stride Protection: Replaced
tl.loadwithtl.make_block_ptrto stop massive L2 Cache misses dead in their tracks. - CUDAGraphs
.item()Fix: Guttedtensor.item()from the router forward path. CUDAGraphs now captures properly since sparsity targets are standard_sparsity_float.
Prerequisites: NVIDIA Driver 570+, CUDA 12.8+, Python 3.10+, PyTorch 2.10+ (nightly)
Pro-tip for reproducible benchmarks (OS tuning):
# GPU: persistence + power limit
sudo nvidia-smi -pm 1 && sudo nvidia-smi -pl 500
# CPU: performance governor + boost
echo performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
echo 1 | sudo tee /sys/devices/system/cpu/cpuboost
# Memory: reduce OS jitter
sudo sysctl vm.swappiness=10# Clone the repository
git clone https://github.com/NetBr3ak/HSPMN.git
cd HSPMN
# Install dependencies
pip install -r requirements.txtTest how fast your rig really is:
python benchmark_v3_0.py --mode allFor direct integration or testing the core block programmatically:
import torch
from hspmn_v3_0 import HSPMNBlock
from utils_v3_0 import HSPMNConfig
# Initialize configuration
config = HSPMNConfig(dim=2048, num_heads=16, num_kv_heads=4, sparsity_k=0.2)
model = HSPMNBlock(config).cuda().bfloat16()
# Compile the model
model = torch.compile(model, mode="max-autotune", fullgraph=True)
# Process a dummy sequence
x = torch.randn(1, 4096, 2048).cuda().bfloat16()
output, aux_loss, kv_cache = model(x)
print(f"Output shape: {output.shape}")python train_v3_0.py \
--batch 32 \
--seq_len 4096 \
--dim 2048 \
--steps 1000 \
--grad_accum 4 \
--wandb "hspmn-experiment-1"Tear it down to see if it breaks:
python test_kernels_v3_0.py
python verify_models.pyAuthor: Some Science Guy (Szymon JΔdryczko) License: Proprietary / All Rights Reserved - Non-Commercial Use Only
Source-available for portfolio viewing only. Commercial use, unauthorized modification, reproduction, or distribution is strictly prohibited. But feel free to look around!