English | 简体中文
A structured CUDA programming course: from SGEMM optimization to production inference engines. Four progressive sub-projects covering GPU kernel development from basics to advanced.
| # | Project | Focus | Tech |
|---|---|---|---|
| 01 | SGEMM Tutorial | Matrix multiplication optimization | CUDA C++, Makefile |
| 02 | TensorCraft Core | Header-only kernel library | C++17/20, CMake |
| 03 | HPC Advanced | Advanced HPC techniques | CUDA, CMake, Benchmark |
| 04 | Inference Engine | DL inference engine | CUDA, CMake, pybind11 |
01-SGEMM Tutorial (Basics)
↓
02-TensorCraft Core (Library Design)
↓
03-HPC Advanced (Optimization)
↓
04-Inference Engine (Application)
git clone https://github.com/LessUp/cuda-kernel-academy.git
cd cuda-kernel-academy
# Build all sub-projects
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)
# Run tests
cd build && ctest --output-on-failure| Option | Default | Description |
|---|---|---|
BUILD_TENSORCRAFT |
ON | Build TensorCraft Core |
BUILD_HPC_ADVANCED |
ON | Build HPC Advanced |
BUILD_INFERENCE_ENGINE |
ON | Build Inference Engine |
- GEMM Optimization: Naive → Tiled → Register Blocked → Tensor Core
- Memory Hierarchy: Global → Shared → Register, bank conflict avoidance
- Parallel Patterns: Reduction, scan, histogram, sort
- Kernel Fusion: Bias+Activation, LayerNorm+Residual
- Mixed Precision: FP16/BF16 Tensor Core, INT8 quantization
- CUDA Toolkit 12.x+
- CMake 3.20+
- C++17/20 compiler
- GPU: Volta (SM 7.0) or newer
MIT License