Building efficient AI infrastructure, high-performance systems, and data pipelines.
专注于 AI 基础设施与高性能计算的工程实践。
⚡ Focus: CUDA Kernel Optimization · LLM Inference · GPU Computing
🌱 Currently exploring: AI inference acceleration & large-scale pipeline orchestration
📍 Open to: Research collaboration & open-source co-building
About · Projects · Experience · Tech Stack · Stats · Contact
I build AI infrastructure and high-performance systems with C++/CUDA, Python, and Go.
主要关注 AI 基础设施、GPU 算子优化与高性能计算等方向的工程实践。
- GPU Kernel Engineering: CUDA/Triton kernel optimization, FlashAttention, GEMM, quantization / GPU 算子优化
- AI Inference Systems: LLM inference engines, model quantization (W8A16/FP8), KV Cache / AI 推理系统
- High-Performance Computing: N-body simulation, ray tracing, image processing pipelines / 高性能计算
- Real-time Systems: WebRTC signaling, real-time detection, digital human platform / 实时系统
|
Modern C++17/CUDA AI kernel library — Elementwise, GEMM, FlashAttention, Conv2D, SpMV, FP8 quantization |
From naive 3-loop to Tensor Core — progressive SGEMM optimization reaching 40% cuBLAS |
|
RMSNorm+RoPE fusion, Gated MLP fusion, FP8 GEMM with auto-tuning for Transformers |
CUDA LLM kernel library — FlashAttention (online softmax), FP16/INT8 GEMM with Tensor Core |
|
Lightweight LLM inference engine — W8A16 quantization, KV Cache, multi-sampling strategies |
7-level GEMM optimization (Naive→Tensor Core), reaching 72% cuBLAS with MNIST demo |
|
WebGPU micro inference engine — Conv2d, kernel fusion, Im2Col, MNIST classification |
Multi-model real-time vision — YOLO/DETR/OWL-ViT/BLIP with WebSocket streaming |
|
Pure CUDA ray tracer — Phong shading, path tracing, BVH acceleration, warp divergence optimization |
Million-particle GPU simulation — Direct N², Barnes-Hut, Spatial Hash with CUDA-OpenGL interop |
|
10K-particle real-time fluid simulation using WebGPU compute shaders with trail effects |
CUDA image processing library — convolution, morphology, geometric transforms, pipeline processing |
|
DAG-based heterogeneous image pipeline — multi-stream scheduling, pinned memory pool |
|
3D digital human platform — Three.js rendering, voice interaction, behavior control, emotion FSM |
Minimal WebRTC demo — Go WebSocket signaling, room management, peer-to-peer media |
|
E2E encrypted note sync — AES-256, 12-word mnemonic, real-time collaboration via WebSocket |
Browser-based memory training — N-back, spaced reinforcement, adaptive difficulty, PWA |
|
Background in communications engineering. / 通信与信息工程相关背景 |
Medical imaging, RTC & Genomic data. / 医疗、音视频与基因数据工程 |
| Category | Technologies |
|---|---|
| Languages | |
| AI & HPC | |
| System & DevOps | |
| Web & Frontend |
Feel free to reach out for collaboration, technical discussions, or open-source ideas.
欢迎通过邮箱与我交流技术想法、合作机会或开源项目。


