I'm founding member of technical staff at UniversalAGI, building ML models for physics from first principles. I researched at the UC Berkeley's Sky Computing Lab with Joseph E. Gonzalez and Matei Zaharia on efficient inference and systems for LLMs (semantic caching with error guarantees, compound AI orchestration, agentic systems, database query optimization, and sparse attention). Before, I built production systems at Snowflake and Microsoft and studied computer science at TU Munich and UC Berkeley.
How do you build efficient and reliable systems when the components underneath are probabilistic?
|
Optimizing LLM Queries in Relational Workloads Row and column reordering algorithms to maximize KV cache reuse in batch analytics with LLMs. 3.4× faster job completion, 32% cost reduction. |
|
|
vCache: Semantic Caching with Error Rate Guarantees | Code First production-ready semantic caching with mathematical error guarantees. Outperforms all baselines on error rate and cache hit rate. |
|
|
vAttention: Dynamic Sparse Attention for Efficient Inference | Code First practical sparse attention with mathematical accuracy guarantees. Matches full model quality at up to 20× sparsity. |
|
|
ALTO: Compound AI System Orchestration Automatic optimization of compound AI systems through streaming and parallelism via nested ancestry abstraction. 10-30% latency improvements over LangGraph. |
|
|
The Danger of Overthinking in Agentic Systems Identifies overthinking in reasoning models—when they favor extended reasoning over environmental interaction. Our mitigation strategies improve performance by 30%, reduce costs by 43%. |
|