projects
Projects
Two projects, in different stages. Plus the small things that keep the lab honest.
Live
KVWarden
Tenant fairness on shared inference.
1.14× of solo TTFT, 26× better than FIFO
KVWarden is a scheduler and cache-pressure experiment for shared LLM inference. The first public result is narrow on purpose: a quiet tenant stays near solo latency while a flooder pushes the system. The harness is public; the plots do not hide the quiet tenant in an aggregate.
In research
mlxd
Tenant-fair LLM inference on Apple Silicon.
mlxd is a planned scheduler and admission layer that lives on top of `mlx_lm.server`. The thesis: today MLX has no tenant identity — concurrent requests can bleed KV cache between callers. Once correctness is restored, the next gap is fairness, and the model has to be different from CUDA's KV-block partitioning because unified memory makes bandwidth the shared resource, not GPU memory.
Sibling to KVWarden under coconut-labs. Shared methodology, separate codebase.
Probe window: 2026-05-19 → 2026-05-23.