Read the launch

projects

Projects

Two projects, in different stages. Plus the small things that keep the lab honest.

Live

KVWarden

Tenant fairness on shared inference.

1.14× of solo TTFT, 26× better than FIFO

KVWarden is a scheduler and cache-pressure experiment for shared LLM inference. The first public result is narrow on purpose: a quiet tenant stays near solo latency while a flooder pushes the system. The harness is public; the plots do not hide the quiet tenant in an aggregate.

In research

mlxd

Tenant-fair LLM inference on Apple Silicon.

mlxd is a planned scheduler and admission layer that lives on top of `mlx_lm.server`. The thesis: today MLX has no tenant identity — concurrent requests can bleed KV cache between callers. Once correctness is restored, the next gap is fairness, and the model has to be different from CUDA's KV-block partitioning because unified memory makes bandwidth the shared resource, not GPU memory. Sibling to KVWarden under coconut-labs. Shared methodology, separate codebase.

Probe window: 2026-05-19 → 2026-05-23.

tools and experiments

Smaller things, mostly the scaffolding behind the public work.

RSS for new entries: /rss.xml