projects

Projects

Two projects, in different stages. Plus the small things that keep the lab honest.

Live

KVWarden

Tenant fairness on shared inference.

1.14× of solo TTFT, 26× better than FIFO

KVWarden is a scheduler and cache-pressure experiment for shared LLM inference. The first public result is narrow on purpose: a quiet tenant stays near solo latency while a flooder pushes the system. The harness is public; the plots do not hide the quiet tenant in an aggregate.

Read the launch Project page GitHub

In research

mlxd

Tenant-fair LLM inference on Apple Silicon.

mlxd is a planned scheduler and admission layer that lives on top of `mlx_lm.server`. The thesis: today MLX has no tenant identity — concurrent requests can bleed KV cache between callers. Once correctness is restored, the next gap is fairness, and the model has to be different from CUDA's KV-block partitioning because unified memory makes bandwidth the shared resource, not GPU memory. Sibling to KVWarden under coconut-labs. Shared methodology, separate codebase.

Probe window: 2026-05-19 → 2026-05-23.

Project page

tools and experiments

Smaller things, mostly the scaffolding behind the public work.

Inference NotesPython
Small scripts, traces, and notebook fragments behind the public research feed.

RSS for new entries: /rss.xml