LMCache solves a practical inference problem: many requests share the first part of the prompt, but the server recomputes prefill every time. That wastes GPU time.
If your workload has long system prompts, RAG templates, multi-turn conversations, agent tool descriptions, or fixed knowledge context, LMCache is worth evaluating. It moves temporary KV Cache out of the inference engine and turns it into a reusable, observable cache layer. The main target is lower TTFT, or time to first token.
Project:
|
|
Documentation:
|
|
Do You Need LMCache?
LMCache fits best when:
- Prompts are long and share large prefixes.
- RAG repeatedly injects similar documents or templates.
- Multi-turn conversations reuse long history.
- Agent tool descriptions and policy text are long.
- Multiple vLLM instances need shared cache.
- You care about TTFT, not only decode tokens/s.
If requests are short, share little prefix, or are bottlenecked by generation, the benefit may be small.
Start with vLLM MP Mode
Two common modes:
- MP mode: LMCache runs as a standalone service and vLLM connects through
LMCacheMPConnector. - In-process mode: LMCache runs inside the vLLM process through
LMCacheConnectorV1.
Start with MP mode for practical use. It is easier to observe, manage, and share across instances.
Installation
|
|
Or:
|
|
Pin versions in production and check vLLM, LMCache, and connector compatibility together.
Start LMCache
|
|
--l1-size-gb 20 allocates local cache memory. --eviction-policy LRU evicts least recently used data. --chunk-size 16 is useful for demos; production usually uses the default, such as 256.
Default ports: ZMQ 5555, HTTP management and metrics 8080.
Start vLLM
|
|
For vLLM 0.20.0 or newer, prefer the LMCache-shipped connector:
|
|
Test Cache Hits
Send two requests with a shared prefix.
|
|
|
|
The first request should show Stored ... tokens; the second should show Retrieved ... tokens.
What to Watch
- Hit tokens and hit ratio.
- Whether cache comes from CPU RAM, local disk, or remote storage.
- Whether loading is cheaper than recomputing prefill.
- Whether chunk alignment limits the hit.
- Whether TTFT improves on real prompts.
Best-Fit Workloads
Long system prompts, fixed RAG templates, agent tool descriptions, and multi-instance inference services are the easiest places to get value. Start with local CPU RAM before trying Redis, S3, NIXL, or other distributed backends. Prove that your prompts are reusable first.
In-Process Mode
For quick testing:
|
|
It is convenient, but the cache follows the vLLM process. MP mode is better for longer-term production structure.
Production Checklist
- Compare TTFT with and without LMCache.
- Record prefill latency separately.
- Track prefix cache hit tokens and hit ratio.
- Watch LMCache memory stability.
- Test behavior after vLLM restarts.
- Check ZMQ, HTTP metrics, and logs under concurrency.
- Use real business prompts, not only demo prompts.
Common Pitfalls
LMCache is not response caching; it caches KV Cache. Do not judge it only by tokens/s, because it mainly affects TTFT and prefill. Avoid random chunk-size tuning, version mismatch, and running without observability.
Summary
LMCache is useful when long prompts repeat. If you already use vLLM, start with MP mode locally, confirm hits with two shared-prefix requests, then compare TTFT on real traffic. It is worth production use only when the real hit ratio is high enough.