LMCache Practical Guide: Reusing KV Cache in vLLM Inference Services

A practical LMCache guide: when KV Cache reuse is worth adding, how to run LMCache with vLLM MP mode, how to test cache hits, and what to check before production, including memory, chunk size, hit ratio, and observability metrics.

LMCache solves a practical inference problem: many requests share the first part of the prompt, but the server recomputes prefill every time. That wastes GPU time.

If your workload has long system prompts, RAG templates, multi-turn conversations, agent tool descriptions, or fixed knowledge context, LMCache is worth evaluating. It moves temporary KV Cache out of the inference engine and turns it into a reusable, observable cache layer. The main target is lower TTFT, or time to first token.

Project:

1
https://github.com/LMCache/LMCache

Documentation:

1
https://docs.lmcache.ai/

Do You Need LMCache?

LMCache fits best when:

  • Prompts are long and share large prefixes.
  • RAG repeatedly injects similar documents or templates.
  • Multi-turn conversations reuse long history.
  • Agent tool descriptions and policy text are long.
  • Multiple vLLM instances need shared cache.
  • You care about TTFT, not only decode tokens/s.

If requests are short, share little prefix, or are bottlenecked by generation, the benefit may be small.

Start with vLLM MP Mode

Two common modes:

  • MP mode: LMCache runs as a standalone service and vLLM connects through LMCacheMPConnector.
  • In-process mode: LMCache runs inside the vLLM process through LMCacheConnectorV1.

Start with MP mode for practical use. It is easier to observe, manage, and share across instances.

Installation

1
2
3
uv venv --python 3.12
source .venv/bin/activate
uv pip install lmcache vllm

Or:

1
2
3
python -m venv .venv
source .venv/bin/activate
pip install lmcache vllm

Pin versions in production and check vLLM, LMCache, and connector compatibility together.

Start LMCache

1
lmcache server   --l1-size-gb 20   --eviction-policy LRU   --chunk-size 16

--l1-size-gb 20 allocates local cache memory. --eviction-policy LRU evicts least recently used data. --chunk-size 16 is useful for demos; production usually uses the default, such as 256.

Default ports: ZMQ 5555, HTTP management and metrics 8080.

Start vLLM

1
vllm serve Qwen/Qwen3-8B   --port 8000   --kv-transfer-config   '{"kv_connector":"LMCacheMPConnector", "kv_role":"kv_both"}'

For vLLM 0.20.0 or newer, prefer the LMCache-shipped connector:

1
vllm serve Qwen/Qwen3-8B   --port 8000   --kv-transfer-config   '{"kv_connector":"LMCacheMPConnector", "kv_connector_module_path":"lmcache.integration.vllm.lmcache_mp_connector", "kv_role":"kv_both"}'

Test Cache Hits

Send two requests with a shared prefix.

1
2
3
4
5
6
curl http://localhost:8000/v1/completions   -H "Content-Type: application/json"   -d '{
    "model": "Qwen/Qwen3-8B",
    "prompt": "Qwen3 is the latest generation of large language models in Qwen series, offering a comprehensive suite of dense and mixture-of-experts",
    "max_tokens": 100,
    "temperature": 0.7
  }'
1
2
3
4
5
6
curl http://localhost:8000/v1/completions   -H "Content-Type: application/json"   -d '{
    "model": "Qwen/Qwen3-8B",
    "prompt": "Qwen3 is the latest generation of large language models in Qwen series, offering a comprehensive suite of dense and mixture-of-experts (MoE) models",
    "max_tokens": 100,
    "temperature": 0.7
  }'

The first request should show Stored ... tokens; the second should show Retrieved ... tokens.

What to Watch

  • Hit tokens and hit ratio.
  • Whether cache comes from CPU RAM, local disk, or remote storage.
  • Whether loading is cheaper than recomputing prefill.
  • Whether chunk alignment limits the hit.
  • Whether TTFT improves on real prompts.

Best-Fit Workloads

Long system prompts, fixed RAG templates, agent tool descriptions, and multi-instance inference services are the easiest places to get value. Start with local CPU RAM before trying Redis, S3, NIXL, or other distributed backends. Prove that your prompts are reusable first.

In-Process Mode

For quick testing:

1
LMCACHE_CHUNK_SIZE=8 vllm serve Qwen/Qwen3-8B   --port 8000   --kv-transfer-config   '{"kv_connector":"LMCacheConnectorV1", "kv_role":"kv_both"}'

It is convenient, but the cache follows the vLLM process. MP mode is better for longer-term production structure.

Production Checklist

  • Compare TTFT with and without LMCache.
  • Record prefill latency separately.
  • Track prefix cache hit tokens and hit ratio.
  • Watch LMCache memory stability.
  • Test behavior after vLLM restarts.
  • Check ZMQ, HTTP metrics, and logs under concurrency.
  • Use real business prompts, not only demo prompts.

Common Pitfalls

LMCache is not response caching; it caches KV Cache. Do not judge it only by tokens/s, because it mainly affects TTFT and prefill. Avoid random chunk-size tuning, version mismatch, and running without observability.

Summary

LMCache is useful when long prompts repeat. If you already use vLLM, start with MP mode locally, confirm hits with two shared-prefix requests, then compare TTFT on real traffic. It is worth production use only when the real hit ratio is high enough.

记录并分享
Built with Hugo
Theme Stack designed by Jimmy