<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
    <channel>
        <title>LMCache on KnightLi Blog</title>
        <link>https://knightli.com/en/tags/lmcache/</link>
        <description>Recent content in LMCache on KnightLi Blog</description>
        <generator>Hugo -- gohugo.io</generator>
        <language>en</language>
        <lastBuildDate>Thu, 25 Jun 2026 09:14:41 +0800</lastBuildDate><atom:link href="https://knightli.com/en/tags/lmcache/index.xml" rel="self" type="application/rss+xml" /><item>
        <title>LMCache Practical Guide: Reusing KV Cache in vLLM Inference Services</title>
        <link>https://knightli.com/en/2026/06/25/lmcache-vllm-kv-cache-guide/</link>
        <pubDate>Thu, 25 Jun 2026 09:14:41 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/06/25/lmcache-vllm-kv-cache-guide/</guid>
        <description>&lt;p&gt;LMCache solves a practical inference problem: many requests share the first part of the prompt, but the server recomputes prefill every time. That wastes GPU time.&lt;/p&gt;
&lt;p&gt;If your workload has long system prompts, RAG templates, multi-turn conversations, agent tool descriptions, or fixed knowledge context, LMCache is worth evaluating. It moves temporary KV Cache out of the inference engine and turns it into a reusable, observable cache layer. The main target is lower TTFT, or time to first token.&lt;/p&gt;
&lt;p&gt;Project:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;https://github.com/LMCache/LMCache
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Documentation:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;https://docs.lmcache.ai/
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;h2 id=&#34;do-you-need-lmcache&#34;&gt;Do You Need LMCache?
&lt;/h2&gt;&lt;p&gt;LMCache fits best when:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Prompts are long and share large prefixes.&lt;/li&gt;
&lt;li&gt;RAG repeatedly injects similar documents or templates.&lt;/li&gt;
&lt;li&gt;Multi-turn conversations reuse long history.&lt;/li&gt;
&lt;li&gt;Agent tool descriptions and policy text are long.&lt;/li&gt;
&lt;li&gt;Multiple vLLM instances need shared cache.&lt;/li&gt;
&lt;li&gt;You care about TTFT, not only decode tokens/s.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If requests are short, share little prefix, or are bottlenecked by generation, the benefit may be small.&lt;/p&gt;
&lt;h2 id=&#34;start-with-vllm-mp-mode&#34;&gt;Start with vLLM MP Mode
&lt;/h2&gt;&lt;p&gt;Two common modes:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;MP mode: LMCache runs as a standalone service and vLLM connects through &lt;code&gt;LMCacheMPConnector&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;In-process mode: LMCache runs inside the vLLM process through &lt;code&gt;LMCacheConnectorV1&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Start with MP mode for practical use. It is easier to observe, manage, and share across instances.&lt;/p&gt;
&lt;h2 id=&#34;installation&#34;&gt;Installation
&lt;/h2&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;uv venv --python 3.12
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nb&#34;&gt;source&lt;/span&gt; .venv/bin/activate
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;uv pip install lmcache vllm
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Or:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;python -m venv .venv
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nb&#34;&gt;source&lt;/span&gt; .venv/bin/activate
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;pip install lmcache vllm
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Pin versions in production and check vLLM, LMCache, and connector compatibility together.&lt;/p&gt;
&lt;h2 id=&#34;start-lmcache&#34;&gt;Start LMCache
&lt;/h2&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;lmcache server   --l1-size-gb &lt;span class=&#34;m&#34;&gt;20&lt;/span&gt;   --eviction-policy LRU   --chunk-size &lt;span class=&#34;m&#34;&gt;16&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;&lt;code&gt;--l1-size-gb 20&lt;/code&gt; allocates local cache memory. &lt;code&gt;--eviction-policy LRU&lt;/code&gt; evicts least recently used data. &lt;code&gt;--chunk-size 16&lt;/code&gt; is useful for demos; production usually uses the default, such as 256.&lt;/p&gt;
&lt;p&gt;Default ports: ZMQ &lt;code&gt;5555&lt;/code&gt;, HTTP management and metrics &lt;code&gt;8080&lt;/code&gt;.&lt;/p&gt;
&lt;h2 id=&#34;start-vllm&#34;&gt;Start vLLM
&lt;/h2&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;vllm serve Qwen/Qwen3-8B   --port &lt;span class=&#34;m&#34;&gt;8000&lt;/span&gt;   --kv-transfer-config   &lt;span class=&#34;s1&#34;&gt;&amp;#39;{&amp;#34;kv_connector&amp;#34;:&amp;#34;LMCacheMPConnector&amp;#34;, &amp;#34;kv_role&amp;#34;:&amp;#34;kv_both&amp;#34;}&amp;#39;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;For vLLM 0.20.0 or newer, prefer the LMCache-shipped connector:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;vllm serve Qwen/Qwen3-8B   --port &lt;span class=&#34;m&#34;&gt;8000&lt;/span&gt;   --kv-transfer-config   &lt;span class=&#34;s1&#34;&gt;&amp;#39;{&amp;#34;kv_connector&amp;#34;:&amp;#34;LMCacheMPConnector&amp;#34;, &amp;#34;kv_connector_module_path&amp;#34;:&amp;#34;lmcache.integration.vllm.lmcache_mp_connector&amp;#34;, &amp;#34;kv_role&amp;#34;:&amp;#34;kv_both&amp;#34;}&amp;#39;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;h2 id=&#34;test-cache-hits&#34;&gt;Test Cache Hits
&lt;/h2&gt;&lt;p&gt;Send two requests with a shared prefix.&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;4
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;5
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;6
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;curl http://localhost:8000/v1/completions   -H &lt;span class=&#34;s2&#34;&gt;&amp;#34;Content-Type: application/json&amp;#34;&lt;/span&gt;   -d &lt;span class=&#34;s1&#34;&gt;&amp;#39;{
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;s1&#34;&gt;    &amp;#34;model&amp;#34;: &amp;#34;Qwen/Qwen3-8B&amp;#34;,
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;s1&#34;&gt;    &amp;#34;prompt&amp;#34;: &amp;#34;Qwen3 is the latest generation of large language models in Qwen series, offering a comprehensive suite of dense and mixture-of-experts&amp;#34;,
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;s1&#34;&gt;    &amp;#34;max_tokens&amp;#34;: 100,
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;s1&#34;&gt;    &amp;#34;temperature&amp;#34;: 0.7
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;s1&#34;&gt;  }&amp;#39;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;4
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;5
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;6
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;curl http://localhost:8000/v1/completions   -H &lt;span class=&#34;s2&#34;&gt;&amp;#34;Content-Type: application/json&amp;#34;&lt;/span&gt;   -d &lt;span class=&#34;s1&#34;&gt;&amp;#39;{
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;s1&#34;&gt;    &amp;#34;model&amp;#34;: &amp;#34;Qwen/Qwen3-8B&amp;#34;,
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;s1&#34;&gt;    &amp;#34;prompt&amp;#34;: &amp;#34;Qwen3 is the latest generation of large language models in Qwen series, offering a comprehensive suite of dense and mixture-of-experts (MoE) models&amp;#34;,
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;s1&#34;&gt;    &amp;#34;max_tokens&amp;#34;: 100,
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;s1&#34;&gt;    &amp;#34;temperature&amp;#34;: 0.7
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;s1&#34;&gt;  }&amp;#39;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;The first request should show &lt;code&gt;Stored ... tokens&lt;/code&gt;; the second should show &lt;code&gt;Retrieved ... tokens&lt;/code&gt;.&lt;/p&gt;
&lt;h2 id=&#34;what-to-watch&#34;&gt;What to Watch
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;Hit tokens and hit ratio.&lt;/li&gt;
&lt;li&gt;Whether cache comes from CPU RAM, local disk, or remote storage.&lt;/li&gt;
&lt;li&gt;Whether loading is cheaper than recomputing prefill.&lt;/li&gt;
&lt;li&gt;Whether chunk alignment limits the hit.&lt;/li&gt;
&lt;li&gt;Whether TTFT improves on real prompts.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;best-fit-workloads&#34;&gt;Best-Fit Workloads
&lt;/h2&gt;&lt;p&gt;Long system prompts, fixed RAG templates, agent tool descriptions, and multi-instance inference services are the easiest places to get value. Start with local CPU RAM before trying Redis, S3, NIXL, or other distributed backends. Prove that your prompts are reusable first.&lt;/p&gt;
&lt;h2 id=&#34;in-process-mode&#34;&gt;In-Process Mode
&lt;/h2&gt;&lt;p&gt;For quick testing:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nv&#34;&gt;LMCACHE_CHUNK_SIZE&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;m&#34;&gt;8&lt;/span&gt; vllm serve Qwen/Qwen3-8B   --port &lt;span class=&#34;m&#34;&gt;8000&lt;/span&gt;   --kv-transfer-config   &lt;span class=&#34;s1&#34;&gt;&amp;#39;{&amp;#34;kv_connector&amp;#34;:&amp;#34;LMCacheConnectorV1&amp;#34;, &amp;#34;kv_role&amp;#34;:&amp;#34;kv_both&amp;#34;}&amp;#39;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;It is convenient, but the cache follows the vLLM process. MP mode is better for longer-term production structure.&lt;/p&gt;
&lt;h2 id=&#34;production-checklist&#34;&gt;Production Checklist
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;Compare TTFT with and without LMCache.&lt;/li&gt;
&lt;li&gt;Record prefill latency separately.&lt;/li&gt;
&lt;li&gt;Track prefix cache hit tokens and hit ratio.&lt;/li&gt;
&lt;li&gt;Watch LMCache memory stability.&lt;/li&gt;
&lt;li&gt;Test behavior after vLLM restarts.&lt;/li&gt;
&lt;li&gt;Check ZMQ, HTTP metrics, and logs under concurrency.&lt;/li&gt;
&lt;li&gt;Use real business prompts, not only demo prompts.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;common-pitfalls&#34;&gt;Common Pitfalls
&lt;/h2&gt;&lt;p&gt;LMCache is not response caching; it caches KV Cache. Do not judge it only by tokens/s, because it mainly affects TTFT and prefill. Avoid random &lt;code&gt;chunk-size&lt;/code&gt; tuning, version mismatch, and running without observability.&lt;/p&gt;
&lt;h2 id=&#34;summary&#34;&gt;Summary
&lt;/h2&gt;&lt;p&gt;LMCache is useful when long prompts repeat. If you already use vLLM, start with MP mode locally, confirm hits with two shared-prefix requests, then compare TTFT on real traffic. It is worth production use only when the real hit ratio is high enough.&lt;/p&gt;
</description>
        </item>
        
    </channel>
</rss>
