<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
    <channel>
        <title>Multi-GPU on KnightLi Blog</title>
        <link>https://knightli.com/en/tags/multi-gpu/</link>
        <description>Recent content in Multi-GPU on KnightLi Blog</description>
        <generator>Hugo -- gohugo.io</generator>
        <language>en</language>
        <lastBuildDate>Sat, 09 May 2026 15:05:41 +0800</lastBuildDate><atom:link href="https://knightli.com/en/tags/multi-gpu/index.xml" rel="self" type="application/rss+xml" /><item>
        <title>A Practical llama.cpp Multi-GPU Benchmarking Approach: Is 2x V100 16GB Faster Than One 32GB Card?</title>
        <link>https://knightli.com/en/2026/05/09/llama-cpp-multi-gpu-offload-performance/</link>
        <pubDate>Sat, 09 May 2026 15:05:41 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/05/09/llama-cpp-multi-gpu-offload-performance/</guid>
        <description>&lt;p&gt;Short version: llama.cpp multi-GPU offload is not free performance just because you add a second card. If the model already fits fully on one 32GB GPU, 2x V100 16GB is often less convenient than a single 32GB card and may even be slower. If the model does not fit on one 16GB card, the main value of dual GPUs is that the model can stay on GPU, and the benefit can be obvious.&lt;/p&gt;
&lt;h2 id=&#34;first-understand-split-mode&#34;&gt;First, Understand split mode
&lt;/h2&gt;&lt;p&gt;llama.cpp multi-GPU usage mainly revolves around &lt;code&gt;--split-mode&lt;/code&gt; and &lt;code&gt;--tensor-split&lt;/code&gt;. When discussing performance, distinguish these modes first:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;layer&lt;/code&gt;: splits layers across GPUs. It is usually the most compatible starting point.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;tensor&lt;/code&gt;: splits tensor computation across multiple GPUs. It is closer to true parallel compute, but depends more heavily on inter-GPU bandwidth and backend support.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;row&lt;/code&gt;: an older row-splitting mode that still appears in some setups, but is usually not the first choice for new deployments.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In simple terms, &lt;code&gt;layer&lt;/code&gt; is like putting different floors on different cards. During single-token generation, it may not keep both cards fully busy at the same time. &lt;code&gt;tensor&lt;/code&gt; is more like letting both cards work on the same layer together. It has more theoretical parallelism, but inter-GPU communication can become the bottleneck.&lt;/p&gt;
&lt;h2 id=&#34;if-one-32gb-card-can-fit-the-model-dual-16gb-is-not-always-faster&#34;&gt;If One 32GB Card Can Fit the Model, Dual 16GB Is Not Always Faster
&lt;/h2&gt;&lt;p&gt;If the model and KV cache fit fully on one 32GB GPU, a single card is usually steadier and often faster. For hardware in the same generation, such as 1x V100 32GB versus 2x V100 16GB, the dual-card setup does not necessarily win.&lt;/p&gt;
&lt;p&gt;A conservative expectation is that 2x V100 16GB may be 10% to 40% slower than one V100 32GB, especially for single-user chat, Continue Agent, and code Q&amp;amp;A workloads where one request is mainly generating one answer.&lt;/p&gt;
&lt;p&gt;The reason is straightforward: multi-GPU does not simply merge VRAM into one fast pool. With layer splitting, inference moves across GPUs and one card may wait for the other during token generation. With tensor splitting, both cards can compute together, but intermediate results need cross-GPU synchronization, so bandwidth and latency directly affect throughput.&lt;/p&gt;
&lt;p&gt;So if your choice is:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;1x V100 32GB&lt;/li&gt;
&lt;li&gt;2x V100 16GB&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;and the target model already fits fully on one 32GB card, the single 32GB card is often the more comfortable option.&lt;/p&gt;
&lt;h2 id=&#34;if-one-16gb-card-cannot-fit-the-model-dual-cards-matter&#34;&gt;If One 16GB Card Cannot Fit the Model, Dual Cards Matter
&lt;/h2&gt;&lt;p&gt;The situation changes completely when the model does not fit on one 16GB card but can fit across two 16GB cards.&lt;/p&gt;
&lt;p&gt;In that case, the value of dual GPUs is very direct:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;One 16GB card: may require heavy CPU offload, which can slow things down a lot.&lt;/li&gt;
&lt;li&gt;2x 16GB cards: weights can stay mostly on GPU, which may be much faster than mixed CPU/GPU execution.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In this scenario, 2x V100 16GB is not guaranteed to beat one 32GB card, but it may be several times faster than a single 16GB card with heavy system-memory offload. In other words, the first value of dual cards is not acceleration. It is avoiding the need to push model weights into slower system RAM.&lt;/p&gt;
&lt;h2 id=&#34;v100-pcie-and-v100-sxm2-are-very-different&#34;&gt;V100 PCIe and V100 SXM2 Are Very Different
&lt;/h2&gt;&lt;p&gt;The easiest thing to overlook in multi-GPU inference is the interconnect.&lt;/p&gt;
&lt;p&gt;If you have V100 SXM2 with NVLink, cross-GPU communication bandwidth is much higher. NVIDIA&amp;rsquo;s V100 material lists NVLink interconnect bandwidth up to 300GB/s. In that environment, &lt;code&gt;tensor&lt;/code&gt; mode or higher-batch workloads have a better chance of approaching or exceeding single-card performance.&lt;/p&gt;
&lt;p&gt;If you have V100 PCIe, expectations should be more conservative. V100 PCIe mainly uses PCIe Gen3, and the listed interconnect bandwidth is 32GB/s. That is a very different class from NVLink, which is why dual PCIe cards often provide enough VRAM without doubling speed.&lt;/p&gt;
&lt;p&gt;So when judging whether 2x V100 16GB is worthwhile, do not only add the VRAM to 32GB. Also check whether the cards are PCIe or SXM2/NVLink.&lt;/p&gt;
&lt;h2 id=&#34;a-practical-buying-rule&#34;&gt;A Practical Buying Rule
&lt;/h2&gt;&lt;p&gt;If the model fits on one 32GB GPU, choose the single card first. Its latency, stability, and tuning cost are usually better.&lt;/p&gt;
&lt;p&gt;If the model does not fit on one 16GB GPU but can fit on two 16GB GPUs, dual cards are worth using. At that point, the goal is to keep weights on GPU as much as possible, not to expect linear performance scaling.&lt;/p&gt;
&lt;p&gt;If you have dual V100 PCIe cards, start with &lt;code&gt;--split-mode layer&lt;/code&gt; and aim for stable execution with less CPU fallback.&lt;/p&gt;
&lt;p&gt;If you have V100 SXM2/NVLink, it is more worth testing &lt;code&gt;tensor&lt;/code&gt;-related modes, especially for prefill, larger batches, or concurrent serving.&lt;/p&gt;
&lt;h2 id=&#34;when-to-buy-2x16gb-and-when-to-buy-1x32gb&#34;&gt;When to Buy 2x16GB and When to Buy 1x32GB
&lt;/h2&gt;&lt;p&gt;If you serve only one user and mainly do chat, code completion, Continue Agent, or long-context Q&amp;amp;A, and the target model fits within 32GB, 1x32GB is usually the better choice. It avoids cross-GPU scheduling, has steadier latency, and is easier to debug.&lt;/p&gt;
&lt;p&gt;If you already own one 16GB card and want a lower-cost path to run 30B, 32B, or higher-quantized models, 2x16GB makes sense. It may not double token/s, but it can keep weights on GPU that would otherwise require CPU offload.&lt;/p&gt;
&lt;p&gt;If you are buying from scratch, the priority can look like this:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Single model, single user, latency-sensitive: prefer 1x32GB.&lt;/li&gt;
&lt;li&gt;Model does not fit on one card and budget is limited: consider 2x16GB.&lt;/li&gt;
&lt;li&gt;Machine has NVLink or SXM2: 2x16GB is much more interesting than ordinary PCIe dual cards.&lt;/li&gt;
&lt;li&gt;You want longer context later: do not only count model weights; reserve VRAM for KV cache too.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;practical-advice-for-layer-split-and-tensor-split&#34;&gt;Practical Advice for layer split and tensor split
&lt;/h2&gt;&lt;p&gt;The practical rule is: start with &lt;code&gt;layer&lt;/code&gt;, then benchmark &lt;code&gt;tensor&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;layer&lt;/code&gt; is the default starting point. It splits the model by layer, has better compatibility, and is friendlier to PCIe dual-card systems. The downside is that generation can behave more like a pipeline: at certain moments one card is busy while the other waits.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;tensor&lt;/code&gt; is better suited to machines with strong interconnects, such as V100 SXM2/NVLink. It splits part of the same layer&amp;rsquo;s computation across GPUs, so it has more parallelism in theory, but it also synchronizes across cards more often. On PCIe dual cards, communication overhead may eat the benefit.&lt;/p&gt;
&lt;p&gt;You can start with these tests:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;llama-bench -m model.gguf -ngl &lt;span class=&#34;m&#34;&gt;99&lt;/span&gt; --split-mode layer --tensor-split 1,1
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;llama-bench -m model.gguf -ngl &lt;span class=&#34;m&#34;&gt;99&lt;/span&gt; --split-mode tensor --tensor-split 1,1
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;llama-bench -m model.gguf -ngl &lt;span class=&#34;m&#34;&gt;99&lt;/span&gt; --split-mode layer --tensor-split 1,0
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;The third command is not meant as the long-term configuration. It gives you a single-card reference, so you can see whether dual GPUs are actually faster or only distributing VRAM pressure.&lt;/p&gt;
&lt;h2 id=&#34;why-prefill-and-decode-behave-differently&#34;&gt;Why prefill and decode Behave Differently
&lt;/h2&gt;&lt;p&gt;Local LLM performance should usually be viewed in two stages:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;prefill&lt;/code&gt;: processes the input prompt. A typical metric is prompt-processing throughput such as &lt;code&gt;pp512&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;decode&lt;/code&gt;: generates the response token by token. A typical metric is token-generation throughput such as &lt;code&gt;tg128&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;code&gt;prefill&lt;/code&gt; is more like large-batch matrix computation. With larger batches, it is easier to keep GPUs busy and more likely to benefit from multi-GPU parallelism. &lt;code&gt;decode&lt;/code&gt; generates one token after another. The batch is smaller and synchronization is more frequent, so cross-card communication and scheduling latency are easier to notice.&lt;/p&gt;
&lt;p&gt;That is why you may see dual GPUs improve &lt;code&gt;pp512&lt;/code&gt; while &lt;code&gt;tg128&lt;/code&gt; barely improves or even gets worse. For chat and agent workflows, user experience is closer to &lt;code&gt;tg128&lt;/code&gt;. For long document ingestion, batch prefill, or concurrent serving, &lt;code&gt;pp512&lt;/code&gt; also matters.&lt;/p&gt;
&lt;h2 id=&#34;can-kv-cache-become-a-second-vram-bottleneck&#34;&gt;Can KV cache Become a Second VRAM Bottleneck?
&lt;/h2&gt;&lt;p&gt;Yes. Many people only count model weights and forget KV cache.&lt;/p&gt;
&lt;p&gt;Model weights decide whether the model can load. KV cache decides whether you can use the context length you want. The longer the context, the higher the concurrency, and the larger the batch, the more visible KV cache usage becomes. You may find that the model itself fits in 32GB, but 32K or 64K context pushes VRAM over the limit.&lt;/p&gt;
&lt;p&gt;At minimum, leave VRAM headroom for:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;KV cache&lt;/li&gt;
&lt;li&gt;CUDA graph or backend runtime overhead&lt;/li&gt;
&lt;li&gt;prompt batch and ubatch&lt;/li&gt;
&lt;li&gt;desktop, driver, and other process usage&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you use 2x16GB, VRAM is not a fully equivalent 32GB pool. Some buffers, KV cache, or intermediate tensors may still be limited by remaining memory on a single card. When testing long context, use the target &lt;code&gt;--ctx-size&lt;/code&gt; and target concurrency directly instead of only checking whether the model starts.&lt;/p&gt;
&lt;h2 id=&#34;how-to-benchmark-dual-cards-with-llama-bench&#34;&gt;How to Benchmark Dual Cards with llama-bench
&lt;/h2&gt;&lt;p&gt;&lt;code&gt;llama-bench&lt;/code&gt; is better than direct chatting for hardware comparison because it separates prompt processing and token generation into comparable metrics. The default example in the official README is:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;llama-bench -m model.gguf
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;For dual V100 cards, test at least these sets:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;4
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;5
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;6
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;7
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;8
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;c1&#34;&gt;# Single-card baseline&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nv&#34;&gt;CUDA_VISIBLE_DEVICES&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;m&#34;&gt;0&lt;/span&gt; llama-bench -m model.gguf -ngl &lt;span class=&#34;m&#34;&gt;99&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;c1&#34;&gt;# Dual-card layer split&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nv&#34;&gt;CUDA_VISIBLE_DEVICES&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;0,1 llama-bench -m model.gguf -ngl &lt;span class=&#34;m&#34;&gt;99&lt;/span&gt; --split-mode layer --tensor-split 1,1
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;c1&#34;&gt;# Dual-card tensor split&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nv&#34;&gt;CUDA_VISIBLE_DEVICES&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;0,1 llama-bench -m model.gguf -ngl &lt;span class=&#34;m&#34;&gt;99&lt;/span&gt; --split-mode tensor --tensor-split 1,1
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Focus on two columns:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;pp512&lt;/code&gt;: prompt processing, more relevant to long inputs and batch prefill.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;tg128&lt;/code&gt;: token generation, more relevant to single-user chat and agent responsiveness.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Keep the model, quantization, context length, batch settings, driver version, and llama.cpp version fixed. Run each group several times and compare medians rather than one-off results. Finally, test your real workflow too, such as Continue Agent, an OpenAI-compatible server, or your own RAG requests, because a good benchmark does not always mean better interactive experience.&lt;/p&gt;
&lt;h2 id=&#34;one-sentence-conclusion&#34;&gt;One-Sentence Conclusion
&lt;/h2&gt;&lt;p&gt;The main advantage of 2x V100 16GB is VRAM capacity, not guaranteed generation speed. If the model fits on one card, a single 32GB card is usually faster and steadier. If the model does not fit on one 16GB card, dual 16GB cards become valuable because they avoid heavy CPU offload. Whether they are faster depends on split mode, batch size, model size, and whether the two V100 cards are connected through PCIe or NVLink.&lt;/p&gt;
&lt;p&gt;References:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;llama.cpp server README&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://www.mintlify.com/ggml-org/llama.cpp/concepts/backends&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;llama.cpp Compute Backends&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://www.nvidia.com/en-gb/data-center/tesla-v100/&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;NVIDIA Tesla V100&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://images.nvidia.com/content/technologies/volta/pdf/tesla-volta-v100-datasheet.pdf&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;NVIDIA V100 Datasheet&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</description>
        </item>
        <item>
        <title>Ollama Multi-GPU Notes: VRAM Pooling, GPU Selection, and Common Misunderstandings</title>
        <link>https://knightli.com/en/2026/04/19/ollama-multiple-gpu-notes/</link>
        <pubDate>Sun, 19 Apr 2026 00:18:00 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/04/19/ollama-multiple-gpu-notes/</guid>
        <description>&lt;p&gt;When running local inference with Ollama, a few questions come up quickly: if I already have one GPU and my motherboard still has empty PCIe slots, does adding more GPUs help? Do the GPUs need to be identical? Can VRAM be combined? Will it accelerate inference like a multi-GPU training framework?&lt;/p&gt;
&lt;p&gt;This note summarizes how Ollama behaves with multiple GPUs. The short version:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Ollama supports multiple GPUs.&lt;/li&gt;
&lt;li&gt;The main value of multiple GPUs is usually fitting larger models into available VRAM, not getting linear token/s scaling.&lt;/li&gt;
&lt;li&gt;By default, if a model fits entirely on one GPU, Ollama tends to load it on a single GPU.&lt;/li&gt;
&lt;li&gt;If a model does not fit on one GPU, Ollama can spread it across available GPUs.&lt;/li&gt;
&lt;li&gt;Mixed GPU models may be visible to Ollama, but performance and placement may not be ideal.&lt;/li&gt;
&lt;li&gt;SLI / NVLink is not required for multi-GPU use.&lt;/li&gt;
&lt;li&gt;To limit which GPUs Ollama can use, use &lt;code&gt;CUDA_VISIBLE_DEVICES&lt;/code&gt;, &lt;code&gt;ROCR_VISIBLE_DEVICES&lt;/code&gt;, or &lt;code&gt;GGML_VK_VISIBLE_DEVICES&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;official-behavior-single-gpu-first-multi-gpu-when-needed&#34;&gt;Official Behavior: Single GPU First, Multi-GPU When Needed
&lt;/h2&gt;&lt;p&gt;Ollama&amp;rsquo;s FAQ describes the multi-GPU loading logic directly: when loading a new model, Ollama estimates the required VRAM and compares it with currently available GPU memory. If the model can fit entirely on one GPU, it loads the model onto that GPU. If it cannot fit on a single GPU, the model is spread across all available GPUs.&lt;/p&gt;
&lt;p&gt;The reason is performance. Keeping a model on one GPU usually reduces data transfers across the PCIe bus during inference, so it is often faster.&lt;/p&gt;
&lt;p&gt;So do not think of Ollama multi-GPU as &amp;ldquo;more cards automatically means several times faster.&amp;rdquo; A more accurate model is:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Small model fits on one GPU: usually runs on one GPU.&lt;/li&gt;
&lt;li&gt;Large model does not fit on one GPU: split across multiple GPUs.&lt;/li&gt;
&lt;li&gt;Still not enough VRAM: part of the model falls back to system memory, and speed drops noticeably.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Use this command to see where the model is loaded:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;ollama ps
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;The &lt;code&gt;PROCESSOR&lt;/code&gt; column may show something like:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;100% GPU
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;48%/52% CPU/GPU
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;100% CPU
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;If you see &lt;code&gt;48%/52% CPU/GPU&lt;/code&gt;, part of the model is already in system memory. In that case, adding more GPU memory or using a larger-VRAM GPU is usually more useful than continuing to rely on CPU/RAM.&lt;/p&gt;
&lt;h2 id=&#34;multi-gpu-is-not-simple-compute-stacking&#34;&gt;Multi-GPU Is Not Simple Compute Stacking
&lt;/h2&gt;&lt;p&gt;Local LLM inference is not the same as SLI in games. With Ollama on multiple GPUs, the common pattern is that different layers or tensors are placed on different devices. This can make a larger model fit into the combined available VRAM, but data may still need to move between devices during inference.&lt;/p&gt;
&lt;p&gt;So multi-GPU benefits usually fall into two categories:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;VRAM benefit: larger models fit more easily, or less of the model falls back to CPU/RAM.&lt;/li&gt;
&lt;li&gt;Performance benefit: usually most obvious when a model would otherwise not fit on one GPU or would heavily spill to CPU.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If an 8B or 14B model already fits entirely on a single RTX 3090, forcing it across two GPUs may not be faster. It may even slow down due to cross-GPU transfer overhead. Ollama&amp;rsquo;s default &amp;ldquo;use one GPU when it fits&amp;rdquo; strategy avoids that unnecessary PCIe cost.&lt;/p&gt;
&lt;h2 id=&#34;sli-or-nvlink-is-not-required&#34;&gt;SLI or NVLink Is Not Required
&lt;/h2&gt;&lt;p&gt;Ollama multi-GPU does not depend on SLI. Multiple normal PCIe GPUs can be scheduled as long as the driver and Ollama can detect them.&lt;/p&gt;
&lt;p&gt;NVLink or higher PCIe bandwidth may help in some cross-GPU scenarios, but it is not a requirement. Many used GPU servers and workstations can run multiple GPUs over ordinary PCIe.&lt;/p&gt;
&lt;p&gt;What you should pay attention to is PCIe bandwidth. The difference between &lt;code&gt;x1&lt;/code&gt;, &lt;code&gt;x4&lt;/code&gt;, &lt;code&gt;x8&lt;/code&gt;, and &lt;code&gt;x16&lt;/code&gt; affects how quickly a model is loaded into VRAM. If you frequently switch large models, PCIe bandwidth becomes more important. After a model is loaded, PCIe usually matters less during generation, but cross-GPU splitting can still add overhead.&lt;/p&gt;
&lt;p&gt;Safer rules:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Prefer x16 / x8 over mining-style x1 risers.&lt;/li&gt;
&lt;li&gt;PCIe bandwidth matters more when switching large models frequently.&lt;/li&gt;
&lt;li&gt;If a model stays resident in VRAM for a long time, PCIe bandwidth is less visible.&lt;/li&gt;
&lt;li&gt;For multi-GPU machines, check motherboard PCIe topology and CPU-attached lanes.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;limit-which-nvidia-gpus-ollama-uses&#34;&gt;Limit Which NVIDIA GPUs Ollama Uses
&lt;/h2&gt;&lt;p&gt;On NVIDIA multi-GPU systems, use &lt;code&gt;CUDA_VISIBLE_DEVICES&lt;/code&gt; to control which GPUs Ollama can see.&lt;/p&gt;
&lt;p&gt;Temporary run:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nv&#34;&gt;CUDA_VISIBLE_DEVICES&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;0,1 ollama serve
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Use only the second GPU:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nv&#34;&gt;CUDA_VISIBLE_DEVICES&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;m&#34;&gt;1&lt;/span&gt; ollama serve
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Force Ollama not to use NVIDIA GPUs:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nv&#34;&gt;CUDA_VISIBLE_DEVICES&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;-1 ollama serve
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;The official docs note that numeric IDs may change order, so GPU UUIDs are more reliable. Check UUIDs first:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;nvidia-smi -L
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Example output:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;GPU 0: NVIDIA GeForce RTX 3090 (UUID: GPU-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx)
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;GPU 1: NVIDIA GeForce RTX 3070 (UUID: GPU-yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy)
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Then specify the UUID:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nv&#34;&gt;CUDA_VISIBLE_DEVICES&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;GPU-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx ollama serve
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;If Ollama is installed as a Linux systemd service, put the variable into the service environment:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;sudo systemctl edit ollama.service
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Add:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-ini&#34; data-lang=&#34;ini&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;k&#34;&gt;[Service]&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;na&#34;&gt;Environment&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s&#34;&gt;&amp;#34;CUDA_VISIBLE_DEVICES=0,1&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Reload and restart:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;sudo systemctl daemon-reload
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;sudo systemctl restart ollama
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;h2 id=&#34;amd-and-vulkan-device-selection&#34;&gt;AMD and Vulkan Device Selection
&lt;/h2&gt;&lt;p&gt;For AMD ROCm, use &lt;code&gt;ROCR_VISIBLE_DEVICES&lt;/code&gt; to control visible GPUs:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nv&#34;&gt;ROCR_VISIBLE_DEVICES&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;0,1 ollama serve
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;To force Ollama not to use ROCm GPUs, use an invalid ID:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nv&#34;&gt;ROCR_VISIBLE_DEVICES&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;-1 ollama serve
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Ollama&amp;rsquo;s GPU docs also mention experimental Vulkan support. For Vulkan GPUs, use &lt;code&gt;GGML_VK_VISIBLE_DEVICES&lt;/code&gt;:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nv&#34;&gt;OLLAMA_VULKAN&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;m&#34;&gt;1&lt;/span&gt; &lt;span class=&#34;nv&#34;&gt;GGML_VK_VISIBLE_DEVICES&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;m&#34;&gt;0&lt;/span&gt; ollama serve
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;If Vulkan devices cause problems, disable them:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nv&#34;&gt;GGML_VK_VISIBLE_DEVICES&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;-1 ollama serve
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;AMD multi-GPU setups are more likely to run into driver, ROCm version, and GFX version compatibility issues. The official docs also mention Linux ROCm driver requirements and compatibility overrides such as &lt;code&gt;HSA_OVERRIDE_GFX_VERSION&lt;/code&gt;. If you mix different generations of AMD GPUs, first verify that each card works on its own before trying multi-GPU.&lt;/p&gt;
&lt;h2 id=&#34;exposing-multiple-gpus-in-docker&#34;&gt;Exposing Multiple GPUs in Docker
&lt;/h2&gt;&lt;p&gt;If you run Ollama in Docker, NVIDIA setups usually require &lt;code&gt;nvidia-container-toolkit&lt;/code&gt;, then &lt;code&gt;--gpus&lt;/code&gt; to expose devices.&lt;/p&gt;
&lt;p&gt;Expose all GPUs:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;4
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;5
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;6
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;docker run -d &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  --gpus&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;all &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  -v ollama:/root/.ollama &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  -p 11434:11434 &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  --name ollama &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  ollama/ollama
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Expose specific GPUs:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;4
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;5
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;6
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;docker run -d &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  --gpus &lt;span class=&#34;s1&#34;&gt;&amp;#39;&amp;#34;device=0,1&amp;#34;&amp;#39;&lt;/span&gt; &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  -v ollama:/root/.ollama &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  -p 11434:11434 &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  --name ollama &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  ollama/ollama
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;You can also combine this with environment variables:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;4
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;5
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;6
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;7
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;docker run -d &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  --gpus&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;all &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  -e &lt;span class=&#34;nv&#34;&gt;CUDA_VISIBLE_DEVICES&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;0,1 &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  -v ollama:/root/.ollama &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  -p 11434:11434 &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  --name ollama &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  ollama/ollama
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;If &lt;code&gt;nvidia-smi&lt;/code&gt; cannot see GPUs inside the container, Ollama cannot use them either. Troubleshoot Docker GPU passthrough first, then Ollama.&lt;/p&gt;
&lt;h2 id=&#34;what-is-ollama_sched_spread&#34;&gt;What Is &lt;code&gt;OLLAMA_SCHED_SPREAD&lt;/code&gt;
&lt;/h2&gt;&lt;p&gt;In some multi-GPU configuration discussions, you may see &lt;code&gt;OLLAMA_SCHED_SPREAD=1&lt;/code&gt; or &lt;code&gt;OLLAMA_SCHED_SPREAD=true&lt;/code&gt;. It is related to Ollama&amp;rsquo;s scheduler and is often used when people want models or requests to be spread more broadly across GPUs.&lt;/p&gt;
&lt;p&gt;Example:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nv&#34;&gt;OLLAMA_SCHED_SPREAD&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;m&#34;&gt;1&lt;/span&gt; ollama serve
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Or with systemd:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-ini&#34; data-lang=&#34;ini&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;k&#34;&gt;[Service]&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;na&#34;&gt;Environment&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s&#34;&gt;&amp;#34;OLLAMA_SCHED_SPREAD=true&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;But it is not a magic switch. Enabling it does not imply linear token/s scaling, and it may still run into OOM when multiple models are loaded, VRAM estimates are tight, context length grows, or the KV cache expands. The core FAQ behavior still applies: if one GPU can fully hold the model, one GPU is usually more efficient; if one GPU cannot hold it, then multi-GPU splitting becomes useful.&lt;/p&gt;
&lt;p&gt;Treat &lt;code&gt;OLLAMA_SCHED_SPREAD&lt;/code&gt; as an advanced scheduling experiment, not a required multi-GPU setting. Understand the default behavior first, then adjust based on &lt;code&gt;ollama ps&lt;/code&gt;, logs, and &lt;code&gt;nvidia-smi&lt;/code&gt;.&lt;/p&gt;
&lt;h2 id=&#34;how-to-check-whether-multiple-gpus-are-being-used&#34;&gt;How to Check Whether Multiple GPUs Are Being Used
&lt;/h2&gt;&lt;p&gt;Useful commands:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;ollama ps
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;watch -n 0.5 nvidia-smi
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;View the Ollama service logs:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;journalctl -u ollama -f
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;If using Docker:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;docker logs -f ollama
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Watch for:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Whether Ollama discovers compatible GPUs.&lt;/li&gt;
&lt;li&gt;Whether the model shows &lt;code&gt;100% GPU&lt;/code&gt; or a CPU/GPU split.&lt;/li&gt;
&lt;li&gt;Whether each GPU has VRAM allocated.&lt;/li&gt;
&lt;li&gt;Whether VRAM grows on multiple GPUs during model loading.&lt;/li&gt;
&lt;li&gt;Whether generation token/s improves compared with CPU/RAM spillover.&lt;/li&gt;
&lt;li&gt;Whether OOM or model unloading happens frequently.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;GPU utilization alone can be misleading. LLM inference does not always keep GPUs fully loaded, especially with multiple GPUs, low batch sizes, small contexts, slow CPUs, or slow PCIe links.&lt;/p&gt;
&lt;h2 id=&#34;common-misunderstandings&#34;&gt;Common Misunderstandings
&lt;/h2&gt;&lt;h3 id=&#34;misunderstanding-1-two-12gb-gpus-equal-one-24gb-gpu&#34;&gt;Misunderstanding 1: Two 12GB GPUs Equal One 24GB GPU
&lt;/h3&gt;&lt;p&gt;Not exactly. Multiple GPUs can place a model across devices, but cross-device access has overhead. It solves the &amp;ldquo;does not fit&amp;rdquo; problem, but it is not equivalent to the speed and stability of one large-VRAM GPU.&lt;/p&gt;
&lt;h3 id=&#34;misunderstanding-2-different-gpu-models-cannot-be-mixed&#34;&gt;Misunderstanding 2: Different GPU Models Cannot Be Mixed
&lt;/h3&gt;&lt;p&gt;Not necessarily. If the driver, compute capability, and runtime libraries support the cards, Ollama can see multiple GPUs. But mixed setups are usually limited by the slower card, smaller VRAM, and PCIe topology. The most predictable setup is still same model, same VRAM size, and well-supported same-generation drivers.&lt;/p&gt;
&lt;h3 id=&#34;misunderstanding-3-multi-gpu-is-always-faster-than-single-gpu&#34;&gt;Misunderstanding 3: Multi-GPU Is Always Faster Than Single-GPU
&lt;/h3&gt;&lt;p&gt;Not always. If the model fits completely on one fast GPU, single-GPU may be faster. Multi-GPU is mainly useful for large models, long contexts, or insufficient single-GPU VRAM.&lt;/p&gt;
&lt;h3 id=&#34;misunderstanding-4-nvlink--sli-is-required&#34;&gt;Misunderstanding 4: NVLink / SLI Is Required
&lt;/h3&gt;&lt;p&gt;No. Ordinary PCIe multi-GPU systems can be used by Ollama. NVLink is not a prerequisite.&lt;/p&gt;
&lt;h3 id=&#34;misunderstanding-5-adding-a-gpu-does-not-require-restarting-services&#34;&gt;Misunderstanding 5: Adding a GPU Does Not Require Restarting Services
&lt;/h3&gt;&lt;p&gt;Not always true. Linux systemd services, Windows background apps, and Docker containers may need to be restarted before they rediscover devices and environment variables.&lt;/p&gt;
&lt;h2 id=&#34;gpu-selection-suggestions&#34;&gt;GPU Selection Suggestions
&lt;/h2&gt;&lt;p&gt;For Ollama local inference, the rough priority is:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Larger single-GPU VRAM is usually easier to manage.&lt;/li&gt;
&lt;li&gt;Identical GPUs are easier to troubleshoot than mixed GPUs.&lt;/li&gt;
&lt;li&gt;More complete PCIe lanes make large-model loading smoother.&lt;/li&gt;
&lt;li&gt;Older cards should be checked for CUDA compute capability or ROCm support first.&lt;/li&gt;
&lt;li&gt;Multi-GPU power, cooling, and chassis airflow must be planned ahead.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;For budget second-hand platforms:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Dual RTX 3090 remains a common high-VRAM option.&lt;/li&gt;
&lt;li&gt;Older Tesla cards such as P40 / M40 have large VRAM, but power, cooling, driver support, and performance all need trade-offs.&lt;/li&gt;
&lt;li&gt;Cards such as RTX 4070 / 4070 Ti have good efficiency, but single-card VRAM can be limiting.&lt;/li&gt;
&lt;li&gt;Multiple old 8GB cards can be fun to experiment with, but are not ideal for running large models long-term.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;summary&#34;&gt;Summary
&lt;/h2&gt;&lt;p&gt;Ollama multi-GPU support is best understood as &amp;ldquo;VRAM expansion first, performance acceleration second.&amp;rdquo; If the model fits entirely on one GPU, the default single-GPU path is usually faster. If one GPU cannot hold it, multi-GPU can spread the model across devices and avoid heavy CPU/RAM spillover, making larger models usable.&lt;/p&gt;
&lt;p&gt;In practice, use &lt;code&gt;ollama ps&lt;/code&gt; to check where the model is loaded, then use &lt;code&gt;nvidia-smi&lt;/code&gt; or ROCm tools to observe VRAM allocation. For GPU selection, use &lt;code&gt;CUDA_VISIBLE_DEVICES&lt;/code&gt; on NVIDIA, &lt;code&gt;ROCR_VISIBLE_DEVICES&lt;/code&gt; on AMD ROCm, and &lt;code&gt;GGML_VK_VISIBLE_DEVICES&lt;/code&gt; for Vulkan. If running in Docker, first make sure the container can see the GPUs.&lt;/p&gt;
&lt;p&gt;Multi-GPU is not magic. It can help fit larger models, but it does not guarantee linear speedup. The stable route is still to prefer large-VRAM single GPUs or identical multi-GPU setups, while considering driver support, PCIe, power, cooling, and model quantization together.&lt;/p&gt;
&lt;h2 id=&#34;references&#34;&gt;References
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;Ollama FAQ: How does Ollama load models on multiple GPUs?: &lt;a class=&#34;link&#34; href=&#34;https://github.com/ollama/ollama/blob/main/docs/faq.mdx&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;https://github.com/ollama/ollama/blob/main/docs/faq.mdx&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Ollama GPU docs: Hardware support / GPU Selection: &lt;a class=&#34;link&#34; href=&#34;https://github.com/ollama/ollama/blob/main/docs/gpu.mdx&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;https://github.com/ollama/ollama/blob/main/docs/gpu.mdx&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Ollama Docker Hub: &lt;a class=&#34;link&#34; href=&#34;https://hub.docker.com/r/ollama/ollama&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;https://hub.docker.com/r/ollama/ollama&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;NVIDIA Container Toolkit: &lt;a class=&#34;link&#34; href=&#34;https://github.com/NVIDIA/nvidia-container-toolkit&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;https://github.com/NVIDIA/nvidia-container-toolkit&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</description>
        </item>
        
    </channel>
</rss>
