Multi-GPU on KnightLi Blog

A Practical llama.cpp Multi-GPU Benchmarking Approach: Is 2x V100 16GB Faster Than One 32GB Card?

Sat, 09 May 2026 15:05:41 +0800

Short version: llama.cpp multi-GPU offload is not free performance just because you add a second card. If the model already fits fully on one 32GB GPU, 2x V100 16GB is often less convenient than a single 32GB card and may even be slower. If the model does not fit on one 16GB card, the main value of dual GPUs is that the model can stay on GPU, and the benefit can be obvious.

First, Understand split mode

llama.cpp multi-GPU usage mainly revolves around --split-mode and --tensor-split. When discussing performance, distinguish these modes first:

layer: splits layers across GPUs. It is usually the most compatible starting point.
tensor: splits tensor computation across multiple GPUs. It is closer to true parallel compute, but depends more heavily on inter-GPU bandwidth and backend support.
row: an older row-splitting mode that still appears in some setups, but is usually not the first choice for new deployments.

In simple terms, layer is like putting different floors on different cards. During single-token generation, it may not keep both cards fully busy at the same time. tensor is more like letting both cards work on the same layer together. It has more theoretical parallelism, but inter-GPU communication can become the bottleneck.

If One 32GB Card Can Fit the Model, Dual 16GB Is Not Always Faster

If the model and KV cache fit fully on one 32GB GPU, a single card is usually steadier and often faster. For hardware in the same generation, such as 1x V100 32GB versus 2x V100 16GB, the dual-card setup does not necessarily win.

A conservative expectation is that 2x V100 16GB may be 10% to 40% slower than one V100 32GB, especially for single-user chat, Continue Agent, and code Q&A workloads where one request is mainly generating one answer.

The reason is straightforward: multi-GPU does not simply merge VRAM into one fast pool. With layer splitting, inference moves across GPUs and one card may wait for the other during token generation. With tensor splitting, both cards can compute together, but intermediate results need cross-GPU synchronization, so bandwidth and latency directly affect throughput.

So if your choice is:

1x V100 32GB
2x V100 16GB

and the target model already fits fully on one 32GB card, the single 32GB card is often the more comfortable option.

If One 16GB Card Cannot Fit the Model, Dual Cards Matter

The situation changes completely when the model does not fit on one 16GB card but can fit across two 16GB cards.

In that case, the value of dual GPUs is very direct:

One 16GB card: may require heavy CPU offload, which can slow things down a lot.
2x 16GB cards: weights can stay mostly on GPU, which may be much faster than mixed CPU/GPU execution.

In this scenario, 2x V100 16GB is not guaranteed to beat one 32GB card, but it may be several times faster than a single 16GB card with heavy system-memory offload. In other words, the first value of dual cards is not acceleration. It is avoiding the need to push model weights into slower system RAM.

V100 PCIe and V100 SXM2 Are Very Different

The easiest thing to overlook in multi-GPU inference is the interconnect.

If you have V100 SXM2 with NVLink, cross-GPU communication bandwidth is much higher. NVIDIA’s V100 material lists NVLink interconnect bandwidth up to 300GB/s. In that environment, tensor mode or higher-batch workloads have a better chance of approaching or exceeding single-card performance.

If you have V100 PCIe, expectations should be more conservative. V100 PCIe mainly uses PCIe Gen3, and the listed interconnect bandwidth is 32GB/s. That is a very different class from NVLink, which is why dual PCIe cards often provide enough VRAM without doubling speed.

So when judging whether 2x V100 16GB is worthwhile, do not only add the VRAM to 32GB. Also check whether the cards are PCIe or SXM2/NVLink.

A Practical Buying Rule

If the model fits on one 32GB GPU, choose the single card first. Its latency, stability, and tuning cost are usually better.

If the model does not fit on one 16GB GPU but can fit on two 16GB GPUs, dual cards are worth using. At that point, the goal is to keep weights on GPU as much as possible, not to expect linear performance scaling.

If you have dual V100 PCIe cards, start with --split-mode layer and aim for stable execution with less CPU fallback.

If you have V100 SXM2/NVLink, it is more worth testing tensor-related modes, especially for prefill, larger batches, or concurrent serving.

When to Buy 2x16GB and When to Buy 1x32GB

If you serve only one user and mainly do chat, code completion, Continue Agent, or long-context Q&A, and the target model fits within 32GB, 1x32GB is usually the better choice. It avoids cross-GPU scheduling, has steadier latency, and is easier to debug.

If you already own one 16GB card and want a lower-cost path to run 30B, 32B, or higher-quantized models, 2x16GB makes sense. It may not double token/s, but it can keep weights on GPU that would otherwise require CPU offload.

If you are buying from scratch, the priority can look like this:

Single model, single user, latency-sensitive: prefer 1x32GB.
Model does not fit on one card and budget is limited: consider 2x16GB.
Machine has NVLink or SXM2: 2x16GB is much more interesting than ordinary PCIe dual cards.
You want longer context later: do not only count model weights; reserve VRAM for KV cache too.

Practical Advice for layer split and tensor split

The practical rule is: start with layer, then benchmark tensor.

layer is the default starting point. It splits the model by layer, has better compatibility, and is friendlier to PCIe dual-card systems. The downside is that generation can behave more like a pipeline: at certain moments one card is busy while the other waits.

tensor is better suited to machines with strong interconnects, such as V100 SXM2/NVLink. It splits part of the same layer’s computation across GPUs, so it has more parallelism in theory, but it also synchronizes across cards more often. On PCIe dual cards, communication overhead may eat the benefit.

You can start with these tests:

1
2
3

llama-bench -m model.gguf -ngl 99 --split-mode layer --tensor-split 1,1
llama-bench -m model.gguf -ngl 99 --split-mode tensor --tensor-split 1,1
llama-bench -m model.gguf -ngl 99 --split-mode layer --tensor-split 1,0

The third command is not meant as the long-term configuration. It gives you a single-card reference, so you can see whether dual GPUs are actually faster or only distributing VRAM pressure.

Why prefill and decode Behave Differently

Local LLM performance should usually be viewed in two stages:

prefill: processes the input prompt. A typical metric is prompt-processing throughput such as pp512.
decode: generates the response token by token. A typical metric is token-generation throughput such as tg128.

prefill is more like large-batch matrix computation. With larger batches, it is easier to keep GPUs busy and more likely to benefit from multi-GPU parallelism. decode generates one token after another. The batch is smaller and synchronization is more frequent, so cross-card communication and scheduling latency are easier to notice.

That is why you may see dual GPUs improve pp512 while tg128 barely improves or even gets worse. For chat and agent workflows, user experience is closer to tg128. For long document ingestion, batch prefill, or concurrent serving, pp512 also matters.

Can KV cache Become a Second VRAM Bottleneck?

Yes. Many people only count model weights and forget KV cache.

Model weights decide whether the model can load. KV cache decides whether you can use the context length you want. The longer the context, the higher the concurrency, and the larger the batch, the more visible KV cache usage becomes. You may find that the model itself fits in 32GB, but 32K or 64K context pushes VRAM over the limit.

At minimum, leave VRAM headroom for:

KV cache
CUDA graph or backend runtime overhead
prompt batch and ubatch
desktop, driver, and other process usage

If you use 2x16GB, VRAM is not a fully equivalent 32GB pool. Some buffers, KV cache, or intermediate tensors may still be limited by remaining memory on a single card. When testing long context, use the target --ctx-size and target concurrency directly instead of only checking whether the model starts.

How to Benchmark Dual Cards with llama-bench

llama-bench is better than direct chatting for hardware comparison because it separates prompt processing and token generation into comparable metrics. The default example in the official README is:

`1`	`llama-bench -m model.gguf`

For dual V100 cards, test at least these sets:

# Single-card baseline
CUDA_VISIBLE_DEVICES=0 llama-bench -m model.gguf -ngl 99

# Dual-card layer split
CUDA_VISIBLE_DEVICES=0,1 llama-bench -m model.gguf -ngl 99 --split-mode layer --tensor-split 1,1

# Dual-card tensor split
CUDA_VISIBLE_DEVICES=0,1 llama-bench -m model.gguf -ngl 99 --split-mode tensor --tensor-split 1,1

Focus on two columns:

pp512: prompt processing, more relevant to long inputs and batch prefill.
tg128: token generation, more relevant to single-user chat and agent responsiveness.

Keep the model, quantization, context length, batch settings, driver version, and llama.cpp version fixed. Run each group several times and compare medians rather than one-off results. Finally, test your real workflow too, such as Continue Agent, an OpenAI-compatible server, or your own RAG requests, because a good benchmark does not always mean better interactive experience.

One-Sentence Conclusion

The main advantage of 2x V100 16GB is VRAM capacity, not guaranteed generation speed. If the model fits on one card, a single 32GB card is usually faster and steadier. If the model does not fit on one 16GB card, dual 16GB cards become valuable because they avoid heavy CPU offload. Whether they are faster depends on split mode, batch size, model size, and whether the two V100 cards are connected through PCIe or NVLink.

References:

Ollama Multi-GPU Notes: VRAM Pooling, GPU Selection, and Common Misunderstandings

Sun, 19 Apr 2026 00:18:00 +0800

When running local inference with Ollama, a few questions come up quickly: if I already have one GPU and my motherboard still has empty PCIe slots, does adding more GPUs help? Do the GPUs need to be identical? Can VRAM be combined? Will it accelerate inference like a multi-GPU training framework?

This note summarizes how Ollama behaves with multiple GPUs. The short version:

Ollama supports multiple GPUs.
The main value of multiple GPUs is usually fitting larger models into available VRAM, not getting linear token/s scaling.
By default, if a model fits entirely on one GPU, Ollama tends to load it on a single GPU.
If a model does not fit on one GPU, Ollama can spread it across available GPUs.
Mixed GPU models may be visible to Ollama, but performance and placement may not be ideal.
SLI / NVLink is not required for multi-GPU use.
To limit which GPUs Ollama can use, use CUDA_VISIBLE_DEVICES, ROCR_VISIBLE_DEVICES, or GGML_VK_VISIBLE_DEVICES.

Official Behavior: Single GPU First, Multi-GPU When Needed

Ollama’s FAQ describes the multi-GPU loading logic directly: when loading a new model, Ollama estimates the required VRAM and compares it with currently available GPU memory. If the model can fit entirely on one GPU, it loads the model onto that GPU. If it cannot fit on a single GPU, the model is spread across all available GPUs.

The reason is performance. Keeping a model on one GPU usually reduces data transfers across the PCIe bus during inference, so it is often faster.

So do not think of Ollama multi-GPU as “more cards automatically means several times faster.” A more accurate model is:

Small model fits on one GPU: usually runs on one GPU.
Large model does not fit on one GPU: split across multiple GPUs.
Still not enough VRAM: part of the model falls back to system memory, and speed drops noticeably.

Use this command to see where the model is loaded:

`1`	`ollama ps`

The PROCESSOR column may show something like:

1
2
3

100% GPU
48%/52% CPU/GPU
100% CPU

If you see 48%/52% CPU/GPU, part of the model is already in system memory. In that case, adding more GPU memory or using a larger-VRAM GPU is usually more useful than continuing to rely on CPU/RAM.

Multi-GPU Is Not Simple Compute Stacking

Local LLM inference is not the same as SLI in games. With Ollama on multiple GPUs, the common pattern is that different layers or tensors are placed on different devices. This can make a larger model fit into the combined available VRAM, but data may still need to move between devices during inference.

So multi-GPU benefits usually fall into two categories:

VRAM benefit: larger models fit more easily, or less of the model falls back to CPU/RAM.
Performance benefit: usually most obvious when a model would otherwise not fit on one GPU or would heavily spill to CPU.

If an 8B or 14B model already fits entirely on a single RTX 3090, forcing it across two GPUs may not be faster. It may even slow down due to cross-GPU transfer overhead. Ollama’s default “use one GPU when it fits” strategy avoids that unnecessary PCIe cost.

SLI or NVLink Is Not Required

Ollama multi-GPU does not depend on SLI. Multiple normal PCIe GPUs can be scheduled as long as the driver and Ollama can detect them.

NVLink or higher PCIe bandwidth may help in some cross-GPU scenarios, but it is not a requirement. Many used GPU servers and workstations can run multiple GPUs over ordinary PCIe.

What you should pay attention to is PCIe bandwidth. The difference between x1, x4, x8, and x16 affects how quickly a model is loaded into VRAM. If you frequently switch large models, PCIe bandwidth becomes more important. After a model is loaded, PCIe usually matters less during generation, but cross-GPU splitting can still add overhead.

Safer rules:

Prefer x16 / x8 over mining-style x1 risers.
PCIe bandwidth matters more when switching large models frequently.
If a model stays resident in VRAM for a long time, PCIe bandwidth is less visible.
For multi-GPU machines, check motherboard PCIe topology and CPU-attached lanes.

Limit Which NVIDIA GPUs Ollama Uses

On NVIDIA multi-GPU systems, use CUDA_VISIBLE_DEVICES to control which GPUs Ollama can see.

Temporary run:

`1`	`CUDA_VISIBLE_DEVICES=0,1 ollama serve`

Use only the second GPU:

`1`	`CUDA_VISIBLE_DEVICES=1 ollama serve`

Force Ollama not to use NVIDIA GPUs:

`1`	`CUDA_VISIBLE_DEVICES=-1 ollama serve`

The official docs note that numeric IDs may change order, so GPU UUIDs are more reliable. Check UUIDs first:

`1`	`nvidia-smi -L`

Example output:

1
2

GPU 0: NVIDIA GeForce RTX 3090 (UUID: GPU-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx)
GPU 1: NVIDIA GeForce RTX 3070 (UUID: GPU-yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy)

Then specify the UUID:

`1`	`CUDA_VISIBLE_DEVICES=GPU-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx ollama serve`

If Ollama is installed as a Linux systemd service, put the variable into the service environment:

`1`	`sudo systemctl edit ollama.service`

Add:

1
2

[Service]
Environment="CUDA_VISIBLE_DEVICES=0,1"

Reload and restart:

1
2

sudo systemctl daemon-reload
sudo systemctl restart ollama

AMD and Vulkan Device Selection

For AMD ROCm, use ROCR_VISIBLE_DEVICES to control visible GPUs:

`1`	`ROCR_VISIBLE_DEVICES=0,1 ollama serve`

To force Ollama not to use ROCm GPUs, use an invalid ID:

`1`	`ROCR_VISIBLE_DEVICES=-1 ollama serve`

Ollama’s GPU docs also mention experimental Vulkan support. For Vulkan GPUs, use GGML_VK_VISIBLE_DEVICES:

`1`	`OLLAMA_VULKAN=1 GGML_VK_VISIBLE_DEVICES=0 ollama serve`

If Vulkan devices cause problems, disable them:

`1`	`GGML_VK_VISIBLE_DEVICES=-1 ollama serve`

AMD multi-GPU setups are more likely to run into driver, ROCm version, and GFX version compatibility issues. The official docs also mention Linux ROCm driver requirements and compatibility overrides such as HSA_OVERRIDE_GFX_VERSION. If you mix different generations of AMD GPUs, first verify that each card works on its own before trying multi-GPU.

Exposing Multiple GPUs in Docker

If you run Ollama in Docker, NVIDIA setups usually require nvidia-container-toolkit, then --gpus to expose devices.

Expose all GPUs:

docker run -d \
  --gpus=all \
  -v ollama:/root/.ollama \
  -p 11434:11434 \
  --name ollama \
  ollama/ollama

Expose specific GPUs:

docker run -d \
  --gpus '"device=0,1"' \
  -v ollama:/root/.ollama \
  -p 11434:11434 \
  --name ollama \
  ollama/ollama

You can also combine this with environment variables:

docker run -d \
  --gpus=all \
  -e CUDA_VISIBLE_DEVICES=0,1 \
  -v ollama:/root/.ollama \
  -p 11434:11434 \
  --name ollama \
  ollama/ollama

If nvidia-smi cannot see GPUs inside the container, Ollama cannot use them either. Troubleshoot Docker GPU passthrough first, then Ollama.

What Is `OLLAMA_SCHED_SPREAD`

In some multi-GPU configuration discussions, you may see OLLAMA_SCHED_SPREAD=1 or OLLAMA_SCHED_SPREAD=true. It is related to Ollama’s scheduler and is often used when people want models or requests to be spread more broadly across GPUs.

Example:

`1`	`OLLAMA_SCHED_SPREAD=1 ollama serve`

Or with systemd:

1
2

[Service]
Environment="OLLAMA_SCHED_SPREAD=true"

But it is not a magic switch. Enabling it does not imply linear token/s scaling, and it may still run into OOM when multiple models are loaded, VRAM estimates are tight, context length grows, or the KV cache expands. The core FAQ behavior still applies: if one GPU can fully hold the model, one GPU is usually more efficient; if one GPU cannot hold it, then multi-GPU splitting becomes useful.

Treat OLLAMA_SCHED_SPREAD as an advanced scheduling experiment, not a required multi-GPU setting. Understand the default behavior first, then adjust based on ollama ps, logs, and nvidia-smi.

How to Check Whether Multiple GPUs Are Being Used

Useful commands:

`1`	`ollama ps`

`1`	`watch -n 0.5 nvidia-smi`

View the Ollama service logs:

`1`	`journalctl -u ollama -f`

If using Docker:

`1`	`docker logs -f ollama`

Watch for:

Whether Ollama discovers compatible GPUs.
Whether the model shows 100% GPU or a CPU/GPU split.
Whether each GPU has VRAM allocated.
Whether VRAM grows on multiple GPUs during model loading.
Whether generation token/s improves compared with CPU/RAM spillover.
Whether OOM or model unloading happens frequently.

GPU utilization alone can be misleading. LLM inference does not always keep GPUs fully loaded, especially with multiple GPUs, low batch sizes, small contexts, slow CPUs, or slow PCIe links.

Common Misunderstandings

Misunderstanding 1: Two 12GB GPUs Equal One 24GB GPU

Not exactly. Multiple GPUs can place a model across devices, but cross-device access has overhead. It solves the “does not fit” problem, but it is not equivalent to the speed and stability of one large-VRAM GPU.

Misunderstanding 2: Different GPU Models Cannot Be Mixed

Not necessarily. If the driver, compute capability, and runtime libraries support the cards, Ollama can see multiple GPUs. But mixed setups are usually limited by the slower card, smaller VRAM, and PCIe topology. The most predictable setup is still same model, same VRAM size, and well-supported same-generation drivers.

Misunderstanding 3: Multi-GPU Is Always Faster Than Single-GPU

Not always. If the model fits completely on one fast GPU, single-GPU may be faster. Multi-GPU is mainly useful for large models, long contexts, or insufficient single-GPU VRAM.

Misunderstanding 4: NVLink / SLI Is Required

No. Ordinary PCIe multi-GPU systems can be used by Ollama. NVLink is not a prerequisite.

Misunderstanding 5: Adding a GPU Does Not Require Restarting Services

Not always true. Linux systemd services, Windows background apps, and Docker containers may need to be restarted before they rediscover devices and environment variables.

GPU Selection Suggestions

For Ollama local inference, the rough priority is:

Larger single-GPU VRAM is usually easier to manage.
Identical GPUs are easier to troubleshoot than mixed GPUs.
More complete PCIe lanes make large-model loading smoother.
Older cards should be checked for CUDA compute capability or ROCm support first.
Multi-GPU power, cooling, and chassis airflow must be planned ahead.

For budget second-hand platforms:

Dual RTX 3090 remains a common high-VRAM option.
Older Tesla cards such as P40 / M40 have large VRAM, but power, cooling, driver support, and performance all need trade-offs.
Cards such as RTX 4070 / 4070 Ti have good efficiency, but single-card VRAM can be limiting.
Multiple old 8GB cards can be fun to experiment with, but are not ideal for running large models long-term.

Summary

Ollama multi-GPU support is best understood as “VRAM expansion first, performance acceleration second.” If the model fits entirely on one GPU, the default single-GPU path is usually faster. If one GPU cannot hold it, multi-GPU can spread the model across devices and avoid heavy CPU/RAM spillover, making larger models usable.

In practice, use ollama ps to check where the model is loaded, then use nvidia-smi or ROCm tools to observe VRAM allocation. For GPU selection, use CUDA_VISIBLE_DEVICES on NVIDIA, ROCR_VISIBLE_DEVICES on AMD ROCm, and GGML_VK_VISIBLE_DEVICES for Vulkan. If running in Docker, first make sure the container can see the GPUs.

Multi-GPU is not magic. It can help fit larger models, but it does not guarantee linear speedup. The stable route is still to prefer large-VRAM single GPUs or identical multi-GPU setups, while considering driver support, PCIe, power, cooling, and model quantization together.

References

Ollama FAQ: How does Ollama load models on multiple GPUs?: https://github.com/ollama/ollama/blob/main/docs/faq.mdx
Ollama GPU docs: Hardware support / GPU Selection: https://github.com/ollama/ollama/blob/main/docs/gpu.mdx
Ollama Docker Hub: https://hub.docker.com/r/ollama/ollama
NVIDIA Container Toolkit: https://github.com/NVIDIA/nvidia-container-toolkit