CUDA on KnightLi Blog

Ubuntu 26.04 LTS GPU and Hardware Updates: CUDA, ROCm, DPC++, and More Platform Changes

Sun, 26 Apr 2026 19:35:57 +0800

If the previous article worked as a desktop-focused overview of Ubuntu 26.04 LTS, this one is better read as its hardware and compute-side follow-up. In this 26.04 cycle, Ubuntu pushed a number of AI, GPU computing, and platform compatibility changes into the main archive or formal support scope.

The short version is this: the most important part of this round is not just desktop and kernel upgrades, but that Ubuntu is bringing Intel, NVIDIA, and AMD GPU computing stacks into the distribution in a more systematic way.

Starting with 26.04, Intel’s open-source oneAPI DPC++ compiler is available directly from Ubuntu Archive for building SYCL code. Its runtime also includes adapters for Intel GPUs.

Two related components are also now available from Ubuntu repositories:

oneDPL, the DPC++ library, which provides higher-productivity developer APIs
oneDNN, built with dpclang-6, which can run on Intel GPUs

That means if you are already working with SYCL, heterogeneous computing, or AI workloads on Intel GPUs, Ubuntu now offers a more direct path instead of forcing you to maintain a separate external stack for everything.

Ubuntu also calls out one practical requirement: users need to be in the render group to actually use these Intel GPU-related capabilities.

2. The NVIDIA CUDA toolkit can now be installed directly with `apt`

For many developers and operators, this may be one of the most immediately useful changes in the notes.

Starting with 26.04, the NVIDIA CUDA toolkit can now be installed directly from Ubuntu Archive:

`1`	`sudo apt install cuda-toolkit`

The value here is bigger than just saving a few setup steps.

For developers shipping software on Ubuntu, this new model means they can simply declare a dependency on the CUDA runtime, while Ubuntu manages installation and compatibility at the distribution level. That makes CUDA feel more like a native system capability on Ubuntu, rather than an extra software layer that always has to be maintained separately.

3. AMD ROCm 7.1.0 is now in Universe

On the AMD side, Ubuntu Universe now includes ROCm 7.1.0.

These libraries mainly provide:

backend infrastructure for AI training and inference on AMD GPUs
software foundations for machine learning and high performance computing

Canonical also notes that ROCm-related components are continuously tested in its CI/CD pipeline. Beyond autopkgtests, that includes several user-space applications such as:

llama.cpp
pytorch
Blender
Lemonade Server

That detail matters, because it shows Ubuntu is not just dropping packages into the archive. It is validating ROCm as a maintainable software stack.

4. The bigger story is that all three GPU ecosystems are landing

It becomes easier to see the direction of 26.04 when DPC++, CUDA, and ROCm are viewed together:

Intel: bringing SYCL / oneAPI components into official repositories
NVIDIA: giving the CUDA toolkit a distribution-managed installation path
AMD: shipping ROCm 7.1.0 in Universe with ongoing testing

If you work with these kinds of workloads on Ubuntu, this release will probably feel more relevant:

local LLM inference
GPU-accelerated training or fine-tuning
Blender, scientific computing, and HPC
development environments that need to move across different GPU platforms

In other words, Ubuntu is no longer just “a system where you can install a GPU driver.” It is starting to carry a fuller user-space software stack for AI and GPU computing.

5. NVIDIA Dynamic Boost is enabled by default

Since 25.04, Dynamic Boost has been enabled by default on supported NVIDIA laptops.

The idea is straightforward: depending on system load, power can be shifted dynamically between the CPU and GPU. In gaming scenarios, that usually means giving more power to the GPU when needed to extract more performance.

It only applies under two conditions:

the laptop is connected to AC power
the GPU load is high enough

It does not engage while the system is running on battery.

6. Support for new Intel integrated and discrete GPUs keeps moving forward

Ubuntu also continues expanding support for new Intel GPUs, including:

Integrated:

Intel Core Ultra Xe2
Intel Core Ultra Xe3

Discrete:

Intel Arc 5 B570
Intel Arc 5 B580
Intel Arc Pro B50
Intel Arc Pro B60
Intel Arc Pro B65
Intel Arc Pro B70

Ubuntu also highlights several features already available around these devices:

improved GPU and CPU ray tracing performance through Intel Embree, benefiting applications such as Blender 4.2+
hardware video encoding for AVC, JPEG, HEVC, and AV1 on “Battlemage” devices
a new CCS optimization in Intel Compute Runtime
enabled debugging support for Intel Xe GPUs

If you are watching follow-up releases, 25.10 also continues to bring in more capabilities, including:

initial support for Intel’s next-generation client platform codenamed Panther Lake through Linux kernel 6.17
improved IOMMU, PCIe subsystem, and multi-GPU support
Mesa 25.2.3 enabling VK_KHR_shader_bfloat16 for Battlemage and Panther Lake
intel-media-driver 25.3.0 adding Panther Lake decode support and VP9 encoding
intel-compute-runtime 25.31 adjusting the Level Zero USM pool and local device memory event allocation behavior
level-zero 1.24 and level-zero-raytracing 1.1.0 bringing broader spec and RTAS extension support

7. Suspend and resume is more stable on Nvidia desktops too

Starting with 25.10, Ubuntu enables suspend-resume support in the proprietary Nvidia driver to reduce corruption and freezing when waking a desktop system.

This is not the most visible kind of change, but it matters a lot in everyday use, especially on desktops that stay on for long periods and frequently suspend and resume.

8. ARM, Raspberry Pi, RISC-V, and IBM Z also get harder platform-level changes

Beyond the GPU software stack, the release notes also include several platform-level changes worth calling out separately.

ARM64 desktop platforms

Starting with 25.10, the ARM64 linux-generic kernel provides broader desktop compatibility for ARM64 desktop platforms that boot through UEFI.

A new Raspberry Pi boot layout

One change introduced in 25.10 and refined in 26.04 is a new boot partition layout for Raspberry Pi systems.

Its goal is to improve boot reliability: newly written boot assets are first “tested” before they are committed as the new “known good” set.

The firmware date requirements are the part most users will want to remember:

Pi 3 / 3+ / CM3+ / Zero 2W: no additional action required, the boot firmware is in the image itself
Pi 4 / 400 / CM4: boot firmware must be dated no earlier than 2022-11-25
Pi 5 / 500 / CM5: boot firmware must be dated no earlier than 2025-02-11

You can check it with:

`1`	`sudo rpi-eeprom-update`

If the firmware is too old and you are using Ubuntu 24.04 LTS or newer, you can update it like this:

1
2

sudo rpi-eeprom-update -a
sudo reboot

Raspberry Pi desktop images now use desktop-minimal

Since 25.10, Ubuntu Desktop images for Raspberry Pi are based on desktop-minimal rather than the full desktop seed.

Ubuntu gives a very concrete benefit here: the default app set is smaller, saving about 777MB on the uncompressed image and on installed systems.

If you want to remove that default app set in bulk after upgrading, you can use:

`1`	`sudo apt purge ubuntu-desktop --autoremove`

If you want to keep some of those applications, just mark them as manually installed with apt first.

Swap on Raspberry Pi is now handled by cloud-init

Since 25.10, swap file creation on Raspberry Pi desktop images is handled by cloud-init.
If you want to customize swap size before first boot, you can edit user-data on the boot partition directly.

RISC-V requirements have moved up

Starting with 25.10, the RISC-V build of Ubuntu 26.04 LTS requires hardware that implements the RVA23S64 ISA profile.

Systems that do not meet that requirement can no longer run Ubuntu 26.04 LTS. If you still have boards based on earlier RVA20 processor cores, you need to stay on the support line provided by Ubuntu 24.04 LTS.

According to Ubuntu, as of April 2026, there is still no real RVA23S64 hardware available. So the only currently supported platform is effectively a QEMU virtualized environment configured with -cpu rva23s64.

IBM Z now requires z15 at minimum

Starting with 26.04, the minimum requirement for the s390x architecture has moved up to z15.

That means:

z14 / LinuxONE II and older systems can no longer install Ubuntu 26.04 LTS
z15 / LinuxONE III and newer systems should see better performance

9. Who should read this first

This article is more useful than the desktop overview if you fall into any of these cases:

you use Ubuntu for CUDA, ROCm, SYCL, or local AI inference
you do development or compute work on Intel, NVIDIA, or AMD GPUs
you maintain Raspberry Pi, ARM64, RISC-V, IBM Z, or other non-standard x86 platforms
you are especially sensitive to repository availability, driver behavior, runtimes, and platform requirements after an upgrade

10. One-line takeaway

The key point of Ubuntu 26.04 LTS on the hardware and AI stack side is not that one GPU vendor got a standout upgrade. It is that Intel’s DPC++, NVIDIA’s CUDA, and AMD’s ROCm are all entering the Ubuntu ecosystem in a more official, in-repository, and maintainable way.

If you used to think of Ubuntu as “the system first, then I assemble the GPU environment myself,” 26.04 starts to look more like a distribution that is willing to actively carry AI and heterogeneous computing workloads.

What Is NVIDIA nvbandwidth: How to Use This GPU Bandwidth Testing Tool

Fri, 24 Apr 2026 14:41:35 +0800

If you have recently been troubleshooting interconnect performance between multiple NVIDIA GPUs, or you want to verify the real bandwidth between PCIe, NVLink, host memory, and VRAM, NVIDIA/nvbandwidth is a small tool worth knowing about.

It is not a general benchmark utility, and it is not a hidden command inside a large model framework. It is an open-source tool from NVIDIA specifically designed to measure bandwidth and latency for GPU-related memory copies. Instead of only looking at theoretical bandwidth, nvbandwidth is better at answering a practical question: how much bandwidth can this machine and its current GPU interconnects actually deliver right now?

1. What does `nvbandwidth` do

According to the official README, nvbandwidth is a command-line tool for measuring bandwidth on NVIDIA GPUs.

It mainly focuses on transfer performance across different memcpy patterns, such as:

GPU -> GPU
CPU -> GPU
GPU -> CPU
Transfers between GPUs across multiple nodes

These tests are especially useful in scenarios like:

Troubleshooting interconnect bottlenecks in multi-GPU training or inference
Verifying the actual behavior of links such as NVLink, PCIe, and C2C
Comparing transfer differences across servers, topologies, drivers, or CUDA versions
Performing baseline hardware validation before cluster deployment

In short, nvbandwidth is not about model throughput. It is about the lower-level ability to move data.

2. It does not produce just one simple score

Many people think of a bandwidth test as something that ends with a single number, but nvbandwidth provides more detailed output than that.

It reports results as matrices for each test type. For example, in a test like device_to_device_memcpy_write_ce, it shows the bandwidth between each pair of GPUs by row and column. That means you can see more than just a rough system-wide speed estimate. You can also spot:

Which GPU pairs are especially fast
Which paths are clearly limited by PCIe
Whether certain GPU pairs show abnormally low bandwidth
Whether the multi-GPU topology matches your expectations

If you are working with an 8-GPU server, a dual-socket platform, or a multinode system, this matrix-style output is often more useful than a single average number.

3. How to understand `CE` and `SM` copies

The official documentation splits tests into two categories:

CE: copy engine transfers based on memcpy APIs
SM: kernel-based transfers

These two result types are not guaranteed to match exactly, because they represent different copy paths.
If you mainly want to understand regular device-to-device transfer behavior, you will usually look at CE first. If you want to study execution details more closely, then SM is worth checking too.

The README also explains that bandwidth results use the median across multiple test runs by default. Newer versions additionally include variability statistics, which makes it easier to judge how stable the numbers are.

4. What environment does it require

nvbandwidth is not a pure binary utility that you simply download and run. It expects a standard CUDA development environment.

The current README lists these basic requirements:

CUDA Toolkit 11.x or newer
A compiler with C++17 support
CMake 3.20+, with 3.24+ recommended
Boost program_options
A usable CUDA device and a compatible driver

The requirements are higher if you want the multinode version. The current README explicitly states:

Multinode builds require CUDA Toolkit 12.3
The driver must be 550 or newer
MPI is required
The nvidia-imex service must be configured

So this is much more of an engineering tool for Linux GPU servers and clusters than something aimed at casual desktop use.

5. How to build and run the single-node version

The single-node build process is straightforward:

1
2

cmake .
make

On Ubuntu / Debian, the project also provides a debian_install.sh script that installs common dependencies and builds the project.

After building, you can check the help output first:

`1`	`./nvbandwidth -h`

Some commonly used options include:

-l: list available tests
-t: run a specific test by name or index
-p: run tests by prefix
-b: set the memcpy buffer size, default 512 MiB
-i: set the number of benchmark iterations
-j: output JSON
-H: enable huge pages for host memory allocation

If you just want to run the default test suite once, use:

`1`	`./nvbandwidth`

If you only want to test one specific item, such as a device-to-device copy:

`1`	`./nvbandwidth -t device_to_device_memcpy_read_ce`

6. Multinode support is one of its standout features

nvbandwidth is not only for single-node multi-GPU testing. It also supports multinode scenarios.

According to the README, the multinode build is done like this:

1
2

cmake -DMULTINODE=1 .
make

At runtime, it is typically used together with mpirun, with one process launched per GPU.
The documentation also requires all participating ranks to belong to the same multinode clique, and it recommends mainly running tests with the multinode prefix under MPI.

That makes its positioning much closer to high-performance computing and large GPU systems than to simple workstation self-checks.

If you are working with NVLink multinode deployments or more complex platforms such as GB200 / Grace Hopper, the value of nvbandwidth is much higher than it would be on a typical consumer GPU setup.

7. What changed in `v0.9`

As of April 24, 2026, the GitHub Releases page shows that the latest version of nvbandwidth is v0.9, released on April 8, 2026.

The most notable updates in this release include:

Added variability statistics to bandwidth output
Added huge page support for host memory (Windows excluded)
Added pair sampling for device-to-device tests
Added a troubleshooting guide
Unified single-node and multinode execution paths

Two engineering-oriented changes are also worth noting:

Improved CUDA architecture detection without relying as much on direct GPU access
Deprecated Volta (sm_70 / sm_72) support in CUDA Toolkit 13.0+ environments

So if you only looked at early versions before, v0.9 is no longer just a basic bandwidth tester. It is clearly moving toward better automation, troubleshooting, and large-scale system validation.

8. When is it a good fit

nvbandwidth is especially suitable when:

You want to verify real interconnect bandwidth between multiple NVIDIA GPUs
You suspect one GPU is installed in a bandwidth-limited PCIe slot
You want to compare NVLink paths against non-NVLink paths
You are deploying a multinode GPU cluster and need to validate the links
You want test results in JSON for automation pipelines

But if your goal is only to answer questions like “how fast is training” or “how many tokens per second can inference reach,” this tool is not the whole answer.
In that case, you still need workload-level testing with your training framework, inference engine, or real application.

9. How to think about its value

Many GPU performance problems are not really caused by insufficient compute. They happen because the data path is not working as expected.

For example:

GPUs are not using the intended interconnect path
Cross-NUMA access is reducing speed
Certain GPU pairs have abnormal bandwidth
Multinode communication is only partially configured

These issues are often hard to diagnose if you only look at nvidia-smi or model throughput.
A lower-level, matrix-oriented tool like nvbandwidth is useful precisely because it exposes what is happening at the interconnect layer.

So a simple way to think about it is: nvbandwidth is a command-line health check tool for bandwidth on NVIDIA GPU systems.

GitHub project: https://github.com/NVIDIA/nvbandwidth
Releases: https://github.com/NVIDIA/nvbandwidth/releases

llama.cpp GPU Performance Ranking: Full CUDA, ROCm, and Vulkan Scoreboards Explained with pp512 / tg128 / FA

Thu, 23 Apr 2026 10:22:04 +0800

Understanding the Metrics First

What is Q4_0

Q4_0 is a 4-bit quantization format. It does not mean the model is stronger. It means the model is smaller, uses less VRAM, and fits on more devices. Most of these scoreboards standardize on Llama 2 7B, Q4_0 so that GPU-to-GPU comparisons are easier.

What is pp512

pp512 usually means prompt processing 512 tokens, which is the throughput while processing 512 input tokens.

pp = prompt processing
512 = input length is 512 tokens
/s = tokens per second This is closer to prompt-ingestion speed, so it is often much higher than generation speed.

What is g128

g128 usually means 	ext generation 128 tokens, which is the speed while generating 128 tokens continuously.

g = text generation
128 = generate 128 tokens continuously
/s = tokens per second This is usually closer to the speed users actually feel in interactive usage.

What is FA

FA stands for Flash Attention.

with FA means Flash Attention is enabled

o FA means Flash Attention is disabled On many GPUs, FA improves pp512 more clearly than g128, but the gain is not identical across backends, drivers, and GPU architectures.

How to read /s

/s means 	okens per second. When reading these scoreboards, the key rule is to compare the same type of test with the same settings.

Do not compare pp512 and g128 as if they were the same thing
Do not mix o FA and with FA
Do not assume CUDA, ROCm, and Vulkan are directly interchangeable

Quick Takeaways

CUDA is still the strongest overall path in llama.cpp GPU benchmarks, especially on high-end Nvidia GPUs.
ROCm is already delivering strong results on high-end AMD GPUs and Instinct accelerators.
Vulkan has the broadest hardware coverage, including Nvidia, AMD, Intel, older GPUs, and some Apple / Asahi setups.
g128 is closer to everyday perceived speed, while pp512 is better for judging prompt throughput.

CUDA Scoreboards

Llama 2 7B, Q4_0, no FA

Chip	Memory	pp512 t/s	tg128 t/s	Commit	Thanks to
RTX 5090	32 GB / GDDR7 / 512 bit	14073.41 ± 115.16	290.02 ± 1.10	8cf6b42	@totaldev
RTX PRO 6000 Blackwell	96 GB / GDDR7 / 512 bit	14854.63 ± 22.73	274.20 ± 0.14	79c1160	@Tom94
H100 80 GB	80 GB / HBM3 / 5120 bit	9918.34 ± 176.97	267.81 ± 1.54	5143fa8	@Hedede
A100 80 GB	80 GB / HBM2e / 5120 bit	4849.53 ± 8.94	190.88 ± 0.33	5143fa8	@Hedede
RTX 4090 D	24 GB / GDDR6X / 384 bit	10293.86 ± 134.72	189.33 ± 0.19	79c1160	@autonomous-AI-lab
RTX 4090	24 GB / GDDR6X / 384 bit	11992.70 ± 107.99	186.21 ± 0.13	2241453	@lhl
RTX 5080	16 GB / GDDR7 / 256 bit	8297.36 ± 9.50	181.99 ± 0.42	8a4280c	@Hedede
RTX 5070 Ti	16 GB / GDDR7 / 256 bit	6952.38 ± 13.73	176.85 ± 0.07	933414c	@TinyServal
RTX 6000 Ada	48 GB / GDDR6 / 384 bit	9229.23 ± 101.78	176.07 ± 0.26	b8e09f0	@Hedede
RTX 3090 Ti	24 GB / GDDR6X / 384 bit	6567.49 ± 20.30	171.19 ± 3.98	9c35706	@slaren
RTX 3090	24 GB / GDDR6X / 384 bit	5174.69 ± 21.83	158.16 ± 0.21	c76b420	@m18coppola
L40	48 GB / GDDR6 / 384 bit	8870.49 ± 378.76	152.01 ± 0.28	ee09828	@Hedede
RTX 4080 SUPER	16 GB / GDDR6X / 256 bit	8125.15 ± 41.05	148.33 ± 0.20	81086cd	@zacharyarnaise
RTX 4080	16 GB / GDDR6X / 256 bit	8031.64 ± 26.49	142.49 ± 0.16	20638e4	@Ristovski
RTX 3080	10 GB / GDDR6X / 320 bit	5013.86 ± 24.80	139.65 ± 0.99	9c35706	@slaren
RTX A6000	48 GB / GDDR6 / 384 bit	4913.93 ± 6.79	138.73 ± 2.75	4795c91	@Hedede
RTX 4070 Ti SUPER	16 GB / GDDR6X / 256 bit	6924.53 ± 13.87	132.26 ± 0.16	9c35706	@Ristovski
RTX PRO 4000 Blackwell	24 GB / GDDR7 / 192 bit	4992.83 ± 113.52	131.66 ± 0.20	7d77f07	@Hedede
RTX A5000	24 GB / GDDR6 / 384 bit	4028.16 ± 19.14	130.07 ± 2.74	e5155e6	@Hedede
Tesla V100	32 GB / HBM2 / 4096 bit	3042.64 ± 40.71	129.08 ± 0.05	51f5a45	@Hedede
RTX 5070	12 GB / GDDR7 / 192 bit	5184.75 ± 18.70	127.54 ± 0.46	@Spyro000	-
A40	48 GB / GDDR6 / 384 bit	4609.01 ± 10.67	124.11 ± 0.17	3470a5c	@Hedede
A30	24 GB / HBM2e / 3072 bit	2767.10 ± 1.88	124.81 ± 0.16	583cb83	@Hedede
Titan V	12 GB / HBM2 / 3072 bit	2617.46 ± 2.10	108.79 ± 0.05	e56abd2	@Hedede
RTX 2080 Ti	11 GB / GDDR6 / 352 bit	2890.66 ± 2.42	107.51 ± 0.21	9c35706	@ariya
Quadro RTX 6000	24 GB / GDDR6 / 384 bit	2751.18 ± 19.43	102.77 ± 0.04	b8e09f0	@Hedede
Quadro RTX 8000	48 GB / GDDR6 / 384 bit	2709.95 ± 3.35	102.68 ± 0.03	b8e09f0	@Hedede
RTX A4500	20 GB / GDDR6 / 320 bit	2827.20 ± 66.43	97.32 ± 2.80	5cdb27e	@aleksyx
RTX 5060 Ti 16 GB	16 GB / GDDR7 / 128 bit	3737.25 ± 6.79	90.94 ± 0.02	89d1029	@mike-llamacpp
RTX 2070 SUPER	8 GB / GDDR6 / 256 bit	2088.34 ± 1.94	88.06 ± 0.28	bc07349	@phstudy
RTX A4000	16 GB / GDDR6 / 256 bit	2684.06 ± 15.28	83.77 ± 0.37	65349f2	@TinyServal
Titan Xp	12 GB / GDDR5X / 384 bit	1154.96 ± 1.46	76.08 ± 0.08	c4510dc	@Hedede
RTX 3060	12 GB / GDDR6 / 192 bit	2137.50 ± 10.12	75.57 ± 0.07	baa9255	@QuantiusBenignus
Quadro RTX 4000	8 GB / GDDR6 / 256 bit	1536.89 ± 0.90	65.62 ± 0.62	7d77f07	@Hedede
RTX 4060 Ti 8 GB	8 GB / GDDR6 / 128 bit	3394.63 ± 7.44	63.86 ± 0.01	89d1029	@mike-llamacpp
GTX 1080 Ti	11 GB / GDDR5X / 352 bit	1084.41 ± 3.01	62.49 ± 0.06	9c35706	@ariya
RTX A4000 Ada	20 GB / GDDR6 / 160 bit	2779.77 ± 9.91	61.83 ± 0.04	a74a0d6	@sdwolfz
RTX 2060 SUPER	8 GB / GDDR6 / 256 bit	1420.24 ± 1.95	60.04 ± 0.01	5c0eb5e	@ggerganov
Tesla P100	16 GB / HBM2 / 4096 bit	760.80 ± 2.92	58.35 ± 0.00	b8372ee	@Hedede
DGX Spark	128 GB / LPDDR5x	3062.31 ± 11.02	57.21 ± 0.06	5acd455	@ggerganov
Tesla P40	24 GB / GDDR5 / 384 bit	1007.42 ± 1.23	54.74 ± 0.07	c76b420	@m18coppola
RTX 2000 Ada	16 GB / GDDR6 / 128 bit	1956.22 ± 7.74	50.62 ± 0.04	756cfea	@DigitalRudeness
Tesla T4	16 GB / GDDR6 / 256 bit	1219.06 ± 4.18	46.38 ± 0.73	d32e03f	@pt13762104
RTX 4050 Laptop	6 GB / GDDR6 / 96 bit	1725.85 + 17.85	43.72 + 0.41	d79d8f3	@TimCabbage
GTX 1660	6 GB / GDDR5 / 192 bit	148.91 ± 0.01	41.35 ± 0.02	9515c61	@ariya
Tesla M40	24 GB / GDDR5 / 384 bit	282.65 ± 0.15	38.04 ± 0.02	97d5117	@Hedede
GTX 1070 Ti	8 GB / GDDR5 / 256 bit	714.44 ± 2.04	37.82 ± 0.02	79c1160	@pebaryan
Jetson AGX Orin	64 GB / LPDDR5 / 256 bit	991.31 ± 1.15	33.58 ± 0.14	c1b1876	@TinyServal
Tesla P4	8 GB / GDDR5 / 256 bit	514.53 ± 3.06	33.29 ± 0.00	c76b420	@m18coppola
P106-100	6 GB / GDDR5 / 192 bit	406.94 ± 0.25	30.40 ± 0.02	5fd160b	@pebaryan
GTX 1060	6 GB / GDDR5 / 192 bit	416.85 ± 1.75	27.79 ± 0.02	5fd160b	@pebaryan
Quadro T1000	4 GB / GDDR5 / 128 bit	79.44 ± 0.01	27.82 ± 0.18	f6da8cb	@hanabu
Quadro P2000	5 GB / GDDR5 / 160 bit	309.30 ± 0.05	23.63 ± 0.00	baa9255	@TinyServal
Quadro P1000	4 GB / GDDR5 / 128 bit	183.40 ± 0.11	13.99 ± 0.13	1e74897	@aleksyx
Tesla K80	12 GB / GDDR5 / 384 bit	133.14 ± 0.55	13.80 ± 0.02	32732f2	@pebaryan

Llama 2 7B, Q4_0, with FA

Chip	Memory	pp512 t/s	tg128 t/s	Commit	Thanks to
RTX 5090	32 GB / GDDR7 / 512 bit	14970.15 ± 381.06	300.40 ± 0.28	8cf6b42	@totaldev
RTX PRO 6000 Blackwell	96 GB / GDDR7 / 512 bit	16618.98 ± 20.66	281.11 ± 0.41	5143fa8	@Tom94
H100 80 GB	80 GB / HBM3 / 5120 bit	11263.29 ± 98.34	280.74 ± 1.17	5143fa8	@Hedede
A100 80 GB	80 GB / HBM2e / 5120 bit	5285.96 ± 6.58	200.90 ± 0.12	5143fa8	@Hedede
RTX 4090 D	24 GB / GDDR6X / 384 bit	12506.97 ± 11.51	191.57 ± 0.03	79c1160	@autonomous-AI-lab
RTX 4090	24 GB / GDDR6X / 384 bit	14770.63 ± 102.93	188.96 ± 0.05	2241453	@lhl
RTX 5080	16 GB / GDDR7 / 256 bit	9487.70 ± 21.89	184.68 ± 0.05	8a4280c	@Hedede
RTX 5070 Ti	16 GB / GDDR7 / 256 bit	8419.56 ± 35.50	182.43 ± 0.09	933414c	@TinyServal
RTX 6000 Ada	48 GB / GDDR6 / 384 bit	10576.85 ± 530.21	179.47 ± 0.32	b8e09f0	@Hedede
RTX 3090 Ti	24 GB / GDDR6X / 384 bit	6924.01 ± 10.76	172.26 ± 1.31	9c35706	@slaren
RTX PRO 4500 Blackwell	32 GB / GDDR7 / 256 bit	7251.66 ± 92.40	168.90 ± 0.20	becc481	@Hedede
RTX 3090	24 GB / GDDR6X / 384 bit	5560.06 ± 16.28	161.89 ± 0.18	c76b420	@m18coppola
L40	48 GB / GDDR6 / 384 bit	10097.64 ± 671.22	153.76 ± 0.12	ee09828	@Hedede
RTX 4080 SUPER	16 GB / GDDR6X / 256 bit	9439.01 ± 56.75	147.48 ± 1.41	81086cd	@zacharyarnaise
RTX 4080	16 GB / GDDR6X / 256 bit	9205.93 ± 22.31	143.47 ± 0.02	20638e4	@Ristovski
RTX A6000	48 GB / GDDR6 / 384 bit	5662.39 ± 13.87	144.87 ± 0.18	4795c91	@Hedede
RTX 3080	10 GB / GDDR6X / 320 bit	5569.56 ± 14.04	139.95 ± 0.95	9c35706	@slaren
RTX PRO 4000 Blackwell	24 GB / GDDR7 / 192 bit	5674.44 ± 139.53	136.38 ± 0.13	7d77f07	@Hedede
RTX A5000	24 GB / GDDR6 / 384 bit	4552.15 ± 9.68	135.83 ± 0.11	e5155e6	@Hedede
Tesla V100	32 GB / HBM2 / 4096 bit	2973.78 ± 3.62	134.76 ± 0.02	51f5a45	@Hedede
RTX 4070 Ti SUPER	16 GB / GDDR6X / 256 bit	7612.32 ± 37.35	132.85 ± 0.31	9c35706	@Ristovski
A30	24 GB / HBM2e / 3072 bit	3068.72 ± 0.63	131.93 ± 0.18	583cb83	@Hedede
RTX 5070	12 GB / GDDR7 / 192 bit	5783.44 ± 36.95	128.21 ± 2.52	@Spyro000	-
A40	48 GB / GDDR6 / 384 bit	5256.38 ± 19.39	126.24 ± 0.06	3470a5c	@Hedede
Titan V	12 GB / HBM2 / 3072 bit	2481.25 ± 1.31	112.17 ± 0.01	e56abd2	@Hedede
RTX 2080 Ti	11 GB / GDDR6 / 352 bit	3107.61 ± 4.34	109.17 ± 0.07	9c35706	@ariya
Quadro RTX 6000	24 GB / GDDR6 / 384 bit	3053.96 ± 1.37	104.38 ± 0.04	b8e09f0	@Hedede
Quadro RTX 8000	48 GB / GDDR6 / 384 bit	3052.35 ± 5.64	103.63 ± 0.02	b8e09f0	@Hedede
RTX A4500	20 GB / GDDR6 / 320 bit	3453.10 ± 49.19	103.00 ± 0.25	5cdb27e	@aleksyx
RTX 5060 Ti 16 GB	16 GB / GDDR7 / 128 bit	4195.53 ± 1.98	93.46 ± 0.01	89d1029	@mike-llamacpp
RTX 2070 SUPER	8 GB / GDDR6 / 256 bit	2293.29 ± 5.91	87.71 ± 0.29	bc07349	@phstudy
RTX A4000	16 GB / GDDR6 / 256 bit	2807.83 ± 52.44	85.17 ± 0.66	65349f2	@TinyServal
RTX 3060	12 GB / GDDR6 / 192 bit	2407.67 ± 3.73	76.92 ± 0.03	baa9255	@QuantiusBenignus
Titan Xp	12 GB / GDDR5X / 384 bit	1218.12 ± 1.82	73.84 ± 0.04	c4510dc	@Hedede
Quadro RTX 4000	8 GB / GDDR6 / 256 bit	1662.80 ± 2.04	67.62 ± 0.67	7d77f07	@Hedede
RTX 4060 Ti 8 GB	8 GB / GDDR6 / 128 bit	3803.45 ± 70.80	64.03 ± 0.53	89d1029	@mike-llamacpp
Tesla P100	16 GB / HBM2 / 4096 bit	787.36 ± 3.27	61.99 ± 0.00	b8372ee	@Hedede
GTX 1080 Ti	11 GB / GDDR5X / 352 bit	1138.14 ± 2.02	61.38 ± 0.03	9c35706	@ariya
RTX A4000 Ada	20 GB / GDDR6 / 160 bit	3171.86 ± 4.34	61.37 ± 0.01	a74a0d6	@sdwolfz
RTX 2060 SUPER	8 GB / GDDR6 / 256 bit	1563.77 ± 0.51	61.13 ± 0.05	5c0eb5e	@ggerganov
DGX Spark	128 GB / LPDDR5x	3661.37 ± 38.66	56.74 ± 0.03	5acd455	@ggerganov
Tesla P40	24 GB / GDDR5 / 384 bit	1079.66 ± 0.18	53.73 ± 0.05	c76b420	@m18coppola
RTX 2000 Ada	16 GB / GDDR6 / 128 bit	2250.14 ± 5.91	50.71 ± 0.01	756cfea	@DigitalRudeness
Tesla T4	16 GB / GDDR6 / 256 bit	1309.73 ± 1.02	44.03 ± 0.57	d32e03f	@pt13762104
GTX 1660	6 GB / GDDR5 / 192 bit	154.45 ± 0.52	41.43 ± 0.01	9515c61	@ariya
Tesla M40	24 GB / GDDR5 / 384 bit	290.17 ± 0.11	39.98 ± 0.01	97d5117	@Hedede
GTX 1070 Ti	8 GB / GDDR5 / 256 bit	790.52 ± 2.39	37.87 ± 0.00	79c1160	@pebaryan
Jetson AGX Orin	64 GB / LPDDR5 / 256 bit	1171.96 ± 4.70	35.88 ± 0.18	c1b1876	@TinyServal
Tesla P4	8 GB / GDDR5 / 256 bit	529.53 ± 2.12	33.12 ± 0.03	c76b420	@m18coppola
P106-100	6 GB / GDDR5 / 192 bit	438.49 ± 0.38	30.64 ± 0.06	5fd160b	@pebaryan
GTX 1060	6 GB / GDDR5 / 192 bit	446.19 ± 0.81	28.18 ± 0.01	5fd160b	@pebaryan
Quadro T1000	4 GB / GDDR5 / 128 bit	27.46 ± 0.23	27.46 ± 0.23	f6da8cb	@hanabu
Quadro P2000	5 GB / GDDR5 / 160 bit	311.55 ± 0.19	23.76 ± 0.01	baa9255	@TinyServal
Tesla K80	12 GB / GDDR5 / 384 bit	133.36 ± 0.60	14.27 ± 0.32	32732f2	@pebaryan
Quadro P1000	4 GB / GDDR5 / 128 bit	173.82 ± 0.02	13.65 ± 0.14	1e74897	@aleksyx

Apple Silicon as a Reference Baseline

Discussion #4167 is useful because it established a more unified benchmark format early on. Besides Q4_0, it also includes F16 and Q8_0, which helps explain PP / TG / t/s. The thread explicitly defines:

PP = prompt processing
TG = ext-generation

/s = okens per second A representative example is the M2 Ultra time-series comparison:

Time	Device	Version / Note	Bandwidth GB/s	GPU Cores	F16 PP	F16 TG	Q8_0 PP	Q8_0 TG	Q4_0 PP	Q4_0 TG
2023-11-21	M2 Ultra	8e672ef	800	76	1401.85	41.02	1248.59	66.64	1238.48	94.27
2024-11-12	M2 Ultra	86ed72d + FA	800	76	1525.95	43.15	1368.18	73.11	1391.78	108.80
2025-08-02	M2 Ultra	5c0eb5e + FA	800	76	1561.35	43.24	1386.97	73.35	1412.42	109.41
Representative Apple Silicon entries shown in the thread:
Device	Q4_0 PP	Q4_0 TG	Q8_0 PP	Q8_0 TG	F16 PP	F16 TG
—	—:	—:	—:	—:	—:	—:
M1 Pro 16 GPU	266.25	36.41	270.37	22.34	302.14	12.75
M2 Ultra 76 GPU	1238.48	94.27	1248.59	66.64	1401.85	41.02
M3 Max 40 GPU	690.99	65.85	749.37	43.00	794.26	25.27

ROCm / HIP Scoreboards

Llama 2 7B, Q4_0, no FA

Chip	Memory	pp512 t/s	tg128 t/s	Commit	Thanks to
Instinct MI300X	192 GB / HBM3 / 8192 bit	11476.40 ± 72.79	232.92 ± 0.53	ee3a9fc	@yeahdongcn
RX 7900 XTX	24 GB / GDDR6 / 384 bit	3552.27 ± 101.96	167.11 ± 0.50	2f0c2db	@Diablo-D3
Instinct MI210	64 GB / HBM2e / 4096 bit	2486.22 ± 9.58	124.51 ± 0.04	8160b38	@65a
Pro W7900	48 GB / GDDR6 / 384 bit	3213.17 ± 80.47	121.18 ± 0.06	8160b38	@65a
RX 7900 XT	20 GB / GDDR6 / 320 bit	3098.38 ± 24.02	116.15 ± 0.06	1e15bfd	@AdamNiederer
RX 9070	16 GB / GDDR6 / 256 bit	2381.77 ± 3.68	114.48 ± 0.60	d0660f2	@andj1210
Instinct MI100	32 GB / HBM2 / 4096 bit	2732.83 ± 1.98	110.48 ± 0.14	9c35706	@firefox42
RX 9070 XT	16 GB / GDDR6 / 256 bit	5055.19 ± 109.58	101.27 ± 0.27	583cb83	@Hadrianneue
RX 7800 XT	16 GB / GDDR6 / 256 bit	2151.81 + 17.94	100.94 + 0.10	00131d6	@olegshulyakov
Instinct MI50	32 GB / HBM2 / 4096 bit	1057.24 ± 0.53	98.95 ± 0.25	97d5117	@wtarreau
RX 7900 GRE	16 GB / GDDR6 / 256 bit	1456.98 ± 12.39	96.07 ± 0.10	6fa3b55	@MihaiBojescu
AI PRO R9700	32 GB / GDDR6 / 256 bit	4443.54 ± 339.25	93.84 ± 0.26	bd4ef13	@gogich77
Instinct MI60	32 GB / HBM2 / 4096 bit	1289.11 ± 0.62	91.46 ± 0.13	504af20	@Said-Akbar
RX 6900 XT	16 GB / GDDR6 / 256 bit	1889.84 ± 31.21	88.49 ± 0.00	a972fae	@notgood
Pro VII	16 GB / HBM2 / 4096 bit	1064.99 ± 1.18	87.45 ± 0.04	2739a71	@8XXD8
RX 6800 XT	16 GB / GDDR6 / 256 bit	1447.07 ± 1.36	83.92 ± 0.03	79c1160	@MrLavender
Pro V620	32 GB / GDDR6 / 256 bit	1803.65 ± 2.54	74.66 ± 0.01	5c0eb5e	@samteezy
RX 9060 XT	16 GB / GDDR6 / 256 bit	1419.67 ± 3.64	67.58 ± 0.24	a0e13dc	@lcy0321
RX 5700 XT	8 GB / GDDR6 / 256 bit	354.17 ± 0.18	67.55 ± 0.04	c05e8c9	@daniandtheweb
Instinct MI25	16 GB / HBM2 / 2048 bit	409.83 ± 0.23	63.94 ± 0.06	2739a71	@8XXD8
AI Max+ 395	128 GB / LPDDR5	911.36 ± 1.79	50.01 ± 0.07	e60f241	@firefox42
RX 7600 XT	16 GB / GDDR6 / 128 bit	1099.64 ± 2.05	48.58 ± 0.06	9c35706	@wbruna
RX Vega 64	8 GB / HBM2 / 2048 bit	240.68 ± 0.09	48.46 ± 0.09	ec428b0	@davispuh
Radeon 8060S	System Shared / DDR5	351.36 ± 0.67	47.97 ± 0.33	1d0125b	@hspak
Radeon 880M	System Shared / DDR5	163.25 ± 13.86	12.97 ± 1.63	c55d53a	@Hedede

Llama 2 7B, Q4_0, with FA

Chip	Memory	pp512 t/s	tg128 t/s	Commit	Thanks to
Instinct MI300X	192 GB / HBM3 / 8192 bit	11945.97 ± 54.29	218.53 ± 0.09	ee3a9fc	@yeahdongcn
RX 7900 XTX	24 GB / GDDR6 / 384 bit	3874.25 ± 11.92	170.12 ± 0.56	2f0c2db	@Diablo-D3
Pro W7900	48 GB / GDDR6 / 384 bit	3472.86 ± 52.86	127.43 ± 0.12	8160b38	@65a
Instinct MI210	64 GB / HBM2e / 4096 bit	2571.82 ± 2.89	130.18 ± 0.06	8160b38	@65a
RX 9070	16 GB / GDDR6 / 256 bit	2452.68 ± 1.33	115.32 ± 0.52	d0660f2	@andj1210
RX 7900 XT	20 GB / GDDR6 / 320 bit	3261.75 ± 9.09	112.30 ± 0.06	1e15bfd	@AdamNiederer
Instinct MI50	32 GB / HBM2 / 4096 bit	1129.43 ± 0.15	105.82 ± 0.07	97d5117	@wtarreau
Instinct MI100	32 GB / HBM2 / 4096 bit	2755.00 ± 3.68	104.71 ± 0.10	9c35706	@firefox42
AI PRO R9700	32 GB / GDDR6 / 256 bit	4773.07 ± 49.30	97.98 ± 0.13	bd4ef13	@gogich77
RX 7900 GRE	16 GB / GDDR6 / 256 bit	1598.79 ± 11.48	97.53 ± 0.06	6fa3b55	@MihaiBojescu
RX 9070 XT	16 GB / GDDR6 / 256 bit	4903.51 ± 96.36	97.28 ± 0.13	583cb83	@Hadrianneue
RX 7800 XT	16 GB / GDDR6 / 256 bit	2304.63 + 2.85	95.99 + 0.21	00131d6	@olegshulyakov
RX 6900 XT	16 GB / GDDR6 / 256 bit	1948.31 ± 13.51	85.04 ± 0.02	a972fae	@notgood
Pro V620	32 GB / GDDR6 / 256 bit	1256.86 ± 0.55	70.83 ± 0.02	5c0eb5e	@samteezy
RX 9060 XT	16 GB / GDDR6 / 256 bit	1479.27 ± 0.71	65.42 ± 0.19	a0e13dc	@lcy0321
RX 5700 XT	8 GB / GDDR6 / 256 bit	314.17 ± 0.29	62.02 ± 0.05	c05e8c9	@daniandtheweb
AI Max+ 395	128 GB / LPDDR5	1003.53 ± 2.91	49.87 ± 0.02	e60f241	@firefox42
Radeon 8060S	System Shared / DDR5	366.08 ± 1.44	48.97 ± 0.15	1d0125b	@hspak
RX 7600 XT	16 GB / GDDR6 / 128 bit	1199.16 ± 1.07	47.65 ± 0.06	9c35706	@wbruna
RX Vega 64	8 GB / HBM2 / 2048 bit	153.17 ± 0.72	42.46 ± 0.40	ec428b0	@davispuh
Radeon 880M	System Shared / DDR5	213.31 ± 14.05	16.16 ± 1.41	c55d53a	@Hedede

Vulkan Scoreboards

Llama 2 7B, Q4_0, no FA

Chip	pp512 t/s	tg128 t/s	Commit	Comments
Nvidia RTX 5090	10381.64 ± 508.84	263.63 ± 0.91	ca71fb9	coopmat2
AMD Radeon RX 7900 XTX	3531.93 ± 31.74	191.28 ± 0.20	2f0c2db
Nvidia RTX 4090	9452.03 ± 187.70	187.97 ± 0.21	4ae88d0	coopmat2
Nvidia RTX 5080	7444.99 ± 20.11	185.10 ± 0.54	f6b533d	coopmat2
Nvidia A100	6389.86 ± 4.83	160.78 ± 0.16	2257758	coopmat2
Nvidia RTX 3090	4298.97 ± 10.59	160.13 ± 0.25	4ae88d0	coopmat2
Nvidia RTX 4080 Super	7101.18 ± 269.79	147.13 ± 5.64	81086cd	coopmat2
Nvidia RTX 3080	4287.11 ± 55.50	139.15 ± 0.05	7c7d6ce	coopmat2
Nvidia RTX A5000	3641.55 ± 9.05	139.89 ± 0.69	4ae88d0	coopmat2
AMD Radeon RX 9070 XT	5036.04 ± 88.16	137.11 ± 0.02	e9fd8dc
Nvidia RTX 5070 Ti	6213.63 ± 27.72	135.63 ± 0.18	d13d0f6	coopmat2
AMD Radeon AI Pro R9700	4036.04 ± 34.58	130.19 ± 0.39	3191462
Nvidia Tesla V100	1391.39 ± 1.19	129.58 ± 0.58	7d77f07
Nvidia RTX 4070 Ti Super	6099.18 ± 154.30	129.45 ± 0.18	4ae88d0	coopmat2
AMD Radeon RX 7900 XT	2941.58 ± 17.17	123.18 ± 0.40	71e74a3
AMD Radeon RX 9070	3164.10 ± 66.84	119.71 ± 3.40	21c17b5
AMD Radeon RX 7800 XT	2017.33 ± 19.30	118.27 ± 0.27	4fdbc1e
AMD Radeon RX 7900 GRE	2336.31 ± 7.52	116.11 ± 0.26	4b2a477
Apple M3 Ultra	1116.83 ± 0.55	115.54 ± 0.78	2d451c8	MoltenVK
Intel Arc Pro B70	3379.00 ± 47.92	112.02 ± 1.08	b863507
Nvidia Titan V	984.36 ± 4.13	108.86 ± 0.28	e56abd2
AMD Radeon Pro VII	1078.54 ± 0.86	107.82 ± 0.14	N/A
AMD Radeon RX 6900 XT	1837.21 ± 25.44	104.60 ± 0.30	a972fae
Intel Arc Pro A60	2261.11 ± 9.53	104.25 ± 0.07	97d5117
AMD Radeon RX 6800 XT	1752.92 ± 1.71	100.32 ± 0.97	N/A
AMD Radeon VII	1059.14 ± 0.56	101.19 ± 0.53	77d6ae4
Nvidia RTX 2080 Ti	1888.24 ± 9.20	97.58 ± 6.60	N/A
AMD Radeon RX 6800	1698.69 ± 0.80	95.61 ± 0.19	4b385bf
AMD Radeon Pro W6800X Duo	687.71 ± 4.33	94.82 ± 0.12	N/A
Nvidia RTX 5060 Ti	3460.92 ± 7.16	93.51 ± 0.15	89f10ba	coopmat2
Nvidia RTX 4070	3179.37 ± 46.16	92.29 ± 0.28	9a48399
AMD Radeon Pro W6800X	510.80 ± 0.13	86.47 ± 0.46	13b4548	MoltenVK
AMD Radeon RX 6700 XT	1051.20 ± 0.98	83.88 ± 0.08	6d75883
AMD Radeon RX 6750 XT	1040.58 ± 0.35	81.98 ± 0.03	228f34c
AMD Radeon Pro V620	1595.32 ± 1.59	81.78 ± 0.06	03d4698
Nvidia RTX 3070	2113.02 ± 7.38	78.71 ± 0.13	1b8fb81
AMD Radeon Instinct MI60	369.26 ± 2.48	78.16 ± 1.40	504af20
Nvidia RTX 3060	1815.70 ± 5.85	75.94 ± 0.80	92c0b38	coopmat2
Apple M4 Max	724.77 ± 20.93	75.02 ± 0.14	1ece0cb6
Nvidia Tesla T10	1692.70 ± 2.05	75.01 ± 0.21	7f76692	coopmat2
Nvidia RTX A4000	2248.14 ± 7.59	73.74 ± 0.08	f5245b5	coopmat2
AMD Radeon RX 5700 XT	529.69 ± 0.26	70.73 ± 0.04	4fdbc1e
AMD Radeon RX 9060 XT	2141.67 ± 6.87	70.54 ± 0.74	ed52f36
Intel Arc B580	620.94 ± 15.33	70.14 ± 0.28	7f76692
AMD Radeon Pro V540	583.88 ± 6.56	69.64 ± 0.24	9da3dcd
AMD Radeon Pro W5700	449.85 ± 0.46	68.55 ± 0.15	23bc779
Intel Arc Pro B60	522.36 ± 3.60	68.55 ± 0.01	516a4ca
Nvidia GTX 1080 Ti	540.69 ± 0.71	64.99 ± 0.08	360d653
Nvidia RTX 2070 Super	1199.13 ± 7.70	64.64 ± 0.20	b7552cf
Nvidia RTX 3070 Mobile	1689.40 ± 19.57	63.64 ± 0.39	ceff6bb	coopmat2
Nvidia Tesla P100	678.14 ± 1.40	63.16 ± 0.06	eec1e33
AMD BC-250	370.66 ± 0.04	62.32 ± 0.32	5886f4f
AMD Radeon RX 6650 XT	1029.52 ± 1.21	62.14 ± 0.02	dbb852b
Nvidia RTX 4060 Mobile	2135.66 ± 23.18	59.53 ± 0.03	a5c07dc	coopmat2
Nvidia Tesla P40	488.06 ± 0.27	59.36 ± 0.16	N/A
Nvidia GTX 1660 Ti Mobile	511.67 ± 2.85	56.60 ± 0.07	b43556e
AMD Radeon Instinct MI25	439.42 ± 0.34	54.69 ± 0.03	2739a71
AMD Radeon RX 6600 XT	574.65 ± 0.86	53.92 ± 0.11	091592d
AMD Ryzen AI Max+ 395	1288.96 ± 6.49	53.59 ± 0.38	7f76692
AMD Radeon RX 7600 XT	840.85 ± 3.02	53.02 ± 0.01	01d8eaa
Intel Arc A770	1073.85 + 29.68	52.56 + 0.11	a69d54f
Nvidia GB10	2737.79 ± 19.56	52.28 ± 0.03	b9da444	coopmat2
AMD FirePro S9300 x2	247.26 ± 0.43	51.86 ± 0.11	eec1e33	Split across two GPUs
AMD Radeon RX 6600	761.89 ± 1.76	50.63 ± 0.02	b1c70e2
AMD Radeon RX Vega 56	439.87 ± 0.61	50.23 ± 0.14	92c0b38
Intel Arc B570	913.95 ± 0.90	49.64 ± 0.03	7f76692
Nvidia RTX 3060 Mobile	1059.76 ± 3.54	49.03 ± 0.13	dbb3a47
AMD Radeon RX 6800M	861.99 ± 7.67	48.71 ± 0.71	8e6f8bc
AMD Radeon RX 6600M	605.59 ± 0.65	48.21 ± 0.07	fe5b78c
Intel Arc A770M	875.92 ± 2.16	47.69 ± 0.16	eeee367
Nvidia P104-100	311.90 ± 0.22	46.18 ± 0.05	eec1e33
AMD Radeon RX Vega 64	356.08 ± 0.09	45.73 ± 0.18	ec428b0
Nvidia RTX A2000	1245.19 ± 8.76	45.52 ± 0.54	b1afcab	coopmat2
AMD Radeon RX 7600M XT	459.39 ± 2.34	45.28 ± 0.10	b9ab0a4	eGPU
AMD Radeon Pro V340	375.41 ± 0.24	45.16 ± 0.06	9da3dcd	Split across two GPUs
Nvidia GTX 1070 Ti	297.50 ± 0.54	42.86 ± 1.20	860a9e4	eGPU
Intel Arc A750	1075.94 ± 13.89	42.66 ± 0.18	c1b1876
Nvidia RTX 4050 Mobile	1154.28 + 15.76	41.89 + 0.10	d79d8f3
Nvidia GTX 1070	321.57 ± 0.93	41.48 ± 0.09	eec1e33
Intel Arc Pro B50	193.50 ± 0.24	39.99 ± 0.10	7b43f55
Nvidia Tesla M40	92.48 ± 0.02	39.35 ± 1.22	b8372ee
AMD Radeon RX 580	258.03 ± 0.71	39.32 ± 0.03	de4c07f
AMD Radeon RX 470	218.07 ± 0.56	38.63 ± 0.21	e288693
AMD Radeon Pro W5500	315.39 ± 3.76	36.82 ± 0.38	860a9e4
AMD Radeon RX 480	248.66 ± 0.28	34.71 ± 0.14	3b15924
Apple M2 Ultra	205.98 ± 0.02	34.34 ± 0.12	dbb852b	Asahi Linux
Nvidia GTX 980	186.24 ± 0.09	33.90 ± 0.51	860a9e4
Nvidia P106-100	183.78 ± 0.26	29.77 ± 0.04	23bc779
AMD FirePro W8100	155.22 ± 0.17	29.52 ± 0.05	4536363
Nvidia Tesla P4	265.54 ± 0.21	28.03 ± 0.14	24d2ee0
AMD Radeon RX 6500 XT	255.25 ± 0.35	27.81 ± 0.10	g9fdfcd
Apple M3	263.70 ± 0.02	26.39 ± 0.14	b9ab0a4	MoltenVK
AMD FirePro S10000	94.78 ± 0.02	25.32 ± 0.02	914a82d	Split across two GPUs
Nvidia Quadro P2000	169.55 ± 0.17	23.05 ± 0.03	63f8fe0
Intel Core Ultra 200 Series	544.95 ± 4.15	22.49 ± 0.09	cea560f
AMD Ryzen AI 9 300 Series	479.07 ± 0.41	22.41 ± 0.18	N/A
AMD Ryzen 6000 Series	240.89 ± 0.52	21.26 ± 0.08	ee09828
Apple M2 Pro	62.70 ± 0.03	20.95 ± 0.11	1fe0029	Asahi Linux
Nvidia GTX 1050 Ti	136.42 ± 0.67	20.96 ± 0.21	2f0c2db
AMD Ryzen 8000 Series	266.19 ± 1.36	20.53 ± 0.08	a5c07dc
AMD Ryzen 7000 Series	281.62 ± 1.56	19.91 ± 0.07	ebce03e
AMD Ryzen Z1 Extreme	199.36 ± 7.02	18.77 ± 0.02	53ff6b9
AMD FirePro D700	69.95 ± 0.04	16.62 ± 0.01	d3bd719	MoltenVK, running in FP16 mode on FP32 only chip
AMD Radeon Pro WX 4100	78.79 ± 0.10	16.05 ± 0.07	860a9e4
Apple M2	50.79 ± 0.16	13.50 ± 0.02	8c0d6bb	Asahi Linux
Apple M1	38.29 ± 0.00	12.47 ± 0.03	2370665	Asahi Linux
AMD Ryzen 5000 Series	90.55 ± 0.08	10.98 ± 0.07	d84635b
Intel Core 1100 Series	187.20 ± 1.78	10.39 ± 0.04	abb9f3c
AMD Radeon RX 550	52.66 ± 0.49	10.20 ± 0.01	N/A
AMD Ryzen 4000 Series	103.87 ± 0.02	9.63 ± 0.01	4b385bf
Nvidia Tesla K80	89.46 ± 0.10	9.39 ± 0.06	5d46bab	Running on single GPU
Nvidia Tesla K40	64.37 ± 0.09	9.30 ± 0.19	eec1e33
MediaTek Dimensity 9400	38.36 ± 15.15	8.92 ± 0.06	b9ab0a4	GPU supports coopmat but pp512 is faster with it turned off
Intel Core Ultra 100 Series	185.51 ± 0.22	8.21 ± 0.07	1d72c84
AMD Ryzen 3000 Series	48.63 ± 0.10	8.49 ± 0.01	1fe0029
CIX CD8180	2.80 ± 0.01	5.51 ± 0.00	4dca015
Intel Core 1000 Series	25.58 ± 0.00	4.25 ± 0.18	N/A
Intel Core 8000 Series	25.43 ± 0.17	3.35 ± 0.03	c4df49a
Intel N150	28.84 ± 0.02	2.93 ± 0.00	4f63cd7

Llama 2 7B, Q4_0, FA enabled

Chip	pp512 t/s	tg128 t/s	Commit	Comments
Nvidia RTX 5090	11796.38 ± 601.36	273.68 ± 0.52	ca71fb9	coopmat2
AMD Radeon RX 7900 XTX	3332.90 ± 11.47	195.30 ± 0.23	2f0c2db
Nvidia RTX 5080	8054.59 ± 35.68	192.17 ± 0.21	f6b533d	coopmat2
Nvidia RTX 4090	10830.41 ± 36.25	190.10 ± 0.31	4ae88d0	coopmat2
Nvidia A100	7064.40 ± 1.63	170.56 ± 0.02	2257758	coopmat2
Nvidia RTX 3090	4732.33 ± 4.80	162.28 ± 0.21	4ae88d0	coopmat2
Nvidia RTX 4080 Super	8007.37 ± 46.03	150.20 ± 0.26	81086cd	coopmat2
Nvidia RTX 3080	4913.83 ± 21.52	145.74 ± 0.16	7c7d6ce	coopmat2
Nvidia Tesla V100	1411.25 ± 2.12	142.13 ± 0.03	7d77f07
Nvidia RTX A5000	4071.22 ± 13.13	140.43 ± 0.22	4ae88d0	coopmat2
AMD Radeon RX 9070 XT	4911.74 ± 28.52	138.20 ± 0.18	e9fd8dc
Nvidia RTX 5070 Ti	6764.53 ± 11.95	135.65 ± 0.02	d13d0f6	coopmat2
AMD Radeon AI Pro R9700	4333.83 ± 29.36	130.90 ± 0.12	3191462
AMD Radeon RX 7900 XT	3043.93 ± 10.42	124.20 ± 0.09	71e74a3
AMD Radeon RX 7800 XT	2094.64 ± 14.38	119.63 ± 0.13	4fdbc1e
AMD Radeon RX 9070	3277.24 ± 18.17	119.55 ± 0.06	21c17b5
AMD Radeon RX 7900 GRE	2402.07 ± 22.50	116.77 ± 0.08	4b2a477
Apple M3 Ultra	1115.55 ± 0.75	115.99 ± 0.12	2d451c8	MoltenVK
Intel Arc Pro B70	3314.53 ± 17.95	111.63 ± 0.05	b863507
Nvidia Titan V	792.74 ± 4.30	109.21 ± 0.72	e56abd2
AMD Radeon Pro VII	783.94 ± 0.77	108.45 ± 0.48	N/A
AMD Radeon RX 6900 XT	1761.93 ± 4.75	106.15 ± 0.04	a972fae
Nvidia RTX 2080 Ti	1936.25 ± 32.08	100.99 ± 0.24	N/A
AMD Radeon RX 6800 XT	1704.79 ± 0.71	100.50 ± 0.06	N/A
AMD Radeon Pro W6800X Duo	795.28 ± 0.72	100.08 ± 0.02	N/A
Nvidia RTX 5060 Ti	3912.65 ± 5.86	97.01 ± 0.14	89f10ba	coopmat2
AMD Radeon RX 6800	1749.46 ± 3.36	96.65 ± 0.48	4b385bf
Nvidia RTX 4070	4293.57 ± 27.70	91.49 ± 0.89	9a48399	coopmat2
AMD Radeon RX 6750 XT	997.05 ± 0.45	82.29 ± 0.06	228f34c
AMD Radeon RX 6700 XT	1010.90 ± 12.89	81.86 ± 0.19	6d75883
Nvidia RTX 3060	2012.88 ± 10.12	80.59 ± 0.02	92c0b38	coopmat2
AMD Radeon Pro V620	1556.31 ± 2.82	79.24 ± 0.09	03d4698
Nvidia RTX A4000	2482.74 ± 26.05	76.07 ± 0.08	f5245b5	coopmat2
Nvidia Tesla T10	1840.14 ± 1.22	76.05 ± 0.13	7f76692	coopmat2
AMD Radeon RX 5700 XT	538.31 ± 0.35	74.43 ± 0.03	4fdbc1e
Intel Arc B580	419.49 ± 3.37	72.00 ± 0.24	7f76692
Apple M4 Max	557.46 ± 26.87	71.79 ± 4.16	1ece0cb6
AMD Radeon Pro W5700	446.98 ± 0.39	71.30 ± 0.24	23bc779
Intel Arc Pro B60	274.76 ± 0.27	70.54 ± 0.03	516a4ca
AMD Radeon RX 9060 XT	1915.41 ± 7.90	70.52 ± 0.16	ed52f36
Nvidia Tesla P100	685.51 ± 0.88	66.48 ± 0.02	eec1e33
AMD Radeon RX 6650 XT	1088.90 ± 0.40	64.53 ± 0.75	dbb852b
Nvidia GTX 1080 Ti	529.96 ± 0.38	64.63 ± 0.10	360d653
AMD BC-250	356.87 ± 1.24	63.14 ± 0.09	5886f4f
Nvidia RTX 3070 Mobile	1832.07 ± 57.14	62.92 ± 0.37	ceff6bb	coopmat2
Nvidia RTX 4060 Mobile	2358.03 ± 12.17	60.01 ± 0.08	a5c07dc	coopmat2
Nvidia Tesla P40	484.37 ± 0.27	59.22 ± 0.15	N/A
Nvidia GTX 1660 Ti Mobile	514.34 ± 0.88	57.30 ± 0.42	b43556e
AMD Radeon RX 7600 XT	1024.38 ± 7.56	56.11 ± 0.02	01d8eaa
AMD FirePro S9300 x2	243.33 ± 0.22	55.64 ± 0.06	eec1e33	Split across two GPUs
Nvidia GB10	3279.89 ± 26.78	53.64 ± 0.05	b9da444	coopmat2
AMD Radeon RX 6600	808.76 ± 0.15	53.24 ± 0.03	b1c70e2
Intel Arc A770	1119.68 + 30.25	53.07 + 0.09	a69d54f
AMD Ryzen AI Max+ 395	1357.07 ± 10.94	53.00 ± 0.13	7f76692
AMD Radeon RX Vega 56	428.54 ± 0.50	52.66 ± 0.03	92c0b38
Intel Arc B570	288.51 ± 0.09	50.49 ± 0.05	7f76692
Nvidia P104-100	325.30 ± 0.25	48.64 ± 0.04	eec1e33
AMD Radeon Pro V340	360.23 ± 0.74	47.54 ± 0.06	9da3dcd	Split across two GPUs
AMD Radeon RX 6800M	784.16 ± 2.76	49.06 ± 0.34	8e6f8bc
AMD Radeon RX Vega 64	320.12 ± 0.22	47.06 ± 0.01	ec428b0
Nvidia RTX A2000	1361.85 ± 3.26	45.69 ± 0.20	b1afcab	coopmat2
Intel Arc A770M	384.74 ± 0.78	45.68 ± 0.06	eeee367
Intel Arc A750	303.37 ± 1.44	43.96 ± 0.03	c1b1876
Nvidia GTX 1070 Ti	292.85 ± 0.23	43.42 ± 0.34	860a9e4	eGPU
Nvidia GTX 1070	330.84 ± 1.02	43.33 ± 0.06	360d653
Nvidia Tesla M40	93.35 ± 0.01	41.68 ± 0.01	b8372ee
Intel Arc Pro B50	132.48 ± 0.04	41.02 ± 0.04	7b43f55
AMD Radeon RX 470	197.26 ± 0.27	37.28 ± 0.11	3769fe6
AMD Radeon RX 480	194.52 ± 0.61	37.23 ± 0.09	0bcb40b
Apple M2 Ultra	198.83 ± 0.85	198.83 ± 0.85	dbb852b	Asahi Linux
Nvidia GTX 980	180.97 ± 0.74	34.16 ± 0.10	860a9e4
Nvidia P106-100	183.40 ± 0.34	30.79 ± 0.32	23bc779
AMD FirePro W8100	140.52 ± 0.34	29.28 ± 0.14	4536363
Nvidia Tesla P4	287.14 ± 0.29	28.37 ± 0.24	24d2ee0
Nvidia Quadro P2000	181.71 ± 0.12	23.77 ± 0.02	63f8fe0
Intel Core Ultra 200 Series	536.48 ± 1.27	23.05 ± 0.04	cea560f
AMD Ryzen AI 9 300 Series	532.59 ± 3.55	22.31 ± 0.06	N/A
AMD Ryzen 6000 Series	277.91 ± 0.37	21.15 ± 0.09	ee09828
Apple M2 Pro	58.86 ± 0.02	20.97 ± 0.03	1fe0029	Asahi Linux
AMD Ryzen 8000 Series	297.39 ± 1.22	20.59 ± 0.38	a5c07dc
AMD Ryzen 7000 Series	312.85 ± 2.51	20.09 ± 0.35	835b2b9
Nvidia GTX 1050 Ti	127.54 ± 1.03	20.08 ± 0.17	2f0c2db
AMD Radeon Pro WX 4100	75.59 ± 0.19	16.56 ± 0.04	860a9e4
Apple M1	35.93 ± 0.00	12.85 ± 0.02	2370665	Asahi Linux
Apple M2	46.81 ± 0.08	12.25 ± 2.30	8c0d6bb	Asahi Linux
AMD Ryzen 5000 Series	79.06 ± 0.01	10.75 ± 0.00	5d195f1
Intel Core 1100 Series	174.77 ± 4.47	10.58 ± 0.03	abb9f3c
Nvidia Tesla K40	64.37 ± 0.02	9.92 ± 0.06	eec1e33
AMD Ryzen 4000 Series	113.32 ± 0.01	9.87 ± 0.01	4b385bf
Nvidia Tesla K80	88.26 ± 0.19	9.49 ± 0.01	5d46bab	Running on single GPU
AMD Ryzen 5 3000 Series	47.41 ± 0.14	8.47 ± 0.01	1fe0029
Intel Core Ultra 100 Series	77.66 ± 2.75	7.75 ± 0.05	2e89f76
Intel Core 8000 Series	25.55 ± 0.04	3.35 ± 0.02	c4df49a
Intel N150	25.59 ± 0.00	2.91 ± 0.00	4f63cd7

How to Use These Tables

Decide whether you care more about g128 or pp512. For chat and interactive use, g128 usually matters more. For long prompts and batch throughput, pp512 matters more.
Match the backend you actually use. Nvidia users should usually prioritize CUDA. AMD users should compare ROCm and Vulkan first. Cross-platform users should pay close attention to Vulkan.
Check FA last. On many GPUs, enabling FA improves pp512 more than g128, so a single headline number can be misleading.

One-Sentence Summary

In llama.cpp benchmarks, pp512, g128, Q4_0, FA, and CUDA / ROCm / Vulkan describe different dimensions. Once the benchmark context is clear, the tables become much easier to read.

Sources

CUDA discussion #15013: https://github.com/ggml-org/llama.cpp/discussions/15013
Apple Silicon discussion #4167: https://github.com/ggml-org/llama.cpp/discussions/4167
ROCm discussion #15021: https://github.com/ggml-org/llama.cpp/discussions/15021
Vulkan discussion #10879: https://github.com/ggml-org/llama.cpp/discussions/10879

What the Common GPU Inference Benchmark Metrics Actually Mean: FA, pp512, tg128, and Q4_0

Thu, 23 Apr 2026 00:15:00 +0800

As soon as you start looking at local LLM or GPU inference benchmarks, you quickly run into a stack of abbreviations: FA, pp512, tg128, and Q4_0. They all look like performance metrics, but without context they can be surprisingly hard to interpret.

For example, you may see a line like this:

`1`	`CUDA Scoreboard for Llama 2 7B, Q4_0 (no FA)`

Then right below it, you might also see:

1
2

pp512 t/s
tg128 t/s

If you do not unpack what these terms mean, it becomes difficult to understand what the benchmark is actually measuring, or how to compare the results of two different GPUs.

This article is not about which GPU is the better buy. It is specifically about breaking down the most common metrics you see in GPU inference benchmarks.

First, what the whole title line is actually saying

A line like CUDA Scoreboard for Llama 2 7B, Q4_0 (no FA) already tells you most of the test setup.

At minimum, it contains four layers of information:

CUDA: the benchmark is running on the NVIDIA CUDA path
Llama 2 7B: the model being tested is the 7B version of Llama 2
Q4_0: the model uses a 4-bit quantized format
no FA: Flash Attention was disabled in this test

So in practical terms, this kind of title usually means:

“A benchmark of a quantized large model running on an NVIDIA GPU, measured under a specific inference path.”

What FA means: Flash Attention

Here, FA stands for Flash Attention.

It is one of the most important acceleration techniques in large-model training and inference, mainly because it optimizes how attention is computed. In Transformer models, attention is already one of the most expensive and memory-bandwidth-heavy parts of the entire pipeline.

A traditional attention implementation often suffers from a few problems:

frequent memory reads and writes
many intermediate results
repeated data movement between VRAM and on-chip cache
rapidly growing overhead as context length increases

What Flash Attention does, in simple terms, is:

reorganize the computation order
reduce how often intermediate results are written back to VRAM
keep more of the work inside faster cache

That gives it three typical advantages:

it is faster
it saves memory
it is mathematically equivalent to standard attention rather than a lower-accuracy shortcut

That is why so many modern inference and training frameworks treat it as a major optimization feature.

What no FA means

If FA means Flash Attention, then no FA simply means that Flash Attention was not enabled for this test.

In other words, the benchmark was measured using a more traditional attention implementation.

There are several reasons benchmark tables explicitly label no FA:

to keep a baseline for comparison
to support hardware or software environments where FA is unavailable
to avoid mixing scores from different optimization conditions

So when you see no FA, you should not read it as “this GPU is weak.” A more accurate reading is:

“This score was measured without Flash Attention enabled.”

What Q4_0 means: a quantization format

Q4_0 refers to a 4-bit quantization format.

The original model weights are usually not stored at such low precision. Quantization compresses higher-precision weights into a lower-bit representation so the model becomes easier to run on consumer GPUs.

A rough way to think about it is:

Q: Quantization
4: 4-bit
_0: a specific quantization scheme identifier

Its practical importance is straightforward:

smaller model size
lower VRAM requirements
better chances of fitting on consumer hardware

So Llama 2 7B, Q4_0 does not mean just “a normal 7B model.” It means “a 7B model already compressed using a 4-bit quantization format.”

What pp512 t/s means

pp512 usually means:

Prompt Processing 512 tokens

It measures how fast the model processes the input prompt, usually in t/s, meaning tokens per second.

Here, 512 means the prompt length used in the test was 512 tokens.

This metric does not measure output speed. It measures how quickly the model encodes and computes over the input before it starts responding. You can think of it as the speed of the “reading the prompt first” stage.

One important property of this stage is that it is usually much more parallelizable.

Because the input sequence can be processed in batches, the GPU can often keep its compute units highly utilized. That is why pp512 numbers can look extremely high, sometimes almost suspiciously high at first glance.

So if you see something like:

`1`	`pp512 ≈ 14000 t/s`

there is no reason to panic. That is measuring prompt-processing throughput, not the speed of token-by-token output generation.

What tg128 t/s means

tg128 usually means:

Text Generation 128 tokens

It measures the average speed of generating 128 tokens, again in t/s.

This metric is much closer to what people intuitively mean when they ask whether a model feels fast, because it is directly measuring the output stage.

But the biggest difference from pp512 is that text generation is usually autoregressive.

That means:

the model must generate the first token
then use that to generate the second
then continue to the third

So this stage cannot be parallelized the way prompt processing can, and it is naturally much slower.

That is why it is perfectly normal to see something like:

pp512 in the tens of thousands of t/s
tg128 only in the hundreds of t/s

This is not a benchmark error. These two metrics are measuring fundamentally different workloads.

Why pp512 and tg128 differ so much

This is often the first thing people find confusing when reading a scoreboard.

The short explanation is:

pp512 is closer to measuring parallel throughput, while tg128 is closer to measuring token-by-token generation ability.

To expand on that:

the input stage is easier to parallelize
the output stage depends on sequential token generation
generation is usually more sensitive to memory bandwidth and cache behavior
so generation speed being much lower than prompt-processing speed is entirely normal

That also explains an interesting pattern you sometimes see in GPU comparisons:

one GPU is stronger in pp512
another ends up slightly faster in tg128

That is not contradictory. One metric leans more toward peak compute throughput, while the other reflects the actual memory and latency behavior of the generation path.

How to think about t/s

Here, t/s simply means tokens per second.

It tells you how many tokens the model can process or generate in one second.

But there is one important caveat: a token is not the same thing as a character or a word. It is the unit produced by the model’s tokenizer, and its actual text length can vary a lot across models and languages.

So in practice, t/s is most useful for:

comparing different GPUs on the same model
comparing different parameter settings in the same environment
comparing a framework before and after a specific optimization is enabled

It is much less reliable as a universal “absolute speed” metric across different models, frameworks, and tokenizers.

What to focus on first when reading a scoreboard

If you do not want to get buried under abbreviations every time, start with these questions.

1. What model is being tested

For example, is it Llama 2 7B? Is it the same quantized variant, such as Q4_0? If the model or quantization format changes, direct comparison becomes much less meaningful.

2. Whether key optimizations are enabled

The most common example is FA. If one benchmark uses Flash Attention and the other does not, those scores are not directly comparable.

3. Whether the metric is measuring input speed or output speed

pp512 and tg128 are measuring different stages. One is closer to prompt-reading speed, the other is closer to answer-generation speed.

4. Whether you care about throughput or user feel

If you care more about how quickly a long prompt gets processed, pp512 matters more. If you care more about how fast the model feels while answering, tg128 is usually closer to the real experience.

A more practical way to remember all this

If you want to compress all of these into one short memory aid, you can think of them like this:

Q4_0: the model is compressed into a 4-bit quantized version
FA: whether Flash Attention is enabled
pp512: how fast the model processes a 512-token input
tg128: how fast the model generates a 128-token output
t/s: speed unit, tokens per second

Once those five points are clear, it becomes much easier to judge what a given CUDA Scoreboard is actually measuring.

Closing

GPU benchmark tables often look more complicated than they really are, not because the metrics themselves are mysterious, but because model identity, quantization, optimization flags, and different stages of throughput are all compressed into a few short abbreviations.

Once you unpack terms like FA, Q4_0, pp512, and tg128, these benchmark tables become much easier to read.

What matters is not just remembering a raw score, but knowing:

which model configuration the score came from
whether key optimizations were enabled
whether it measured input or output behavior
whether it reflects compute throughput or something closer to actual generation feel

That makes it much easier to judge what these results really mean.

Ollama Multi-GPU Notes: VRAM Pooling, GPU Selection, and Common Misunderstandings

Sun, 19 Apr 2026 00:18:00 +0800

When running local inference with Ollama, a few questions come up quickly: if I already have one GPU and my motherboard still has empty PCIe slots, does adding more GPUs help? Do the GPUs need to be identical? Can VRAM be combined? Will it accelerate inference like a multi-GPU training framework?

This note summarizes how Ollama behaves with multiple GPUs. The short version:

Ollama supports multiple GPUs.
The main value of multiple GPUs is usually fitting larger models into available VRAM, not getting linear token/s scaling.
By default, if a model fits entirely on one GPU, Ollama tends to load it on a single GPU.
If a model does not fit on one GPU, Ollama can spread it across available GPUs.
Mixed GPU models may be visible to Ollama, but performance and placement may not be ideal.
SLI / NVLink is not required for multi-GPU use.
To limit which GPUs Ollama can use, use CUDA_VISIBLE_DEVICES, ROCR_VISIBLE_DEVICES, or GGML_VK_VISIBLE_DEVICES.

Official Behavior: Single GPU First, Multi-GPU When Needed

Ollama’s FAQ describes the multi-GPU loading logic directly: when loading a new model, Ollama estimates the required VRAM and compares it with currently available GPU memory. If the model can fit entirely on one GPU, it loads the model onto that GPU. If it cannot fit on a single GPU, the model is spread across all available GPUs.

The reason is performance. Keeping a model on one GPU usually reduces data transfers across the PCIe bus during inference, so it is often faster.

So do not think of Ollama multi-GPU as “more cards automatically means several times faster.” A more accurate model is:

Small model fits on one GPU: usually runs on one GPU.
Large model does not fit on one GPU: split across multiple GPUs.
Still not enough VRAM: part of the model falls back to system memory, and speed drops noticeably.

Use this command to see where the model is loaded:

`1`	`ollama ps`

The PROCESSOR column may show something like:

1
2
3

100% GPU
48%/52% CPU/GPU
100% CPU

If you see 48%/52% CPU/GPU, part of the model is already in system memory. In that case, adding more GPU memory or using a larger-VRAM GPU is usually more useful than continuing to rely on CPU/RAM.

Multi-GPU Is Not Simple Compute Stacking

Local LLM inference is not the same as SLI in games. With Ollama on multiple GPUs, the common pattern is that different layers or tensors are placed on different devices. This can make a larger model fit into the combined available VRAM, but data may still need to move between devices during inference.

So multi-GPU benefits usually fall into two categories:

VRAM benefit: larger models fit more easily, or less of the model falls back to CPU/RAM.
Performance benefit: usually most obvious when a model would otherwise not fit on one GPU or would heavily spill to CPU.

If an 8B or 14B model already fits entirely on a single RTX 3090, forcing it across two GPUs may not be faster. It may even slow down due to cross-GPU transfer overhead. Ollama’s default “use one GPU when it fits” strategy avoids that unnecessary PCIe cost.

SLI or NVLink Is Not Required

Ollama multi-GPU does not depend on SLI. Multiple normal PCIe GPUs can be scheduled as long as the driver and Ollama can detect them.

NVLink or higher PCIe bandwidth may help in some cross-GPU scenarios, but it is not a requirement. Many used GPU servers and workstations can run multiple GPUs over ordinary PCIe.

What you should pay attention to is PCIe bandwidth. The difference between x1, x4, x8, and x16 affects how quickly a model is loaded into VRAM. If you frequently switch large models, PCIe bandwidth becomes more important. After a model is loaded, PCIe usually matters less during generation, but cross-GPU splitting can still add overhead.

Safer rules:

Prefer x16 / x8 over mining-style x1 risers.
PCIe bandwidth matters more when switching large models frequently.
If a model stays resident in VRAM for a long time, PCIe bandwidth is less visible.
For multi-GPU machines, check motherboard PCIe topology and CPU-attached lanes.

Limit Which NVIDIA GPUs Ollama Uses

On NVIDIA multi-GPU systems, use CUDA_VISIBLE_DEVICES to control which GPUs Ollama can see.

Temporary run:

`1`	`CUDA_VISIBLE_DEVICES=0,1 ollama serve`

Use only the second GPU:

`1`	`CUDA_VISIBLE_DEVICES=1 ollama serve`

Force Ollama not to use NVIDIA GPUs:

`1`	`CUDA_VISIBLE_DEVICES=-1 ollama serve`

The official docs note that numeric IDs may change order, so GPU UUIDs are more reliable. Check UUIDs first:

`1`	`nvidia-smi -L`

Example output:

1
2

GPU 0: NVIDIA GeForce RTX 3090 (UUID: GPU-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx)
GPU 1: NVIDIA GeForce RTX 3070 (UUID: GPU-yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy)

Then specify the UUID:

`1`	`CUDA_VISIBLE_DEVICES=GPU-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx ollama serve`

If Ollama is installed as a Linux systemd service, put the variable into the service environment:

`1`	`sudo systemctl edit ollama.service`

Add:

1
2

[Service]
Environment="CUDA_VISIBLE_DEVICES=0,1"

Reload and restart:

1
2

sudo systemctl daemon-reload
sudo systemctl restart ollama

AMD and Vulkan Device Selection

For AMD ROCm, use ROCR_VISIBLE_DEVICES to control visible GPUs:

`1`	`ROCR_VISIBLE_DEVICES=0,1 ollama serve`

To force Ollama not to use ROCm GPUs, use an invalid ID:

`1`	`ROCR_VISIBLE_DEVICES=-1 ollama serve`

Ollama’s GPU docs also mention experimental Vulkan support. For Vulkan GPUs, use GGML_VK_VISIBLE_DEVICES:

`1`	`OLLAMA_VULKAN=1 GGML_VK_VISIBLE_DEVICES=0 ollama serve`

If Vulkan devices cause problems, disable them:

`1`	`GGML_VK_VISIBLE_DEVICES=-1 ollama serve`

AMD multi-GPU setups are more likely to run into driver, ROCm version, and GFX version compatibility issues. The official docs also mention Linux ROCm driver requirements and compatibility overrides such as HSA_OVERRIDE_GFX_VERSION. If you mix different generations of AMD GPUs, first verify that each card works on its own before trying multi-GPU.

Exposing Multiple GPUs in Docker

If you run Ollama in Docker, NVIDIA setups usually require nvidia-container-toolkit, then --gpus to expose devices.

Expose all GPUs:

docker run -d \
  --gpus=all \
  -v ollama:/root/.ollama \
  -p 11434:11434 \
  --name ollama \
  ollama/ollama

Expose specific GPUs:

docker run -d \
  --gpus '"device=0,1"' \
  -v ollama:/root/.ollama \
  -p 11434:11434 \
  --name ollama \
  ollama/ollama

You can also combine this with environment variables:

docker run -d \
  --gpus=all \
  -e CUDA_VISIBLE_DEVICES=0,1 \
  -v ollama:/root/.ollama \
  -p 11434:11434 \
  --name ollama \
  ollama/ollama

If nvidia-smi cannot see GPUs inside the container, Ollama cannot use them either. Troubleshoot Docker GPU passthrough first, then Ollama.

What Is `OLLAMA_SCHED_SPREAD`

In some multi-GPU configuration discussions, you may see OLLAMA_SCHED_SPREAD=1 or OLLAMA_SCHED_SPREAD=true. It is related to Ollama’s scheduler and is often used when people want models or requests to be spread more broadly across GPUs.

Example:

`1`	`OLLAMA_SCHED_SPREAD=1 ollama serve`

Or with systemd:

1
2

[Service]
Environment="OLLAMA_SCHED_SPREAD=true"

But it is not a magic switch. Enabling it does not imply linear token/s scaling, and it may still run into OOM when multiple models are loaded, VRAM estimates are tight, context length grows, or the KV cache expands. The core FAQ behavior still applies: if one GPU can fully hold the model, one GPU is usually more efficient; if one GPU cannot hold it, then multi-GPU splitting becomes useful.

Treat OLLAMA_SCHED_SPREAD as an advanced scheduling experiment, not a required multi-GPU setting. Understand the default behavior first, then adjust based on ollama ps, logs, and nvidia-smi.

How to Check Whether Multiple GPUs Are Being Used

Useful commands:

`1`	`ollama ps`

`1`	`watch -n 0.5 nvidia-smi`

View the Ollama service logs:

`1`	`journalctl -u ollama -f`

If using Docker:

`1`	`docker logs -f ollama`

Watch for:

Whether Ollama discovers compatible GPUs.
Whether the model shows 100% GPU or a CPU/GPU split.
Whether each GPU has VRAM allocated.
Whether VRAM grows on multiple GPUs during model loading.
Whether generation token/s improves compared with CPU/RAM spillover.
Whether OOM or model unloading happens frequently.

GPU utilization alone can be misleading. LLM inference does not always keep GPUs fully loaded, especially with multiple GPUs, low batch sizes, small contexts, slow CPUs, or slow PCIe links.

Common Misunderstandings

Misunderstanding 1: Two 12GB GPUs Equal One 24GB GPU

Not exactly. Multiple GPUs can place a model across devices, but cross-device access has overhead. It solves the “does not fit” problem, but it is not equivalent to the speed and stability of one large-VRAM GPU.

Misunderstanding 2: Different GPU Models Cannot Be Mixed

Not necessarily. If the driver, compute capability, and runtime libraries support the cards, Ollama can see multiple GPUs. But mixed setups are usually limited by the slower card, smaller VRAM, and PCIe topology. The most predictable setup is still same model, same VRAM size, and well-supported same-generation drivers.

Misunderstanding 3: Multi-GPU Is Always Faster Than Single-GPU

Not always. If the model fits completely on one fast GPU, single-GPU may be faster. Multi-GPU is mainly useful for large models, long contexts, or insufficient single-GPU VRAM.

Misunderstanding 4: NVLink / SLI Is Required

No. Ordinary PCIe multi-GPU systems can be used by Ollama. NVLink is not a prerequisite.

Misunderstanding 5: Adding a GPU Does Not Require Restarting Services

Not always true. Linux systemd services, Windows background apps, and Docker containers may need to be restarted before they rediscover devices and environment variables.

GPU Selection Suggestions

For Ollama local inference, the rough priority is:

Larger single-GPU VRAM is usually easier to manage.
Identical GPUs are easier to troubleshoot than mixed GPUs.
More complete PCIe lanes make large-model loading smoother.
Older cards should be checked for CUDA compute capability or ROCm support first.
Multi-GPU power, cooling, and chassis airflow must be planned ahead.

For budget second-hand platforms:

Dual RTX 3090 remains a common high-VRAM option.
Older Tesla cards such as P40 / M40 have large VRAM, but power, cooling, driver support, and performance all need trade-offs.
Cards such as RTX 4070 / 4070 Ti have good efficiency, but single-card VRAM can be limiting.
Multiple old 8GB cards can be fun to experiment with, but are not ideal for running large models long-term.

Summary

Ollama multi-GPU support is best understood as “VRAM expansion first, performance acceleration second.” If the model fits entirely on one GPU, the default single-GPU path is usually faster. If one GPU cannot hold it, multi-GPU can spread the model across devices and avoid heavy CPU/RAM spillover, making larger models usable.

In practice, use ollama ps to check where the model is loaded, then use nvidia-smi or ROCm tools to observe VRAM allocation. For GPU selection, use CUDA_VISIBLE_DEVICES on NVIDIA, ROCR_VISIBLE_DEVICES on AMD ROCm, and GGML_VK_VISIBLE_DEVICES for Vulkan. If running in Docker, first make sure the container can see the GPUs.

Multi-GPU is not magic. It can help fit larger models, but it does not guarantee linear speedup. The stable route is still to prefer large-VRAM single GPUs or identical multi-GPU setups, while considering driver support, PCIe, power, cooling, and model quantization together.

References

Ollama FAQ: How does Ollama load models on multiple GPUs?: https://github.com/ollama/ollama/blob/main/docs/faq.mdx
Ollama GPU docs: Hardware support / GPU Selection: https://github.com/ollama/ollama/blob/main/docs/gpu.mdx
Ollama Docker Hub: https://hub.docker.com/r/ollama/ollama
NVIDIA Container Toolkit: https://github.com/NVIDIA/nvidia-container-toolkit

CUDA on KnightLi Blog

Ubuntu 26.04 LTS GPU and Hardware Updates: CUDA, ROCm, DPC++, and More Platform Changes

1. Intel DPC++ and related components are now in Ubuntu Archive

2. The NVIDIA CUDA toolkit can now be installed directly with apt

3. AMD ROCm 7.1.0 is now in Universe

4. The bigger story is that all three GPU ecosystems are landing

5. NVIDIA Dynamic Boost is enabled by default

6. Support for new Intel integrated and discrete GPUs keeps moving forward

7. Suspend and resume is more stable on Nvidia desktops too

8. ARM, Raspberry Pi, RISC-V, and IBM Z also get harder platform-level changes

ARM64 desktop platforms

A new Raspberry Pi boot layout

Raspberry Pi desktop images now use desktop-minimal

Swap on Raspberry Pi is now handled by cloud-init

RISC-V requirements have moved up

IBM Z now requires z15 at minimum

9. Who should read this first

10. One-line takeaway

What Is NVIDIA nvbandwidth: How to Use This GPU Bandwidth Testing Tool

1. What does nvbandwidth do

2. It does not produce just one simple score

3. How to understand CE and SM copies

4. What environment does it require

5. How to build and run the single-node version

6. Multinode support is one of its standout features

7. What changed in v0.9

8. When is it a good fit

9. How to think about its value

Related links

llama.cpp GPU Performance Ranking: Full CUDA, ROCm, and Vulkan Scoreboards Explained with pp512 / tg128 / FA

Understanding the Metrics First

What is Q4_0

What is pp512

What is g128

What is FA

How to read /s

Quick Takeaways

CUDA Scoreboards

Llama 2 7B, Q4_0, no FA

Llama 2 7B, Q4_0, with FA

Apple Silicon as a Reference Baseline

ROCm / HIP Scoreboards

Llama 2 7B, Q4_0, no FA

Llama 2 7B, Q4_0, with FA

Vulkan Scoreboards

Llama 2 7B, Q4_0, no FA

Llama 2 7B, Q4_0, FA enabled

How to Use These Tables

One-Sentence Summary

Sources

What the Common GPU Inference Benchmark Metrics Actually Mean: FA, pp512, tg128, and Q4_0

First, what the whole title line is actually saying

What FA means: Flash Attention

What no FA means

What Q4_0 means: a quantization format

What pp512 t/s means

What tg128 t/s means

Why pp512 and tg128 differ so much

How to think about t/s

What to focus on first when reading a scoreboard

1. What model is being tested

2. Whether key optimizations are enabled

3. Whether the metric is measuring input speed or output speed

4. Whether you care about throughput or user feel

A more practical way to remember all this

Closing

Ollama Multi-GPU Notes: VRAM Pooling, GPU Selection, and Common Misunderstandings

Official Behavior: Single GPU First, Multi-GPU When Needed

Multi-GPU Is Not Simple Compute Stacking

SLI or NVLink Is Not Required

Limit Which NVIDIA GPUs Ollama Uses

AMD and Vulkan Device Selection

Exposing Multiple GPUs in Docker

What Is OLLAMA_SCHED_SPREAD

How to Check Whether Multiple GPUs Are Being Used

Common Misunderstandings

Misunderstanding 1: Two 12GB GPUs Equal One 24GB GPU

Misunderstanding 2: Different GPU Models Cannot Be Mixed

Misunderstanding 3: Multi-GPU Is Always Faster Than Single-GPU

Misunderstanding 4: NVLink / SLI Is Required

2. The NVIDIA CUDA toolkit can now be installed directly with `apt`

1. What does `nvbandwidth` do

3. How to understand `CE` and `SM` copies

7. What changed in `v0.9`

What Is `OLLAMA_SCHED_SPREAD`