ROCm on KnightLi Blog

AMD ROCm 7.2 + ComfyUI Compatibility Setup: Using a CUDA Alternative on Windows

Fri, 08 May 2026 10:09:05 +0800

For a long time, local AI art and video tools were built around NVIDIA CUDA by default. Stable Diffusion, ComfyUI, AnimateDiff, video super-resolution, LLM inference, and many plugins usually supported CUDA first. AMD GPUs often offered good VRAM value, but Windows users had to rely on DirectML, ZLUDA, Linux ROCm, or community patches. Stability and tutorial consistency were weaker than NVIDIA.

The ROCm 7.2 series changes that picture in a meaningful way. At CES 2026, AMD announced the Ryzen AI 400 series and tied ROCm, Radeon, Ryzen AI, and Windows AI workflows more closely together. AMD documentation shows that ROCm 7.2.1 updates PyTorch support on Windows for AMD Radeon graphics products and AMD Ryzen AI processors. ComfyUI Desktop also added official AMD ROCm support starting with v0.7.0.

This does not mean AMD has fully caught up with the CUDA ecosystem. It does mean that running ComfyUI on AMD GPUs under Windows is moving from a tinkering-only option to something worth seriously evaluating.

What ROCm 7.2 Brings

ROCm is AMD’s open software stack for GPU computing and machine learning. Its role is similar to NVIDIA CUDA. It includes HIP, compilers, math libraries, deep-learning libraries, profilers, PyTorch integration, and low-level runtime components.

For desktop users, ROCm 7.2 matters in three ways.

First, Windows support is more official. AMD’s Radeon/Ryzen ROCm documentation states that PyTorch on Windows has been updated to ROCm 7.2.1 for AMD Radeon graphics and AMD Ryzen AI processors. This is important for ComfyUI, Hugging Face Transformers, and local inference tools because most upper-layer tools eventually depend on PyTorch.

Second, hardware support is clearer. AMD documentation mentions support for Radeon 9000 series, selected Radeon 7000 series, Ryzen AI Max 300, selected Ryzen AI 400, and selected Ryzen AI 300 APUs. In other words, “AMD GPU” does not automatically mean full support. The exact model still needs to be checked against the compatibility matrix.

Third, ComfyUI now has an official route. In January 2026, the ComfyUI team announced that ComfyUI Desktop for Windows supports AMD ROCm from v0.7.0. For normal users, that matters because it reduces manual environment setup, wheel hunting, and launch-parameter tweaking.

For people looking for a CUDA alternative, these changes matter more than a single benchmark. Long-term usability depends on whether drivers, frameworks, models, plugins, and the frontend connect reliably.

Which Hardware Fits Best

The AMD route should be viewed in three groups.

The first is Radeon 9000 series. It is the newest discrete-GPU line that ROCm 7.2 focuses on, and it should have the highest priority if you are buying an AMD GPU now for local AI.

The second is selected Radeon 7000 series cards. These RDNA 3 GPUs already have some ROCm support, but not every model is equally stable. Before buying, check AMD’s official compatibility matrix and confirm Windows, Linux, PyTorch, and the target tool all support your card.

The third is Ryzen AI APUs. Ryzen AI 400 and Ryzen AI Max 300 bring CPU, GPU, NPU, and shared memory into laptops, mini PCs, and development devices. They are better for lightweight inference, development tests, mobile work, and small ComfyUI workflows. They should not be planned like high-end discrete GPUs for heavy model throughput.

If the goal is smooth mainstream AI art, a discrete GPU is still the safer choice. APUs are attractive for integration and shared memory, but they are not ideal for heavy video generation or large-batch image work.

Recommended Windows Path

For typical Windows users, ComfyUI Desktop should be the first choice. It is the official support path, reduces environment conflicts, and is easier to update with upstream changes.

The basic flow is:

Use Windows 11 and update AMD Software: Adrenalin Edition.
Confirm your GPU or APU is in the AMD ROCm Radeon/Ryzen compatibility matrix.
Install ComfyUI Desktop v0.7.0 or later.
Select or enable the AMD ROCm backend in ComfyUI Desktop.
After first launch, check the console for PyTorch/ROCm information.
Test a basic SDXL or Flux workflow before installing many plugins.

If you use manual ComfyUI, the idea is similar: install Python, install the PyTorch build for the ROCm 7.2 series, then launch main.py. AMD’s official ComfyUI guide notes that after launch you should verify the terminal shows the expected ROCm 7.2.1 PyTorch version.

Low-VRAM devices can try:

`1`	`python main.py --lowvram --disable-pinned-memory`

These options do not always improve speed, but they can reduce memory and VRAM pressure. On 8GB, 12GB, or shared-memory devices, finishing reliably is more important than maximum speed.

Linux Is Still Better For Heavy Users

ROCm on Windows is more usable now, but Linux remains the more mature AMD AI environment. AMD documentation also shows broader Linux support for Radeon across PyTorch, TensorFlow, JAX, ONNX, vLLM, Llama.cpp, and some training workflows.

If you only want ComfyUI image generation, Windows is worth trying.
If you need vLLM, LoRA training, batch video generation, multi-GPU, Docker, automation scripts, or long-running services, Linux is still the stronger choice.

Choose by workload:

Windows: desktop users, ComfyUI Desktop, lightweight image generation, local experimentation.
Linux: developers, heavy AI users, servers, batch processing, and the fuller ROCm ecosystem.
WSL: useful if you want Windows plus Linux tooling, but you must confirm ROCDXG, driver, and hardware support.

Do not treat Windows ROCm as the answer to every problem. It lowers the entry barrier and improves desktop use, while heavy production still depends more on Linux support.

Be Careful With ComfyUI Plugins

ComfyUI’s difficulty is not only the main program. The plugin ecosystem matters. Many nodes assume CUDA, xFormers, Triton, FlashAttention, or specific PyTorch extensions. After switching to AMD ROCm, common problems include:

Plugins calling CUDA-only extensions.
Acceleration libraries without ROCm wheels.
Custom-node install scripts that check for NVIDIA by default.
Video nodes depending on codecs or optical-flow libraries without AMD support.
New model workflows using NVIDIA-optimized settings by default.

Do not start by copying an old NVIDIA ComfyUI directory into an AMD setup. A cleaner approach is to install a fresh environment, verify a base model, and add plugins one by one.

Recommended test order:

Basic text-to-image.
Image-to-image.
LoRA.
ControlNet.
Upscaling and high-res fix.
AnimateDiff or video nodes.
Heavier models such as Flux, SD3, Wan, or HunyuanVideo.

Test after each plugin group. If something breaks, you can identify the likely node or dependency.

Why AMD GPUs Are Attractive For AI Art

The biggest attraction of AMD is VRAM and price. Many users choose AMD not because its AI software ecosystem is already easier than CUDA, but because the same budget often buys more memory, which helps local creation and long experiments.

Large VRAM is practical in ComfyUI:

It can fit larger checkpoints.
It can raise resolution.
It can load more LoRA, ControlNet, and reference-image nodes.
It can reduce the speed loss of low-VRAM mode.
It makes video generation and batch jobs less likely to run out of memory.

If ROCm 7.2 keeps PyTorch and ComfyUI stable on Windows, AMD GPUs become a more realistic CUDA alternative, especially for users who do not want cloud services but want more local VRAM.

Limits You Still Need To Accept

The AMD route is usable, but it is not a no-brainer CUDA replacement.

Main limits include:

Supported models are limited; older and some lower-end cards may not be listed.
Windows framework support is still narrower than Linux.
Many AI tutorials still assume NVIDIA.
Some ComfyUI plugins have only been tested on CUDA.
Community answers are fewer when errors appear.
The same model may perform very differently on different backends.

Before choosing AMD, confirm three things:

Your GPU is in the official compatibility matrix.
Your main tools explicitly support ROCm.
Your key plugins do not depend on CUDA-only extensions.

If all three are acceptable, AMD can be reliable. Otherwise, the money saved on hardware may be spent on environment debugging.

Recommended Setup Strategy

For beginners, use Windows 11 + a supported Radeon 9000/7000 card + ComfyUI Desktop. Follow the official path first and do not install too many third-party nodes immediately.

For developers, prepare a Linux environment. ROCm has a fuller toolchain on Linux and is better for batch tasks, LLM inference, Docker, and automation.

For laptop or mini-PC users, Ryzen AI 400 and Ryzen AI Max platforms are suitable for lightweight local AI. They can handle development, preview, simple image generation, and small-model inference, but should not be planned like high-end discrete GPUs for video generation.

For heavy ComfyUI users, focus on VRAM, driver version, and plugin compatibility. AMD’s memory value is tempting, but if one critical node does not support ROCm, the whole workflow can be affected.

Summary

The ROCm 7.2 series is a meaningful step forward for AMD local AI on Windows. Radeon and Ryzen AI PyTorch support is clearer, and ComfyUI Desktop now offers official ROCm support. This brings AMD GPUs closer to a CUDA alternative that ordinary users can actually try.

But usable does not mean fully compatible. The safer approach is to check the compatibility matrix, use the official install path, test basic ComfyUI first, and then add plugins and complex video workflows gradually. Windows fits lightweight desktop creation; Linux still fits heavy development and production.

If you want the least friction, CUDA remains the mainstream answer.
If you are willing to validate the workflow in exchange for larger VRAM and a more open ecosystem, ROCm 7.2 + ComfyUI is now worth serious testing.

References

Ubuntu 26.04 LTS GPU and Hardware Updates: CUDA, ROCm, DPC++, and More Platform Changes

Sun, 26 Apr 2026 19:35:57 +0800

If the previous article worked as a desktop-focused overview of Ubuntu 26.04 LTS, this one is better read as its hardware and compute-side follow-up. In this 26.04 cycle, Ubuntu pushed a number of AI, GPU computing, and platform compatibility changes into the main archive or formal support scope.

The short version is this: the most important part of this round is not just desktop and kernel upgrades, but that Ubuntu is bringing Intel, NVIDIA, and AMD GPU computing stacks into the distribution in a more systematic way.

Starting with 26.04, Intel’s open-source oneAPI DPC++ compiler is available directly from Ubuntu Archive for building SYCL code. Its runtime also includes adapters for Intel GPUs.

Two related components are also now available from Ubuntu repositories:

oneDPL, the DPC++ library, which provides higher-productivity developer APIs
oneDNN, built with dpclang-6, which can run on Intel GPUs

That means if you are already working with SYCL, heterogeneous computing, or AI workloads on Intel GPUs, Ubuntu now offers a more direct path instead of forcing you to maintain a separate external stack for everything.

Ubuntu also calls out one practical requirement: users need to be in the render group to actually use these Intel GPU-related capabilities.

2. The NVIDIA CUDA toolkit can now be installed directly with `apt`

For many developers and operators, this may be one of the most immediately useful changes in the notes.

Starting with 26.04, the NVIDIA CUDA toolkit can now be installed directly from Ubuntu Archive:

`1`	`sudo apt install cuda-toolkit`

The value here is bigger than just saving a few setup steps.

For developers shipping software on Ubuntu, this new model means they can simply declare a dependency on the CUDA runtime, while Ubuntu manages installation and compatibility at the distribution level. That makes CUDA feel more like a native system capability on Ubuntu, rather than an extra software layer that always has to be maintained separately.

3. AMD ROCm 7.1.0 is now in Universe

On the AMD side, Ubuntu Universe now includes ROCm 7.1.0.

These libraries mainly provide:

backend infrastructure for AI training and inference on AMD GPUs
software foundations for machine learning and high performance computing

Canonical also notes that ROCm-related components are continuously tested in its CI/CD pipeline. Beyond autopkgtests, that includes several user-space applications such as:

llama.cpp
pytorch
Blender
Lemonade Server

That detail matters, because it shows Ubuntu is not just dropping packages into the archive. It is validating ROCm as a maintainable software stack.

4. The bigger story is that all three GPU ecosystems are landing

It becomes easier to see the direction of 26.04 when DPC++, CUDA, and ROCm are viewed together:

Intel: bringing SYCL / oneAPI components into official repositories
NVIDIA: giving the CUDA toolkit a distribution-managed installation path
AMD: shipping ROCm 7.1.0 in Universe with ongoing testing

If you work with these kinds of workloads on Ubuntu, this release will probably feel more relevant:

local LLM inference
GPU-accelerated training or fine-tuning
Blender, scientific computing, and HPC
development environments that need to move across different GPU platforms

In other words, Ubuntu is no longer just “a system where you can install a GPU driver.” It is starting to carry a fuller user-space software stack for AI and GPU computing.

5. NVIDIA Dynamic Boost is enabled by default

Since 25.04, Dynamic Boost has been enabled by default on supported NVIDIA laptops.

The idea is straightforward: depending on system load, power can be shifted dynamically between the CPU and GPU. In gaming scenarios, that usually means giving more power to the GPU when needed to extract more performance.

It only applies under two conditions:

the laptop is connected to AC power
the GPU load is high enough

It does not engage while the system is running on battery.

6. Support for new Intel integrated and discrete GPUs keeps moving forward

Ubuntu also continues expanding support for new Intel GPUs, including:

Integrated:

Intel Core Ultra Xe2
Intel Core Ultra Xe3

Discrete:

Intel Arc 5 B570
Intel Arc 5 B580
Intel Arc Pro B50
Intel Arc Pro B60
Intel Arc Pro B65
Intel Arc Pro B70

Ubuntu also highlights several features already available around these devices:

improved GPU and CPU ray tracing performance through Intel Embree, benefiting applications such as Blender 4.2+
hardware video encoding for AVC, JPEG, HEVC, and AV1 on “Battlemage” devices
a new CCS optimization in Intel Compute Runtime
enabled debugging support for Intel Xe GPUs

If you are watching follow-up releases, 25.10 also continues to bring in more capabilities, including:

initial support for Intel’s next-generation client platform codenamed Panther Lake through Linux kernel 6.17
improved IOMMU, PCIe subsystem, and multi-GPU support
Mesa 25.2.3 enabling VK_KHR_shader_bfloat16 for Battlemage and Panther Lake
intel-media-driver 25.3.0 adding Panther Lake decode support and VP9 encoding
intel-compute-runtime 25.31 adjusting the Level Zero USM pool and local device memory event allocation behavior
level-zero 1.24 and level-zero-raytracing 1.1.0 bringing broader spec and RTAS extension support

7. Suspend and resume is more stable on Nvidia desktops too

Starting with 25.10, Ubuntu enables suspend-resume support in the proprietary Nvidia driver to reduce corruption and freezing when waking a desktop system.

This is not the most visible kind of change, but it matters a lot in everyday use, especially on desktops that stay on for long periods and frequently suspend and resume.

8. ARM, Raspberry Pi, RISC-V, and IBM Z also get harder platform-level changes

Beyond the GPU software stack, the release notes also include several platform-level changes worth calling out separately.

ARM64 desktop platforms

Starting with 25.10, the ARM64 linux-generic kernel provides broader desktop compatibility for ARM64 desktop platforms that boot through UEFI.

A new Raspberry Pi boot layout

One change introduced in 25.10 and refined in 26.04 is a new boot partition layout for Raspberry Pi systems.

Its goal is to improve boot reliability: newly written boot assets are first “tested” before they are committed as the new “known good” set.

The firmware date requirements are the part most users will want to remember:

Pi 3 / 3+ / CM3+ / Zero 2W: no additional action required, the boot firmware is in the image itself
Pi 4 / 400 / CM4: boot firmware must be dated no earlier than 2022-11-25
Pi 5 / 500 / CM5: boot firmware must be dated no earlier than 2025-02-11

You can check it with:

`1`	`sudo rpi-eeprom-update`

If the firmware is too old and you are using Ubuntu 24.04 LTS or newer, you can update it like this:

1
2

sudo rpi-eeprom-update -a
sudo reboot

Raspberry Pi desktop images now use desktop-minimal

Since 25.10, Ubuntu Desktop images for Raspberry Pi are based on desktop-minimal rather than the full desktop seed.

Ubuntu gives a very concrete benefit here: the default app set is smaller, saving about 777MB on the uncompressed image and on installed systems.

If you want to remove that default app set in bulk after upgrading, you can use:

`1`	`sudo apt purge ubuntu-desktop --autoremove`

If you want to keep some of those applications, just mark them as manually installed with apt first.

Swap on Raspberry Pi is now handled by cloud-init

Since 25.10, swap file creation on Raspberry Pi desktop images is handled by cloud-init.
If you want to customize swap size before first boot, you can edit user-data on the boot partition directly.

RISC-V requirements have moved up

Starting with 25.10, the RISC-V build of Ubuntu 26.04 LTS requires hardware that implements the RVA23S64 ISA profile.

Systems that do not meet that requirement can no longer run Ubuntu 26.04 LTS. If you still have boards based on earlier RVA20 processor cores, you need to stay on the support line provided by Ubuntu 24.04 LTS.

According to Ubuntu, as of April 2026, there is still no real RVA23S64 hardware available. So the only currently supported platform is effectively a QEMU virtualized environment configured with -cpu rva23s64.

IBM Z now requires z15 at minimum

Starting with 26.04, the minimum requirement for the s390x architecture has moved up to z15.

That means:

z14 / LinuxONE II and older systems can no longer install Ubuntu 26.04 LTS
z15 / LinuxONE III and newer systems should see better performance

9. Who should read this first

This article is more useful than the desktop overview if you fall into any of these cases:

you use Ubuntu for CUDA, ROCm, SYCL, or local AI inference
you do development or compute work on Intel, NVIDIA, or AMD GPUs
you maintain Raspberry Pi, ARM64, RISC-V, IBM Z, or other non-standard x86 platforms
you are especially sensitive to repository availability, driver behavior, runtimes, and platform requirements after an upgrade

10. One-line takeaway

The key point of Ubuntu 26.04 LTS on the hardware and AI stack side is not that one GPU vendor got a standout upgrade. It is that Intel’s DPC++, NVIDIA’s CUDA, and AMD’s ROCm are all entering the Ubuntu ecosystem in a more official, in-repository, and maintainable way.

If you used to think of Ubuntu as “the system first, then I assemble the GPU environment myself,” 26.04 starts to look more like a distribution that is willing to actively carry AI and heterogeneous computing workloads.

How to Fix Ollama Using CPU Instead of GPU

Fri, 24 Apr 2026 18:30:00 +0800

When running local LLMs, one of the most frustrating problems is this: your machine clearly has a GPU, yet Ollama still leans heavily on the CPU, and performance is painfully slow.

The short version is that this is usually not caused by one single issue. The most common causes are:

Ollama is not detecting any usable GPU
The driver, ROCm, or CUDA environment is not set up correctly
The Ollama service was started without the right environment variables
The model is too large and has fallen back to CPU or mixed CPU/GPU loading
On AMD platforms, there may be extra compatibility issues such as ROCm version mismatch, gfx settings, or device visibility problems

The fastest way to troubleshoot it is to go through the checks below in order.

1. First, confirm whether Ollama is really not using the GPU

The most direct check is:

`1`	`ollama ps`

Focus on the PROCESSOR column.

100% GPU: the model is fully running on the GPU
100% CPU: the GPU is not being used at all
Results like 48%/52% CPU/GPU: part of the model is in VRAM, and part has spilled into system memory

If you see 100% CPU, the next step is to focus on environment and service configuration.
If you see mixed loading, that does not necessarily mean the GPU is broken. In many cases, it simply means VRAM is not enough.

2. Rule out the most common misunderstanding first: the model does not fit into VRAM

Many people assume that once a GPU is installed, Ollama will always run fully on it. That is not how it works.

If the model is too large, the context is too long, or some other loaded model is already occupying VRAM, Ollama may fall back to:

Partial GPU + partial CPU
Full 100% CPU

At this point, the two simplest tests are:

Try a smaller model first
For example, test with a 4B or 7B model before jumping straight to much larger ones.
Unload other active models and test again
Run ollama ps first and make sure nothing else is occupying VRAM.

If smaller models use the GPU but larger ones do not, the real problem is usually VRAM capacity rather than the driver.

3. Check whether the GPU driver and the lower-level runtime are actually working

If even small models run only on CPU, the next step is to check the underlying environment.

NVIDIA

First confirm that the driver is working and the system can see the GPU. A common check is:

`1`	`nvidia-smi`

If this already fails, Ollama is very unlikely to use the GPU correctly.

AMD / ROCm

If you are using an AMD GPU, especially with ROCm, start with:

1
2

rocminfo
rocm-smi

If these tools cannot list the device properly, the problem is still below Ollama, so there is no point debugging the application layer yet.

On AMD, the most common issue is not simply “is the driver installed,” but rather:

The ROCm version does not match the OS version
The current GPU architecture has incomplete support
The device exists, but the runtime is not being exposed correctly to Ollama

4. Restart the Ollama service, not just your terminal

This is a very common trap.

Many people install drivers, change environment variables, fix ROCm, then just open a new terminal and continue with ollama run. But if Ollama is running as a background service, it may still be using the old environment.

So the safer approach is:

Fully restart the Ollama service
Reboot the machine if necessary

If you are running it as a service on Linux, make sure the service process was actually restarted instead of reusing the old one.

5. Check whether the environment variables are really reaching the service

This matters especially on AMD ROCm systems.

Some machines work fine when commands are run manually in a shell, but the Ollama service still uses only CPU. In that case, the usual reason is that the service process never received the variables you set in your shell.

Common variables to look at include:

1
2

ROCR_VISIBLE_DEVICES
HSA_OVERRIDE_GFX_VERSION

Specifically:

ROCR_VISIBLE_DEVICES limits or selects which GPUs ROCm can see
HSA_OVERRIDE_GFX_VERSION is often used as a compatibility workaround on some AMD platforms

If you only export these variables in the current terminal, but Ollama is started by systemd, a desktop background service, or another daemon, they may not take effect.

In other words, “it looks set in my terminal” does not mean Ollama is actually using it.

6. On AMD platforms, focus on ROCm compatibility

Based on the public page metadata, the original video for this topic is tied to AMD Max+ 395, strix halo, and AMD ROCm.
In setups like these, Ollama failing to use the GPU is often more dependent on version matching than on NVIDIA systems.

Start by checking these:

Whether the installed ROCm version fits the current OS and GPU
Whether the GPU belongs to an architecture with solid ROCm support
Whether you need to set HSA_OVERRIDE_GFX_VERSION
Whether an older Ollama build or older inference runtime is causing compatibility issues

If rocminfo works and the GPU is visible to the system, but Ollama still runs only on CPU, the issue is often in the version combination rather than in model parameters.

7. In Docker, WSL, or remote environments, also check device mapping

If you are not running on bare metal but inside:

Docker
WSL
Remote containers
Virtualized environments

then you need to check one more layer: whether the GPU device is actually being exposed inside that environment.

A typical symptom looks like this:

The host machine can see the GPU
Ollama inside the container or subsystem still uses only CPU

In that case, the issue may not be Ollama itself. The container or subsystem may simply not have GPU access.

8. Check logs last, but check them for the right reason

If you have already gone through the earlier steps, the most effective next move is not endless reinstalling, but looking directly at the Ollama startup and runtime logs.

Focus on two kinds of messages:

Whether a GPU was detected at all
Whether there are driver, library loading, or device initialization errors

If the logs clearly say something like “no compatible GPU found” or “failed to initialize ROCm/CUDA,” the troubleshooting direction becomes much clearer immediately.

Troubleshooting Order

If you only want the shortest path, use this order:

Run ollama ps and confirm whether it is GPU, CPU, or mixed loading
Try a smaller model to rule out VRAM limits
Use nvidia-smi, rocminfo, and rocm-smi to verify the lower-level environment first
Fully restart the Ollama service
Check service environment variables, especially ROCR_VISIBLE_DEVICES and HSA_OVERRIDE_GFX_VERSION on AMD
If you are in Docker or WSL, verify device mapping
Finally, inspect logs for the exact error

Conclusion

When Ollama uses CPU instead of GPU, the root cause usually falls into one of three groups:

The GPU is not being detected at all
The GPU is detectable, but the runtime environment is not reaching Ollama
The GPU is working, but the model is too large and falls back to CPU or mixed memory

Once you separate those three cases, troubleshooting becomes much faster.
If you are on an AMD platform, pay special attention to ROCm version matching, device visibility, and compatibility variables instead of focusing only on the Ollama command itself.

Original video: https://www.bilibili.com/video/BV1cHoYBqE8k/

llama.cpp GPU Performance Ranking: Full CUDA, ROCm, and Vulkan Scoreboards Explained with pp512 / tg128 / FA

Thu, 23 Apr 2026 10:22:04 +0800

Understanding the Metrics First

What is Q4_0

Q4_0 is a 4-bit quantization format. It does not mean the model is stronger. It means the model is smaller, uses less VRAM, and fits on more devices. Most of these scoreboards standardize on Llama 2 7B, Q4_0 so that GPU-to-GPU comparisons are easier.

What is pp512

pp512 usually means prompt processing 512 tokens, which is the throughput while processing 512 input tokens.

pp = prompt processing
512 = input length is 512 tokens
/s = tokens per second This is closer to prompt-ingestion speed, so it is often much higher than generation speed.

What is g128

g128 usually means 	ext generation 128 tokens, which is the speed while generating 128 tokens continuously.

g = text generation
128 = generate 128 tokens continuously
/s = tokens per second This is usually closer to the speed users actually feel in interactive usage.

What is FA

FA stands for Flash Attention.

with FA means Flash Attention is enabled

o FA means Flash Attention is disabled On many GPUs, FA improves pp512 more clearly than g128, but the gain is not identical across backends, drivers, and GPU architectures.

How to read /s

/s means 	okens per second. When reading these scoreboards, the key rule is to compare the same type of test with the same settings.

Do not compare pp512 and g128 as if they were the same thing
Do not mix o FA and with FA
Do not assume CUDA, ROCm, and Vulkan are directly interchangeable

Quick Takeaways

CUDA is still the strongest overall path in llama.cpp GPU benchmarks, especially on high-end Nvidia GPUs.
ROCm is already delivering strong results on high-end AMD GPUs and Instinct accelerators.
Vulkan has the broadest hardware coverage, including Nvidia, AMD, Intel, older GPUs, and some Apple / Asahi setups.
g128 is closer to everyday perceived speed, while pp512 is better for judging prompt throughput.

CUDA Scoreboards

Llama 2 7B, Q4_0, no FA

Chip	Memory	pp512 t/s	tg128 t/s	Commit	Thanks to
RTX 5090	32 GB / GDDR7 / 512 bit	14073.41 ± 115.16	290.02 ± 1.10	8cf6b42	@totaldev
RTX PRO 6000 Blackwell	96 GB / GDDR7 / 512 bit	14854.63 ± 22.73	274.20 ± 0.14	79c1160	@Tom94
H100 80 GB	80 GB / HBM3 / 5120 bit	9918.34 ± 176.97	267.81 ± 1.54	5143fa8	@Hedede
A100 80 GB	80 GB / HBM2e / 5120 bit	4849.53 ± 8.94	190.88 ± 0.33	5143fa8	@Hedede
RTX 4090 D	24 GB / GDDR6X / 384 bit	10293.86 ± 134.72	189.33 ± 0.19	79c1160	@autonomous-AI-lab
RTX 4090	24 GB / GDDR6X / 384 bit	11992.70 ± 107.99	186.21 ± 0.13	2241453	@lhl
RTX 5080	16 GB / GDDR7 / 256 bit	8297.36 ± 9.50	181.99 ± 0.42	8a4280c	@Hedede
RTX 5070 Ti	16 GB / GDDR7 / 256 bit	6952.38 ± 13.73	176.85 ± 0.07	933414c	@TinyServal
RTX 6000 Ada	48 GB / GDDR6 / 384 bit	9229.23 ± 101.78	176.07 ± 0.26	b8e09f0	@Hedede
RTX 3090 Ti	24 GB / GDDR6X / 384 bit	6567.49 ± 20.30	171.19 ± 3.98	9c35706	@slaren
RTX 3090	24 GB / GDDR6X / 384 bit	5174.69 ± 21.83	158.16 ± 0.21	c76b420	@m18coppola
L40	48 GB / GDDR6 / 384 bit	8870.49 ± 378.76	152.01 ± 0.28	ee09828	@Hedede
RTX 4080 SUPER	16 GB / GDDR6X / 256 bit	8125.15 ± 41.05	148.33 ± 0.20	81086cd	@zacharyarnaise
RTX 4080	16 GB / GDDR6X / 256 bit	8031.64 ± 26.49	142.49 ± 0.16	20638e4	@Ristovski
RTX 3080	10 GB / GDDR6X / 320 bit	5013.86 ± 24.80	139.65 ± 0.99	9c35706	@slaren
RTX A6000	48 GB / GDDR6 / 384 bit	4913.93 ± 6.79	138.73 ± 2.75	4795c91	@Hedede
RTX 4070 Ti SUPER	16 GB / GDDR6X / 256 bit	6924.53 ± 13.87	132.26 ± 0.16	9c35706	@Ristovski
RTX PRO 4000 Blackwell	24 GB / GDDR7 / 192 bit	4992.83 ± 113.52	131.66 ± 0.20	7d77f07	@Hedede
RTX A5000	24 GB / GDDR6 / 384 bit	4028.16 ± 19.14	130.07 ± 2.74	e5155e6	@Hedede
Tesla V100	32 GB / HBM2 / 4096 bit	3042.64 ± 40.71	129.08 ± 0.05	51f5a45	@Hedede
RTX 5070	12 GB / GDDR7 / 192 bit	5184.75 ± 18.70	127.54 ± 0.46	@Spyro000	-
A40	48 GB / GDDR6 / 384 bit	4609.01 ± 10.67	124.11 ± 0.17	3470a5c	@Hedede
A30	24 GB / HBM2e / 3072 bit	2767.10 ± 1.88	124.81 ± 0.16	583cb83	@Hedede
Titan V	12 GB / HBM2 / 3072 bit	2617.46 ± 2.10	108.79 ± 0.05	e56abd2	@Hedede
RTX 2080 Ti	11 GB / GDDR6 / 352 bit	2890.66 ± 2.42	107.51 ± 0.21	9c35706	@ariya
Quadro RTX 6000	24 GB / GDDR6 / 384 bit	2751.18 ± 19.43	102.77 ± 0.04	b8e09f0	@Hedede
Quadro RTX 8000	48 GB / GDDR6 / 384 bit	2709.95 ± 3.35	102.68 ± 0.03	b8e09f0	@Hedede
RTX A4500	20 GB / GDDR6 / 320 bit	2827.20 ± 66.43	97.32 ± 2.80	5cdb27e	@aleksyx
RTX 5060 Ti 16 GB	16 GB / GDDR7 / 128 bit	3737.25 ± 6.79	90.94 ± 0.02	89d1029	@mike-llamacpp
RTX 2070 SUPER	8 GB / GDDR6 / 256 bit	2088.34 ± 1.94	88.06 ± 0.28	bc07349	@phstudy
RTX A4000	16 GB / GDDR6 / 256 bit	2684.06 ± 15.28	83.77 ± 0.37	65349f2	@TinyServal
Titan Xp	12 GB / GDDR5X / 384 bit	1154.96 ± 1.46	76.08 ± 0.08	c4510dc	@Hedede
RTX 3060	12 GB / GDDR6 / 192 bit	2137.50 ± 10.12	75.57 ± 0.07	baa9255	@QuantiusBenignus
Quadro RTX 4000	8 GB / GDDR6 / 256 bit	1536.89 ± 0.90	65.62 ± 0.62	7d77f07	@Hedede
RTX 4060 Ti 8 GB	8 GB / GDDR6 / 128 bit	3394.63 ± 7.44	63.86 ± 0.01	89d1029	@mike-llamacpp
GTX 1080 Ti	11 GB / GDDR5X / 352 bit	1084.41 ± 3.01	62.49 ± 0.06	9c35706	@ariya
RTX A4000 Ada	20 GB / GDDR6 / 160 bit	2779.77 ± 9.91	61.83 ± 0.04	a74a0d6	@sdwolfz
RTX 2060 SUPER	8 GB / GDDR6 / 256 bit	1420.24 ± 1.95	60.04 ± 0.01	5c0eb5e	@ggerganov
Tesla P100	16 GB / HBM2 / 4096 bit	760.80 ± 2.92	58.35 ± 0.00	b8372ee	@Hedede
DGX Spark	128 GB / LPDDR5x	3062.31 ± 11.02	57.21 ± 0.06	5acd455	@ggerganov
Tesla P40	24 GB / GDDR5 / 384 bit	1007.42 ± 1.23	54.74 ± 0.07	c76b420	@m18coppola
RTX 2000 Ada	16 GB / GDDR6 / 128 bit	1956.22 ± 7.74	50.62 ± 0.04	756cfea	@DigitalRudeness
Tesla T4	16 GB / GDDR6 / 256 bit	1219.06 ± 4.18	46.38 ± 0.73	d32e03f	@pt13762104
RTX 4050 Laptop	6 GB / GDDR6 / 96 bit	1725.85 + 17.85	43.72 + 0.41	d79d8f3	@TimCabbage
GTX 1660	6 GB / GDDR5 / 192 bit	148.91 ± 0.01	41.35 ± 0.02	9515c61	@ariya
Tesla M40	24 GB / GDDR5 / 384 bit	282.65 ± 0.15	38.04 ± 0.02	97d5117	@Hedede
GTX 1070 Ti	8 GB / GDDR5 / 256 bit	714.44 ± 2.04	37.82 ± 0.02	79c1160	@pebaryan
Jetson AGX Orin	64 GB / LPDDR5 / 256 bit	991.31 ± 1.15	33.58 ± 0.14	c1b1876	@TinyServal
Tesla P4	8 GB / GDDR5 / 256 bit	514.53 ± 3.06	33.29 ± 0.00	c76b420	@m18coppola
P106-100	6 GB / GDDR5 / 192 bit	406.94 ± 0.25	30.40 ± 0.02	5fd160b	@pebaryan
GTX 1060	6 GB / GDDR5 / 192 bit	416.85 ± 1.75	27.79 ± 0.02	5fd160b	@pebaryan
Quadro T1000	4 GB / GDDR5 / 128 bit	79.44 ± 0.01	27.82 ± 0.18	f6da8cb	@hanabu
Quadro P2000	5 GB / GDDR5 / 160 bit	309.30 ± 0.05	23.63 ± 0.00	baa9255	@TinyServal
Quadro P1000	4 GB / GDDR5 / 128 bit	183.40 ± 0.11	13.99 ± 0.13	1e74897	@aleksyx
Tesla K80	12 GB / GDDR5 / 384 bit	133.14 ± 0.55	13.80 ± 0.02	32732f2	@pebaryan

Llama 2 7B, Q4_0, with FA

Chip	Memory	pp512 t/s	tg128 t/s	Commit	Thanks to
RTX 5090	32 GB / GDDR7 / 512 bit	14970.15 ± 381.06	300.40 ± 0.28	8cf6b42	@totaldev
RTX PRO 6000 Blackwell	96 GB / GDDR7 / 512 bit	16618.98 ± 20.66	281.11 ± 0.41	5143fa8	@Tom94
H100 80 GB	80 GB / HBM3 / 5120 bit	11263.29 ± 98.34	280.74 ± 1.17	5143fa8	@Hedede
A100 80 GB	80 GB / HBM2e / 5120 bit	5285.96 ± 6.58	200.90 ± 0.12	5143fa8	@Hedede
RTX 4090 D	24 GB / GDDR6X / 384 bit	12506.97 ± 11.51	191.57 ± 0.03	79c1160	@autonomous-AI-lab
RTX 4090	24 GB / GDDR6X / 384 bit	14770.63 ± 102.93	188.96 ± 0.05	2241453	@lhl
RTX 5080	16 GB / GDDR7 / 256 bit	9487.70 ± 21.89	184.68 ± 0.05	8a4280c	@Hedede
RTX 5070 Ti	16 GB / GDDR7 / 256 bit	8419.56 ± 35.50	182.43 ± 0.09	933414c	@TinyServal
RTX 6000 Ada	48 GB / GDDR6 / 384 bit	10576.85 ± 530.21	179.47 ± 0.32	b8e09f0	@Hedede
RTX 3090 Ti	24 GB / GDDR6X / 384 bit	6924.01 ± 10.76	172.26 ± 1.31	9c35706	@slaren
RTX PRO 4500 Blackwell	32 GB / GDDR7 / 256 bit	7251.66 ± 92.40	168.90 ± 0.20	becc481	@Hedede
RTX 3090	24 GB / GDDR6X / 384 bit	5560.06 ± 16.28	161.89 ± 0.18	c76b420	@m18coppola
L40	48 GB / GDDR6 / 384 bit	10097.64 ± 671.22	153.76 ± 0.12	ee09828	@Hedede
RTX 4080 SUPER	16 GB / GDDR6X / 256 bit	9439.01 ± 56.75	147.48 ± 1.41	81086cd	@zacharyarnaise
RTX 4080	16 GB / GDDR6X / 256 bit	9205.93 ± 22.31	143.47 ± 0.02	20638e4	@Ristovski
RTX A6000	48 GB / GDDR6 / 384 bit	5662.39 ± 13.87	144.87 ± 0.18	4795c91	@Hedede
RTX 3080	10 GB / GDDR6X / 320 bit	5569.56 ± 14.04	139.95 ± 0.95	9c35706	@slaren
RTX PRO 4000 Blackwell	24 GB / GDDR7 / 192 bit	5674.44 ± 139.53	136.38 ± 0.13	7d77f07	@Hedede
RTX A5000	24 GB / GDDR6 / 384 bit	4552.15 ± 9.68	135.83 ± 0.11	e5155e6	@Hedede
Tesla V100	32 GB / HBM2 / 4096 bit	2973.78 ± 3.62	134.76 ± 0.02	51f5a45	@Hedede
RTX 4070 Ti SUPER	16 GB / GDDR6X / 256 bit	7612.32 ± 37.35	132.85 ± 0.31	9c35706	@Ristovski
A30	24 GB / HBM2e / 3072 bit	3068.72 ± 0.63	131.93 ± 0.18	583cb83	@Hedede
RTX 5070	12 GB / GDDR7 / 192 bit	5783.44 ± 36.95	128.21 ± 2.52	@Spyro000	-
A40	48 GB / GDDR6 / 384 bit	5256.38 ± 19.39	126.24 ± 0.06	3470a5c	@Hedede
Titan V	12 GB / HBM2 / 3072 bit	2481.25 ± 1.31	112.17 ± 0.01	e56abd2	@Hedede
RTX 2080 Ti	11 GB / GDDR6 / 352 bit	3107.61 ± 4.34	109.17 ± 0.07	9c35706	@ariya
Quadro RTX 6000	24 GB / GDDR6 / 384 bit	3053.96 ± 1.37	104.38 ± 0.04	b8e09f0	@Hedede
Quadro RTX 8000	48 GB / GDDR6 / 384 bit	3052.35 ± 5.64	103.63 ± 0.02	b8e09f0	@Hedede
RTX A4500	20 GB / GDDR6 / 320 bit	3453.10 ± 49.19	103.00 ± 0.25	5cdb27e	@aleksyx
RTX 5060 Ti 16 GB	16 GB / GDDR7 / 128 bit	4195.53 ± 1.98	93.46 ± 0.01	89d1029	@mike-llamacpp
RTX 2070 SUPER	8 GB / GDDR6 / 256 bit	2293.29 ± 5.91	87.71 ± 0.29	bc07349	@phstudy
RTX A4000	16 GB / GDDR6 / 256 bit	2807.83 ± 52.44	85.17 ± 0.66	65349f2	@TinyServal
RTX 3060	12 GB / GDDR6 / 192 bit	2407.67 ± 3.73	76.92 ± 0.03	baa9255	@QuantiusBenignus
Titan Xp	12 GB / GDDR5X / 384 bit	1218.12 ± 1.82	73.84 ± 0.04	c4510dc	@Hedede
Quadro RTX 4000	8 GB / GDDR6 / 256 bit	1662.80 ± 2.04	67.62 ± 0.67	7d77f07	@Hedede
RTX 4060 Ti 8 GB	8 GB / GDDR6 / 128 bit	3803.45 ± 70.80	64.03 ± 0.53	89d1029	@mike-llamacpp
Tesla P100	16 GB / HBM2 / 4096 bit	787.36 ± 3.27	61.99 ± 0.00	b8372ee	@Hedede
GTX 1080 Ti	11 GB / GDDR5X / 352 bit	1138.14 ± 2.02	61.38 ± 0.03	9c35706	@ariya
RTX A4000 Ada	20 GB / GDDR6 / 160 bit	3171.86 ± 4.34	61.37 ± 0.01	a74a0d6	@sdwolfz
RTX 2060 SUPER	8 GB / GDDR6 / 256 bit	1563.77 ± 0.51	61.13 ± 0.05	5c0eb5e	@ggerganov
DGX Spark	128 GB / LPDDR5x	3661.37 ± 38.66	56.74 ± 0.03	5acd455	@ggerganov
Tesla P40	24 GB / GDDR5 / 384 bit	1079.66 ± 0.18	53.73 ± 0.05	c76b420	@m18coppola
RTX 2000 Ada	16 GB / GDDR6 / 128 bit	2250.14 ± 5.91	50.71 ± 0.01	756cfea	@DigitalRudeness
Tesla T4	16 GB / GDDR6 / 256 bit	1309.73 ± 1.02	44.03 ± 0.57	d32e03f	@pt13762104
GTX 1660	6 GB / GDDR5 / 192 bit	154.45 ± 0.52	41.43 ± 0.01	9515c61	@ariya
Tesla M40	24 GB / GDDR5 / 384 bit	290.17 ± 0.11	39.98 ± 0.01	97d5117	@Hedede
GTX 1070 Ti	8 GB / GDDR5 / 256 bit	790.52 ± 2.39	37.87 ± 0.00	79c1160	@pebaryan
Jetson AGX Orin	64 GB / LPDDR5 / 256 bit	1171.96 ± 4.70	35.88 ± 0.18	c1b1876	@TinyServal
Tesla P4	8 GB / GDDR5 / 256 bit	529.53 ± 2.12	33.12 ± 0.03	c76b420	@m18coppola
P106-100	6 GB / GDDR5 / 192 bit	438.49 ± 0.38	30.64 ± 0.06	5fd160b	@pebaryan
GTX 1060	6 GB / GDDR5 / 192 bit	446.19 ± 0.81	28.18 ± 0.01	5fd160b	@pebaryan
Quadro T1000	4 GB / GDDR5 / 128 bit	27.46 ± 0.23	27.46 ± 0.23	f6da8cb	@hanabu
Quadro P2000	5 GB / GDDR5 / 160 bit	311.55 ± 0.19	23.76 ± 0.01	baa9255	@TinyServal
Tesla K80	12 GB / GDDR5 / 384 bit	133.36 ± 0.60	14.27 ± 0.32	32732f2	@pebaryan
Quadro P1000	4 GB / GDDR5 / 128 bit	173.82 ± 0.02	13.65 ± 0.14	1e74897	@aleksyx

Apple Silicon as a Reference Baseline

Discussion #4167 is useful because it established a more unified benchmark format early on. Besides Q4_0, it also includes F16 and Q8_0, which helps explain PP / TG / t/s. The thread explicitly defines:

PP = prompt processing
TG = ext-generation

/s = okens per second A representative example is the M2 Ultra time-series comparison:

Time	Device	Version / Note	Bandwidth GB/s	GPU Cores	F16 PP	F16 TG	Q8_0 PP	Q8_0 TG	Q4_0 PP	Q4_0 TG
2023-11-21	M2 Ultra	8e672ef	800	76	1401.85	41.02	1248.59	66.64	1238.48	94.27
2024-11-12	M2 Ultra	86ed72d + FA	800	76	1525.95	43.15	1368.18	73.11	1391.78	108.80
2025-08-02	M2 Ultra	5c0eb5e + FA	800	76	1561.35	43.24	1386.97	73.35	1412.42	109.41
Representative Apple Silicon entries shown in the thread:
Device	Q4_0 PP	Q4_0 TG	Q8_0 PP	Q8_0 TG	F16 PP	F16 TG
—	—:	—:	—:	—:	—:	—:
M1 Pro 16 GPU	266.25	36.41	270.37	22.34	302.14	12.75
M2 Ultra 76 GPU	1238.48	94.27	1248.59	66.64	1401.85	41.02
M3 Max 40 GPU	690.99	65.85	749.37	43.00	794.26	25.27

ROCm / HIP Scoreboards

Llama 2 7B, Q4_0, no FA

Chip	Memory	pp512 t/s	tg128 t/s	Commit	Thanks to
Instinct MI300X	192 GB / HBM3 / 8192 bit	11476.40 ± 72.79	232.92 ± 0.53	ee3a9fc	@yeahdongcn
RX 7900 XTX	24 GB / GDDR6 / 384 bit	3552.27 ± 101.96	167.11 ± 0.50	2f0c2db	@Diablo-D3
Instinct MI210	64 GB / HBM2e / 4096 bit	2486.22 ± 9.58	124.51 ± 0.04	8160b38	@65a
Pro W7900	48 GB / GDDR6 / 384 bit	3213.17 ± 80.47	121.18 ± 0.06	8160b38	@65a
RX 7900 XT	20 GB / GDDR6 / 320 bit	3098.38 ± 24.02	116.15 ± 0.06	1e15bfd	@AdamNiederer
RX 9070	16 GB / GDDR6 / 256 bit	2381.77 ± 3.68	114.48 ± 0.60	d0660f2	@andj1210
Instinct MI100	32 GB / HBM2 / 4096 bit	2732.83 ± 1.98	110.48 ± 0.14	9c35706	@firefox42
RX 9070 XT	16 GB / GDDR6 / 256 bit	5055.19 ± 109.58	101.27 ± 0.27	583cb83	@Hadrianneue
RX 7800 XT	16 GB / GDDR6 / 256 bit	2151.81 + 17.94	100.94 + 0.10	00131d6	@olegshulyakov
Instinct MI50	32 GB / HBM2 / 4096 bit	1057.24 ± 0.53	98.95 ± 0.25	97d5117	@wtarreau
RX 7900 GRE	16 GB / GDDR6 / 256 bit	1456.98 ± 12.39	96.07 ± 0.10	6fa3b55	@MihaiBojescu
AI PRO R9700	32 GB / GDDR6 / 256 bit	4443.54 ± 339.25	93.84 ± 0.26	bd4ef13	@gogich77
Instinct MI60	32 GB / HBM2 / 4096 bit	1289.11 ± 0.62	91.46 ± 0.13	504af20	@Said-Akbar
RX 6900 XT	16 GB / GDDR6 / 256 bit	1889.84 ± 31.21	88.49 ± 0.00	a972fae	@notgood
Pro VII	16 GB / HBM2 / 4096 bit	1064.99 ± 1.18	87.45 ± 0.04	2739a71	@8XXD8
RX 6800 XT	16 GB / GDDR6 / 256 bit	1447.07 ± 1.36	83.92 ± 0.03	79c1160	@MrLavender
Pro V620	32 GB / GDDR6 / 256 bit	1803.65 ± 2.54	74.66 ± 0.01	5c0eb5e	@samteezy
RX 9060 XT	16 GB / GDDR6 / 256 bit	1419.67 ± 3.64	67.58 ± 0.24	a0e13dc	@lcy0321
RX 5700 XT	8 GB / GDDR6 / 256 bit	354.17 ± 0.18	67.55 ± 0.04	c05e8c9	@daniandtheweb
Instinct MI25	16 GB / HBM2 / 2048 bit	409.83 ± 0.23	63.94 ± 0.06	2739a71	@8XXD8
AI Max+ 395	128 GB / LPDDR5	911.36 ± 1.79	50.01 ± 0.07	e60f241	@firefox42
RX 7600 XT	16 GB / GDDR6 / 128 bit	1099.64 ± 2.05	48.58 ± 0.06	9c35706	@wbruna
RX Vega 64	8 GB / HBM2 / 2048 bit	240.68 ± 0.09	48.46 ± 0.09	ec428b0	@davispuh
Radeon 8060S	System Shared / DDR5	351.36 ± 0.67	47.97 ± 0.33	1d0125b	@hspak
Radeon 880M	System Shared / DDR5	163.25 ± 13.86	12.97 ± 1.63	c55d53a	@Hedede

Llama 2 7B, Q4_0, with FA

Chip	Memory	pp512 t/s	tg128 t/s	Commit	Thanks to
Instinct MI300X	192 GB / HBM3 / 8192 bit	11945.97 ± 54.29	218.53 ± 0.09	ee3a9fc	@yeahdongcn
RX 7900 XTX	24 GB / GDDR6 / 384 bit	3874.25 ± 11.92	170.12 ± 0.56	2f0c2db	@Diablo-D3
Pro W7900	48 GB / GDDR6 / 384 bit	3472.86 ± 52.86	127.43 ± 0.12	8160b38	@65a
Instinct MI210	64 GB / HBM2e / 4096 bit	2571.82 ± 2.89	130.18 ± 0.06	8160b38	@65a
RX 9070	16 GB / GDDR6 / 256 bit	2452.68 ± 1.33	115.32 ± 0.52	d0660f2	@andj1210
RX 7900 XT	20 GB / GDDR6 / 320 bit	3261.75 ± 9.09	112.30 ± 0.06	1e15bfd	@AdamNiederer
Instinct MI50	32 GB / HBM2 / 4096 bit	1129.43 ± 0.15	105.82 ± 0.07	97d5117	@wtarreau
Instinct MI100	32 GB / HBM2 / 4096 bit	2755.00 ± 3.68	104.71 ± 0.10	9c35706	@firefox42
AI PRO R9700	32 GB / GDDR6 / 256 bit	4773.07 ± 49.30	97.98 ± 0.13	bd4ef13	@gogich77
RX 7900 GRE	16 GB / GDDR6 / 256 bit	1598.79 ± 11.48	97.53 ± 0.06	6fa3b55	@MihaiBojescu
RX 9070 XT	16 GB / GDDR6 / 256 bit	4903.51 ± 96.36	97.28 ± 0.13	583cb83	@Hadrianneue
RX 7800 XT	16 GB / GDDR6 / 256 bit	2304.63 + 2.85	95.99 + 0.21	00131d6	@olegshulyakov
RX 6900 XT	16 GB / GDDR6 / 256 bit	1948.31 ± 13.51	85.04 ± 0.02	a972fae	@notgood
Pro V620	32 GB / GDDR6 / 256 bit	1256.86 ± 0.55	70.83 ± 0.02	5c0eb5e	@samteezy
RX 9060 XT	16 GB / GDDR6 / 256 bit	1479.27 ± 0.71	65.42 ± 0.19	a0e13dc	@lcy0321
RX 5700 XT	8 GB / GDDR6 / 256 bit	314.17 ± 0.29	62.02 ± 0.05	c05e8c9	@daniandtheweb
AI Max+ 395	128 GB / LPDDR5	1003.53 ± 2.91	49.87 ± 0.02	e60f241	@firefox42
Radeon 8060S	System Shared / DDR5	366.08 ± 1.44	48.97 ± 0.15	1d0125b	@hspak
RX 7600 XT	16 GB / GDDR6 / 128 bit	1199.16 ± 1.07	47.65 ± 0.06	9c35706	@wbruna
RX Vega 64	8 GB / HBM2 / 2048 bit	153.17 ± 0.72	42.46 ± 0.40	ec428b0	@davispuh
Radeon 880M	System Shared / DDR5	213.31 ± 14.05	16.16 ± 1.41	c55d53a	@Hedede

Vulkan Scoreboards

Llama 2 7B, Q4_0, no FA

Chip	pp512 t/s	tg128 t/s	Commit	Comments
Nvidia RTX 5090	10381.64 ± 508.84	263.63 ± 0.91	ca71fb9	coopmat2
AMD Radeon RX 7900 XTX	3531.93 ± 31.74	191.28 ± 0.20	2f0c2db
Nvidia RTX 4090	9452.03 ± 187.70	187.97 ± 0.21	4ae88d0	coopmat2
Nvidia RTX 5080	7444.99 ± 20.11	185.10 ± 0.54	f6b533d	coopmat2
Nvidia A100	6389.86 ± 4.83	160.78 ± 0.16	2257758	coopmat2
Nvidia RTX 3090	4298.97 ± 10.59	160.13 ± 0.25	4ae88d0	coopmat2
Nvidia RTX 4080 Super	7101.18 ± 269.79	147.13 ± 5.64	81086cd	coopmat2
Nvidia RTX 3080	4287.11 ± 55.50	139.15 ± 0.05	7c7d6ce	coopmat2
Nvidia RTX A5000	3641.55 ± 9.05	139.89 ± 0.69	4ae88d0	coopmat2
AMD Radeon RX 9070 XT	5036.04 ± 88.16	137.11 ± 0.02	e9fd8dc
Nvidia RTX 5070 Ti	6213.63 ± 27.72	135.63 ± 0.18	d13d0f6	coopmat2
AMD Radeon AI Pro R9700	4036.04 ± 34.58	130.19 ± 0.39	3191462
Nvidia Tesla V100	1391.39 ± 1.19	129.58 ± 0.58	7d77f07
Nvidia RTX 4070 Ti Super	6099.18 ± 154.30	129.45 ± 0.18	4ae88d0	coopmat2
AMD Radeon RX 7900 XT	2941.58 ± 17.17	123.18 ± 0.40	71e74a3
AMD Radeon RX 9070	3164.10 ± 66.84	119.71 ± 3.40	21c17b5
AMD Radeon RX 7800 XT	2017.33 ± 19.30	118.27 ± 0.27	4fdbc1e
AMD Radeon RX 7900 GRE	2336.31 ± 7.52	116.11 ± 0.26	4b2a477
Apple M3 Ultra	1116.83 ± 0.55	115.54 ± 0.78	2d451c8	MoltenVK
Intel Arc Pro B70	3379.00 ± 47.92	112.02 ± 1.08	b863507
Nvidia Titan V	984.36 ± 4.13	108.86 ± 0.28	e56abd2
AMD Radeon Pro VII	1078.54 ± 0.86	107.82 ± 0.14	N/A
AMD Radeon RX 6900 XT	1837.21 ± 25.44	104.60 ± 0.30	a972fae
Intel Arc Pro A60	2261.11 ± 9.53	104.25 ± 0.07	97d5117
AMD Radeon RX 6800 XT	1752.92 ± 1.71	100.32 ± 0.97	N/A
AMD Radeon VII	1059.14 ± 0.56	101.19 ± 0.53	77d6ae4
Nvidia RTX 2080 Ti	1888.24 ± 9.20	97.58 ± 6.60	N/A
AMD Radeon RX 6800	1698.69 ± 0.80	95.61 ± 0.19	4b385bf
AMD Radeon Pro W6800X Duo	687.71 ± 4.33	94.82 ± 0.12	N/A
Nvidia RTX 5060 Ti	3460.92 ± 7.16	93.51 ± 0.15	89f10ba	coopmat2
Nvidia RTX 4070	3179.37 ± 46.16	92.29 ± 0.28	9a48399
AMD Radeon Pro W6800X	510.80 ± 0.13	86.47 ± 0.46	13b4548	MoltenVK
AMD Radeon RX 6700 XT	1051.20 ± 0.98	83.88 ± 0.08	6d75883
AMD Radeon RX 6750 XT	1040.58 ± 0.35	81.98 ± 0.03	228f34c
AMD Radeon Pro V620	1595.32 ± 1.59	81.78 ± 0.06	03d4698
Nvidia RTX 3070	2113.02 ± 7.38	78.71 ± 0.13	1b8fb81
AMD Radeon Instinct MI60	369.26 ± 2.48	78.16 ± 1.40	504af20
Nvidia RTX 3060	1815.70 ± 5.85	75.94 ± 0.80	92c0b38	coopmat2
Apple M4 Max	724.77 ± 20.93	75.02 ± 0.14	1ece0cb6
Nvidia Tesla T10	1692.70 ± 2.05	75.01 ± 0.21	7f76692	coopmat2
Nvidia RTX A4000	2248.14 ± 7.59	73.74 ± 0.08	f5245b5	coopmat2
AMD Radeon RX 5700 XT	529.69 ± 0.26	70.73 ± 0.04	4fdbc1e
AMD Radeon RX 9060 XT	2141.67 ± 6.87	70.54 ± 0.74	ed52f36
Intel Arc B580	620.94 ± 15.33	70.14 ± 0.28	7f76692
AMD Radeon Pro V540	583.88 ± 6.56	69.64 ± 0.24	9da3dcd
AMD Radeon Pro W5700	449.85 ± 0.46	68.55 ± 0.15	23bc779
Intel Arc Pro B60	522.36 ± 3.60	68.55 ± 0.01	516a4ca
Nvidia GTX 1080 Ti	540.69 ± 0.71	64.99 ± 0.08	360d653
Nvidia RTX 2070 Super	1199.13 ± 7.70	64.64 ± 0.20	b7552cf
Nvidia RTX 3070 Mobile	1689.40 ± 19.57	63.64 ± 0.39	ceff6bb	coopmat2
Nvidia Tesla P100	678.14 ± 1.40	63.16 ± 0.06	eec1e33
AMD BC-250	370.66 ± 0.04	62.32 ± 0.32	5886f4f
AMD Radeon RX 6650 XT	1029.52 ± 1.21	62.14 ± 0.02	dbb852b
Nvidia RTX 4060 Mobile	2135.66 ± 23.18	59.53 ± 0.03	a5c07dc	coopmat2
Nvidia Tesla P40	488.06 ± 0.27	59.36 ± 0.16	N/A
Nvidia GTX 1660 Ti Mobile	511.67 ± 2.85	56.60 ± 0.07	b43556e
AMD Radeon Instinct MI25	439.42 ± 0.34	54.69 ± 0.03	2739a71
AMD Radeon RX 6600 XT	574.65 ± 0.86	53.92 ± 0.11	091592d
AMD Ryzen AI Max+ 395	1288.96 ± 6.49	53.59 ± 0.38	7f76692
AMD Radeon RX 7600 XT	840.85 ± 3.02	53.02 ± 0.01	01d8eaa
Intel Arc A770	1073.85 + 29.68	52.56 + 0.11	a69d54f
Nvidia GB10	2737.79 ± 19.56	52.28 ± 0.03	b9da444	coopmat2
AMD FirePro S9300 x2	247.26 ± 0.43	51.86 ± 0.11	eec1e33	Split across two GPUs
AMD Radeon RX 6600	761.89 ± 1.76	50.63 ± 0.02	b1c70e2
AMD Radeon RX Vega 56	439.87 ± 0.61	50.23 ± 0.14	92c0b38
Intel Arc B570	913.95 ± 0.90	49.64 ± 0.03	7f76692
Nvidia RTX 3060 Mobile	1059.76 ± 3.54	49.03 ± 0.13	dbb3a47
AMD Radeon RX 6800M	861.99 ± 7.67	48.71 ± 0.71	8e6f8bc
AMD Radeon RX 6600M	605.59 ± 0.65	48.21 ± 0.07	fe5b78c
Intel Arc A770M	875.92 ± 2.16	47.69 ± 0.16	eeee367
Nvidia P104-100	311.90 ± 0.22	46.18 ± 0.05	eec1e33
AMD Radeon RX Vega 64	356.08 ± 0.09	45.73 ± 0.18	ec428b0
Nvidia RTX A2000	1245.19 ± 8.76	45.52 ± 0.54	b1afcab	coopmat2
AMD Radeon RX 7600M XT	459.39 ± 2.34	45.28 ± 0.10	b9ab0a4	eGPU
AMD Radeon Pro V340	375.41 ± 0.24	45.16 ± 0.06	9da3dcd	Split across two GPUs
Nvidia GTX 1070 Ti	297.50 ± 0.54	42.86 ± 1.20	860a9e4	eGPU
Intel Arc A750	1075.94 ± 13.89	42.66 ± 0.18	c1b1876
Nvidia RTX 4050 Mobile	1154.28 + 15.76	41.89 + 0.10	d79d8f3
Nvidia GTX 1070	321.57 ± 0.93	41.48 ± 0.09	eec1e33
Intel Arc Pro B50	193.50 ± 0.24	39.99 ± 0.10	7b43f55
Nvidia Tesla M40	92.48 ± 0.02	39.35 ± 1.22	b8372ee
AMD Radeon RX 580	258.03 ± 0.71	39.32 ± 0.03	de4c07f
AMD Radeon RX 470	218.07 ± 0.56	38.63 ± 0.21	e288693
AMD Radeon Pro W5500	315.39 ± 3.76	36.82 ± 0.38	860a9e4
AMD Radeon RX 480	248.66 ± 0.28	34.71 ± 0.14	3b15924
Apple M2 Ultra	205.98 ± 0.02	34.34 ± 0.12	dbb852b	Asahi Linux
Nvidia GTX 980	186.24 ± 0.09	33.90 ± 0.51	860a9e4
Nvidia P106-100	183.78 ± 0.26	29.77 ± 0.04	23bc779
AMD FirePro W8100	155.22 ± 0.17	29.52 ± 0.05	4536363
Nvidia Tesla P4	265.54 ± 0.21	28.03 ± 0.14	24d2ee0
AMD Radeon RX 6500 XT	255.25 ± 0.35	27.81 ± 0.10	g9fdfcd
Apple M3	263.70 ± 0.02	26.39 ± 0.14	b9ab0a4	MoltenVK
AMD FirePro S10000	94.78 ± 0.02	25.32 ± 0.02	914a82d	Split across two GPUs
Nvidia Quadro P2000	169.55 ± 0.17	23.05 ± 0.03	63f8fe0
Intel Core Ultra 200 Series	544.95 ± 4.15	22.49 ± 0.09	cea560f
AMD Ryzen AI 9 300 Series	479.07 ± 0.41	22.41 ± 0.18	N/A
AMD Ryzen 6000 Series	240.89 ± 0.52	21.26 ± 0.08	ee09828
Apple M2 Pro	62.70 ± 0.03	20.95 ± 0.11	1fe0029	Asahi Linux
Nvidia GTX 1050 Ti	136.42 ± 0.67	20.96 ± 0.21	2f0c2db
AMD Ryzen 8000 Series	266.19 ± 1.36	20.53 ± 0.08	a5c07dc
AMD Ryzen 7000 Series	281.62 ± 1.56	19.91 ± 0.07	ebce03e
AMD Ryzen Z1 Extreme	199.36 ± 7.02	18.77 ± 0.02	53ff6b9
AMD FirePro D700	69.95 ± 0.04	16.62 ± 0.01	d3bd719	MoltenVK, running in FP16 mode on FP32 only chip
AMD Radeon Pro WX 4100	78.79 ± 0.10	16.05 ± 0.07	860a9e4
Apple M2	50.79 ± 0.16	13.50 ± 0.02	8c0d6bb	Asahi Linux
Apple M1	38.29 ± 0.00	12.47 ± 0.03	2370665	Asahi Linux
AMD Ryzen 5000 Series	90.55 ± 0.08	10.98 ± 0.07	d84635b
Intel Core 1100 Series	187.20 ± 1.78	10.39 ± 0.04	abb9f3c
AMD Radeon RX 550	52.66 ± 0.49	10.20 ± 0.01	N/A
AMD Ryzen 4000 Series	103.87 ± 0.02	9.63 ± 0.01	4b385bf
Nvidia Tesla K80	89.46 ± 0.10	9.39 ± 0.06	5d46bab	Running on single GPU
Nvidia Tesla K40	64.37 ± 0.09	9.30 ± 0.19	eec1e33
MediaTek Dimensity 9400	38.36 ± 15.15	8.92 ± 0.06	b9ab0a4	GPU supports coopmat but pp512 is faster with it turned off
Intel Core Ultra 100 Series	185.51 ± 0.22	8.21 ± 0.07	1d72c84
AMD Ryzen 3000 Series	48.63 ± 0.10	8.49 ± 0.01	1fe0029
CIX CD8180	2.80 ± 0.01	5.51 ± 0.00	4dca015
Intel Core 1000 Series	25.58 ± 0.00	4.25 ± 0.18	N/A
Intel Core 8000 Series	25.43 ± 0.17	3.35 ± 0.03	c4df49a
Intel N150	28.84 ± 0.02	2.93 ± 0.00	4f63cd7

Llama 2 7B, Q4_0, FA enabled

Chip	pp512 t/s	tg128 t/s	Commit	Comments
Nvidia RTX 5090	11796.38 ± 601.36	273.68 ± 0.52	ca71fb9	coopmat2
AMD Radeon RX 7900 XTX	3332.90 ± 11.47	195.30 ± 0.23	2f0c2db
Nvidia RTX 5080	8054.59 ± 35.68	192.17 ± 0.21	f6b533d	coopmat2
Nvidia RTX 4090	10830.41 ± 36.25	190.10 ± 0.31	4ae88d0	coopmat2
Nvidia A100	7064.40 ± 1.63	170.56 ± 0.02	2257758	coopmat2
Nvidia RTX 3090	4732.33 ± 4.80	162.28 ± 0.21	4ae88d0	coopmat2
Nvidia RTX 4080 Super	8007.37 ± 46.03	150.20 ± 0.26	81086cd	coopmat2
Nvidia RTX 3080	4913.83 ± 21.52	145.74 ± 0.16	7c7d6ce	coopmat2
Nvidia Tesla V100	1411.25 ± 2.12	142.13 ± 0.03	7d77f07
Nvidia RTX A5000	4071.22 ± 13.13	140.43 ± 0.22	4ae88d0	coopmat2
AMD Radeon RX 9070 XT	4911.74 ± 28.52	138.20 ± 0.18	e9fd8dc
Nvidia RTX 5070 Ti	6764.53 ± 11.95	135.65 ± 0.02	d13d0f6	coopmat2
AMD Radeon AI Pro R9700	4333.83 ± 29.36	130.90 ± 0.12	3191462
AMD Radeon RX 7900 XT	3043.93 ± 10.42	124.20 ± 0.09	71e74a3
AMD Radeon RX 7800 XT	2094.64 ± 14.38	119.63 ± 0.13	4fdbc1e
AMD Radeon RX 9070	3277.24 ± 18.17	119.55 ± 0.06	21c17b5
AMD Radeon RX 7900 GRE	2402.07 ± 22.50	116.77 ± 0.08	4b2a477
Apple M3 Ultra	1115.55 ± 0.75	115.99 ± 0.12	2d451c8	MoltenVK
Intel Arc Pro B70	3314.53 ± 17.95	111.63 ± 0.05	b863507
Nvidia Titan V	792.74 ± 4.30	109.21 ± 0.72	e56abd2
AMD Radeon Pro VII	783.94 ± 0.77	108.45 ± 0.48	N/A
AMD Radeon RX 6900 XT	1761.93 ± 4.75	106.15 ± 0.04	a972fae
Nvidia RTX 2080 Ti	1936.25 ± 32.08	100.99 ± 0.24	N/A
AMD Radeon RX 6800 XT	1704.79 ± 0.71	100.50 ± 0.06	N/A
AMD Radeon Pro W6800X Duo	795.28 ± 0.72	100.08 ± 0.02	N/A
Nvidia RTX 5060 Ti	3912.65 ± 5.86	97.01 ± 0.14	89f10ba	coopmat2
AMD Radeon RX 6800	1749.46 ± 3.36	96.65 ± 0.48	4b385bf
Nvidia RTX 4070	4293.57 ± 27.70	91.49 ± 0.89	9a48399	coopmat2
AMD Radeon RX 6750 XT	997.05 ± 0.45	82.29 ± 0.06	228f34c
AMD Radeon RX 6700 XT	1010.90 ± 12.89	81.86 ± 0.19	6d75883
Nvidia RTX 3060	2012.88 ± 10.12	80.59 ± 0.02	92c0b38	coopmat2
AMD Radeon Pro V620	1556.31 ± 2.82	79.24 ± 0.09	03d4698
Nvidia RTX A4000	2482.74 ± 26.05	76.07 ± 0.08	f5245b5	coopmat2
Nvidia Tesla T10	1840.14 ± 1.22	76.05 ± 0.13	7f76692	coopmat2
AMD Radeon RX 5700 XT	538.31 ± 0.35	74.43 ± 0.03	4fdbc1e
Intel Arc B580	419.49 ± 3.37	72.00 ± 0.24	7f76692
Apple M4 Max	557.46 ± 26.87	71.79 ± 4.16	1ece0cb6
AMD Radeon Pro W5700	446.98 ± 0.39	71.30 ± 0.24	23bc779
Intel Arc Pro B60	274.76 ± 0.27	70.54 ± 0.03	516a4ca
AMD Radeon RX 9060 XT	1915.41 ± 7.90	70.52 ± 0.16	ed52f36
Nvidia Tesla P100	685.51 ± 0.88	66.48 ± 0.02	eec1e33
AMD Radeon RX 6650 XT	1088.90 ± 0.40	64.53 ± 0.75	dbb852b
Nvidia GTX 1080 Ti	529.96 ± 0.38	64.63 ± 0.10	360d653
AMD BC-250	356.87 ± 1.24	63.14 ± 0.09	5886f4f
Nvidia RTX 3070 Mobile	1832.07 ± 57.14	62.92 ± 0.37	ceff6bb	coopmat2
Nvidia RTX 4060 Mobile	2358.03 ± 12.17	60.01 ± 0.08	a5c07dc	coopmat2
Nvidia Tesla P40	484.37 ± 0.27	59.22 ± 0.15	N/A
Nvidia GTX 1660 Ti Mobile	514.34 ± 0.88	57.30 ± 0.42	b43556e
AMD Radeon RX 7600 XT	1024.38 ± 7.56	56.11 ± 0.02	01d8eaa
AMD FirePro S9300 x2	243.33 ± 0.22	55.64 ± 0.06	eec1e33	Split across two GPUs
Nvidia GB10	3279.89 ± 26.78	53.64 ± 0.05	b9da444	coopmat2
AMD Radeon RX 6600	808.76 ± 0.15	53.24 ± 0.03	b1c70e2
Intel Arc A770	1119.68 + 30.25	53.07 + 0.09	a69d54f
AMD Ryzen AI Max+ 395	1357.07 ± 10.94	53.00 ± 0.13	7f76692
AMD Radeon RX Vega 56	428.54 ± 0.50	52.66 ± 0.03	92c0b38
Intel Arc B570	288.51 ± 0.09	50.49 ± 0.05	7f76692
Nvidia P104-100	325.30 ± 0.25	48.64 ± 0.04	eec1e33
AMD Radeon Pro V340	360.23 ± 0.74	47.54 ± 0.06	9da3dcd	Split across two GPUs
AMD Radeon RX 6800M	784.16 ± 2.76	49.06 ± 0.34	8e6f8bc
AMD Radeon RX Vega 64	320.12 ± 0.22	47.06 ± 0.01	ec428b0
Nvidia RTX A2000	1361.85 ± 3.26	45.69 ± 0.20	b1afcab	coopmat2
Intel Arc A770M	384.74 ± 0.78	45.68 ± 0.06	eeee367
Intel Arc A750	303.37 ± 1.44	43.96 ± 0.03	c1b1876
Nvidia GTX 1070 Ti	292.85 ± 0.23	43.42 ± 0.34	860a9e4	eGPU
Nvidia GTX 1070	330.84 ± 1.02	43.33 ± 0.06	360d653
Nvidia Tesla M40	93.35 ± 0.01	41.68 ± 0.01	b8372ee
Intel Arc Pro B50	132.48 ± 0.04	41.02 ± 0.04	7b43f55
AMD Radeon RX 470	197.26 ± 0.27	37.28 ± 0.11	3769fe6
AMD Radeon RX 480	194.52 ± 0.61	37.23 ± 0.09	0bcb40b
Apple M2 Ultra	198.83 ± 0.85	198.83 ± 0.85	dbb852b	Asahi Linux
Nvidia GTX 980	180.97 ± 0.74	34.16 ± 0.10	860a9e4
Nvidia P106-100	183.40 ± 0.34	30.79 ± 0.32	23bc779
AMD FirePro W8100	140.52 ± 0.34	29.28 ± 0.14	4536363
Nvidia Tesla P4	287.14 ± 0.29	28.37 ± 0.24	24d2ee0
Nvidia Quadro P2000	181.71 ± 0.12	23.77 ± 0.02	63f8fe0
Intel Core Ultra 200 Series	536.48 ± 1.27	23.05 ± 0.04	cea560f
AMD Ryzen AI 9 300 Series	532.59 ± 3.55	22.31 ± 0.06	N/A
AMD Ryzen 6000 Series	277.91 ± 0.37	21.15 ± 0.09	ee09828
Apple M2 Pro	58.86 ± 0.02	20.97 ± 0.03	1fe0029	Asahi Linux
AMD Ryzen 8000 Series	297.39 ± 1.22	20.59 ± 0.38	a5c07dc
AMD Ryzen 7000 Series	312.85 ± 2.51	20.09 ± 0.35	835b2b9
Nvidia GTX 1050 Ti	127.54 ± 1.03	20.08 ± 0.17	2f0c2db
AMD Radeon Pro WX 4100	75.59 ± 0.19	16.56 ± 0.04	860a9e4
Apple M1	35.93 ± 0.00	12.85 ± 0.02	2370665	Asahi Linux
Apple M2	46.81 ± 0.08	12.25 ± 2.30	8c0d6bb	Asahi Linux
AMD Ryzen 5000 Series	79.06 ± 0.01	10.75 ± 0.00	5d195f1
Intel Core 1100 Series	174.77 ± 4.47	10.58 ± 0.03	abb9f3c
Nvidia Tesla K40	64.37 ± 0.02	9.92 ± 0.06	eec1e33
AMD Ryzen 4000 Series	113.32 ± 0.01	9.87 ± 0.01	4b385bf
Nvidia Tesla K80	88.26 ± 0.19	9.49 ± 0.01	5d46bab	Running on single GPU
AMD Ryzen 5 3000 Series	47.41 ± 0.14	8.47 ± 0.01	1fe0029
Intel Core Ultra 100 Series	77.66 ± 2.75	7.75 ± 0.05	2e89f76
Intel Core 8000 Series	25.55 ± 0.04	3.35 ± 0.02	c4df49a
Intel N150	25.59 ± 0.00	2.91 ± 0.00	4f63cd7

How to Use These Tables

Decide whether you care more about g128 or pp512. For chat and interactive use, g128 usually matters more. For long prompts and batch throughput, pp512 matters more.
Match the backend you actually use. Nvidia users should usually prioritize CUDA. AMD users should compare ROCm and Vulkan first. Cross-platform users should pay close attention to Vulkan.
Check FA last. On many GPUs, enabling FA improves pp512 more than g128, so a single headline number can be misleading.

One-Sentence Summary

In llama.cpp benchmarks, pp512, g128, Q4_0, FA, and CUDA / ROCm / Vulkan describe different dimensions. Once the benchmark context is clear, the tables become much easier to read.

Sources

CUDA discussion #15013: https://github.com/ggml-org/llama.cpp/discussions/15013
Apple Silicon discussion #4167: https://github.com/ggml-org/llama.cpp/discussions/4167
ROCm discussion #15021: https://github.com/ggml-org/llama.cpp/discussions/15021
Vulkan discussion #10879: https://github.com/ggml-org/llama.cpp/discussions/10879

Ollama Multi-GPU Notes: VRAM Pooling, GPU Selection, and Common Misunderstandings

Sun, 19 Apr 2026 00:18:00 +0800

When running local inference with Ollama, a few questions come up quickly: if I already have one GPU and my motherboard still has empty PCIe slots, does adding more GPUs help? Do the GPUs need to be identical? Can VRAM be combined? Will it accelerate inference like a multi-GPU training framework?

This note summarizes how Ollama behaves with multiple GPUs. The short version:

Ollama supports multiple GPUs.
The main value of multiple GPUs is usually fitting larger models into available VRAM, not getting linear token/s scaling.
By default, if a model fits entirely on one GPU, Ollama tends to load it on a single GPU.
If a model does not fit on one GPU, Ollama can spread it across available GPUs.
Mixed GPU models may be visible to Ollama, but performance and placement may not be ideal.
SLI / NVLink is not required for multi-GPU use.
To limit which GPUs Ollama can use, use CUDA_VISIBLE_DEVICES, ROCR_VISIBLE_DEVICES, or GGML_VK_VISIBLE_DEVICES.

Official Behavior: Single GPU First, Multi-GPU When Needed

Ollama’s FAQ describes the multi-GPU loading logic directly: when loading a new model, Ollama estimates the required VRAM and compares it with currently available GPU memory. If the model can fit entirely on one GPU, it loads the model onto that GPU. If it cannot fit on a single GPU, the model is spread across all available GPUs.

The reason is performance. Keeping a model on one GPU usually reduces data transfers across the PCIe bus during inference, so it is often faster.

So do not think of Ollama multi-GPU as “more cards automatically means several times faster.” A more accurate model is:

Small model fits on one GPU: usually runs on one GPU.
Large model does not fit on one GPU: split across multiple GPUs.
Still not enough VRAM: part of the model falls back to system memory, and speed drops noticeably.

Use this command to see where the model is loaded:

`1`	`ollama ps`

The PROCESSOR column may show something like:

1
2
3

100% GPU
48%/52% CPU/GPU
100% CPU

If you see 48%/52% CPU/GPU, part of the model is already in system memory. In that case, adding more GPU memory or using a larger-VRAM GPU is usually more useful than continuing to rely on CPU/RAM.

Multi-GPU Is Not Simple Compute Stacking

Local LLM inference is not the same as SLI in games. With Ollama on multiple GPUs, the common pattern is that different layers or tensors are placed on different devices. This can make a larger model fit into the combined available VRAM, but data may still need to move between devices during inference.

So multi-GPU benefits usually fall into two categories:

VRAM benefit: larger models fit more easily, or less of the model falls back to CPU/RAM.
Performance benefit: usually most obvious when a model would otherwise not fit on one GPU or would heavily spill to CPU.

If an 8B or 14B model already fits entirely on a single RTX 3090, forcing it across two GPUs may not be faster. It may even slow down due to cross-GPU transfer overhead. Ollama’s default “use one GPU when it fits” strategy avoids that unnecessary PCIe cost.

SLI or NVLink Is Not Required

Ollama multi-GPU does not depend on SLI. Multiple normal PCIe GPUs can be scheduled as long as the driver and Ollama can detect them.

NVLink or higher PCIe bandwidth may help in some cross-GPU scenarios, but it is not a requirement. Many used GPU servers and workstations can run multiple GPUs over ordinary PCIe.

What you should pay attention to is PCIe bandwidth. The difference between x1, x4, x8, and x16 affects how quickly a model is loaded into VRAM. If you frequently switch large models, PCIe bandwidth becomes more important. After a model is loaded, PCIe usually matters less during generation, but cross-GPU splitting can still add overhead.

Safer rules:

Prefer x16 / x8 over mining-style x1 risers.
PCIe bandwidth matters more when switching large models frequently.
If a model stays resident in VRAM for a long time, PCIe bandwidth is less visible.
For multi-GPU machines, check motherboard PCIe topology and CPU-attached lanes.

Limit Which NVIDIA GPUs Ollama Uses

On NVIDIA multi-GPU systems, use CUDA_VISIBLE_DEVICES to control which GPUs Ollama can see.

Temporary run:

`1`	`CUDA_VISIBLE_DEVICES=0,1 ollama serve`

Use only the second GPU:

`1`	`CUDA_VISIBLE_DEVICES=1 ollama serve`

Force Ollama not to use NVIDIA GPUs:

`1`	`CUDA_VISIBLE_DEVICES=-1 ollama serve`

The official docs note that numeric IDs may change order, so GPU UUIDs are more reliable. Check UUIDs first:

`1`	`nvidia-smi -L`

Example output:

1
2

GPU 0: NVIDIA GeForce RTX 3090 (UUID: GPU-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx)
GPU 1: NVIDIA GeForce RTX 3070 (UUID: GPU-yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy)

Then specify the UUID:

`1`	`CUDA_VISIBLE_DEVICES=GPU-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx ollama serve`

If Ollama is installed as a Linux systemd service, put the variable into the service environment:

`1`	`sudo systemctl edit ollama.service`

Add:

1
2

[Service]
Environment="CUDA_VISIBLE_DEVICES=0,1"

Reload and restart:

1
2

sudo systemctl daemon-reload
sudo systemctl restart ollama

AMD and Vulkan Device Selection

For AMD ROCm, use ROCR_VISIBLE_DEVICES to control visible GPUs:

`1`	`ROCR_VISIBLE_DEVICES=0,1 ollama serve`

To force Ollama not to use ROCm GPUs, use an invalid ID:

`1`	`ROCR_VISIBLE_DEVICES=-1 ollama serve`

Ollama’s GPU docs also mention experimental Vulkan support. For Vulkan GPUs, use GGML_VK_VISIBLE_DEVICES:

`1`	`OLLAMA_VULKAN=1 GGML_VK_VISIBLE_DEVICES=0 ollama serve`

If Vulkan devices cause problems, disable them:

`1`	`GGML_VK_VISIBLE_DEVICES=-1 ollama serve`

AMD multi-GPU setups are more likely to run into driver, ROCm version, and GFX version compatibility issues. The official docs also mention Linux ROCm driver requirements and compatibility overrides such as HSA_OVERRIDE_GFX_VERSION. If you mix different generations of AMD GPUs, first verify that each card works on its own before trying multi-GPU.

Exposing Multiple GPUs in Docker

If you run Ollama in Docker, NVIDIA setups usually require nvidia-container-toolkit, then --gpus to expose devices.

Expose all GPUs:

docker run -d \
  --gpus=all \
  -v ollama:/root/.ollama \
  -p 11434:11434 \
  --name ollama \
  ollama/ollama

Expose specific GPUs:

docker run -d \
  --gpus '"device=0,1"' \
  -v ollama:/root/.ollama \
  -p 11434:11434 \
  --name ollama \
  ollama/ollama

You can also combine this with environment variables:

docker run -d \
  --gpus=all \
  -e CUDA_VISIBLE_DEVICES=0,1 \
  -v ollama:/root/.ollama \
  -p 11434:11434 \
  --name ollama \
  ollama/ollama

If nvidia-smi cannot see GPUs inside the container, Ollama cannot use them either. Troubleshoot Docker GPU passthrough first, then Ollama.

What Is `OLLAMA_SCHED_SPREAD`

In some multi-GPU configuration discussions, you may see OLLAMA_SCHED_SPREAD=1 or OLLAMA_SCHED_SPREAD=true. It is related to Ollama’s scheduler and is often used when people want models or requests to be spread more broadly across GPUs.

Example:

`1`	`OLLAMA_SCHED_SPREAD=1 ollama serve`

Or with systemd:

1
2

[Service]
Environment="OLLAMA_SCHED_SPREAD=true"

But it is not a magic switch. Enabling it does not imply linear token/s scaling, and it may still run into OOM when multiple models are loaded, VRAM estimates are tight, context length grows, or the KV cache expands. The core FAQ behavior still applies: if one GPU can fully hold the model, one GPU is usually more efficient; if one GPU cannot hold it, then multi-GPU splitting becomes useful.

Treat OLLAMA_SCHED_SPREAD as an advanced scheduling experiment, not a required multi-GPU setting. Understand the default behavior first, then adjust based on ollama ps, logs, and nvidia-smi.

How to Check Whether Multiple GPUs Are Being Used

Useful commands:

`1`	`ollama ps`

`1`	`watch -n 0.5 nvidia-smi`

View the Ollama service logs:

`1`	`journalctl -u ollama -f`

If using Docker:

`1`	`docker logs -f ollama`

Watch for:

Whether Ollama discovers compatible GPUs.
Whether the model shows 100% GPU or a CPU/GPU split.
Whether each GPU has VRAM allocated.
Whether VRAM grows on multiple GPUs during model loading.
Whether generation token/s improves compared with CPU/RAM spillover.
Whether OOM or model unloading happens frequently.

GPU utilization alone can be misleading. LLM inference does not always keep GPUs fully loaded, especially with multiple GPUs, low batch sizes, small contexts, slow CPUs, or slow PCIe links.

Common Misunderstandings

Misunderstanding 1: Two 12GB GPUs Equal One 24GB GPU

Not exactly. Multiple GPUs can place a model across devices, but cross-device access has overhead. It solves the “does not fit” problem, but it is not equivalent to the speed and stability of one large-VRAM GPU.

Misunderstanding 2: Different GPU Models Cannot Be Mixed

Not necessarily. If the driver, compute capability, and runtime libraries support the cards, Ollama can see multiple GPUs. But mixed setups are usually limited by the slower card, smaller VRAM, and PCIe topology. The most predictable setup is still same model, same VRAM size, and well-supported same-generation drivers.

Misunderstanding 3: Multi-GPU Is Always Faster Than Single-GPU

Not always. If the model fits completely on one fast GPU, single-GPU may be faster. Multi-GPU is mainly useful for large models, long contexts, or insufficient single-GPU VRAM.

Misunderstanding 4: NVLink / SLI Is Required

No. Ordinary PCIe multi-GPU systems can be used by Ollama. NVLink is not a prerequisite.

Misunderstanding 5: Adding a GPU Does Not Require Restarting Services

Not always true. Linux systemd services, Windows background apps, and Docker containers may need to be restarted before they rediscover devices and environment variables.

GPU Selection Suggestions

For Ollama local inference, the rough priority is:

Larger single-GPU VRAM is usually easier to manage.
Identical GPUs are easier to troubleshoot than mixed GPUs.
More complete PCIe lanes make large-model loading smoother.
Older cards should be checked for CUDA compute capability or ROCm support first.
Multi-GPU power, cooling, and chassis airflow must be planned ahead.

For budget second-hand platforms:

Dual RTX 3090 remains a common high-VRAM option.
Older Tesla cards such as P40 / M40 have large VRAM, but power, cooling, driver support, and performance all need trade-offs.
Cards such as RTX 4070 / 4070 Ti have good efficiency, but single-card VRAM can be limiting.
Multiple old 8GB cards can be fun to experiment with, but are not ideal for running large models long-term.

Summary

Ollama multi-GPU support is best understood as “VRAM expansion first, performance acceleration second.” If the model fits entirely on one GPU, the default single-GPU path is usually faster. If one GPU cannot hold it, multi-GPU can spread the model across devices and avoid heavy CPU/RAM spillover, making larger models usable.

In practice, use ollama ps to check where the model is loaded, then use nvidia-smi or ROCm tools to observe VRAM allocation. For GPU selection, use CUDA_VISIBLE_DEVICES on NVIDIA, ROCR_VISIBLE_DEVICES on AMD ROCm, and GGML_VK_VISIBLE_DEVICES for Vulkan. If running in Docker, first make sure the container can see the GPUs.

Multi-GPU is not magic. It can help fit larger models, but it does not guarantee linear speedup. The stable route is still to prefer large-VRAM single GPUs or identical multi-GPU setups, while considering driver support, PCIe, power, cooling, and model quantization together.

References

Ollama FAQ: How does Ollama load models on multiple GPUs?: https://github.com/ollama/ollama/blob/main/docs/faq.mdx
Ollama GPU docs: Hardware support / GPU Selection: https://github.com/ollama/ollama/blob/main/docs/gpu.mdx
Ollama Docker Hub: https://hub.docker.com/r/ollama/ollama
NVIDIA Container Toolkit: https://github.com/NVIDIA/nvidia-container-toolkit

ROCm on KnightLi Blog

AMD ROCm 7.2 + ComfyUI Compatibility Setup: Using a CUDA Alternative on Windows

What ROCm 7.2 Brings

Which Hardware Fits Best

Recommended Windows Path

Linux Is Still Better For Heavy Users

Be Careful With ComfyUI Plugins

Why AMD GPUs Are Attractive For AI Art

Limits You Still Need To Accept

Recommended Setup Strategy

Summary

References

Ubuntu 26.04 LTS GPU and Hardware Updates: CUDA, ROCm, DPC++, and More Platform Changes

1. Intel DPC++ and related components are now in Ubuntu Archive

2. The NVIDIA CUDA toolkit can now be installed directly with apt

3. AMD ROCm 7.1.0 is now in Universe

4. The bigger story is that all three GPU ecosystems are landing

5. NVIDIA Dynamic Boost is enabled by default

6. Support for new Intel integrated and discrete GPUs keeps moving forward

7. Suspend and resume is more stable on Nvidia desktops too

8. ARM, Raspberry Pi, RISC-V, and IBM Z also get harder platform-level changes

ARM64 desktop platforms

A new Raspberry Pi boot layout

Raspberry Pi desktop images now use desktop-minimal

Swap on Raspberry Pi is now handled by cloud-init

RISC-V requirements have moved up

IBM Z now requires z15 at minimum

9. Who should read this first

10. One-line takeaway

How to Fix Ollama Using CPU Instead of GPU

1. First, confirm whether Ollama is really not using the GPU

2. Rule out the most common misunderstanding first: the model does not fit into VRAM

3. Check whether the GPU driver and the lower-level runtime are actually working

NVIDIA

AMD / ROCm

4. Restart the Ollama service, not just your terminal

5. Check whether the environment variables are really reaching the service

6. On AMD platforms, focus on ROCm compatibility

7. In Docker, WSL, or remote environments, also check device mapping

8. Check logs last, but check them for the right reason

Troubleshooting Order

Conclusion

llama.cpp GPU Performance Ranking: Full CUDA, ROCm, and Vulkan Scoreboards Explained with pp512 / tg128 / FA

Understanding the Metrics First

What is Q4_0

What is pp512

What is g128

What is FA

How to read /s

Quick Takeaways

CUDA Scoreboards

Llama 2 7B, Q4_0, no FA

Llama 2 7B, Q4_0, with FA

Apple Silicon as a Reference Baseline

ROCm / HIP Scoreboards

Llama 2 7B, Q4_0, no FA

Llama 2 7B, Q4_0, with FA

Vulkan Scoreboards

Llama 2 7B, Q4_0, no FA

Llama 2 7B, Q4_0, FA enabled

How to Use These Tables

One-Sentence Summary

Sources

Ollama Multi-GPU Notes: VRAM Pooling, GPU Selection, and Common Misunderstandings

Official Behavior: Single GPU First, Multi-GPU When Needed

Multi-GPU Is Not Simple Compute Stacking

SLI or NVLink Is Not Required

Limit Which NVIDIA GPUs Ollama Uses

AMD and Vulkan Device Selection

Exposing Multiple GPUs in Docker

What Is OLLAMA_SCHED_SPREAD

How to Check Whether Multiple GPUs Are Being Used

Common Misunderstandings

Misunderstanding 1: Two 12GB GPUs Equal One 24GB GPU

Misunderstanding 2: Different GPU Models Cannot Be Mixed

Misunderstanding 3: Multi-GPU Is Always Faster Than Single-GPU

Misunderstanding 4: NVLink / SLI Is Required

Misunderstanding 5: Adding a GPU Does Not Require Restarting Services

GPU Selection Suggestions

Summary

2. The NVIDIA CUDA toolkit can now be installed directly with `apt`

What Is `OLLAMA_SCHED_SPREAD`