GPU on KnightLi Blog

How to Pick a GPU in April 2026: Which Models to Avoid and Which Ones Are More Worth Considering

Mon, 27 Apr 2026 08:51:10 +0800

If you are getting ready to build a PC, the GPU is the one part where you really should not look only at whether a card is new. By April 2026, some models are already much harder to justify, while others are not perfect but still feel noticeably more reasonable than the alternatives around the same price.

So this article skips theory and goes straight to specific models.

Models I Would Not Prioritize

1. `RTX 5060 Ti 8GB`

The biggest issue with this card is not that it is unusable. The issue is that 8GB already feels caught in an awkward middle ground at this point.

If you mostly play lighter online games at 1080p medium to high settings, it can still do the job. But once you move into any of these areas, the limitation shows up quickly:

Newer AAA games
Higher texture settings
1440p
Mixed use with AI inference, editing, or productivity work

If you are already looking at the RTX 5060 Ti, the safer move is usually to go straight to the 16GB version instead of saving a bit of budget by taking the 8GB one.

In short:

RTX 5060 Ti 8GB: not recommended
RTX 5060 Ti 16GB: clearly more worth considering

2. Expensive older cards, especially `RTX 3080 10GB` and `RTX 3070 Ti` when they are still priced high

The problem with these cards is not that performance is completely bad. The problem is that, in today’s market, buying them often puts you in an awkward spot:

Power draw is not low
They are no longer new
VRAM is not especially generous
Used-market sources are often messy

RTX 3080 10GB is the clearest example. If it is still priced high, it quickly turns into a card that looks strong on paper but feels less balanced in real use.

RTX 3070 Ti follows the same logic. It is not absolutely unbuyable, but if the price gap is not meaningful, you are usually better off looking at something newer, something with more comfortable VRAM, or something more balanced in power and thermals.

3. Older flagships with unclear history, such as `RTX 3090` and `RTX 3080 Ti`

These two cards are easy to want for obvious reasons:

The names still sound strong
Paper performance is not weak
They are very visible in the used market

What you really need to watch out for is where they came from.

If you are buying:

A pulled card
A repaired card
A used card with unclear history

then the risk is usually much higher than with a normal retail card. A card like the RTX 3090 looks attractive because of the 24GB VRAM, but heat, power delivery, silicon condition, and past usage history all become bigger worries than they would be on a straightforward new card.

If you do not already know exactly what you are buying, and you are not planning to spend time checking the card carefully, these older flagships are generally not something I would touch casually.

4. `RTX 5070` when the price is not right

RTX 5070 is not a card that is automatically bad. The catch is that the price has to make sense.

Its awkwardness shows up when the gap between it and the RTX 5070 Ti is not large enough. In that case, a lot of buyers end up feeling oddly unsatisfied.

The pattern usually looks like this:

Buy the 5070: you keep thinking a little more would have gotten you the 5070 Ti
Do not stretch the budget: you still know you bought the “almost” card

So RTX 5070 is not something to ignore entirely, but it is worth considering only when the price is clearly right. If the pricing sits in an uncomfortable middle zone, it quickly becomes a card that makes theoretical sense but does not feel great in practice.

Models That Make More Sense

1. `RTX 5060 Ti 16GB`

If you are already shopping in the midrange, this card is usually the safer choice compared with the 8GB version.

The reasons are simple:

More headroom within the same product family
Less likely to be boxed in by VRAM over the next few years
Easier to live with if you mix gaming and productivity

It may not be the most explosive card at its price, but it is at least the kind of card you are less likely to regret immediately.

2. `RTX 5070 Ti`

If your budget can stretch, this is usually a more complete answer than the RTX 5070.

Its value is not that it dominates every single scenario. Its value is that it feels more like a card that can balance gaming, resolution, and longer-term use all at once.

It makes sense for people who:

Want 1440p high settings
Want the system to last for years
Do not want to start thinking about upgrades too soon

If you are already stuck between the 5070 and 5070 Ti, and the gap is not absurdly large, going straight to the 5070 Ti is often the less annoying decision.

3. Properly priced new cards are usually a better first stop than older high-end cards

If you are not a veteran used-GPU hunter, a simple and effective rule is this:

Prioritize normal retail new cards
Be cautious with older high-end cards that have messy origins

At this point, the more practical approach is often:

Midrange budget: start with RTX 5060 Ti 16GB
A tier higher: focus on RTX 5070 Ti
Consider RTX 5070 only when pricing is clearly favorable

That is usually a better path than gambling on older cards that sound stronger but come with more baggage.

If You Just Want the Short Version

You can remember it like this:

Not really recommended: RTX 5060 Ti 8GB
Not recommended unless priced well: RTX 5070
Be cautious with: RTX 3080 10GB, RTX 3070 Ti, and unclear-source RTX 3090 / RTX 3080 Ti
More worth considering: RTX 5060 Ti 16GB
Easier long-term pick if budget allows: RTX 5070 Ti

Final Line

At this point in the market, the real mistake is usually not spending a bit more. It is buying a card that looks acceptable on paper but always feels just a little compromised in real use.

If you want to minimize regret, RTX 5060 Ti 16GB and RTX 5070 Ti are generally safer than many cards that seem “good enough,” while RTX 5060 Ti 8GB, badly priced RTX 5070, and older high-end cards with unclear history are usually the first ones to cross off.

Ubuntu 26.04 LTS GPU and Hardware Updates: CUDA, ROCm, DPC++, and More Platform Changes

Sun, 26 Apr 2026 19:35:57 +0800

If the previous article worked as a desktop-focused overview of Ubuntu 26.04 LTS, this one is better read as its hardware and compute-side follow-up. In this 26.04 cycle, Ubuntu pushed a number of AI, GPU computing, and platform compatibility changes into the main archive or formal support scope.

The short version is this: the most important part of this round is not just desktop and kernel upgrades, but that Ubuntu is bringing Intel, NVIDIA, and AMD GPU computing stacks into the distribution in a more systematic way.

Starting with 26.04, Intel’s open-source oneAPI DPC++ compiler is available directly from Ubuntu Archive for building SYCL code. Its runtime also includes adapters for Intel GPUs.

Two related components are also now available from Ubuntu repositories:

oneDPL, the DPC++ library, which provides higher-productivity developer APIs
oneDNN, built with dpclang-6, which can run on Intel GPUs

That means if you are already working with SYCL, heterogeneous computing, or AI workloads on Intel GPUs, Ubuntu now offers a more direct path instead of forcing you to maintain a separate external stack for everything.

Ubuntu also calls out one practical requirement: users need to be in the render group to actually use these Intel GPU-related capabilities.

2. The NVIDIA CUDA toolkit can now be installed directly with `apt`

For many developers and operators, this may be one of the most immediately useful changes in the notes.

Starting with 26.04, the NVIDIA CUDA toolkit can now be installed directly from Ubuntu Archive:

`1`	`sudo apt install cuda-toolkit`

The value here is bigger than just saving a few setup steps.

For developers shipping software on Ubuntu, this new model means they can simply declare a dependency on the CUDA runtime, while Ubuntu manages installation and compatibility at the distribution level. That makes CUDA feel more like a native system capability on Ubuntu, rather than an extra software layer that always has to be maintained separately.

3. AMD ROCm 7.1.0 is now in Universe

On the AMD side, Ubuntu Universe now includes ROCm 7.1.0.

These libraries mainly provide:

backend infrastructure for AI training and inference on AMD GPUs
software foundations for machine learning and high performance computing

Canonical also notes that ROCm-related components are continuously tested in its CI/CD pipeline. Beyond autopkgtests, that includes several user-space applications such as:

llama.cpp
pytorch
Blender
Lemonade Server

That detail matters, because it shows Ubuntu is not just dropping packages into the archive. It is validating ROCm as a maintainable software stack.

4. The bigger story is that all three GPU ecosystems are landing

It becomes easier to see the direction of 26.04 when DPC++, CUDA, and ROCm are viewed together:

Intel: bringing SYCL / oneAPI components into official repositories
NVIDIA: giving the CUDA toolkit a distribution-managed installation path
AMD: shipping ROCm 7.1.0 in Universe with ongoing testing

If you work with these kinds of workloads on Ubuntu, this release will probably feel more relevant:

local LLM inference
GPU-accelerated training or fine-tuning
Blender, scientific computing, and HPC
development environments that need to move across different GPU platforms

In other words, Ubuntu is no longer just “a system where you can install a GPU driver.” It is starting to carry a fuller user-space software stack for AI and GPU computing.

5. NVIDIA Dynamic Boost is enabled by default

Since 25.04, Dynamic Boost has been enabled by default on supported NVIDIA laptops.

The idea is straightforward: depending on system load, power can be shifted dynamically between the CPU and GPU. In gaming scenarios, that usually means giving more power to the GPU when needed to extract more performance.

It only applies under two conditions:

the laptop is connected to AC power
the GPU load is high enough

It does not engage while the system is running on battery.

6. Support for new Intel integrated and discrete GPUs keeps moving forward

Ubuntu also continues expanding support for new Intel GPUs, including:

Integrated:

Intel Core Ultra Xe2
Intel Core Ultra Xe3

Discrete:

Intel Arc 5 B570
Intel Arc 5 B580
Intel Arc Pro B50
Intel Arc Pro B60
Intel Arc Pro B65
Intel Arc Pro B70

Ubuntu also highlights several features already available around these devices:

improved GPU and CPU ray tracing performance through Intel Embree, benefiting applications such as Blender 4.2+
hardware video encoding for AVC, JPEG, HEVC, and AV1 on “Battlemage” devices
a new CCS optimization in Intel Compute Runtime
enabled debugging support for Intel Xe GPUs

If you are watching follow-up releases, 25.10 also continues to bring in more capabilities, including:

initial support for Intel’s next-generation client platform codenamed Panther Lake through Linux kernel 6.17
improved IOMMU, PCIe subsystem, and multi-GPU support
Mesa 25.2.3 enabling VK_KHR_shader_bfloat16 for Battlemage and Panther Lake
intel-media-driver 25.3.0 adding Panther Lake decode support and VP9 encoding
intel-compute-runtime 25.31 adjusting the Level Zero USM pool and local device memory event allocation behavior
level-zero 1.24 and level-zero-raytracing 1.1.0 bringing broader spec and RTAS extension support

7. Suspend and resume is more stable on Nvidia desktops too

Starting with 25.10, Ubuntu enables suspend-resume support in the proprietary Nvidia driver to reduce corruption and freezing when waking a desktop system.

This is not the most visible kind of change, but it matters a lot in everyday use, especially on desktops that stay on for long periods and frequently suspend and resume.

8. ARM, Raspberry Pi, RISC-V, and IBM Z also get harder platform-level changes

Beyond the GPU software stack, the release notes also include several platform-level changes worth calling out separately.

ARM64 desktop platforms

Starting with 25.10, the ARM64 linux-generic kernel provides broader desktop compatibility for ARM64 desktop platforms that boot through UEFI.

A new Raspberry Pi boot layout

One change introduced in 25.10 and refined in 26.04 is a new boot partition layout for Raspberry Pi systems.

Its goal is to improve boot reliability: newly written boot assets are first “tested” before they are committed as the new “known good” set.

The firmware date requirements are the part most users will want to remember:

Pi 3 / 3+ / CM3+ / Zero 2W: no additional action required, the boot firmware is in the image itself
Pi 4 / 400 / CM4: boot firmware must be dated no earlier than 2022-11-25
Pi 5 / 500 / CM5: boot firmware must be dated no earlier than 2025-02-11

You can check it with:

`1`	`sudo rpi-eeprom-update`

If the firmware is too old and you are using Ubuntu 24.04 LTS or newer, you can update it like this:

1
2

sudo rpi-eeprom-update -a
sudo reboot

Raspberry Pi desktop images now use desktop-minimal

Since 25.10, Ubuntu Desktop images for Raspberry Pi are based on desktop-minimal rather than the full desktop seed.

Ubuntu gives a very concrete benefit here: the default app set is smaller, saving about 777MB on the uncompressed image and on installed systems.

If you want to remove that default app set in bulk after upgrading, you can use:

`1`	`sudo apt purge ubuntu-desktop --autoremove`

If you want to keep some of those applications, just mark them as manually installed with apt first.

Swap on Raspberry Pi is now handled by cloud-init

Since 25.10, swap file creation on Raspberry Pi desktop images is handled by cloud-init.
If you want to customize swap size before first boot, you can edit user-data on the boot partition directly.

RISC-V requirements have moved up

Starting with 25.10, the RISC-V build of Ubuntu 26.04 LTS requires hardware that implements the RVA23S64 ISA profile.

Systems that do not meet that requirement can no longer run Ubuntu 26.04 LTS. If you still have boards based on earlier RVA20 processor cores, you need to stay on the support line provided by Ubuntu 24.04 LTS.

According to Ubuntu, as of April 2026, there is still no real RVA23S64 hardware available. So the only currently supported platform is effectively a QEMU virtualized environment configured with -cpu rva23s64.

IBM Z now requires z15 at minimum

Starting with 26.04, the minimum requirement for the s390x architecture has moved up to z15.

That means:

z14 / LinuxONE II and older systems can no longer install Ubuntu 26.04 LTS
z15 / LinuxONE III and newer systems should see better performance

9. Who should read this first

This article is more useful than the desktop overview if you fall into any of these cases:

you use Ubuntu for CUDA, ROCm, SYCL, or local AI inference
you do development or compute work on Intel, NVIDIA, or AMD GPUs
you maintain Raspberry Pi, ARM64, RISC-V, IBM Z, or other non-standard x86 platforms
you are especially sensitive to repository availability, driver behavior, runtimes, and platform requirements after an upgrade

10. One-line takeaway

The key point of Ubuntu 26.04 LTS on the hardware and AI stack side is not that one GPU vendor got a standout upgrade. It is that Intel’s DPC++, NVIDIA’s CUDA, and AMD’s ROCm are all entering the Ubuntu ecosystem in a more official, in-repository, and maintainable way.

If you used to think of Ubuntu as “the system first, then I assemble the GPU environment myself,” 26.04 starts to look more like a distribution that is willing to actively carry AI and heterogeneous computing workloads.

How to Fix Ollama Using CPU Instead of GPU

Fri, 24 Apr 2026 18:30:00 +0800

When running local LLMs, one of the most frustrating problems is this: your machine clearly has a GPU, yet Ollama still leans heavily on the CPU, and performance is painfully slow.

The short version is that this is usually not caused by one single issue. The most common causes are:

Ollama is not detecting any usable GPU
The driver, ROCm, or CUDA environment is not set up correctly
The Ollama service was started without the right environment variables
The model is too large and has fallen back to CPU or mixed CPU/GPU loading
On AMD platforms, there may be extra compatibility issues such as ROCm version mismatch, gfx settings, or device visibility problems

The fastest way to troubleshoot it is to go through the checks below in order.

1. First, confirm whether Ollama is really not using the GPU

The most direct check is:

`1`	`ollama ps`

Focus on the PROCESSOR column.

100% GPU: the model is fully running on the GPU
100% CPU: the GPU is not being used at all
Results like 48%/52% CPU/GPU: part of the model is in VRAM, and part has spilled into system memory

If you see 100% CPU, the next step is to focus on environment and service configuration.
If you see mixed loading, that does not necessarily mean the GPU is broken. In many cases, it simply means VRAM is not enough.

2. Rule out the most common misunderstanding first: the model does not fit into VRAM

Many people assume that once a GPU is installed, Ollama will always run fully on it. That is not how it works.

If the model is too large, the context is too long, or some other loaded model is already occupying VRAM, Ollama may fall back to:

Partial GPU + partial CPU
Full 100% CPU

At this point, the two simplest tests are:

Try a smaller model first
For example, test with a 4B or 7B model before jumping straight to much larger ones.
Unload other active models and test again
Run ollama ps first and make sure nothing else is occupying VRAM.

If smaller models use the GPU but larger ones do not, the real problem is usually VRAM capacity rather than the driver.

3. Check whether the GPU driver and the lower-level runtime are actually working

If even small models run only on CPU, the next step is to check the underlying environment.

NVIDIA

First confirm that the driver is working and the system can see the GPU. A common check is:

`1`	`nvidia-smi`

If this already fails, Ollama is very unlikely to use the GPU correctly.

AMD / ROCm

If you are using an AMD GPU, especially with ROCm, start with:

1
2

rocminfo
rocm-smi

If these tools cannot list the device properly, the problem is still below Ollama, so there is no point debugging the application layer yet.

On AMD, the most common issue is not simply “is the driver installed,” but rather:

The ROCm version does not match the OS version
The current GPU architecture has incomplete support
The device exists, but the runtime is not being exposed correctly to Ollama

4. Restart the Ollama service, not just your terminal

This is a very common trap.

Many people install drivers, change environment variables, fix ROCm, then just open a new terminal and continue with ollama run. But if Ollama is running as a background service, it may still be using the old environment.

So the safer approach is:

Fully restart the Ollama service
Reboot the machine if necessary

If you are running it as a service on Linux, make sure the service process was actually restarted instead of reusing the old one.

5. Check whether the environment variables are really reaching the service

This matters especially on AMD ROCm systems.

Some machines work fine when commands are run manually in a shell, but the Ollama service still uses only CPU. In that case, the usual reason is that the service process never received the variables you set in your shell.

Common variables to look at include:

1
2

ROCR_VISIBLE_DEVICES
HSA_OVERRIDE_GFX_VERSION

Specifically:

ROCR_VISIBLE_DEVICES limits or selects which GPUs ROCm can see
HSA_OVERRIDE_GFX_VERSION is often used as a compatibility workaround on some AMD platforms

If you only export these variables in the current terminal, but Ollama is started by systemd, a desktop background service, or another daemon, they may not take effect.

In other words, “it looks set in my terminal” does not mean Ollama is actually using it.

6. On AMD platforms, focus on ROCm compatibility

Based on the public page metadata, the original video for this topic is tied to AMD Max+ 395, strix halo, and AMD ROCm.
In setups like these, Ollama failing to use the GPU is often more dependent on version matching than on NVIDIA systems.

Start by checking these:

Whether the installed ROCm version fits the current OS and GPU
Whether the GPU belongs to an architecture with solid ROCm support
Whether you need to set HSA_OVERRIDE_GFX_VERSION
Whether an older Ollama build or older inference runtime is causing compatibility issues

If rocminfo works and the GPU is visible to the system, but Ollama still runs only on CPU, the issue is often in the version combination rather than in model parameters.

7. In Docker, WSL, or remote environments, also check device mapping

If you are not running on bare metal but inside:

Docker
WSL
Remote containers
Virtualized environments

then you need to check one more layer: whether the GPU device is actually being exposed inside that environment.

A typical symptom looks like this:

The host machine can see the GPU
Ollama inside the container or subsystem still uses only CPU

In that case, the issue may not be Ollama itself. The container or subsystem may simply not have GPU access.

8. Check logs last, but check them for the right reason

If you have already gone through the earlier steps, the most effective next move is not endless reinstalling, but looking directly at the Ollama startup and runtime logs.

Focus on two kinds of messages:

Whether a GPU was detected at all
Whether there are driver, library loading, or device initialization errors

If the logs clearly say something like “no compatible GPU found” or “failed to initialize ROCm/CUDA,” the troubleshooting direction becomes much clearer immediately.

Troubleshooting Order

If you only want the shortest path, use this order:

Run ollama ps and confirm whether it is GPU, CPU, or mixed loading
Try a smaller model to rule out VRAM limits
Use nvidia-smi, rocminfo, and rocm-smi to verify the lower-level environment first
Fully restart the Ollama service
Check service environment variables, especially ROCR_VISIBLE_DEVICES and HSA_OVERRIDE_GFX_VERSION on AMD
If you are in Docker or WSL, verify device mapping
Finally, inspect logs for the exact error

Conclusion

When Ollama uses CPU instead of GPU, the root cause usually falls into one of three groups:

The GPU is not being detected at all
The GPU is detectable, but the runtime environment is not reaching Ollama
The GPU is working, but the model is too large and falls back to CPU or mixed memory

Once you separate those three cases, troubleshooting becomes much faster.
If you are on an AMD platform, pay special attention to ROCm version matching, device visibility, and compatibility variables instead of focusing only on the Ollama command itself.

Original video: https://www.bilibili.com/video/BV1cHoYBqE8k/

What Is NVIDIA nvbandwidth: How to Use This GPU Bandwidth Testing Tool

Fri, 24 Apr 2026 14:41:35 +0800

If you have recently been troubleshooting interconnect performance between multiple NVIDIA GPUs, or you want to verify the real bandwidth between PCIe, NVLink, host memory, and VRAM, NVIDIA/nvbandwidth is a small tool worth knowing about.

It is not a general benchmark utility, and it is not a hidden command inside a large model framework. It is an open-source tool from NVIDIA specifically designed to measure bandwidth and latency for GPU-related memory copies. Instead of only looking at theoretical bandwidth, nvbandwidth is better at answering a practical question: how much bandwidth can this machine and its current GPU interconnects actually deliver right now?

1. What does `nvbandwidth` do

According to the official README, nvbandwidth is a command-line tool for measuring bandwidth on NVIDIA GPUs.

It mainly focuses on transfer performance across different memcpy patterns, such as:

GPU -> GPU
CPU -> GPU
GPU -> CPU
Transfers between GPUs across multiple nodes

These tests are especially useful in scenarios like:

Troubleshooting interconnect bottlenecks in multi-GPU training or inference
Verifying the actual behavior of links such as NVLink, PCIe, and C2C
Comparing transfer differences across servers, topologies, drivers, or CUDA versions
Performing baseline hardware validation before cluster deployment

In short, nvbandwidth is not about model throughput. It is about the lower-level ability to move data.

2. It does not produce just one simple score

Many people think of a bandwidth test as something that ends with a single number, but nvbandwidth provides more detailed output than that.

It reports results as matrices for each test type. For example, in a test like device_to_device_memcpy_write_ce, it shows the bandwidth between each pair of GPUs by row and column. That means you can see more than just a rough system-wide speed estimate. You can also spot:

Which GPU pairs are especially fast
Which paths are clearly limited by PCIe
Whether certain GPU pairs show abnormally low bandwidth
Whether the multi-GPU topology matches your expectations

If you are working with an 8-GPU server, a dual-socket platform, or a multinode system, this matrix-style output is often more useful than a single average number.

3. How to understand `CE` and `SM` copies

The official documentation splits tests into two categories:

CE: copy engine transfers based on memcpy APIs
SM: kernel-based transfers

These two result types are not guaranteed to match exactly, because they represent different copy paths.
If you mainly want to understand regular device-to-device transfer behavior, you will usually look at CE first. If you want to study execution details more closely, then SM is worth checking too.

The README also explains that bandwidth results use the median across multiple test runs by default. Newer versions additionally include variability statistics, which makes it easier to judge how stable the numbers are.

4. What environment does it require

nvbandwidth is not a pure binary utility that you simply download and run. It expects a standard CUDA development environment.

The current README lists these basic requirements:

CUDA Toolkit 11.x or newer
A compiler with C++17 support
CMake 3.20+, with 3.24+ recommended
Boost program_options
A usable CUDA device and a compatible driver

The requirements are higher if you want the multinode version. The current README explicitly states:

Multinode builds require CUDA Toolkit 12.3
The driver must be 550 or newer
MPI is required
The nvidia-imex service must be configured

So this is much more of an engineering tool for Linux GPU servers and clusters than something aimed at casual desktop use.

5. How to build and run the single-node version

The single-node build process is straightforward:

1
2

cmake .
make

On Ubuntu / Debian, the project also provides a debian_install.sh script that installs common dependencies and builds the project.

After building, you can check the help output first:

`1`	`./nvbandwidth -h`

Some commonly used options include:

-l: list available tests
-t: run a specific test by name or index
-p: run tests by prefix
-b: set the memcpy buffer size, default 512 MiB
-i: set the number of benchmark iterations
-j: output JSON
-H: enable huge pages for host memory allocation

If you just want to run the default test suite once, use:

`1`	`./nvbandwidth`

If you only want to test one specific item, such as a device-to-device copy:

`1`	`./nvbandwidth -t device_to_device_memcpy_read_ce`

6. Multinode support is one of its standout features

nvbandwidth is not only for single-node multi-GPU testing. It also supports multinode scenarios.

According to the README, the multinode build is done like this:

1
2

cmake -DMULTINODE=1 .
make

At runtime, it is typically used together with mpirun, with one process launched per GPU.
The documentation also requires all participating ranks to belong to the same multinode clique, and it recommends mainly running tests with the multinode prefix under MPI.

That makes its positioning much closer to high-performance computing and large GPU systems than to simple workstation self-checks.

If you are working with NVLink multinode deployments or more complex platforms such as GB200 / Grace Hopper, the value of nvbandwidth is much higher than it would be on a typical consumer GPU setup.

7. What changed in `v0.9`

As of April 24, 2026, the GitHub Releases page shows that the latest version of nvbandwidth is v0.9, released on April 8, 2026.

The most notable updates in this release include:

Added variability statistics to bandwidth output
Added huge page support for host memory (Windows excluded)
Added pair sampling for device-to-device tests
Added a troubleshooting guide
Unified single-node and multinode execution paths

Two engineering-oriented changes are also worth noting:

Improved CUDA architecture detection without relying as much on direct GPU access
Deprecated Volta (sm_70 / sm_72) support in CUDA Toolkit 13.0+ environments

So if you only looked at early versions before, v0.9 is no longer just a basic bandwidth tester. It is clearly moving toward better automation, troubleshooting, and large-scale system validation.

8. When is it a good fit

nvbandwidth is especially suitable when:

You want to verify real interconnect bandwidth between multiple NVIDIA GPUs
You suspect one GPU is installed in a bandwidth-limited PCIe slot
You want to compare NVLink paths against non-NVLink paths
You are deploying a multinode GPU cluster and need to validate the links
You want test results in JSON for automation pipelines

But if your goal is only to answer questions like “how fast is training” or “how many tokens per second can inference reach,” this tool is not the whole answer.
In that case, you still need workload-level testing with your training framework, inference engine, or real application.

9. How to think about its value

Many GPU performance problems are not really caused by insufficient compute. They happen because the data path is not working as expected.

For example:

GPUs are not using the intended interconnect path
Cross-NUMA access is reducing speed
Certain GPU pairs have abnormal bandwidth
Multinode communication is only partially configured

These issues are often hard to diagnose if you only look at nvidia-smi or model throughput.
A lower-level, matrix-oriented tool like nvbandwidth is useful precisely because it exposes what is happening at the interconnect layer.

So a simple way to think about it is: nvbandwidth is a command-line health check tool for bandwidth on NVIDIA GPU systems.

GitHub project: https://github.com/NVIDIA/nvbandwidth
Releases: https://github.com/NVIDIA/nvbandwidth/releases

How to Check Whether a Tesla V100 Has ECC Errors

Thu, 23 Apr 2026 11:50:21 +0800

If you have a Tesla V100 on hand and want to do a basic health check first, ECC status is one of the most useful things to look at.

The most direct method is to inspect the card’s detailed information with nvidia-smi.

1
2
3

nvidia-smi -q
# 查询第 0 块 GPU
nvidia-smi -q -i 0

Focus on the ECC Errors section.

On a card in normal condition, the four common groups of counters under ECC Errors should all be 0 or N/A. If any of them already show a non-zero value, it means the card has seen that type of ECC anomaly before, and you should further evaluate whether it is still suitable for continued use.

Reference output:

nvidia-smi -q
    ECC Mode
        Current                          : Enabled
        Pending                          : Enabled
    ECC Errors
        Volatile
            Single Bit
                Device Memory            : 0
                Register File            : 0
                L1 Cache                 : 0
                L2 Cache                 : 0
                Texture Memory           : N/A
                Texture Shared           : N/A
                CBU                      : N/A
                Total                    : 0
            Double Bit
                Device Memory            : 0
                Register File            : 0
                L1 Cache                 : 0
                L2 Cache                 : 0
                Texture Memory           : N/A
                Texture Shared           : N/A
                CBU                      : 0
                Total                    : 0
        Aggregate
            Single Bit
                Device Memory            : 0
                Register File            : 0
                L1 Cache                 : 0
                L2 Cache                 : 0
                Texture Memory           : N/A
                Texture Shared           : N/A
                CBU                      : N/A
                Total                    : 0
            Double Bit
                Device Memory            : 0
                Register File            : 0
                L1 Cache                 : 0
                L2 Cache                 : 0
                Texture Memory           : N/A
                Texture Shared           : N/A
                CBU                      : 0
                Total                    : 0
    Retired Pages

You can think of it like this:

Volatile is the error count for the current power cycle
Aggregate is the lifetime accumulated error count
Single Bit means correctable errors
Double Bit means uncorrectable errors, which are more serious

If you only want a quick screening rule, remember this:

Most items should be 0
N/A is normal for some not-applicable entries
If Double Bit or the total count is not 0, do not rely only on a seller’s verbal description; it is better to continue with fuller stress testing and stability checks

This does not replace a complete inspection, but it is enough for a first round of checks after a V100 arrives.

Is Tesla V100 Still Worth Buying: ECC Checks, Cooling Mods, and DIY Pitfalls

Thu, 23 Apr 2026 11:15:10 +0800

If you have been looking at used Tesla V100 cards recently, you have probably seen two very different opinions:

one side says the card is still strong and offers great value
the other says the market is full of traps and DIY users can easily get burned

Both are true.

The point is not that V100 is unbuyable. The point is that you cannot buy it the same way you would buy a normal consumer GPU. What matters is not only whether it boots, and not only whether the seller says “like new” or “pulled from an original server”. What matters is whether the card has been tampered with, what its ECC condition looks like, and whether the cooling and power setup are actually reliable.

This article pulls together the most useful checks for buying and using one in practice.

Quick Takeaways

If you only want the short version, remember these points:

V100 was produced roughly from 2017 to 2021, and 2021 cards are uncommon in the 16G version
looking only at “zero ECC” or “original pull” is not enough, because both data and physical condition can be altered
the biggest risk is often not buying an old card, but buying one that was disassembled, reflashed, or paired with a bad cooling setup
for DIY users, the real problem is usually not the core itself, but the adapter board, power delivery, hotspot temperature, and backplate cooling

1. Start with Production Date and Batch Clues

A very practical method is to check the chip date first, then see whether the dates on nearby components match it.

For example, if the chip surface shows 1828, it usually means:

18 = year 2018
28 = week 28

So that chip was produced in week 28 of 2018.

Besides the chip package, nearby inductors often carry date-related markings too. If the chip date and inductor date are far apart, for example:

chip date is 2017
inductors point to 2020

then you should be cautious. It does not automatically prove the card is bad, but it does suggest it is no longer in a very original state.

On the other hand, if the dates broadly line up, such as:

a 2018 chip with 2018 surrounding components
a late 2019 chip paired with 2020 components

that is much more normal.

2. Do Not Only Look at the Chip: Check Inductors, Springs, and Frame

Visual inspection is best broken into a few separate checks.

1. Touch the inductors first

Gently press or touch the inductors. Under normal conditions, none of them should feel loose.

If one of them is already moving, it usually means:

the solder condition is not healthy
the problem may worsen with continued use

Even if the card still works now, that is not a good sign.

2. Check whether the retaining spring has been removed before

There is a useful logic here:

if the seller insists this is an “original server pull”
then the retaining spring generally should not have been casually removed

In a normal factory server environment, people do not usually remove this spring for no reason.

If the spring comes off very easily, the card was probably opened before. If the seller is also claiming it is untouched, that claim deserves skepticism.

3. If the frame comes apart too easily, that is also suspicious

Once the middle frame is removed, if the whole structure separates with almost no effort, that usually means the card has already been disassembled multiple times.

That matters on used V100 cards because reflashing, modification, and repair work often leave exactly these kinds of traces.

3. If the Backplate Separates Too Easily, Suspect a Reflash or Prior Tampering

One especially important detail is that there is a metal plate under the PCB. It is not only for protection; it also helps with heat dissipation.

In a normal original condition, this backplate is usually not easy to remove. Reasons include:

adhesive
a tight structural fit
the design was not meant for repeated disassembly

If the backplate separates from the PCB with only a little force, then you should suspect:

it has been opened before
the card may have had its VBIOS reflashed
there may have been secondary modifications

That does not automatically make it unusable, but it is clearly inconsistent with “original and untouched.”

4. How to Read `ECC`: What Matters Most Is Not Whether It Is Zero, but Whether It Grows

ECC is one of the first things people look at on a V100, and it really needs to be interpreted carefully.

A common method is to use nvidia-smi in detailed mode and check the ECC Errors section.

1. Real-time errors are the most dangerous

The upper section can be understood as real-time errors.

If those numbers keep increasing while the card is running, that usually means the card is already in an unstable state.

In simple terms:

a card that runs without new errors matters more than a static zero reading
a card that starts increasing errors under stress is much more worrying than one with only historical accumulated counts

2. Lifetime accumulated errors are not always scary

Another section shows lifetime accumulated errors, meaning how many corrected or uncorrected events happened across the card’s life.

If those values are only:

single digits
or maybe in the teens

that is not automatically a disaster.

If real-time errors do not continue increasing during actual use, the card may still be perfectly usable.

3. The page retirement section deserves more attention

The page retirement section is even more important, because it indicates memory blocks that were retired after uncorrectable errors.

A practical way to think about it is:

single-bit and double-bit categories may each have retired blocks
if the total climbs past 10, you are entering a range where caution is warranted

That does not always mean the card is unusable, but it does suggest reduced effective memory and weaker long-term confidence.

5. Do Not Worship “Zero ECC”: The Data Itself Can Be Manipulated

There is a very practical warning here:

ECC numbers are not inherently sacred.

If a card has:

extremely clean-looking data
but obvious signs of disassembly
and a structure that clearly looks worked on

then you should not trust “zero ECC” by itself.

A useful analogy is an old car that suddenly shows 0 mileage and almost no tire wear after many years. It is hard not to suspect the odometer was touched.

The same idea applies to V100:

numbers that look too perfect are not always good news
what matters is whether the data, the physical condition, and the stress-test behavior all make sense together

6. Stress Testing Is Necessary, but Testing Only the Core Is Not Enough

You can use a tool such as gpu-burn to stress the card for several minutes or longer and watch:

whether it remains stable
whether the card drops out
whether new ECC errors appear

But there is another important point:

Testing only the core does not prove the entire card is healthy.

A lot of V100 failures do not start with the core. They start with:

overheating in the power-delivery area
insufficient cooling around the backplate
excessive hotspot temperatures
adapter boards and cooling systems that are always operating too close to the edge

So stress testing only proves that “the card can run right now.” It does not prove that “this DIY setup will survive in the long run.”

7. For DIY Users, the Real Failure Point Is Usually Cooling and Power, Not the Purchase Itself

This is probably the most important part of the entire topic.

The core idea is simple:

For DIY users, casually combining an adapter base with a generic cooler is not a robust plan.

That is because V100 is not a normal consumer card. It is a server accelerator with:

high power draw
high heat density
complicated heat distribution

The chip is not the only thing producing heat. The backplate, power area, and connector region also get hot, and sometimes very hot.

1. Do not only watch average GPU temperature

Many monitoring tools show the average card temperature, but the more dangerous number is often the hot spot.

That means:

the visible temperature may only be in the 60s Celsius
while local hotspots may already be over 100C

That is why some DIY V100 builds look “fine” on paper and then suddenly die later.

2. Backplate cooling must be considered

Cooling for the backplate and power area cannot be ignored.

If you only cool the core, but:

the MOS area is neglected
the backplate gets no heat transfer help
the rear side lacks proper thermal design

then the full setup is still incomplete.

3. Cheap improvised water-cooling setups are risky

You should be cautious about the “random adapter board + cheap AIO water cooler” style setup.

The issue is not that it always fails immediately. The issue is that it often has:

uneven water-channel coverage
incomplete cooling for the power-delivery area
poor control of the actual hotspot zones
unpredictable long-term lifespan

8. If You Still Want to DIY, At Least Watch These Points

The most practical recommendations are:

prefer more mature adapter-board solutions with a better track record
do not focus only on the core; the rear power area and backplate need thermal attention too
the water block needs real coverage and even heat handling, not just physical contact
after stress testing, keep watching temperatures, hotspots, and long-term behavior
PSU quality also affects coil whine and overall stability

In other words, the hard part of a DIY V100 build is not “getting it to boot.” The hard part is “keeping it alive and stable afterward.”

9. Coil Whine and Adapter-Board Variance Are Real Problems Too

Two more points are often overlooked.

1. Coil whine may not be fully eliminable

It depends on the individual card, the inductors, capacitors, and the power environment. It is not something you can always solve with one cable or one small accessory.

2. Adapter-board variance is huge

That is why some sellers, even when they are willing to sell a bare card, still emphasize:

bench-testing it first
recording the serial number
doing stress tests
documenting the process

Because a lot of disputes are not caused by the silicon itself. They are caused by the adapter board and cooling solution paired with it afterward.

Closing

So, is Tesla V100 still worth buying? Yes, but only if you understand what you are buying and how you plan to use it afterward.

If you only check:

whether it powers on
whether ECC is all zero
whether the seller says “original pull”

that is nowhere near enough.

The more useful things to verify are:

whether the dates and batch clues line up
whether there are suspicious signs of prior disassembly
whether the backplate and structure were clearly opened before
whether errors increase under stress testing
whether your cooling and power setup are actually trustworthy

Especially for DIY users, the most dangerous part of V100 is often not “buying an old card”, but underestimating how demanding these cards are about cooling, power delivery, and modification quality.

llama.cpp GPU Performance Ranking: Full CUDA, ROCm, and Vulkan Scoreboards Explained with pp512 / tg128 / FA

Thu, 23 Apr 2026 10:22:04 +0800

Understanding the Metrics First

What is Q4_0

Q4_0 is a 4-bit quantization format. It does not mean the model is stronger. It means the model is smaller, uses less VRAM, and fits on more devices. Most of these scoreboards standardize on Llama 2 7B, Q4_0 so that GPU-to-GPU comparisons are easier.

What is pp512

pp512 usually means prompt processing 512 tokens, which is the throughput while processing 512 input tokens.

pp = prompt processing
512 = input length is 512 tokens
/s = tokens per second This is closer to prompt-ingestion speed, so it is often much higher than generation speed.

What is g128

g128 usually means 	ext generation 128 tokens, which is the speed while generating 128 tokens continuously.

g = text generation
128 = generate 128 tokens continuously
/s = tokens per second This is usually closer to the speed users actually feel in interactive usage.

What is FA

FA stands for Flash Attention.

with FA means Flash Attention is enabled

o FA means Flash Attention is disabled On many GPUs, FA improves pp512 more clearly than g128, but the gain is not identical across backends, drivers, and GPU architectures.

How to read /s

/s means 	okens per second. When reading these scoreboards, the key rule is to compare the same type of test with the same settings.

Do not compare pp512 and g128 as if they were the same thing
Do not mix o FA and with FA
Do not assume CUDA, ROCm, and Vulkan are directly interchangeable

Quick Takeaways

CUDA is still the strongest overall path in llama.cpp GPU benchmarks, especially on high-end Nvidia GPUs.
ROCm is already delivering strong results on high-end AMD GPUs and Instinct accelerators.
Vulkan has the broadest hardware coverage, including Nvidia, AMD, Intel, older GPUs, and some Apple / Asahi setups.
g128 is closer to everyday perceived speed, while pp512 is better for judging prompt throughput.

CUDA Scoreboards

Llama 2 7B, Q4_0, no FA

Chip	Memory	pp512 t/s	tg128 t/s	Commit	Thanks to
RTX 5090	32 GB / GDDR7 / 512 bit	14073.41 ± 115.16	290.02 ± 1.10	8cf6b42	@totaldev
RTX PRO 6000 Blackwell	96 GB / GDDR7 / 512 bit	14854.63 ± 22.73	274.20 ± 0.14	79c1160	@Tom94
H100 80 GB	80 GB / HBM3 / 5120 bit	9918.34 ± 176.97	267.81 ± 1.54	5143fa8	@Hedede
A100 80 GB	80 GB / HBM2e / 5120 bit	4849.53 ± 8.94	190.88 ± 0.33	5143fa8	@Hedede
RTX 4090 D	24 GB / GDDR6X / 384 bit	10293.86 ± 134.72	189.33 ± 0.19	79c1160	@autonomous-AI-lab
RTX 4090	24 GB / GDDR6X / 384 bit	11992.70 ± 107.99	186.21 ± 0.13	2241453	@lhl
RTX 5080	16 GB / GDDR7 / 256 bit	8297.36 ± 9.50	181.99 ± 0.42	8a4280c	@Hedede
RTX 5070 Ti	16 GB / GDDR7 / 256 bit	6952.38 ± 13.73	176.85 ± 0.07	933414c	@TinyServal
RTX 6000 Ada	48 GB / GDDR6 / 384 bit	9229.23 ± 101.78	176.07 ± 0.26	b8e09f0	@Hedede
RTX 3090 Ti	24 GB / GDDR6X / 384 bit	6567.49 ± 20.30	171.19 ± 3.98	9c35706	@slaren
RTX 3090	24 GB / GDDR6X / 384 bit	5174.69 ± 21.83	158.16 ± 0.21	c76b420	@m18coppola
L40	48 GB / GDDR6 / 384 bit	8870.49 ± 378.76	152.01 ± 0.28	ee09828	@Hedede
RTX 4080 SUPER	16 GB / GDDR6X / 256 bit	8125.15 ± 41.05	148.33 ± 0.20	81086cd	@zacharyarnaise
RTX 4080	16 GB / GDDR6X / 256 bit	8031.64 ± 26.49	142.49 ± 0.16	20638e4	@Ristovski
RTX 3080	10 GB / GDDR6X / 320 bit	5013.86 ± 24.80	139.65 ± 0.99	9c35706	@slaren
RTX A6000	48 GB / GDDR6 / 384 bit	4913.93 ± 6.79	138.73 ± 2.75	4795c91	@Hedede
RTX 4070 Ti SUPER	16 GB / GDDR6X / 256 bit	6924.53 ± 13.87	132.26 ± 0.16	9c35706	@Ristovski
RTX PRO 4000 Blackwell	24 GB / GDDR7 / 192 bit	4992.83 ± 113.52	131.66 ± 0.20	7d77f07	@Hedede
RTX A5000	24 GB / GDDR6 / 384 bit	4028.16 ± 19.14	130.07 ± 2.74	e5155e6	@Hedede
Tesla V100	32 GB / HBM2 / 4096 bit	3042.64 ± 40.71	129.08 ± 0.05	51f5a45	@Hedede
RTX 5070	12 GB / GDDR7 / 192 bit	5184.75 ± 18.70	127.54 ± 0.46	@Spyro000	-
A40	48 GB / GDDR6 / 384 bit	4609.01 ± 10.67	124.11 ± 0.17	3470a5c	@Hedede
A30	24 GB / HBM2e / 3072 bit	2767.10 ± 1.88	124.81 ± 0.16	583cb83	@Hedede
Titan V	12 GB / HBM2 / 3072 bit	2617.46 ± 2.10	108.79 ± 0.05	e56abd2	@Hedede
RTX 2080 Ti	11 GB / GDDR6 / 352 bit	2890.66 ± 2.42	107.51 ± 0.21	9c35706	@ariya
Quadro RTX 6000	24 GB / GDDR6 / 384 bit	2751.18 ± 19.43	102.77 ± 0.04	b8e09f0	@Hedede
Quadro RTX 8000	48 GB / GDDR6 / 384 bit	2709.95 ± 3.35	102.68 ± 0.03	b8e09f0	@Hedede
RTX A4500	20 GB / GDDR6 / 320 bit	2827.20 ± 66.43	97.32 ± 2.80	5cdb27e	@aleksyx
RTX 5060 Ti 16 GB	16 GB / GDDR7 / 128 bit	3737.25 ± 6.79	90.94 ± 0.02	89d1029	@mike-llamacpp
RTX 2070 SUPER	8 GB / GDDR6 / 256 bit	2088.34 ± 1.94	88.06 ± 0.28	bc07349	@phstudy
RTX A4000	16 GB / GDDR6 / 256 bit	2684.06 ± 15.28	83.77 ± 0.37	65349f2	@TinyServal
Titan Xp	12 GB / GDDR5X / 384 bit	1154.96 ± 1.46	76.08 ± 0.08	c4510dc	@Hedede
RTX 3060	12 GB / GDDR6 / 192 bit	2137.50 ± 10.12	75.57 ± 0.07	baa9255	@QuantiusBenignus
Quadro RTX 4000	8 GB / GDDR6 / 256 bit	1536.89 ± 0.90	65.62 ± 0.62	7d77f07	@Hedede
RTX 4060 Ti 8 GB	8 GB / GDDR6 / 128 bit	3394.63 ± 7.44	63.86 ± 0.01	89d1029	@mike-llamacpp
GTX 1080 Ti	11 GB / GDDR5X / 352 bit	1084.41 ± 3.01	62.49 ± 0.06	9c35706	@ariya
RTX A4000 Ada	20 GB / GDDR6 / 160 bit	2779.77 ± 9.91	61.83 ± 0.04	a74a0d6	@sdwolfz
RTX 2060 SUPER	8 GB / GDDR6 / 256 bit	1420.24 ± 1.95	60.04 ± 0.01	5c0eb5e	@ggerganov
Tesla P100	16 GB / HBM2 / 4096 bit	760.80 ± 2.92	58.35 ± 0.00	b8372ee	@Hedede
DGX Spark	128 GB / LPDDR5x	3062.31 ± 11.02	57.21 ± 0.06	5acd455	@ggerganov
Tesla P40	24 GB / GDDR5 / 384 bit	1007.42 ± 1.23	54.74 ± 0.07	c76b420	@m18coppola
RTX 2000 Ada	16 GB / GDDR6 / 128 bit	1956.22 ± 7.74	50.62 ± 0.04	756cfea	@DigitalRudeness
Tesla T4	16 GB / GDDR6 / 256 bit	1219.06 ± 4.18	46.38 ± 0.73	d32e03f	@pt13762104
RTX 4050 Laptop	6 GB / GDDR6 / 96 bit	1725.85 + 17.85	43.72 + 0.41	d79d8f3	@TimCabbage
GTX 1660	6 GB / GDDR5 / 192 bit	148.91 ± 0.01	41.35 ± 0.02	9515c61	@ariya
Tesla M40	24 GB / GDDR5 / 384 bit	282.65 ± 0.15	38.04 ± 0.02	97d5117	@Hedede
GTX 1070 Ti	8 GB / GDDR5 / 256 bit	714.44 ± 2.04	37.82 ± 0.02	79c1160	@pebaryan
Jetson AGX Orin	64 GB / LPDDR5 / 256 bit	991.31 ± 1.15	33.58 ± 0.14	c1b1876	@TinyServal
Tesla P4	8 GB / GDDR5 / 256 bit	514.53 ± 3.06	33.29 ± 0.00	c76b420	@m18coppola
P106-100	6 GB / GDDR5 / 192 bit	406.94 ± 0.25	30.40 ± 0.02	5fd160b	@pebaryan
GTX 1060	6 GB / GDDR5 / 192 bit	416.85 ± 1.75	27.79 ± 0.02	5fd160b	@pebaryan
Quadro T1000	4 GB / GDDR5 / 128 bit	79.44 ± 0.01	27.82 ± 0.18	f6da8cb	@hanabu
Quadro P2000	5 GB / GDDR5 / 160 bit	309.30 ± 0.05	23.63 ± 0.00	baa9255	@TinyServal
Quadro P1000	4 GB / GDDR5 / 128 bit	183.40 ± 0.11	13.99 ± 0.13	1e74897	@aleksyx
Tesla K80	12 GB / GDDR5 / 384 bit	133.14 ± 0.55	13.80 ± 0.02	32732f2	@pebaryan

Llama 2 7B, Q4_0, with FA

Chip	Memory	pp512 t/s	tg128 t/s	Commit	Thanks to
RTX 5090	32 GB / GDDR7 / 512 bit	14970.15 ± 381.06	300.40 ± 0.28	8cf6b42	@totaldev
RTX PRO 6000 Blackwell	96 GB / GDDR7 / 512 bit	16618.98 ± 20.66	281.11 ± 0.41	5143fa8	@Tom94
H100 80 GB	80 GB / HBM3 / 5120 bit	11263.29 ± 98.34	280.74 ± 1.17	5143fa8	@Hedede
A100 80 GB	80 GB / HBM2e / 5120 bit	5285.96 ± 6.58	200.90 ± 0.12	5143fa8	@Hedede
RTX 4090 D	24 GB / GDDR6X / 384 bit	12506.97 ± 11.51	191.57 ± 0.03	79c1160	@autonomous-AI-lab
RTX 4090	24 GB / GDDR6X / 384 bit	14770.63 ± 102.93	188.96 ± 0.05	2241453	@lhl
RTX 5080	16 GB / GDDR7 / 256 bit	9487.70 ± 21.89	184.68 ± 0.05	8a4280c	@Hedede
RTX 5070 Ti	16 GB / GDDR7 / 256 bit	8419.56 ± 35.50	182.43 ± 0.09	933414c	@TinyServal
RTX 6000 Ada	48 GB / GDDR6 / 384 bit	10576.85 ± 530.21	179.47 ± 0.32	b8e09f0	@Hedede
RTX 3090 Ti	24 GB / GDDR6X / 384 bit	6924.01 ± 10.76	172.26 ± 1.31	9c35706	@slaren
RTX PRO 4500 Blackwell	32 GB / GDDR7 / 256 bit	7251.66 ± 92.40	168.90 ± 0.20	becc481	@Hedede
RTX 3090	24 GB / GDDR6X / 384 bit	5560.06 ± 16.28	161.89 ± 0.18	c76b420	@m18coppola
L40	48 GB / GDDR6 / 384 bit	10097.64 ± 671.22	153.76 ± 0.12	ee09828	@Hedede
RTX 4080 SUPER	16 GB / GDDR6X / 256 bit	9439.01 ± 56.75	147.48 ± 1.41	81086cd	@zacharyarnaise
RTX 4080	16 GB / GDDR6X / 256 bit	9205.93 ± 22.31	143.47 ± 0.02	20638e4	@Ristovski
RTX A6000	48 GB / GDDR6 / 384 bit	5662.39 ± 13.87	144.87 ± 0.18	4795c91	@Hedede
RTX 3080	10 GB / GDDR6X / 320 bit	5569.56 ± 14.04	139.95 ± 0.95	9c35706	@slaren
RTX PRO 4000 Blackwell	24 GB / GDDR7 / 192 bit	5674.44 ± 139.53	136.38 ± 0.13	7d77f07	@Hedede
RTX A5000	24 GB / GDDR6 / 384 bit	4552.15 ± 9.68	135.83 ± 0.11	e5155e6	@Hedede
Tesla V100	32 GB / HBM2 / 4096 bit	2973.78 ± 3.62	134.76 ± 0.02	51f5a45	@Hedede
RTX 4070 Ti SUPER	16 GB / GDDR6X / 256 bit	7612.32 ± 37.35	132.85 ± 0.31	9c35706	@Ristovski
A30	24 GB / HBM2e / 3072 bit	3068.72 ± 0.63	131.93 ± 0.18	583cb83	@Hedede
RTX 5070	12 GB / GDDR7 / 192 bit	5783.44 ± 36.95	128.21 ± 2.52	@Spyro000	-
A40	48 GB / GDDR6 / 384 bit	5256.38 ± 19.39	126.24 ± 0.06	3470a5c	@Hedede
Titan V	12 GB / HBM2 / 3072 bit	2481.25 ± 1.31	112.17 ± 0.01	e56abd2	@Hedede
RTX 2080 Ti	11 GB / GDDR6 / 352 bit	3107.61 ± 4.34	109.17 ± 0.07	9c35706	@ariya
Quadro RTX 6000	24 GB / GDDR6 / 384 bit	3053.96 ± 1.37	104.38 ± 0.04	b8e09f0	@Hedede
Quadro RTX 8000	48 GB / GDDR6 / 384 bit	3052.35 ± 5.64	103.63 ± 0.02	b8e09f0	@Hedede
RTX A4500	20 GB / GDDR6 / 320 bit	3453.10 ± 49.19	103.00 ± 0.25	5cdb27e	@aleksyx
RTX 5060 Ti 16 GB	16 GB / GDDR7 / 128 bit	4195.53 ± 1.98	93.46 ± 0.01	89d1029	@mike-llamacpp
RTX 2070 SUPER	8 GB / GDDR6 / 256 bit	2293.29 ± 5.91	87.71 ± 0.29	bc07349	@phstudy
RTX A4000	16 GB / GDDR6 / 256 bit	2807.83 ± 52.44	85.17 ± 0.66	65349f2	@TinyServal
RTX 3060	12 GB / GDDR6 / 192 bit	2407.67 ± 3.73	76.92 ± 0.03	baa9255	@QuantiusBenignus
Titan Xp	12 GB / GDDR5X / 384 bit	1218.12 ± 1.82	73.84 ± 0.04	c4510dc	@Hedede
Quadro RTX 4000	8 GB / GDDR6 / 256 bit	1662.80 ± 2.04	67.62 ± 0.67	7d77f07	@Hedede
RTX 4060 Ti 8 GB	8 GB / GDDR6 / 128 bit	3803.45 ± 70.80	64.03 ± 0.53	89d1029	@mike-llamacpp
Tesla P100	16 GB / HBM2 / 4096 bit	787.36 ± 3.27	61.99 ± 0.00	b8372ee	@Hedede
GTX 1080 Ti	11 GB / GDDR5X / 352 bit	1138.14 ± 2.02	61.38 ± 0.03	9c35706	@ariya
RTX A4000 Ada	20 GB / GDDR6 / 160 bit	3171.86 ± 4.34	61.37 ± 0.01	a74a0d6	@sdwolfz
RTX 2060 SUPER	8 GB / GDDR6 / 256 bit	1563.77 ± 0.51	61.13 ± 0.05	5c0eb5e	@ggerganov
DGX Spark	128 GB / LPDDR5x	3661.37 ± 38.66	56.74 ± 0.03	5acd455	@ggerganov
Tesla P40	24 GB / GDDR5 / 384 bit	1079.66 ± 0.18	53.73 ± 0.05	c76b420	@m18coppola
RTX 2000 Ada	16 GB / GDDR6 / 128 bit	2250.14 ± 5.91	50.71 ± 0.01	756cfea	@DigitalRudeness
Tesla T4	16 GB / GDDR6 / 256 bit	1309.73 ± 1.02	44.03 ± 0.57	d32e03f	@pt13762104
GTX 1660	6 GB / GDDR5 / 192 bit	154.45 ± 0.52	41.43 ± 0.01	9515c61	@ariya
Tesla M40	24 GB / GDDR5 / 384 bit	290.17 ± 0.11	39.98 ± 0.01	97d5117	@Hedede
GTX 1070 Ti	8 GB / GDDR5 / 256 bit	790.52 ± 2.39	37.87 ± 0.00	79c1160	@pebaryan
Jetson AGX Orin	64 GB / LPDDR5 / 256 bit	1171.96 ± 4.70	35.88 ± 0.18	c1b1876	@TinyServal
Tesla P4	8 GB / GDDR5 / 256 bit	529.53 ± 2.12	33.12 ± 0.03	c76b420	@m18coppola
P106-100	6 GB / GDDR5 / 192 bit	438.49 ± 0.38	30.64 ± 0.06	5fd160b	@pebaryan
GTX 1060	6 GB / GDDR5 / 192 bit	446.19 ± 0.81	28.18 ± 0.01	5fd160b	@pebaryan
Quadro T1000	4 GB / GDDR5 / 128 bit	27.46 ± 0.23	27.46 ± 0.23	f6da8cb	@hanabu
Quadro P2000	5 GB / GDDR5 / 160 bit	311.55 ± 0.19	23.76 ± 0.01	baa9255	@TinyServal
Tesla K80	12 GB / GDDR5 / 384 bit	133.36 ± 0.60	14.27 ± 0.32	32732f2	@pebaryan
Quadro P1000	4 GB / GDDR5 / 128 bit	173.82 ± 0.02	13.65 ± 0.14	1e74897	@aleksyx

Apple Silicon as a Reference Baseline

Discussion #4167 is useful because it established a more unified benchmark format early on. Besides Q4_0, it also includes F16 and Q8_0, which helps explain PP / TG / t/s. The thread explicitly defines:

PP = prompt processing
TG = ext-generation

/s = okens per second A representative example is the M2 Ultra time-series comparison:

Time	Device	Version / Note	Bandwidth GB/s	GPU Cores	F16 PP	F16 TG	Q8_0 PP	Q8_0 TG	Q4_0 PP	Q4_0 TG
2023-11-21	M2 Ultra	8e672ef	800	76	1401.85	41.02	1248.59	66.64	1238.48	94.27
2024-11-12	M2 Ultra	86ed72d + FA	800	76	1525.95	43.15	1368.18	73.11	1391.78	108.80
2025-08-02	M2 Ultra	5c0eb5e + FA	800	76	1561.35	43.24	1386.97	73.35	1412.42	109.41
Representative Apple Silicon entries shown in the thread:
Device	Q4_0 PP	Q4_0 TG	Q8_0 PP	Q8_0 TG	F16 PP	F16 TG
—	—:	—:	—:	—:	—:	—:
M1 Pro 16 GPU	266.25	36.41	270.37	22.34	302.14	12.75
M2 Ultra 76 GPU	1238.48	94.27	1248.59	66.64	1401.85	41.02
M3 Max 40 GPU	690.99	65.85	749.37	43.00	794.26	25.27

ROCm / HIP Scoreboards

Llama 2 7B, Q4_0, no FA

Chip	Memory	pp512 t/s	tg128 t/s	Commit	Thanks to
Instinct MI300X	192 GB / HBM3 / 8192 bit	11476.40 ± 72.79	232.92 ± 0.53	ee3a9fc	@yeahdongcn
RX 7900 XTX	24 GB / GDDR6 / 384 bit	3552.27 ± 101.96	167.11 ± 0.50	2f0c2db	@Diablo-D3
Instinct MI210	64 GB / HBM2e / 4096 bit	2486.22 ± 9.58	124.51 ± 0.04	8160b38	@65a
Pro W7900	48 GB / GDDR6 / 384 bit	3213.17 ± 80.47	121.18 ± 0.06	8160b38	@65a
RX 7900 XT	20 GB / GDDR6 / 320 bit	3098.38 ± 24.02	116.15 ± 0.06	1e15bfd	@AdamNiederer
RX 9070	16 GB / GDDR6 / 256 bit	2381.77 ± 3.68	114.48 ± 0.60	d0660f2	@andj1210
Instinct MI100	32 GB / HBM2 / 4096 bit	2732.83 ± 1.98	110.48 ± 0.14	9c35706	@firefox42
RX 9070 XT	16 GB / GDDR6 / 256 bit	5055.19 ± 109.58	101.27 ± 0.27	583cb83	@Hadrianneue
RX 7800 XT	16 GB / GDDR6 / 256 bit	2151.81 + 17.94	100.94 + 0.10	00131d6	@olegshulyakov
Instinct MI50	32 GB / HBM2 / 4096 bit	1057.24 ± 0.53	98.95 ± 0.25	97d5117	@wtarreau
RX 7900 GRE	16 GB / GDDR6 / 256 bit	1456.98 ± 12.39	96.07 ± 0.10	6fa3b55	@MihaiBojescu
AI PRO R9700	32 GB / GDDR6 / 256 bit	4443.54 ± 339.25	93.84 ± 0.26	bd4ef13	@gogich77
Instinct MI60	32 GB / HBM2 / 4096 bit	1289.11 ± 0.62	91.46 ± 0.13	504af20	@Said-Akbar
RX 6900 XT	16 GB / GDDR6 / 256 bit	1889.84 ± 31.21	88.49 ± 0.00	a972fae	@notgood
Pro VII	16 GB / HBM2 / 4096 bit	1064.99 ± 1.18	87.45 ± 0.04	2739a71	@8XXD8
RX 6800 XT	16 GB / GDDR6 / 256 bit	1447.07 ± 1.36	83.92 ± 0.03	79c1160	@MrLavender
Pro V620	32 GB / GDDR6 / 256 bit	1803.65 ± 2.54	74.66 ± 0.01	5c0eb5e	@samteezy
RX 9060 XT	16 GB / GDDR6 / 256 bit	1419.67 ± 3.64	67.58 ± 0.24	a0e13dc	@lcy0321
RX 5700 XT	8 GB / GDDR6 / 256 bit	354.17 ± 0.18	67.55 ± 0.04	c05e8c9	@daniandtheweb
Instinct MI25	16 GB / HBM2 / 2048 bit	409.83 ± 0.23	63.94 ± 0.06	2739a71	@8XXD8
AI Max+ 395	128 GB / LPDDR5	911.36 ± 1.79	50.01 ± 0.07	e60f241	@firefox42
RX 7600 XT	16 GB / GDDR6 / 128 bit	1099.64 ± 2.05	48.58 ± 0.06	9c35706	@wbruna
RX Vega 64	8 GB / HBM2 / 2048 bit	240.68 ± 0.09	48.46 ± 0.09	ec428b0	@davispuh
Radeon 8060S	System Shared / DDR5	351.36 ± 0.67	47.97 ± 0.33	1d0125b	@hspak
Radeon 880M	System Shared / DDR5	163.25 ± 13.86	12.97 ± 1.63	c55d53a	@Hedede

Llama 2 7B, Q4_0, with FA

Chip	Memory	pp512 t/s	tg128 t/s	Commit	Thanks to
Instinct MI300X	192 GB / HBM3 / 8192 bit	11945.97 ± 54.29	218.53 ± 0.09	ee3a9fc	@yeahdongcn
RX 7900 XTX	24 GB / GDDR6 / 384 bit	3874.25 ± 11.92	170.12 ± 0.56	2f0c2db	@Diablo-D3
Pro W7900	48 GB / GDDR6 / 384 bit	3472.86 ± 52.86	127.43 ± 0.12	8160b38	@65a
Instinct MI210	64 GB / HBM2e / 4096 bit	2571.82 ± 2.89	130.18 ± 0.06	8160b38	@65a
RX 9070	16 GB / GDDR6 / 256 bit	2452.68 ± 1.33	115.32 ± 0.52	d0660f2	@andj1210
RX 7900 XT	20 GB / GDDR6 / 320 bit	3261.75 ± 9.09	112.30 ± 0.06	1e15bfd	@AdamNiederer
Instinct MI50	32 GB / HBM2 / 4096 bit	1129.43 ± 0.15	105.82 ± 0.07	97d5117	@wtarreau
Instinct MI100	32 GB / HBM2 / 4096 bit	2755.00 ± 3.68	104.71 ± 0.10	9c35706	@firefox42
AI PRO R9700	32 GB / GDDR6 / 256 bit	4773.07 ± 49.30	97.98 ± 0.13	bd4ef13	@gogich77
RX 7900 GRE	16 GB / GDDR6 / 256 bit	1598.79 ± 11.48	97.53 ± 0.06	6fa3b55	@MihaiBojescu
RX 9070 XT	16 GB / GDDR6 / 256 bit	4903.51 ± 96.36	97.28 ± 0.13	583cb83	@Hadrianneue
RX 7800 XT	16 GB / GDDR6 / 256 bit	2304.63 + 2.85	95.99 + 0.21	00131d6	@olegshulyakov
RX 6900 XT	16 GB / GDDR6 / 256 bit	1948.31 ± 13.51	85.04 ± 0.02	a972fae	@notgood
Pro V620	32 GB / GDDR6 / 256 bit	1256.86 ± 0.55	70.83 ± 0.02	5c0eb5e	@samteezy
RX 9060 XT	16 GB / GDDR6 / 256 bit	1479.27 ± 0.71	65.42 ± 0.19	a0e13dc	@lcy0321
RX 5700 XT	8 GB / GDDR6 / 256 bit	314.17 ± 0.29	62.02 ± 0.05	c05e8c9	@daniandtheweb
AI Max+ 395	128 GB / LPDDR5	1003.53 ± 2.91	49.87 ± 0.02	e60f241	@firefox42
Radeon 8060S	System Shared / DDR5	366.08 ± 1.44	48.97 ± 0.15	1d0125b	@hspak
RX 7600 XT	16 GB / GDDR6 / 128 bit	1199.16 ± 1.07	47.65 ± 0.06	9c35706	@wbruna
RX Vega 64	8 GB / HBM2 / 2048 bit	153.17 ± 0.72	42.46 ± 0.40	ec428b0	@davispuh
Radeon 880M	System Shared / DDR5	213.31 ± 14.05	16.16 ± 1.41	c55d53a	@Hedede

Vulkan Scoreboards

Llama 2 7B, Q4_0, no FA

Chip	pp512 t/s	tg128 t/s	Commit	Comments
Nvidia RTX 5090	10381.64 ± 508.84	263.63 ± 0.91	ca71fb9	coopmat2
AMD Radeon RX 7900 XTX	3531.93 ± 31.74	191.28 ± 0.20	2f0c2db
Nvidia RTX 4090	9452.03 ± 187.70	187.97 ± 0.21	4ae88d0	coopmat2
Nvidia RTX 5080	7444.99 ± 20.11	185.10 ± 0.54	f6b533d	coopmat2
Nvidia A100	6389.86 ± 4.83	160.78 ± 0.16	2257758	coopmat2
Nvidia RTX 3090	4298.97 ± 10.59	160.13 ± 0.25	4ae88d0	coopmat2
Nvidia RTX 4080 Super	7101.18 ± 269.79	147.13 ± 5.64	81086cd	coopmat2
Nvidia RTX 3080	4287.11 ± 55.50	139.15 ± 0.05	7c7d6ce	coopmat2
Nvidia RTX A5000	3641.55 ± 9.05	139.89 ± 0.69	4ae88d0	coopmat2
AMD Radeon RX 9070 XT	5036.04 ± 88.16	137.11 ± 0.02	e9fd8dc
Nvidia RTX 5070 Ti	6213.63 ± 27.72	135.63 ± 0.18	d13d0f6	coopmat2
AMD Radeon AI Pro R9700	4036.04 ± 34.58	130.19 ± 0.39	3191462
Nvidia Tesla V100	1391.39 ± 1.19	129.58 ± 0.58	7d77f07
Nvidia RTX 4070 Ti Super	6099.18 ± 154.30	129.45 ± 0.18	4ae88d0	coopmat2
AMD Radeon RX 7900 XT	2941.58 ± 17.17	123.18 ± 0.40	71e74a3
AMD Radeon RX 9070	3164.10 ± 66.84	119.71 ± 3.40	21c17b5
AMD Radeon RX 7800 XT	2017.33 ± 19.30	118.27 ± 0.27	4fdbc1e
AMD Radeon RX 7900 GRE	2336.31 ± 7.52	116.11 ± 0.26	4b2a477
Apple M3 Ultra	1116.83 ± 0.55	115.54 ± 0.78	2d451c8	MoltenVK
Intel Arc Pro B70	3379.00 ± 47.92	112.02 ± 1.08	b863507
Nvidia Titan V	984.36 ± 4.13	108.86 ± 0.28	e56abd2
AMD Radeon Pro VII	1078.54 ± 0.86	107.82 ± 0.14	N/A
AMD Radeon RX 6900 XT	1837.21 ± 25.44	104.60 ± 0.30	a972fae
Intel Arc Pro A60	2261.11 ± 9.53	104.25 ± 0.07	97d5117
AMD Radeon RX 6800 XT	1752.92 ± 1.71	100.32 ± 0.97	N/A
AMD Radeon VII	1059.14 ± 0.56	101.19 ± 0.53	77d6ae4
Nvidia RTX 2080 Ti	1888.24 ± 9.20	97.58 ± 6.60	N/A
AMD Radeon RX 6800	1698.69 ± 0.80	95.61 ± 0.19	4b385bf
AMD Radeon Pro W6800X Duo	687.71 ± 4.33	94.82 ± 0.12	N/A
Nvidia RTX 5060 Ti	3460.92 ± 7.16	93.51 ± 0.15	89f10ba	coopmat2
Nvidia RTX 4070	3179.37 ± 46.16	92.29 ± 0.28	9a48399
AMD Radeon Pro W6800X	510.80 ± 0.13	86.47 ± 0.46	13b4548	MoltenVK
AMD Radeon RX 6700 XT	1051.20 ± 0.98	83.88 ± 0.08	6d75883
AMD Radeon RX 6750 XT	1040.58 ± 0.35	81.98 ± 0.03	228f34c
AMD Radeon Pro V620	1595.32 ± 1.59	81.78 ± 0.06	03d4698
Nvidia RTX 3070	2113.02 ± 7.38	78.71 ± 0.13	1b8fb81
AMD Radeon Instinct MI60	369.26 ± 2.48	78.16 ± 1.40	504af20
Nvidia RTX 3060	1815.70 ± 5.85	75.94 ± 0.80	92c0b38	coopmat2
Apple M4 Max	724.77 ± 20.93	75.02 ± 0.14	1ece0cb6
Nvidia Tesla T10	1692.70 ± 2.05	75.01 ± 0.21	7f76692	coopmat2
Nvidia RTX A4000	2248.14 ± 7.59	73.74 ± 0.08	f5245b5	coopmat2
AMD Radeon RX 5700 XT	529.69 ± 0.26	70.73 ± 0.04	4fdbc1e
AMD Radeon RX 9060 XT	2141.67 ± 6.87	70.54 ± 0.74	ed52f36
Intel Arc B580	620.94 ± 15.33	70.14 ± 0.28	7f76692
AMD Radeon Pro V540	583.88 ± 6.56	69.64 ± 0.24	9da3dcd
AMD Radeon Pro W5700	449.85 ± 0.46	68.55 ± 0.15	23bc779
Intel Arc Pro B60	522.36 ± 3.60	68.55 ± 0.01	516a4ca
Nvidia GTX 1080 Ti	540.69 ± 0.71	64.99 ± 0.08	360d653
Nvidia RTX 2070 Super	1199.13 ± 7.70	64.64 ± 0.20	b7552cf
Nvidia RTX 3070 Mobile	1689.40 ± 19.57	63.64 ± 0.39	ceff6bb	coopmat2
Nvidia Tesla P100	678.14 ± 1.40	63.16 ± 0.06	eec1e33
AMD BC-250	370.66 ± 0.04	62.32 ± 0.32	5886f4f
AMD Radeon RX 6650 XT	1029.52 ± 1.21	62.14 ± 0.02	dbb852b
Nvidia RTX 4060 Mobile	2135.66 ± 23.18	59.53 ± 0.03	a5c07dc	coopmat2
Nvidia Tesla P40	488.06 ± 0.27	59.36 ± 0.16	N/A
Nvidia GTX 1660 Ti Mobile	511.67 ± 2.85	56.60 ± 0.07	b43556e
AMD Radeon Instinct MI25	439.42 ± 0.34	54.69 ± 0.03	2739a71
AMD Radeon RX 6600 XT	574.65 ± 0.86	53.92 ± 0.11	091592d
AMD Ryzen AI Max+ 395	1288.96 ± 6.49	53.59 ± 0.38	7f76692
AMD Radeon RX 7600 XT	840.85 ± 3.02	53.02 ± 0.01	01d8eaa
Intel Arc A770	1073.85 + 29.68	52.56 + 0.11	a69d54f
Nvidia GB10	2737.79 ± 19.56	52.28 ± 0.03	b9da444	coopmat2
AMD FirePro S9300 x2	247.26 ± 0.43	51.86 ± 0.11	eec1e33	Split across two GPUs
AMD Radeon RX 6600	761.89 ± 1.76	50.63 ± 0.02	b1c70e2
AMD Radeon RX Vega 56	439.87 ± 0.61	50.23 ± 0.14	92c0b38
Intel Arc B570	913.95 ± 0.90	49.64 ± 0.03	7f76692
Nvidia RTX 3060 Mobile	1059.76 ± 3.54	49.03 ± 0.13	dbb3a47
AMD Radeon RX 6800M	861.99 ± 7.67	48.71 ± 0.71	8e6f8bc
AMD Radeon RX 6600M	605.59 ± 0.65	48.21 ± 0.07	fe5b78c
Intel Arc A770M	875.92 ± 2.16	47.69 ± 0.16	eeee367
Nvidia P104-100	311.90 ± 0.22	46.18 ± 0.05	eec1e33
AMD Radeon RX Vega 64	356.08 ± 0.09	45.73 ± 0.18	ec428b0
Nvidia RTX A2000	1245.19 ± 8.76	45.52 ± 0.54	b1afcab	coopmat2
AMD Radeon RX 7600M XT	459.39 ± 2.34	45.28 ± 0.10	b9ab0a4	eGPU
AMD Radeon Pro V340	375.41 ± 0.24	45.16 ± 0.06	9da3dcd	Split across two GPUs
Nvidia GTX 1070 Ti	297.50 ± 0.54	42.86 ± 1.20	860a9e4	eGPU
Intel Arc A750	1075.94 ± 13.89	42.66 ± 0.18	c1b1876
Nvidia RTX 4050 Mobile	1154.28 + 15.76	41.89 + 0.10	d79d8f3
Nvidia GTX 1070	321.57 ± 0.93	41.48 ± 0.09	eec1e33
Intel Arc Pro B50	193.50 ± 0.24	39.99 ± 0.10	7b43f55
Nvidia Tesla M40	92.48 ± 0.02	39.35 ± 1.22	b8372ee
AMD Radeon RX 580	258.03 ± 0.71	39.32 ± 0.03	de4c07f
AMD Radeon RX 470	218.07 ± 0.56	38.63 ± 0.21	e288693
AMD Radeon Pro W5500	315.39 ± 3.76	36.82 ± 0.38	860a9e4
AMD Radeon RX 480	248.66 ± 0.28	34.71 ± 0.14	3b15924
Apple M2 Ultra	205.98 ± 0.02	34.34 ± 0.12	dbb852b	Asahi Linux
Nvidia GTX 980	186.24 ± 0.09	33.90 ± 0.51	860a9e4
Nvidia P106-100	183.78 ± 0.26	29.77 ± 0.04	23bc779
AMD FirePro W8100	155.22 ± 0.17	29.52 ± 0.05	4536363
Nvidia Tesla P4	265.54 ± 0.21	28.03 ± 0.14	24d2ee0
AMD Radeon RX 6500 XT	255.25 ± 0.35	27.81 ± 0.10	g9fdfcd
Apple M3	263.70 ± 0.02	26.39 ± 0.14	b9ab0a4	MoltenVK
AMD FirePro S10000	94.78 ± 0.02	25.32 ± 0.02	914a82d	Split across two GPUs
Nvidia Quadro P2000	169.55 ± 0.17	23.05 ± 0.03	63f8fe0
Intel Core Ultra 200 Series	544.95 ± 4.15	22.49 ± 0.09	cea560f
AMD Ryzen AI 9 300 Series	479.07 ± 0.41	22.41 ± 0.18	N/A
AMD Ryzen 6000 Series	240.89 ± 0.52	21.26 ± 0.08	ee09828
Apple M2 Pro	62.70 ± 0.03	20.95 ± 0.11	1fe0029	Asahi Linux
Nvidia GTX 1050 Ti	136.42 ± 0.67	20.96 ± 0.21	2f0c2db
AMD Ryzen 8000 Series	266.19 ± 1.36	20.53 ± 0.08	a5c07dc
AMD Ryzen 7000 Series	281.62 ± 1.56	19.91 ± 0.07	ebce03e
AMD Ryzen Z1 Extreme	199.36 ± 7.02	18.77 ± 0.02	53ff6b9
AMD FirePro D700	69.95 ± 0.04	16.62 ± 0.01	d3bd719	MoltenVK, running in FP16 mode on FP32 only chip
AMD Radeon Pro WX 4100	78.79 ± 0.10	16.05 ± 0.07	860a9e4
Apple M2	50.79 ± 0.16	13.50 ± 0.02	8c0d6bb	Asahi Linux
Apple M1	38.29 ± 0.00	12.47 ± 0.03	2370665	Asahi Linux
AMD Ryzen 5000 Series	90.55 ± 0.08	10.98 ± 0.07	d84635b
Intel Core 1100 Series	187.20 ± 1.78	10.39 ± 0.04	abb9f3c
AMD Radeon RX 550	52.66 ± 0.49	10.20 ± 0.01	N/A
AMD Ryzen 4000 Series	103.87 ± 0.02	9.63 ± 0.01	4b385bf
Nvidia Tesla K80	89.46 ± 0.10	9.39 ± 0.06	5d46bab	Running on single GPU
Nvidia Tesla K40	64.37 ± 0.09	9.30 ± 0.19	eec1e33
MediaTek Dimensity 9400	38.36 ± 15.15	8.92 ± 0.06	b9ab0a4	GPU supports coopmat but pp512 is faster with it turned off
Intel Core Ultra 100 Series	185.51 ± 0.22	8.21 ± 0.07	1d72c84
AMD Ryzen 3000 Series	48.63 ± 0.10	8.49 ± 0.01	1fe0029
CIX CD8180	2.80 ± 0.01	5.51 ± 0.00	4dca015
Intel Core 1000 Series	25.58 ± 0.00	4.25 ± 0.18	N/A
Intel Core 8000 Series	25.43 ± 0.17	3.35 ± 0.03	c4df49a
Intel N150	28.84 ± 0.02	2.93 ± 0.00	4f63cd7

Llama 2 7B, Q4_0, FA enabled

Chip	pp512 t/s	tg128 t/s	Commit	Comments
Nvidia RTX 5090	11796.38 ± 601.36	273.68 ± 0.52	ca71fb9	coopmat2
AMD Radeon RX 7900 XTX	3332.90 ± 11.47	195.30 ± 0.23	2f0c2db
Nvidia RTX 5080	8054.59 ± 35.68	192.17 ± 0.21	f6b533d	coopmat2
Nvidia RTX 4090	10830.41 ± 36.25	190.10 ± 0.31	4ae88d0	coopmat2
Nvidia A100	7064.40 ± 1.63	170.56 ± 0.02	2257758	coopmat2
Nvidia RTX 3090	4732.33 ± 4.80	162.28 ± 0.21	4ae88d0	coopmat2
Nvidia RTX 4080 Super	8007.37 ± 46.03	150.20 ± 0.26	81086cd	coopmat2
Nvidia RTX 3080	4913.83 ± 21.52	145.74 ± 0.16	7c7d6ce	coopmat2
Nvidia Tesla V100	1411.25 ± 2.12	142.13 ± 0.03	7d77f07
Nvidia RTX A5000	4071.22 ± 13.13	140.43 ± 0.22	4ae88d0	coopmat2
AMD Radeon RX 9070 XT	4911.74 ± 28.52	138.20 ± 0.18	e9fd8dc
Nvidia RTX 5070 Ti	6764.53 ± 11.95	135.65 ± 0.02	d13d0f6	coopmat2
AMD Radeon AI Pro R9700	4333.83 ± 29.36	130.90 ± 0.12	3191462
AMD Radeon RX 7900 XT	3043.93 ± 10.42	124.20 ± 0.09	71e74a3
AMD Radeon RX 7800 XT	2094.64 ± 14.38	119.63 ± 0.13	4fdbc1e
AMD Radeon RX 9070	3277.24 ± 18.17	119.55 ± 0.06	21c17b5
AMD Radeon RX 7900 GRE	2402.07 ± 22.50	116.77 ± 0.08	4b2a477
Apple M3 Ultra	1115.55 ± 0.75	115.99 ± 0.12	2d451c8	MoltenVK
Intel Arc Pro B70	3314.53 ± 17.95	111.63 ± 0.05	b863507
Nvidia Titan V	792.74 ± 4.30	109.21 ± 0.72	e56abd2
AMD Radeon Pro VII	783.94 ± 0.77	108.45 ± 0.48	N/A
AMD Radeon RX 6900 XT	1761.93 ± 4.75	106.15 ± 0.04	a972fae
Nvidia RTX 2080 Ti	1936.25 ± 32.08	100.99 ± 0.24	N/A
AMD Radeon RX 6800 XT	1704.79 ± 0.71	100.50 ± 0.06	N/A
AMD Radeon Pro W6800X Duo	795.28 ± 0.72	100.08 ± 0.02	N/A
Nvidia RTX 5060 Ti	3912.65 ± 5.86	97.01 ± 0.14	89f10ba	coopmat2
AMD Radeon RX 6800	1749.46 ± 3.36	96.65 ± 0.48	4b385bf
Nvidia RTX 4070	4293.57 ± 27.70	91.49 ± 0.89	9a48399	coopmat2
AMD Radeon RX 6750 XT	997.05 ± 0.45	82.29 ± 0.06	228f34c
AMD Radeon RX 6700 XT	1010.90 ± 12.89	81.86 ± 0.19	6d75883
Nvidia RTX 3060	2012.88 ± 10.12	80.59 ± 0.02	92c0b38	coopmat2
AMD Radeon Pro V620	1556.31 ± 2.82	79.24 ± 0.09	03d4698
Nvidia RTX A4000	2482.74 ± 26.05	76.07 ± 0.08	f5245b5	coopmat2
Nvidia Tesla T10	1840.14 ± 1.22	76.05 ± 0.13	7f76692	coopmat2
AMD Radeon RX 5700 XT	538.31 ± 0.35	74.43 ± 0.03	4fdbc1e
Intel Arc B580	419.49 ± 3.37	72.00 ± 0.24	7f76692
Apple M4 Max	557.46 ± 26.87	71.79 ± 4.16	1ece0cb6
AMD Radeon Pro W5700	446.98 ± 0.39	71.30 ± 0.24	23bc779
Intel Arc Pro B60	274.76 ± 0.27	70.54 ± 0.03	516a4ca
AMD Radeon RX 9060 XT	1915.41 ± 7.90	70.52 ± 0.16	ed52f36
Nvidia Tesla P100	685.51 ± 0.88	66.48 ± 0.02	eec1e33
AMD Radeon RX 6650 XT	1088.90 ± 0.40	64.53 ± 0.75	dbb852b
Nvidia GTX 1080 Ti	529.96 ± 0.38	64.63 ± 0.10	360d653
AMD BC-250	356.87 ± 1.24	63.14 ± 0.09	5886f4f
Nvidia RTX 3070 Mobile	1832.07 ± 57.14	62.92 ± 0.37	ceff6bb	coopmat2
Nvidia RTX 4060 Mobile	2358.03 ± 12.17	60.01 ± 0.08	a5c07dc	coopmat2
Nvidia Tesla P40	484.37 ± 0.27	59.22 ± 0.15	N/A
Nvidia GTX 1660 Ti Mobile	514.34 ± 0.88	57.30 ± 0.42	b43556e
AMD Radeon RX 7600 XT	1024.38 ± 7.56	56.11 ± 0.02	01d8eaa
AMD FirePro S9300 x2	243.33 ± 0.22	55.64 ± 0.06	eec1e33	Split across two GPUs
Nvidia GB10	3279.89 ± 26.78	53.64 ± 0.05	b9da444	coopmat2
AMD Radeon RX 6600	808.76 ± 0.15	53.24 ± 0.03	b1c70e2
Intel Arc A770	1119.68 + 30.25	53.07 + 0.09	a69d54f
AMD Ryzen AI Max+ 395	1357.07 ± 10.94	53.00 ± 0.13	7f76692
AMD Radeon RX Vega 56	428.54 ± 0.50	52.66 ± 0.03	92c0b38
Intel Arc B570	288.51 ± 0.09	50.49 ± 0.05	7f76692
Nvidia P104-100	325.30 ± 0.25	48.64 ± 0.04	eec1e33
AMD Radeon Pro V340	360.23 ± 0.74	47.54 ± 0.06	9da3dcd	Split across two GPUs
AMD Radeon RX 6800M	784.16 ± 2.76	49.06 ± 0.34	8e6f8bc
AMD Radeon RX Vega 64	320.12 ± 0.22	47.06 ± 0.01	ec428b0
Nvidia RTX A2000	1361.85 ± 3.26	45.69 ± 0.20	b1afcab	coopmat2
Intel Arc A770M	384.74 ± 0.78	45.68 ± 0.06	eeee367
Intel Arc A750	303.37 ± 1.44	43.96 ± 0.03	c1b1876
Nvidia GTX 1070 Ti	292.85 ± 0.23	43.42 ± 0.34	860a9e4	eGPU
Nvidia GTX 1070	330.84 ± 1.02	43.33 ± 0.06	360d653
Nvidia Tesla M40	93.35 ± 0.01	41.68 ± 0.01	b8372ee
Intel Arc Pro B50	132.48 ± 0.04	41.02 ± 0.04	7b43f55
AMD Radeon RX 470	197.26 ± 0.27	37.28 ± 0.11	3769fe6
AMD Radeon RX 480	194.52 ± 0.61	37.23 ± 0.09	0bcb40b
Apple M2 Ultra	198.83 ± 0.85	198.83 ± 0.85	dbb852b	Asahi Linux
Nvidia GTX 980	180.97 ± 0.74	34.16 ± 0.10	860a9e4
Nvidia P106-100	183.40 ± 0.34	30.79 ± 0.32	23bc779
AMD FirePro W8100	140.52 ± 0.34	29.28 ± 0.14	4536363
Nvidia Tesla P4	287.14 ± 0.29	28.37 ± 0.24	24d2ee0
Nvidia Quadro P2000	181.71 ± 0.12	23.77 ± 0.02	63f8fe0
Intel Core Ultra 200 Series	536.48 ± 1.27	23.05 ± 0.04	cea560f
AMD Ryzen AI 9 300 Series	532.59 ± 3.55	22.31 ± 0.06	N/A
AMD Ryzen 6000 Series	277.91 ± 0.37	21.15 ± 0.09	ee09828
Apple M2 Pro	58.86 ± 0.02	20.97 ± 0.03	1fe0029	Asahi Linux
AMD Ryzen 8000 Series	297.39 ± 1.22	20.59 ± 0.38	a5c07dc
AMD Ryzen 7000 Series	312.85 ± 2.51	20.09 ± 0.35	835b2b9
Nvidia GTX 1050 Ti	127.54 ± 1.03	20.08 ± 0.17	2f0c2db
AMD Radeon Pro WX 4100	75.59 ± 0.19	16.56 ± 0.04	860a9e4
Apple M1	35.93 ± 0.00	12.85 ± 0.02	2370665	Asahi Linux
Apple M2	46.81 ± 0.08	12.25 ± 2.30	8c0d6bb	Asahi Linux
AMD Ryzen 5000 Series	79.06 ± 0.01	10.75 ± 0.00	5d195f1
Intel Core 1100 Series	174.77 ± 4.47	10.58 ± 0.03	abb9f3c
Nvidia Tesla K40	64.37 ± 0.02	9.92 ± 0.06	eec1e33
AMD Ryzen 4000 Series	113.32 ± 0.01	9.87 ± 0.01	4b385bf
Nvidia Tesla K80	88.26 ± 0.19	9.49 ± 0.01	5d46bab	Running on single GPU
AMD Ryzen 5 3000 Series	47.41 ± 0.14	8.47 ± 0.01	1fe0029
Intel Core Ultra 100 Series	77.66 ± 2.75	7.75 ± 0.05	2e89f76
Intel Core 8000 Series	25.55 ± 0.04	3.35 ± 0.02	c4df49a
Intel N150	25.59 ± 0.00	2.91 ± 0.00	4f63cd7

How to Use These Tables

Decide whether you care more about g128 or pp512. For chat and interactive use, g128 usually matters more. For long prompts and batch throughput, pp512 matters more.
Match the backend you actually use. Nvidia users should usually prioritize CUDA. AMD users should compare ROCm and Vulkan first. Cross-platform users should pay close attention to Vulkan.
Check FA last. On many GPUs, enabling FA improves pp512 more than g128, so a single headline number can be misleading.

One-Sentence Summary

In llama.cpp benchmarks, pp512, g128, Q4_0, FA, and CUDA / ROCm / Vulkan describe different dimensions. Once the benchmark context is clear, the tables become much easier to read.

Sources

CUDA discussion #15013: https://github.com/ggml-org/llama.cpp/discussions/15013
Apple Silicon discussion #4167: https://github.com/ggml-org/llama.cpp/discussions/4167
ROCm discussion #15021: https://github.com/ggml-org/llama.cpp/discussions/15021
Vulkan discussion #10879: https://github.com/ggml-org/llama.cpp/discussions/10879

What the Common GPU Inference Benchmark Metrics Actually Mean: FA, pp512, tg128, and Q4_0

Thu, 23 Apr 2026 00:15:00 +0800

As soon as you start looking at local LLM or GPU inference benchmarks, you quickly run into a stack of abbreviations: FA, pp512, tg128, and Q4_0. They all look like performance metrics, but without context they can be surprisingly hard to interpret.

For example, you may see a line like this:

`1`	`CUDA Scoreboard for Llama 2 7B, Q4_0 (no FA)`

Then right below it, you might also see:

1
2

pp512 t/s
tg128 t/s

If you do not unpack what these terms mean, it becomes difficult to understand what the benchmark is actually measuring, or how to compare the results of two different GPUs.

This article is not about which GPU is the better buy. It is specifically about breaking down the most common metrics you see in GPU inference benchmarks.

First, what the whole title line is actually saying

A line like CUDA Scoreboard for Llama 2 7B, Q4_0 (no FA) already tells you most of the test setup.

At minimum, it contains four layers of information:

CUDA: the benchmark is running on the NVIDIA CUDA path
Llama 2 7B: the model being tested is the 7B version of Llama 2
Q4_0: the model uses a 4-bit quantized format
no FA: Flash Attention was disabled in this test

So in practical terms, this kind of title usually means:

“A benchmark of a quantized large model running on an NVIDIA GPU, measured under a specific inference path.”

What FA means: Flash Attention

Here, FA stands for Flash Attention.

It is one of the most important acceleration techniques in large-model training and inference, mainly because it optimizes how attention is computed. In Transformer models, attention is already one of the most expensive and memory-bandwidth-heavy parts of the entire pipeline.

A traditional attention implementation often suffers from a few problems:

frequent memory reads and writes
many intermediate results
repeated data movement between VRAM and on-chip cache
rapidly growing overhead as context length increases

What Flash Attention does, in simple terms, is:

reorganize the computation order
reduce how often intermediate results are written back to VRAM
keep more of the work inside faster cache

That gives it three typical advantages:

it is faster
it saves memory
it is mathematically equivalent to standard attention rather than a lower-accuracy shortcut

That is why so many modern inference and training frameworks treat it as a major optimization feature.

What no FA means

If FA means Flash Attention, then no FA simply means that Flash Attention was not enabled for this test.

In other words, the benchmark was measured using a more traditional attention implementation.

There are several reasons benchmark tables explicitly label no FA:

to keep a baseline for comparison
to support hardware or software environments where FA is unavailable
to avoid mixing scores from different optimization conditions

So when you see no FA, you should not read it as “this GPU is weak.” A more accurate reading is:

“This score was measured without Flash Attention enabled.”

What Q4_0 means: a quantization format

Q4_0 refers to a 4-bit quantization format.

The original model weights are usually not stored at such low precision. Quantization compresses higher-precision weights into a lower-bit representation so the model becomes easier to run on consumer GPUs.

A rough way to think about it is:

Q: Quantization
4: 4-bit
_0: a specific quantization scheme identifier

Its practical importance is straightforward:

smaller model size
lower VRAM requirements
better chances of fitting on consumer hardware

So Llama 2 7B, Q4_0 does not mean just “a normal 7B model.” It means “a 7B model already compressed using a 4-bit quantization format.”

What pp512 t/s means

pp512 usually means:

Prompt Processing 512 tokens

It measures how fast the model processes the input prompt, usually in t/s, meaning tokens per second.

Here, 512 means the prompt length used in the test was 512 tokens.

This metric does not measure output speed. It measures how quickly the model encodes and computes over the input before it starts responding. You can think of it as the speed of the “reading the prompt first” stage.

One important property of this stage is that it is usually much more parallelizable.

Because the input sequence can be processed in batches, the GPU can often keep its compute units highly utilized. That is why pp512 numbers can look extremely high, sometimes almost suspiciously high at first glance.

So if you see something like:

`1`	`pp512 ≈ 14000 t/s`

there is no reason to panic. That is measuring prompt-processing throughput, not the speed of token-by-token output generation.

What tg128 t/s means

tg128 usually means:

Text Generation 128 tokens

It measures the average speed of generating 128 tokens, again in t/s.

This metric is much closer to what people intuitively mean when they ask whether a model feels fast, because it is directly measuring the output stage.

But the biggest difference from pp512 is that text generation is usually autoregressive.

That means:

the model must generate the first token
then use that to generate the second
then continue to the third

So this stage cannot be parallelized the way prompt processing can, and it is naturally much slower.

That is why it is perfectly normal to see something like:

pp512 in the tens of thousands of t/s
tg128 only in the hundreds of t/s

This is not a benchmark error. These two metrics are measuring fundamentally different workloads.

Why pp512 and tg128 differ so much

This is often the first thing people find confusing when reading a scoreboard.

The short explanation is:

pp512 is closer to measuring parallel throughput, while tg128 is closer to measuring token-by-token generation ability.

To expand on that:

the input stage is easier to parallelize
the output stage depends on sequential token generation
generation is usually more sensitive to memory bandwidth and cache behavior
so generation speed being much lower than prompt-processing speed is entirely normal

That also explains an interesting pattern you sometimes see in GPU comparisons:

one GPU is stronger in pp512
another ends up slightly faster in tg128

That is not contradictory. One metric leans more toward peak compute throughput, while the other reflects the actual memory and latency behavior of the generation path.

How to think about t/s

Here, t/s simply means tokens per second.

It tells you how many tokens the model can process or generate in one second.

But there is one important caveat: a token is not the same thing as a character or a word. It is the unit produced by the model’s tokenizer, and its actual text length can vary a lot across models and languages.

So in practice, t/s is most useful for:

comparing different GPUs on the same model
comparing different parameter settings in the same environment
comparing a framework before and after a specific optimization is enabled

It is much less reliable as a universal “absolute speed” metric across different models, frameworks, and tokenizers.

What to focus on first when reading a scoreboard

If you do not want to get buried under abbreviations every time, start with these questions.

1. What model is being tested

For example, is it Llama 2 7B? Is it the same quantized variant, such as Q4_0? If the model or quantization format changes, direct comparison becomes much less meaningful.

2. Whether key optimizations are enabled

The most common example is FA. If one benchmark uses Flash Attention and the other does not, those scores are not directly comparable.

3. Whether the metric is measuring input speed or output speed

pp512 and tg128 are measuring different stages. One is closer to prompt-reading speed, the other is closer to answer-generation speed.

4. Whether you care about throughput or user feel

If you care more about how quickly a long prompt gets processed, pp512 matters more. If you care more about how fast the model feels while answering, tg128 is usually closer to the real experience.

A more practical way to remember all this

If you want to compress all of these into one short memory aid, you can think of them like this:

Q4_0: the model is compressed into a 4-bit quantized version
FA: whether Flash Attention is enabled
pp512: how fast the model processes a 512-token input
tg128: how fast the model generates a 128-token output
t/s: speed unit, tokens per second

Once those five points are clear, it becomes much easier to judge what a given CUDA Scoreboard is actually measuring.

Closing

GPU benchmark tables often look more complicated than they really are, not because the metrics themselves are mysterious, but because model identity, quantization, optimization flags, and different stages of throughput are all compressed into a few short abbreviations.

Once you unpack terms like FA, Q4_0, pp512, and tg128, these benchmark tables become much easier to read.

What matters is not just remembering a raw score, but knowing:

which model configuration the score came from
whether key optimizations were enabled
whether it measured input or output behavior
whether it reflects compute throughput or something closer to actual generation feel

That makes it much easier to judge what these results really mean.

A Practical Guide to Common Tensor Formats in LLMs: FP32, FP16, BF16, TF32, and FP8

Wed, 22 Apr 2026 22:40:00 +0800

As soon as you start working with large-model training, inference, or deployment, you quickly run into a familiar set of abbreviations: FP32, FP16, BF16, TF32, and FP8. They may look like small labels on a model page, but their impact is much bigger than a naming difference.

These formats determine how numbers are stored in memory and represented during computation. They directly affect training stability, inference speed, and even how large a model a given GPU can realistically handle.

So if you want to understand precision trade-offs in large models, one of the best places to start is not a benchmark chart for a specific model, but a clear picture of what these tensor formats are and why they were designed the way they are.

What tensor formats actually determine

At its core, a large model is a massive set of matrix operations over huge numbers of parameters, and the tensor format is how those numbers are stored in memory and represented during computation.

The trade-off usually revolves around three dimensions:

precision
VRAM usage
compute speed

This is actually a lot like image formats. Lossless formats preserve more detail, but take more space and load more slowly. Compressed formats discard information that is less noticeable to the eye in exchange for smaller size and faster handling. Large models can accept similar trade-offs because, across extremely large parameter sets, many tiny numerical changes do not significantly affect the final output.

That is why the model world has developed a whole family of precision formats.

How a number is represented

Before getting into the formats, it helps to remember one basic structure. A floating-point number is usually made of three parts:

sign bit: determines positive or negative
exponent bits: determine numerical range
mantissa bits: determine numerical detail

In large models, mantissa precision certainly matters, but many models are even more sensitive to insufficient numerical range, meaning too few exponent bits and a higher risk of overflow or unstable training. A lot of tensor format design is essentially about reallocating a limited number of bits between range and detail.

The diagram below gives a quick overall view:

FP32: the most stable, but expensive

FP32 is the traditional single-precision floating-point format. It uses 32 bits in total, or 4 bytes.

Its strengths are straightforward:

wide numerical range
high precision
the most stable training behavior

But the downside is just as clear: it consumes a lot of VRAM.

A very rough estimate is:

`1`	`VRAM usage ≈ parameter count × bytes per parameter`

If a 27B model stores weights entirely in FP32, the weights alone take roughly:

`1`	`27B × 4 bytes ≈ 108GB`

And that still does not include activations, KV cache, optimizer state, or other runtime overhead. So in modern large-model training and inference, FP32 is no longer the default so much as the most stable baseline format.

FP16: half the size, but less stable

FP16 compresses each parameter to 2 bytes, cutting memory usage roughly in half compared with FP32.

For the same 27B model, if you only look at weight size:

`1`	`27B × 2 bytes ≈ 54GB`

That already explains why many deployment guides place a 27B model around the 50GB VRAM range.

The advantages of FP16 are obvious:

much lower VRAM pressure
higher throughput
widely used in early mixed-precision training

Its weakness is the relatively small exponent range. In large-model training, that makes overflow more likely and often requires extra techniques such as loss scaling, which adds engineering complexity.

So FP16 is still common, but in many scenarios it is no longer the most comfortable option.

BF16: a more practical half precision for the large-model era

BF16 also uses 2 bytes, but it makes a different trade-off from FP16.

It keeps a much larger exponent range, making its dynamic range closer to FP32, while giving up some mantissa precision. That trade-off works especially well for large models, because they are often more sensitive to range than to losing a few mantissa bits.

That is why many training frameworks, many large-model papers, and many real deployment setups prefer BF16.

A simple way to think about it is:

VRAM cost close to FP16
stability closer to FP32

If one 27B deployment guide asks for roughly 50GB of VRAM while another optimized one gets closer to 30GB, the former often still lives in the FP16/BF16 layer, while the latter has usually moved further toward lower precision or quantization.

TF32: not about saving VRAM, but about accelerating FP32 workflows

TF32 is easy to mistake for yet another memory-saving format, but its role is different.

In common terms, you can roughly think of it as a computation format that keeps a large exponent range while shortening mantissa precision.

But it is important to note that TF32 is more like an internal computation format used on the Tensor Core path, rather than something primarily used to store weights like FP16 or BF16.

It is mainly a computation mode NVIDIA provides on newer GPUs. The goal is not to reduce VRAM usage, but to make originally FP32-based training workflows run faster without requiring major code changes.

Its role can be summarized in one sentence:

externally it still looks like an FP32 workflow
internally it performs faster approximate matrix math

So TF32 mainly solves the problem that FP32 is too slow, not that FP32 uses too much memory. If your question is why the same model can have very different VRAM requirements, TF32 is not the main answer.

FP8: further compression, but much more demanding engineering

Going one step further leads to FP8. It compresses each value into even fewer bits, reducing memory bandwidth and storage cost even more.

It usually appears not as one single format, but as two common variants: E4M3 and E5M2.

But FP8 comes with an obvious cost: once the bit count gets that low, it becomes very hard to preserve both range and precision at the same time. In practice, different variants are often used for different stages to balance forward passes, backward passes, and gradients.

This format family represents a more aggressive strategy:

give up more precision
gain lower storage cost and higher throughput
rely on more mature hardware and frameworks

It has a lot of potential, but for most users, the main practical dividing lines are still FP32, FP16, and BF16.

Why understanding these formats matters

Many people first treat these abbreviations as implementation details on a download page. In practice, though, they change how you think about both training and deployment.

For example, they help explain:

why some training setups care so much about numerical stability
why some inference stacks emphasize quantization and low precision first
why models with similar parameter counts can still have very different deployment requirements
why some formats are better for storing weights while others make more sense as compute paths

If you keep unpacking those questions, they usually lead back to the same issue: how you choose to trade off precision, range, memory, and speed.

That is why understanding FP32, FP16, BF16, TF32, and FP8 is not just about decoding a glossary. It is about understanding what is really being exchanged when you read a training config, choose an inference engine, or compare deployment options.

A practical mental model

If you do not want to memorize all the details right away, it helps to remember them in this order:

FP32: most stable, most expensive
FP16: lower VRAM use, but smaller range
BF16: similar VRAM cost to FP16, but more suitable stability for large models
TF32: mainly solves slow FP32, not VRAM usage
FP8: a more aggressive compression and acceleration route

After that, when you see fp16, bf16, or fp8 on a model download page, or when different deployment guides give wildly different VRAM thresholds, it no longer looks like a difference in wording. Those labels reflect very different precision budgets and engineering choices.

Closing

Tensor formats in large models may look like a discussion about bit widths, but underneath they are really a discussion about engineering trade-offs.

FP32, FP16, BF16, TF32, and FP8 are not simply better or worse than one another. Each one sits at a different point on the trade-off curve between stability, range, precision, memory, and speed.

Once you understand that layer clearly, it becomes much easier to read training papers, tune inference settings, and compare deployment strategies with the right mental model.

A 16GB GPU Can Still Run 35B Models: VRAM Compression Strategies for MoE Models in LM Studio

Wed, 22 Apr 2026 21:47:34 +0800

Many people think of 16GB VRAM as the point where local LLM deployment more or less tops out at 12B to 14B models, and anything larger becomes too painful even with quantization. That view is understandable, but it is not the true ceiling of a 16GB GPU.

If your model choice and parameter setup are good enough, a 16GB GPU does not have to stay limited to “small-parameter” models. One representative approach is to use MoE models inside LM Studio with a sensible unloading strategy, so that 35B-class models can still run at a genuinely usable speed.

01 Why a 16GB GPU is not necessarily limited to 12B to 14B

The core idea is straightforward: VRAM size matters, but model architecture matters just as much.

If you try to cram a standard dense model into a 16GB GPU, you will hit the wall quickly. These models usually involve all parameters during inference, so VRAM pressure and bandwidth pressure rise immediately.

But MoE models are different. Their total parameter count can be large, while only part of the expert parameters are activated in a single inference step. Take a 35B-class model as an example: although the total parameter count is high, the actual number of parameters participating in each inference step is much smaller, so its real VRAM requirement is not as extreme as many people assume.

That is exactly why a 16GB GPU still leaves some room to work with.

02 Key practical takeaway: 35B MoE models can run surprisingly fast

One representative case is a quantized MoE model such as Qwen 3.5 35B A3B. With a 16GB GPU and the right settings in LM Studio, Q6 quantization can reach something above 30 tokens/s, and Q4 can sometimes test even higher.

That result matters not just because the model “runs,” but because the speed is already in a clearly usable range.

As a comparison, large models of a similar scale that are not MoE often run into VRAM overflow and sharply lower speed on a 16GB GPU. In other words, the outcome is not determined by parameter count alone. What matters is how those parameters are actually used during inference.

03 In LM Studio, the key is not just one parameter

If you want this kind of model to run smoothly on a 16GB GPU, the real trick is not luck. It is tuning two parameters correctly:

GPU Offload
the setting that forces part of the expert layers into CPU memory

The first one is easy to understand. GPU Offload is basically something you push as high as possible, so the model prioritizes GPU computation.

The second one is the real key here. It is not the traditional “borrow system memory after VRAM overflows” approach. Instead, it proactively places part of the expert layers into CPU memory to reduce VRAM usage in advance. Since MoE models do not activate every expert on every step anyway, moving some experts into memory does not hurt overall inference speed as much as many people would expect.

A safer way to tune it is to start within a range and then adjust gradually for your machine:

start with related values somewhere between 20 and 35
then fine-tune based on VRAM usage and memory pressure

At its core, this method is using system memory to buy back VRAM headroom.

04 It can still run at 128K context, and smaller contexts reduce VRAM further

Another interesting point is that even with the context length pushed to 128K, a 35B-class MoE model can still maintain a relatively high speed.

That tells us something important: the bottleneck of a 16GB GPU is not as rigid as many people imagine. Especially inside a local inference tool like LM Studio, the real question is often not simply “can it run or not,” but rather:

are you willing to trade more system memory for less VRAM usage
are you willing to shorten the context length
are you willing to accept different capability tradeoffs across quantization levels

If the context is reduced further from 128K to 64K or 32K, VRAM pressure can drop even more. That means some 35B-class MoE models may even run, barely, on GPUs with less VRAM, though speed and memory pressure will need to be rebalanced.

05 The cost of this approach: much higher demands on RAM and virtual memory

This kind of setup is not free performance.

What you need to watch is that once VRAM pressure is compressed further, system RAM usage rises noticeably, and virtual memory pressure rises too. In other words, you are not removing the cost. You are shifting pressure from the GPU to RAM and disk swap space.

So if you want to try it yourself, it is worth checking a few things first:

whether your system RAM is large enough
whether your virtual memory allocation is large enough
whether too many background applications are already consuming resources

If those conditions are not in place, what you may get is not “35B running fast,” but an overall machine that becomes slow everywhere.

06 More aggressive quantization is not always better

There is another practical tradeoff here. Lower-bit quantization often saves more VRAM, but that does not automatically make it the best choice.

The practical takeaway is that some models do run faster under Q4, but their original capability can also degrade more. By comparison, Q6 tends to strike a better balance between speed and capability retention. So the right choice depends on what you care about more:

maximum speed and fitting into VRAM
or preserving more of the model’s original capability

Those two priorities do not necessarily lead to the same quantization choice.

07 What kinds of models are worth trying

From this angle, the best thing to try is not “blindly chase bigger parameter counts,” but to first look for models that fit this strategy:

models built on MoE architecture
models that are well supported in LM Studio and have complete quantized variants
models with clear advantages in long context or instruction following

And the idea does not stop at one 35B MoE model. It also extends naturally to other directions, such as experimental models with stronger long-context memory, better instruction following, or lighter quantized versions with strong speed performance.

The logic behind this is very consistent: first find models whose architecture fits the “trade memory for VRAM” strategy, and then talk about tuning. Do not start from parameter count alone and decide from there.

08 Short conclusion

If you happen to have a 16GB GPU and assume local LLMs stop at 12B to 14B, that assumption is worth updating.

A more accurate way to put it is:

a 16GB GPU is not automatically ruled out for larger models
dense models and MoE models need to be considered separately
GPU Offload and expert-layer transfer to CPU memory inside LM Studio can significantly change VRAM usage
in practice, you are trading higher memory pressure for larger model scale and better usable speed

This approach will not fit every machine, but it does show one important thing: in local LLM deployment, VRAM is not the only limit. Model architecture and inference configuration matter just as much.

12V-2x6 vs. 12VHPWR: Notes on GPU 16-Pin Power Connector Differences

Sun, 19 Apr 2026 23:21:17 +0800

Among recent high-end GPUs, the power connector that gets discussed most often is probably 12VHPWR and the newer 12V-2x6. Both look like 16-pin connectors, using a 12 + 4 layout, but they are not exactly the same interface.

In simple terms, 12V-2x6 can be understood as a revision of the earlier 12VHPWR design under ATX 3.1 and PCIe CEM 5.1. It keeps the high-power output capability, but uses a more conservative design for insertion detection and terminal structure. The goal is to reduce the risk of the connector continuing to carry load when it is not fully seated.

01 Cable Differences Are Small

The first question many people care about is whether 12V-2x6 and 12VHPWR modular cables can be used interchangeably.

Looking only at the cable itself, the difference is usually not large. The real change is mainly on the board-side connector, such as the GPU socket or the modular power-supply backplate socket. Both newer 12V-2x6 modular cables and older 12VHPWR modular cables are still intended for 16-pin GPU power delivery.

So compatibility should not be judged only by cable length, wire gauge, or appearance. The GPU-side and PSU-side socket specification, terminal quality, and the power-supply vendor’s official compatibility statement matter more.

02 Key Mechanical Changes

The point of 12V-2x6 is not to completely change the outer shape of the connector, but to adjust the pin structure.

Its 12 main power pins are longer and make contact earlier, while the 4 SENSE signal pins are shorter and make contact later. The logic is straightforward: only when the connector is inserted deeply enough should the SENSE pins conduct correctly, allowing the GPU to identify the intended power capability.

This change targets a typical problem exposed by early 12VHPWR connectors: the plug may look inserted, but may not actually be fully seated. Under high load, insufficient contact can generate heat, and in severe cases may burn the plug or socket.

03 More Conservative SENSE Logic

SENSE0	SENSE1	Initial Power (Power Up)	Max Sustained Power
Ground	Ground	375 W	600 W
Open	Ground	225 W	450 W
Ground	Open	150 W	300 W
Short	Short	100 W	150 W
Open	Open	0 W	0 W

The safety improvement in 12V-2x6 centers on the SENSE logic.

In the newer definition, if SENSE0 and SENSE1 are in the Open floating state, the GPU will not power up normally or will not enter the corresponding high-power input state. In other words, when the connector is not seated properly, the system is more inclined to prevent operation instead of letting the GPU keep drawing power.

This is more conservative than early 12VHPWR. In older designs, even if the SENSE state was not ideal, some cases could still allow a certain level of power input. For high-power GPUs, that tolerance can become a risk.

Shortening the SENSE pins is essentially a way to make “fully inserted” a stricter prerequisite.

04 What H++ Means

Newer 12V-2x6 connectors often carry an H++ mark. It indicates that the connector terminals support 9.2A or higher current capability, distinguishing them from earlier 12VHPWR connectors marked H+.

It is worth noting that H++ does not mean the connector’s power limit rises beyond 600W. Whether new or old, the common upper limit for this 16-pin GPU power scheme is still 600W. H++ is better understood as terminal-specification and connector-version identification, not simply “higher wattage.”

05 What It Means for PC Building

For everyday PC building, the biggest value of 12V-2x6 is reducing insertion-related risk, but it is not a magic shield.

When using this kind of connector, it is still worth paying attention to a few things:

Fully insert the plug; do not rely only on whether it “looks inserted.”
Avoid bending the cable sharply right next to the GPU connector.
Do not let the side panel force pressure onto the cable.
Prefer original, custom, or adapter cables explicitly supported by the PSU or GPU vendor.
Avoid cheap adapters of unknown origin on high-power GPUs.

If the case is tight, a 90-degree L-shaped cable or vendor-certified custom cable can reduce bending pressure. Still, terminal quality, wire gauge, and vendor certification matter more than appearance.

06 Quick Summary

12V-2x6 is not a connector that is “basically the same as 12VHPWR because it looks the same.” Its real changes are inside the connector structure and detection logic.

You can think of it this way:

The cable form is similar, but board-side connector and terminal design are more important.
The main power pins are longer, while the SENSE pins are shorter.
When the connector is not fully seated, the newer design is more likely to prevent the GPU from entering a working state.
The H++ mark identifies terminals with higher current capability.
The common GPU power limit is still 600W.

If you are building a system with a high-power GPU, 12V-2x6 is indeed more reassuring than early 12VHPWR. But the final safety still depends on whether the plug is fully seated, cable quality, PSU design, and case cable-management space. A better connector standard does not make careless installation safe.

Ollama Multi-GPU Notes: VRAM Pooling, GPU Selection, and Common Misunderstandings

Sun, 19 Apr 2026 00:18:00 +0800

When running local inference with Ollama, a few questions come up quickly: if I already have one GPU and my motherboard still has empty PCIe slots, does adding more GPUs help? Do the GPUs need to be identical? Can VRAM be combined? Will it accelerate inference like a multi-GPU training framework?

This note summarizes how Ollama behaves with multiple GPUs. The short version:

Ollama supports multiple GPUs.
The main value of multiple GPUs is usually fitting larger models into available VRAM, not getting linear token/s scaling.
By default, if a model fits entirely on one GPU, Ollama tends to load it on a single GPU.
If a model does not fit on one GPU, Ollama can spread it across available GPUs.
Mixed GPU models may be visible to Ollama, but performance and placement may not be ideal.
SLI / NVLink is not required for multi-GPU use.
To limit which GPUs Ollama can use, use CUDA_VISIBLE_DEVICES, ROCR_VISIBLE_DEVICES, or GGML_VK_VISIBLE_DEVICES.

Official Behavior: Single GPU First, Multi-GPU When Needed

Ollama’s FAQ describes the multi-GPU loading logic directly: when loading a new model, Ollama estimates the required VRAM and compares it with currently available GPU memory. If the model can fit entirely on one GPU, it loads the model onto that GPU. If it cannot fit on a single GPU, the model is spread across all available GPUs.

The reason is performance. Keeping a model on one GPU usually reduces data transfers across the PCIe bus during inference, so it is often faster.

So do not think of Ollama multi-GPU as “more cards automatically means several times faster.” A more accurate model is:

Small model fits on one GPU: usually runs on one GPU.
Large model does not fit on one GPU: split across multiple GPUs.
Still not enough VRAM: part of the model falls back to system memory, and speed drops noticeably.

Use this command to see where the model is loaded:

`1`	`ollama ps`

The PROCESSOR column may show something like:

1
2
3

100% GPU
48%/52% CPU/GPU
100% CPU

If you see 48%/52% CPU/GPU, part of the model is already in system memory. In that case, adding more GPU memory or using a larger-VRAM GPU is usually more useful than continuing to rely on CPU/RAM.

Multi-GPU Is Not Simple Compute Stacking

Local LLM inference is not the same as SLI in games. With Ollama on multiple GPUs, the common pattern is that different layers or tensors are placed on different devices. This can make a larger model fit into the combined available VRAM, but data may still need to move between devices during inference.

So multi-GPU benefits usually fall into two categories:

VRAM benefit: larger models fit more easily, or less of the model falls back to CPU/RAM.
Performance benefit: usually most obvious when a model would otherwise not fit on one GPU or would heavily spill to CPU.

If an 8B or 14B model already fits entirely on a single RTX 3090, forcing it across two GPUs may not be faster. It may even slow down due to cross-GPU transfer overhead. Ollama’s default “use one GPU when it fits” strategy avoids that unnecessary PCIe cost.

SLI or NVLink Is Not Required

Ollama multi-GPU does not depend on SLI. Multiple normal PCIe GPUs can be scheduled as long as the driver and Ollama can detect them.

NVLink or higher PCIe bandwidth may help in some cross-GPU scenarios, but it is not a requirement. Many used GPU servers and workstations can run multiple GPUs over ordinary PCIe.

What you should pay attention to is PCIe bandwidth. The difference between x1, x4, x8, and x16 affects how quickly a model is loaded into VRAM. If you frequently switch large models, PCIe bandwidth becomes more important. After a model is loaded, PCIe usually matters less during generation, but cross-GPU splitting can still add overhead.

Safer rules:

Prefer x16 / x8 over mining-style x1 risers.
PCIe bandwidth matters more when switching large models frequently.
If a model stays resident in VRAM for a long time, PCIe bandwidth is less visible.
For multi-GPU machines, check motherboard PCIe topology and CPU-attached lanes.

Limit Which NVIDIA GPUs Ollama Uses

On NVIDIA multi-GPU systems, use CUDA_VISIBLE_DEVICES to control which GPUs Ollama can see.

Temporary run:

`1`	`CUDA_VISIBLE_DEVICES=0,1 ollama serve`

Use only the second GPU:

`1`	`CUDA_VISIBLE_DEVICES=1 ollama serve`

Force Ollama not to use NVIDIA GPUs:

`1`	`CUDA_VISIBLE_DEVICES=-1 ollama serve`

The official docs note that numeric IDs may change order, so GPU UUIDs are more reliable. Check UUIDs first:

`1`	`nvidia-smi -L`

Example output:

1
2

GPU 0: NVIDIA GeForce RTX 3090 (UUID: GPU-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx)
GPU 1: NVIDIA GeForce RTX 3070 (UUID: GPU-yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy)

Then specify the UUID:

`1`	`CUDA_VISIBLE_DEVICES=GPU-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx ollama serve`

If Ollama is installed as a Linux systemd service, put the variable into the service environment:

`1`	`sudo systemctl edit ollama.service`

Add:

1
2

[Service]
Environment="CUDA_VISIBLE_DEVICES=0,1"

Reload and restart:

1
2

sudo systemctl daemon-reload
sudo systemctl restart ollama

AMD and Vulkan Device Selection

For AMD ROCm, use ROCR_VISIBLE_DEVICES to control visible GPUs:

`1`	`ROCR_VISIBLE_DEVICES=0,1 ollama serve`

To force Ollama not to use ROCm GPUs, use an invalid ID:

`1`	`ROCR_VISIBLE_DEVICES=-1 ollama serve`

Ollama’s GPU docs also mention experimental Vulkan support. For Vulkan GPUs, use GGML_VK_VISIBLE_DEVICES:

`1`	`OLLAMA_VULKAN=1 GGML_VK_VISIBLE_DEVICES=0 ollama serve`

If Vulkan devices cause problems, disable them:

`1`	`GGML_VK_VISIBLE_DEVICES=-1 ollama serve`

AMD multi-GPU setups are more likely to run into driver, ROCm version, and GFX version compatibility issues. The official docs also mention Linux ROCm driver requirements and compatibility overrides such as HSA_OVERRIDE_GFX_VERSION. If you mix different generations of AMD GPUs, first verify that each card works on its own before trying multi-GPU.

Exposing Multiple GPUs in Docker

If you run Ollama in Docker, NVIDIA setups usually require nvidia-container-toolkit, then --gpus to expose devices.

Expose all GPUs:

docker run -d \
  --gpus=all \
  -v ollama:/root/.ollama \
  -p 11434:11434 \
  --name ollama \
  ollama/ollama

Expose specific GPUs:

docker run -d \
  --gpus '"device=0,1"' \
  -v ollama:/root/.ollama \
  -p 11434:11434 \
  --name ollama \
  ollama/ollama

You can also combine this with environment variables:

docker run -d \
  --gpus=all \
  -e CUDA_VISIBLE_DEVICES=0,1 \
  -v ollama:/root/.ollama \
  -p 11434:11434 \
  --name ollama \
  ollama/ollama

If nvidia-smi cannot see GPUs inside the container, Ollama cannot use them either. Troubleshoot Docker GPU passthrough first, then Ollama.

What Is `OLLAMA_SCHED_SPREAD`

In some multi-GPU configuration discussions, you may see OLLAMA_SCHED_SPREAD=1 or OLLAMA_SCHED_SPREAD=true. It is related to Ollama’s scheduler and is often used when people want models or requests to be spread more broadly across GPUs.

Example:

`1`	`OLLAMA_SCHED_SPREAD=1 ollama serve`

Or with systemd:

1
2

[Service]
Environment="OLLAMA_SCHED_SPREAD=true"

But it is not a magic switch. Enabling it does not imply linear token/s scaling, and it may still run into OOM when multiple models are loaded, VRAM estimates are tight, context length grows, or the KV cache expands. The core FAQ behavior still applies: if one GPU can fully hold the model, one GPU is usually more efficient; if one GPU cannot hold it, then multi-GPU splitting becomes useful.

Treat OLLAMA_SCHED_SPREAD as an advanced scheduling experiment, not a required multi-GPU setting. Understand the default behavior first, then adjust based on ollama ps, logs, and nvidia-smi.

How to Check Whether Multiple GPUs Are Being Used

Useful commands:

`1`	`ollama ps`

`1`	`watch -n 0.5 nvidia-smi`

View the Ollama service logs:

`1`	`journalctl -u ollama -f`

If using Docker:

`1`	`docker logs -f ollama`

Watch for:

Whether Ollama discovers compatible GPUs.
Whether the model shows 100% GPU or a CPU/GPU split.
Whether each GPU has VRAM allocated.
Whether VRAM grows on multiple GPUs during model loading.
Whether generation token/s improves compared with CPU/RAM spillover.
Whether OOM or model unloading happens frequently.

GPU utilization alone can be misleading. LLM inference does not always keep GPUs fully loaded, especially with multiple GPUs, low batch sizes, small contexts, slow CPUs, or slow PCIe links.

Common Misunderstandings

Misunderstanding 1: Two 12GB GPUs Equal One 24GB GPU

Not exactly. Multiple GPUs can place a model across devices, but cross-device access has overhead. It solves the “does not fit” problem, but it is not equivalent to the speed and stability of one large-VRAM GPU.

Misunderstanding 2: Different GPU Models Cannot Be Mixed

Not necessarily. If the driver, compute capability, and runtime libraries support the cards, Ollama can see multiple GPUs. But mixed setups are usually limited by the slower card, smaller VRAM, and PCIe topology. The most predictable setup is still same model, same VRAM size, and well-supported same-generation drivers.

Misunderstanding 3: Multi-GPU Is Always Faster Than Single-GPU

Not always. If the model fits completely on one fast GPU, single-GPU may be faster. Multi-GPU is mainly useful for large models, long contexts, or insufficient single-GPU VRAM.

Misunderstanding 4: NVLink / SLI Is Required

No. Ordinary PCIe multi-GPU systems can be used by Ollama. NVLink is not a prerequisite.

Misunderstanding 5: Adding a GPU Does Not Require Restarting Services

Not always true. Linux systemd services, Windows background apps, and Docker containers may need to be restarted before they rediscover devices and environment variables.

GPU Selection Suggestions

For Ollama local inference, the rough priority is:

Larger single-GPU VRAM is usually easier to manage.
Identical GPUs are easier to troubleshoot than mixed GPUs.
More complete PCIe lanes make large-model loading smoother.
Older cards should be checked for CUDA compute capability or ROCm support first.
Multi-GPU power, cooling, and chassis airflow must be planned ahead.

For budget second-hand platforms:

Dual RTX 3090 remains a common high-VRAM option.
Older Tesla cards such as P40 / M40 have large VRAM, but power, cooling, driver support, and performance all need trade-offs.
Cards such as RTX 4070 / 4070 Ti have good efficiency, but single-card VRAM can be limiting.
Multiple old 8GB cards can be fun to experiment with, but are not ideal for running large models long-term.

Summary

Ollama multi-GPU support is best understood as “VRAM expansion first, performance acceleration second.” If the model fits entirely on one GPU, the default single-GPU path is usually faster. If one GPU cannot hold it, multi-GPU can spread the model across devices and avoid heavy CPU/RAM spillover, making larger models usable.

In practice, use ollama ps to check where the model is loaded, then use nvidia-smi or ROCm tools to observe VRAM allocation. For GPU selection, use CUDA_VISIBLE_DEVICES on NVIDIA, ROCR_VISIBLE_DEVICES on AMD ROCm, and GGML_VK_VISIBLE_DEVICES for Vulkan. If running in Docker, first make sure the container can see the GPUs.

Multi-GPU is not magic. It can help fit larger models, but it does not guarantee linear speedup. The stable route is still to prefer large-VRAM single GPUs or identical multi-GPU setups, while considering driver support, PCIe, power, cooling, and model quantization together.

References

Ollama FAQ: How does Ollama load models on multiple GPUs?: https://github.com/ollama/ollama/blob/main/docs/faq.mdx
Ollama GPU docs: Hardware support / GPU Selection: https://github.com/ollama/ollama/blob/main/docs/gpu.mdx
Ollama Docker Hub: https://hub.docker.com/r/ollama/ollama
NVIDIA Container Toolkit: https://github.com/NVIDIA/nvidia-container-toolkit

How to Check Whether an Ollama Model Is Loaded on GPU

Mon, 06 Apr 2026 10:15:18 +0800

If you want to confirm whether an Ollama model is actually running on GPU, the most direct way is checking processor allocation for currently loaded models.

Command

`1`	`ollama ps`

Example Output

1
2

NAME        ID            SIZE    PROCESSOR   UNTIL
llama3:70b  bcfb190ca3a7  42 GB   100% GPU    4 minutes from now

How to Read the `PROCESSOR` Column

100% GPU: The model is fully loaded into GPU VRAM.
100% CPU: The model is fully loaded in system memory (no GPU inference).
48%/52% CPU/GPU: The model is split between system memory and GPU VRAM.

Practical Tips

If you expect GPU usage but see 100% CPU, first check GPU drivers, CUDA/ROCm environment, and Ollama runtime settings.
With larger models and limited VRAM, CPU/GPU mixed loading is common.
For performance troubleshooting, run ollama ps before checking speed metrics to locate bottlenecks faster.

Summary

ollama ps is the first step to verify real GPU usage. Focus on the PROCESSOR column to quickly identify where the model is loaded and decide your next optimization action.

GPU on KnightLi Blog

How to Pick a GPU in April 2026: Which Models to Avoid and Which Ones Are More Worth Considering

Models I Would Not Prioritize

1. RTX 5060 Ti 8GB

2. Expensive older cards, especially RTX 3080 10GB and RTX 3070 Ti when they are still priced high

3. Older flagships with unclear history, such as RTX 3090 and RTX 3080 Ti

4. RTX 5070 when the price is not right

Models That Make More Sense

1. RTX 5060 Ti 16GB

2. RTX 5070 Ti

3. Properly priced new cards are usually a better first stop than older high-end cards

If You Just Want the Short Version

Final Line

Ubuntu 26.04 LTS GPU and Hardware Updates: CUDA, ROCm, DPC++, and More Platform Changes

1. Intel DPC++ and related components are now in Ubuntu Archive

2. The NVIDIA CUDA toolkit can now be installed directly with apt

3. AMD ROCm 7.1.0 is now in Universe

4. The bigger story is that all three GPU ecosystems are landing

5. NVIDIA Dynamic Boost is enabled by default

6. Support for new Intel integrated and discrete GPUs keeps moving forward

7. Suspend and resume is more stable on Nvidia desktops too

8. ARM, Raspberry Pi, RISC-V, and IBM Z also get harder platform-level changes

ARM64 desktop platforms

A new Raspberry Pi boot layout

Raspberry Pi desktop images now use desktop-minimal

Swap on Raspberry Pi is now handled by cloud-init

RISC-V requirements have moved up

IBM Z now requires z15 at minimum

9. Who should read this first

10. One-line takeaway

How to Fix Ollama Using CPU Instead of GPU

1. First, confirm whether Ollama is really not using the GPU

2. Rule out the most common misunderstanding first: the model does not fit into VRAM

3. Check whether the GPU driver and the lower-level runtime are actually working

NVIDIA

AMD / ROCm

4. Restart the Ollama service, not just your terminal

5. Check whether the environment variables are really reaching the service

6. On AMD platforms, focus on ROCm compatibility

7. In Docker, WSL, or remote environments, also check device mapping

8. Check logs last, but check them for the right reason

Troubleshooting Order

Conclusion

What Is NVIDIA nvbandwidth: How to Use This GPU Bandwidth Testing Tool

1. What does nvbandwidth do

2. It does not produce just one simple score

3. How to understand CE and SM copies

4. What environment does it require

5. How to build and run the single-node version

6. Multinode support is one of its standout features

7. What changed in v0.9

8. When is it a good fit

9. How to think about its value

Related links

How to Check Whether a Tesla V100 Has ECC Errors

Is Tesla V100 Still Worth Buying: ECC Checks, Cooling Mods, and DIY Pitfalls

Quick Takeaways

1. Start with Production Date and Batch Clues

2. Do Not Only Look at the Chip: Check Inductors, Springs, and Frame

1. Touch the inductors first

2. Check whether the retaining spring has been removed before

3. If the frame comes apart too easily, that is also suspicious

3. If the Backplate Separates Too Easily, Suspect a Reflash or Prior Tampering

4. How to Read ECC: What Matters Most Is Not Whether It Is Zero, but Whether It Grows

1. Real-time errors are the most dangerous

2. Lifetime accumulated errors are not always scary

3. The page retirement section deserves more attention

5. Do Not Worship “Zero ECC”: The Data Itself Can Be Manipulated

6. Stress Testing Is Necessary, but Testing Only the Core Is Not Enough

7. For DIY Users, the Real Failure Point Is Usually Cooling and Power, Not the Purchase Itself

1. Do not only watch average GPU temperature

2. Backplate cooling must be considered

3. Cheap improvised water-cooling setups are risky

8. If You Still Want to DIY, At Least Watch These Points

9. Coil Whine and Adapter-Board Variance Are Real Problems Too

1. Coil whine may not be fully eliminable

2. Adapter-board variance is huge

Closing

llama.cpp GPU Performance Ranking: Full CUDA, ROCm, and Vulkan Scoreboards Explained with pp512 / tg128 / FA

Understanding the Metrics First

1. `RTX 5060 Ti 8GB`

2. Expensive older cards, especially `RTX 3080 10GB` and `RTX 3070 Ti` when they are still priced high

3. Older flagships with unclear history, such as `RTX 3090` and `RTX 3080 Ti`

4. `RTX 5070` when the price is not right

1. `RTX 5060 Ti 16GB`

2. `RTX 5070 Ti`

2. The NVIDIA CUDA toolkit can now be installed directly with `apt`

1. What does `nvbandwidth` do

3. How to understand `CE` and `SM` copies

7. What changed in `v0.9`

4. How to Read `ECC`: What Matters Most Is Not Whether It Is Zero, but Whether It Grows

What Is `OLLAMA_SCHED_SPREAD`