<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
    <channel>
        <title>GPU on KnightLi Blog</title>
        <link>https://knightli.com/en/tags/gpu/</link>
        <description>Recent content in GPU on KnightLi Blog</description>
        <generator>Hugo -- gohugo.io</generator>
        <language>en</language>
        <lastBuildDate>Mon, 27 Apr 2026 08:51:10 +0800</lastBuildDate><atom:link href="https://knightli.com/en/tags/gpu/index.xml" rel="self" type="application/rss+xml" /><item>
        <title>How to Pick a GPU in April 2026: Which Models to Avoid and Which Ones Are More Worth Considering</title>
        <link>https://knightli.com/en/2026/04/27/gpu-buying-guide-april-2026-model-picks/</link>
        <pubDate>Mon, 27 Apr 2026 08:51:10 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/04/27/gpu-buying-guide-april-2026-model-picks/</guid>
        <description>&lt;p&gt;If you are getting ready to build a PC, the GPU is the one part where you really should not look only at whether a card is new. By April 2026, some models are already much harder to justify, while others are not perfect but still feel noticeably more reasonable than the alternatives around the same price.&lt;/p&gt;
&lt;p&gt;So this article skips theory and goes straight to specific models.&lt;/p&gt;
&lt;h2 id=&#34;models-i-would-not-prioritize&#34;&gt;Models I Would Not Prioritize
&lt;/h2&gt;&lt;h2 id=&#34;1-rtx-5060-ti-8gb&#34;&gt;1. &lt;code&gt;RTX 5060 Ti 8GB&lt;/code&gt;
&lt;/h2&gt;&lt;p&gt;The biggest issue with this card is not that it is unusable. The issue is that &lt;code&gt;8GB&lt;/code&gt; already feels caught in an awkward middle ground at this point.&lt;/p&gt;
&lt;p&gt;If you mostly play lighter online games at &lt;code&gt;1080p&lt;/code&gt; medium to high settings, it can still do the job. But once you move into any of these areas, the limitation shows up quickly:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Newer AAA games&lt;/li&gt;
&lt;li&gt;Higher texture settings&lt;/li&gt;
&lt;li&gt;&lt;code&gt;1440p&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Mixed use with AI inference, editing, or productivity work&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you are already looking at the &lt;code&gt;RTX 5060 Ti&lt;/code&gt;, the safer move is usually to go straight to the &lt;code&gt;16GB&lt;/code&gt; version instead of saving a bit of budget by taking the &lt;code&gt;8GB&lt;/code&gt; one.&lt;/p&gt;
&lt;p&gt;In short:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;RTX 5060 Ti 8GB&lt;/code&gt;: not recommended&lt;/li&gt;
&lt;li&gt;&lt;code&gt;RTX 5060 Ti 16GB&lt;/code&gt;: clearly more worth considering&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;2-expensive-older-cards-especially-rtx-3080-10gb-and-rtx-3070-ti-when-they-are-still-priced-high&#34;&gt;2. Expensive older cards, especially &lt;code&gt;RTX 3080 10GB&lt;/code&gt; and &lt;code&gt;RTX 3070 Ti&lt;/code&gt; when they are still priced high
&lt;/h2&gt;&lt;p&gt;The problem with these cards is not that performance is completely bad. The problem is that, in today&amp;rsquo;s market, buying them often puts you in an awkward spot:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Power draw is not low&lt;/li&gt;
&lt;li&gt;They are no longer new&lt;/li&gt;
&lt;li&gt;VRAM is not especially generous&lt;/li&gt;
&lt;li&gt;Used-market sources are often messy&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;code&gt;RTX 3080 10GB&lt;/code&gt; is the clearest example. If it is still priced high, it quickly turns into a card that looks strong on paper but feels less balanced in real use.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;RTX 3070 Ti&lt;/code&gt; follows the same logic. It is not absolutely unbuyable, but if the price gap is not meaningful, you are usually better off looking at something newer, something with more comfortable VRAM, or something more balanced in power and thermals.&lt;/p&gt;
&lt;h2 id=&#34;3-older-flagships-with-unclear-history-such-as-rtx-3090-and-rtx-3080-ti&#34;&gt;3. Older flagships with unclear history, such as &lt;code&gt;RTX 3090&lt;/code&gt; and &lt;code&gt;RTX 3080 Ti&lt;/code&gt;
&lt;/h2&gt;&lt;p&gt;These two cards are easy to want for obvious reasons:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The names still sound strong&lt;/li&gt;
&lt;li&gt;Paper performance is not weak&lt;/li&gt;
&lt;li&gt;They are very visible in the used market&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;What you really need to watch out for is where they came from.&lt;/p&gt;
&lt;p&gt;If you are buying:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A pulled card&lt;/li&gt;
&lt;li&gt;A repaired card&lt;/li&gt;
&lt;li&gt;A used card with unclear history&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;then the risk is usually much higher than with a normal retail card. A card like the &lt;code&gt;RTX 3090&lt;/code&gt; looks attractive because of the &lt;code&gt;24GB&lt;/code&gt; VRAM, but heat, power delivery, silicon condition, and past usage history all become bigger worries than they would be on a straightforward new card.&lt;/p&gt;
&lt;p&gt;If you do not already know exactly what you are buying, and you are not planning to spend time checking the card carefully, these older flagships are generally not something I would touch casually.&lt;/p&gt;
&lt;h2 id=&#34;4-rtx-5070-when-the-price-is-not-right&#34;&gt;4. &lt;code&gt;RTX 5070&lt;/code&gt; when the price is not right
&lt;/h2&gt;&lt;p&gt;&lt;code&gt;RTX 5070&lt;/code&gt; is not a card that is automatically bad. The catch is that the price has to make sense.&lt;/p&gt;
&lt;p&gt;Its awkwardness shows up when the gap between it and the &lt;code&gt;RTX 5070 Ti&lt;/code&gt; is not large enough. In that case, a lot of buyers end up feeling oddly unsatisfied.&lt;/p&gt;
&lt;p&gt;The pattern usually looks like this:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Buy the &lt;code&gt;5070&lt;/code&gt;: you keep thinking a little more would have gotten you the &lt;code&gt;5070 Ti&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Do not stretch the budget: you still know you bought the &amp;ldquo;almost&amp;rdquo; card&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;So &lt;code&gt;RTX 5070&lt;/code&gt; is not something to ignore entirely, but it is &lt;strong&gt;worth considering only when the price is clearly right&lt;/strong&gt;. If the pricing sits in an uncomfortable middle zone, it quickly becomes a card that makes theoretical sense but does not feel great in practice.&lt;/p&gt;
&lt;h2 id=&#34;models-that-make-more-sense&#34;&gt;Models That Make More Sense
&lt;/h2&gt;&lt;h2 id=&#34;1-rtx-5060-ti-16gb&#34;&gt;1. &lt;code&gt;RTX 5060 Ti 16GB&lt;/code&gt;
&lt;/h2&gt;&lt;p&gt;If you are already shopping in the midrange, this card is usually the safer choice compared with the &lt;code&gt;8GB&lt;/code&gt; version.&lt;/p&gt;
&lt;p&gt;The reasons are simple:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;More headroom within the same product family&lt;/li&gt;
&lt;li&gt;Less likely to be boxed in by VRAM over the next few years&lt;/li&gt;
&lt;li&gt;Easier to live with if you mix gaming and productivity&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;It may not be the most explosive card at its price, but it is at least the kind of card you are less likely to regret immediately.&lt;/p&gt;
&lt;h2 id=&#34;2-rtx-5070-ti&#34;&gt;2. &lt;code&gt;RTX 5070 Ti&lt;/code&gt;
&lt;/h2&gt;&lt;p&gt;If your budget can stretch, this is usually a more complete answer than the &lt;code&gt;RTX 5070&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Its value is not that it dominates every single scenario. Its value is that it feels more like a card that can balance gaming, resolution, and longer-term use all at once.&lt;/p&gt;
&lt;p&gt;It makes sense for people who:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Want &lt;code&gt;1440p&lt;/code&gt; high settings&lt;/li&gt;
&lt;li&gt;Want the system to last for years&lt;/li&gt;
&lt;li&gt;Do not want to start thinking about upgrades too soon&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you are already stuck between the &lt;code&gt;5070&lt;/code&gt; and &lt;code&gt;5070 Ti&lt;/code&gt;, and the gap is not absurdly large, going straight to the &lt;code&gt;5070 Ti&lt;/code&gt; is often the less annoying decision.&lt;/p&gt;
&lt;h2 id=&#34;3-properly-priced-new-cards-are-usually-a-better-first-stop-than-older-high-end-cards&#34;&gt;3. Properly priced new cards are usually a better first stop than older high-end cards
&lt;/h2&gt;&lt;p&gt;If you are not a veteran used-GPU hunter, a simple and effective rule is this:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Prioritize normal retail new cards&lt;/li&gt;
&lt;li&gt;Be cautious with older high-end cards that have messy origins&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;At this point, the more practical approach is often:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Midrange budget: start with &lt;code&gt;RTX 5060 Ti 16GB&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;A tier higher: focus on &lt;code&gt;RTX 5070 Ti&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Consider &lt;code&gt;RTX 5070&lt;/code&gt; only when pricing is clearly favorable&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That is usually a better path than gambling on older cards that sound stronger but come with more baggage.&lt;/p&gt;
&lt;h2 id=&#34;if-you-just-want-the-short-version&#34;&gt;If You Just Want the Short Version
&lt;/h2&gt;&lt;p&gt;You can remember it like this:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Not really recommended: &lt;code&gt;RTX 5060 Ti 8GB&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Not recommended unless priced well: &lt;code&gt;RTX 5070&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Be cautious with: &lt;code&gt;RTX 3080 10GB&lt;/code&gt;, &lt;code&gt;RTX 3070 Ti&lt;/code&gt;, and unclear-source &lt;code&gt;RTX 3090&lt;/code&gt; / &lt;code&gt;RTX 3080 Ti&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;More worth considering: &lt;code&gt;RTX 5060 Ti 16GB&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Easier long-term pick if budget allows: &lt;code&gt;RTX 5070 Ti&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;final-line&#34;&gt;Final Line
&lt;/h2&gt;&lt;p&gt;At this point in the market, the real mistake is usually not spending a bit more. It is &lt;strong&gt;buying a card that looks acceptable on paper but always feels just a little compromised in real use&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;If you want to minimize regret, &lt;code&gt;RTX 5060 Ti 16GB&lt;/code&gt; and &lt;code&gt;RTX 5070 Ti&lt;/code&gt; are generally safer than many cards that seem &amp;ldquo;good enough,&amp;rdquo; while &lt;code&gt;RTX 5060 Ti 8GB&lt;/code&gt;, badly priced &lt;code&gt;RTX 5070&lt;/code&gt;, and older high-end cards with unclear history are usually the first ones to cross off.&lt;/p&gt;
</description>
        </item>
        <item>
        <title>Ubuntu 26.04 LTS GPU and Hardware Updates: CUDA, ROCm, DPC&#43;&#43;, and More Platform Changes</title>
        <link>https://knightli.com/en/2026/04/26/ubuntu-26-04-lts-gpu-hardware-ai-updates/</link>
        <pubDate>Sun, 26 Apr 2026 19:35:57 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/04/26/ubuntu-26-04-lts-gpu-hardware-ai-updates/</guid>
        <description>&lt;p&gt;If the previous article worked as a desktop-focused overview of &lt;code&gt;Ubuntu 26.04 LTS&lt;/code&gt;, this one is better read as its hardware and compute-side follow-up. In this &lt;code&gt;26.04&lt;/code&gt; cycle, Ubuntu pushed a number of AI, GPU computing, and platform compatibility changes into the main archive or formal support scope.&lt;/p&gt;
&lt;p&gt;The short version is this: the most important part of this round is not just desktop and kernel upgrades, but that &lt;strong&gt;Ubuntu is bringing Intel, NVIDIA, and AMD GPU computing stacks into the distribution in a more systematic way&lt;/strong&gt;.&lt;/p&gt;
&lt;h2 id=&#34;1-intel-dpc-and-related-components-are-now-in-ubuntu-archive&#34;&gt;1. Intel DPC++ and related components are now in Ubuntu Archive
&lt;/h2&gt;&lt;p&gt;Starting with &lt;code&gt;26.04&lt;/code&gt;, Intel&amp;rsquo;s open-source &lt;code&gt;oneAPI DPC++&lt;/code&gt; compiler is available directly from Ubuntu Archive for building &lt;code&gt;SYCL&lt;/code&gt; code. Its runtime also includes adapters for Intel GPUs.&lt;/p&gt;
&lt;p&gt;Two related components are also now available from Ubuntu repositories:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;oneDPL&lt;/code&gt;, the DPC++ library, which provides higher-productivity developer APIs&lt;/li&gt;
&lt;li&gt;&lt;code&gt;oneDNN&lt;/code&gt;, built with &lt;code&gt;dpclang-6&lt;/code&gt;, which can run on Intel GPUs&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That means if you are already working with &lt;code&gt;SYCL&lt;/code&gt;, heterogeneous computing, or AI workloads on Intel GPUs, Ubuntu now offers a more direct path instead of forcing you to maintain a separate external stack for everything.&lt;/p&gt;
&lt;p&gt;Ubuntu also calls out one practical requirement: users need to be in the &lt;code&gt;render&lt;/code&gt; group to actually use these Intel GPU-related capabilities.&lt;/p&gt;
&lt;h2 id=&#34;2-the-nvidia-cuda-toolkit-can-now-be-installed-directly-with-apt&#34;&gt;2. The NVIDIA CUDA toolkit can now be installed directly with &lt;code&gt;apt&lt;/code&gt;
&lt;/h2&gt;&lt;p&gt;For many developers and operators, this may be one of the most immediately useful changes in the notes.&lt;/p&gt;
&lt;p&gt;Starting with &lt;code&gt;26.04&lt;/code&gt;, the &lt;code&gt;NVIDIA CUDA toolkit&lt;/code&gt; can now be installed directly from Ubuntu Archive:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;sudo apt install cuda-toolkit
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;The value here is bigger than just saving a few setup steps.&lt;/p&gt;
&lt;p&gt;For developers shipping software on Ubuntu, this new model means they can simply declare a dependency on the &lt;code&gt;CUDA runtime&lt;/code&gt;, while Ubuntu manages installation and compatibility at the distribution level. That makes CUDA feel more like a native system capability on Ubuntu, rather than an extra software layer that always has to be maintained separately.&lt;/p&gt;
&lt;h2 id=&#34;3-amd-rocm-710-is-now-in-universe&#34;&gt;3. AMD ROCm 7.1.0 is now in Universe
&lt;/h2&gt;&lt;p&gt;On the AMD side, Ubuntu Universe now includes &lt;code&gt;ROCm 7.1.0&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;These libraries mainly provide:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;backend infrastructure for AI training and inference on AMD GPUs&lt;/li&gt;
&lt;li&gt;software foundations for machine learning and high performance computing&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Canonical also notes that ROCm-related components are continuously tested in its CI/CD pipeline. Beyond &lt;code&gt;autopkgtests&lt;/code&gt;, that includes several user-space applications such as:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;llama.cpp&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;pytorch&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Blender&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Lemonade Server&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That detail matters, because it shows Ubuntu is not just dropping packages into the archive. It is validating ROCm as a maintainable software stack.&lt;/p&gt;
&lt;h2 id=&#34;4-the-bigger-story-is-that-all-three-gpu-ecosystems-are-landing&#34;&gt;4. The bigger story is that all three GPU ecosystems are landing
&lt;/h2&gt;&lt;p&gt;It becomes easier to see the direction of &lt;code&gt;26.04&lt;/code&gt; when &lt;code&gt;DPC++&lt;/code&gt;, &lt;code&gt;CUDA&lt;/code&gt;, and &lt;code&gt;ROCm&lt;/code&gt; are viewed together:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Intel: bringing &lt;code&gt;SYCL&lt;/code&gt; / &lt;code&gt;oneAPI&lt;/code&gt; components into official repositories&lt;/li&gt;
&lt;li&gt;NVIDIA: giving the &lt;code&gt;CUDA toolkit&lt;/code&gt; a distribution-managed installation path&lt;/li&gt;
&lt;li&gt;AMD: shipping &lt;code&gt;ROCm 7.1.0&lt;/code&gt; in Universe with ongoing testing&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you work with these kinds of workloads on Ubuntu, this release will probably feel more relevant:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;local LLM inference&lt;/li&gt;
&lt;li&gt;GPU-accelerated training or fine-tuning&lt;/li&gt;
&lt;li&gt;Blender, scientific computing, and HPC&lt;/li&gt;
&lt;li&gt;development environments that need to move across different GPU platforms&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In other words, Ubuntu is no longer just &amp;ldquo;a system where you can install a GPU driver.&amp;rdquo; It is starting to carry a fuller &lt;strong&gt;user-space software stack for AI and GPU computing&lt;/strong&gt;.&lt;/p&gt;
&lt;h2 id=&#34;5-nvidia-dynamic-boost-is-enabled-by-default&#34;&gt;5. NVIDIA Dynamic Boost is enabled by default
&lt;/h2&gt;&lt;p&gt;Since &lt;code&gt;25.04&lt;/code&gt;, &lt;code&gt;Dynamic Boost&lt;/code&gt; has been enabled by default on supported NVIDIA laptops.&lt;/p&gt;
&lt;p&gt;The idea is straightforward: depending on system load, power can be shifted dynamically between the CPU and GPU. In gaming scenarios, that usually means giving more power to the GPU when needed to extract more performance.&lt;/p&gt;
&lt;p&gt;It only applies under two conditions:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;the laptop is connected to AC power&lt;/li&gt;
&lt;li&gt;the GPU load is high enough&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;It does not engage while the system is running on battery.&lt;/p&gt;
&lt;h2 id=&#34;6-support-for-new-intel-integrated-and-discrete-gpus-keeps-moving-forward&#34;&gt;6. Support for new Intel integrated and discrete GPUs keeps moving forward
&lt;/h2&gt;&lt;p&gt;Ubuntu also continues expanding support for new Intel GPUs, including:&lt;/p&gt;
&lt;p&gt;Integrated:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;Intel Core Ultra Xe2&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Intel Core Ultra Xe3&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Discrete:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;Intel Arc 5 B570&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Intel Arc 5 B580&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Intel Arc Pro B50&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Intel Arc Pro B60&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Intel Arc Pro B65&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Intel Arc Pro B70&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Ubuntu also highlights several features already available around these devices:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;improved GPU and CPU ray tracing performance through Intel Embree, benefiting applications such as &lt;code&gt;Blender 4.2+&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;hardware video encoding for &lt;code&gt;AVC&lt;/code&gt;, &lt;code&gt;JPEG&lt;/code&gt;, &lt;code&gt;HEVC&lt;/code&gt;, and &lt;code&gt;AV1&lt;/code&gt; on &amp;ldquo;Battlemage&amp;rdquo; devices&lt;/li&gt;
&lt;li&gt;a new &lt;code&gt;CCS&lt;/code&gt; optimization in Intel Compute Runtime&lt;/li&gt;
&lt;li&gt;enabled debugging support for Intel Xe GPUs&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you are watching follow-up releases, &lt;code&gt;25.10&lt;/code&gt; also continues to bring in more capabilities, including:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;initial support for Intel&amp;rsquo;s next-generation client platform codenamed &lt;code&gt;Panther Lake&lt;/code&gt; through &lt;code&gt;Linux kernel 6.17&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;improved IOMMU, PCIe subsystem, and multi-GPU support&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Mesa 25.2.3&lt;/code&gt; enabling &lt;code&gt;VK_KHR_shader_bfloat16&lt;/code&gt; for Battlemage and Panther Lake&lt;/li&gt;
&lt;li&gt;&lt;code&gt;intel-media-driver 25.3.0&lt;/code&gt; adding Panther Lake decode support and &lt;code&gt;VP9&lt;/code&gt; encoding&lt;/li&gt;
&lt;li&gt;&lt;code&gt;intel-compute-runtime 25.31&lt;/code&gt; adjusting the Level Zero &lt;code&gt;USM&lt;/code&gt; pool and local device memory event allocation behavior&lt;/li&gt;
&lt;li&gt;&lt;code&gt;level-zero 1.24&lt;/code&gt; and &lt;code&gt;level-zero-raytracing 1.1.0&lt;/code&gt; bringing broader spec and RTAS extension support&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;7-suspend-and-resume-is-more-stable-on-nvidia-desktops-too&#34;&gt;7. Suspend and resume is more stable on Nvidia desktops too
&lt;/h2&gt;&lt;p&gt;Starting with &lt;code&gt;25.10&lt;/code&gt;, Ubuntu enables suspend-resume support in the proprietary &lt;code&gt;Nvidia&lt;/code&gt; driver to reduce corruption and freezing when waking a desktop system.&lt;/p&gt;
&lt;p&gt;This is not the most visible kind of change, but it matters a lot in everyday use, especially on desktops that stay on for long periods and frequently suspend and resume.&lt;/p&gt;
&lt;h2 id=&#34;8-arm-raspberry-pi-risc-v-and-ibm-z-also-get-harder-platform-level-changes&#34;&gt;8. ARM, Raspberry Pi, RISC-V, and IBM Z also get harder platform-level changes
&lt;/h2&gt;&lt;p&gt;Beyond the GPU software stack, the release notes also include several platform-level changes worth calling out separately.&lt;/p&gt;
&lt;h3 id=&#34;arm64-desktop-platforms&#34;&gt;ARM64 desktop platforms
&lt;/h3&gt;&lt;p&gt;Starting with &lt;code&gt;25.10&lt;/code&gt;, the &lt;code&gt;ARM64&lt;/code&gt; &lt;code&gt;linux-generic&lt;/code&gt; kernel provides broader desktop compatibility for ARM64 desktop platforms that boot through &lt;code&gt;UEFI&lt;/code&gt;.&lt;/p&gt;
&lt;h3 id=&#34;a-new-raspberry-pi-boot-layout&#34;&gt;A new Raspberry Pi boot layout
&lt;/h3&gt;&lt;p&gt;One change introduced in &lt;code&gt;25.10&lt;/code&gt; and refined in &lt;code&gt;26.04&lt;/code&gt; is a new boot partition layout for Raspberry Pi systems.&lt;/p&gt;
&lt;p&gt;Its goal is to improve boot reliability: newly written boot assets are first &amp;ldquo;tested&amp;rdquo; before they are committed as the new &amp;ldquo;known good&amp;rdquo; set.&lt;/p&gt;
&lt;p&gt;The firmware date requirements are the part most users will want to remember:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;Pi 3 / 3+ / CM3+ / Zero 2W&lt;/code&gt;: no additional action required, the boot firmware is in the image itself&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Pi 4 / 400 / CM4&lt;/code&gt;: boot firmware must be dated no earlier than &lt;code&gt;2022-11-25&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Pi 5 / 500 / CM5&lt;/code&gt;: boot firmware must be dated no earlier than &lt;code&gt;2025-02-11&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;You can check it with:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;sudo rpi-eeprom-update
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;If the firmware is too old and you are using &lt;code&gt;Ubuntu 24.04 LTS&lt;/code&gt; or newer, you can update it like this:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;sudo rpi-eeprom-update -a
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;sudo reboot
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;h3 id=&#34;raspberry-pi-desktop-images-now-use-desktop-minimal&#34;&gt;Raspberry Pi desktop images now use desktop-minimal
&lt;/h3&gt;&lt;p&gt;Since &lt;code&gt;25.10&lt;/code&gt;, Ubuntu Desktop images for Raspberry Pi are based on &lt;code&gt;desktop-minimal&lt;/code&gt; rather than the full &lt;code&gt;desktop&lt;/code&gt; seed.&lt;/p&gt;
&lt;p&gt;Ubuntu gives a very concrete benefit here: the default app set is smaller, saving about &lt;code&gt;777MB&lt;/code&gt; on the uncompressed image and on installed systems.&lt;/p&gt;
&lt;p&gt;If you want to remove that default app set in bulk after upgrading, you can use:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;sudo apt purge ubuntu-desktop --autoremove
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;If you want to keep some of those applications, just mark them as manually installed with &lt;code&gt;apt&lt;/code&gt; first.&lt;/p&gt;
&lt;h3 id=&#34;swap-on-raspberry-pi-is-now-handled-by-cloud-init&#34;&gt;Swap on Raspberry Pi is now handled by cloud-init
&lt;/h3&gt;&lt;p&gt;Since &lt;code&gt;25.10&lt;/code&gt;, swap file creation on Raspberry Pi desktop images is handled by &lt;code&gt;cloud-init&lt;/code&gt;.&lt;br&gt;
If you want to customize swap size before first boot, you can edit &lt;code&gt;user-data&lt;/code&gt; on the boot partition directly.&lt;/p&gt;
&lt;h3 id=&#34;risc-v-requirements-have-moved-up&#34;&gt;RISC-V requirements have moved up
&lt;/h3&gt;&lt;p&gt;Starting with &lt;code&gt;25.10&lt;/code&gt;, the &lt;code&gt;RISC-V&lt;/code&gt; build of &lt;code&gt;Ubuntu 26.04 LTS&lt;/code&gt; requires hardware that implements the &lt;code&gt;RVA23S64 ISA profile&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Systems that do not meet that requirement can no longer run &lt;code&gt;Ubuntu 26.04 LTS&lt;/code&gt;. If you still have boards based on earlier &lt;code&gt;RVA20&lt;/code&gt; processor cores, you need to stay on the support line provided by &lt;code&gt;Ubuntu 24.04 LTS&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;According to Ubuntu, as of &lt;code&gt;April 2026&lt;/code&gt;, there is still no real &lt;code&gt;RVA23S64&lt;/code&gt; hardware available. So the only currently supported platform is effectively a &lt;code&gt;QEMU&lt;/code&gt; virtualized environment configured with &lt;code&gt;-cpu rva23s64&lt;/code&gt;.&lt;/p&gt;
&lt;h3 id=&#34;ibm-z-now-requires-z15-at-minimum&#34;&gt;IBM Z now requires z15 at minimum
&lt;/h3&gt;&lt;p&gt;Starting with &lt;code&gt;26.04&lt;/code&gt;, the minimum requirement for the &lt;code&gt;s390x&lt;/code&gt; architecture has moved up to &lt;code&gt;z15&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;That means:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;z14&lt;/code&gt; / &lt;code&gt;LinuxONE II&lt;/code&gt; and older systems can no longer install &lt;code&gt;Ubuntu 26.04 LTS&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;z15&lt;/code&gt; / &lt;code&gt;LinuxONE III&lt;/code&gt; and newer systems should see better performance&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;9-who-should-read-this-first&#34;&gt;9. Who should read this first
&lt;/h2&gt;&lt;p&gt;This article is more useful than the desktop overview if you fall into any of these cases:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;you use Ubuntu for &lt;code&gt;CUDA&lt;/code&gt;, &lt;code&gt;ROCm&lt;/code&gt;, &lt;code&gt;SYCL&lt;/code&gt;, or local AI inference&lt;/li&gt;
&lt;li&gt;you do development or compute work on Intel, NVIDIA, or AMD GPUs&lt;/li&gt;
&lt;li&gt;you maintain Raspberry Pi, ARM64, RISC-V, IBM Z, or other non-standard x86 platforms&lt;/li&gt;
&lt;li&gt;you are especially sensitive to repository availability, driver behavior, runtimes, and platform requirements after an upgrade&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;10-one-line-takeaway&#34;&gt;10. One-line takeaway
&lt;/h2&gt;&lt;p&gt;The key point of &lt;code&gt;Ubuntu 26.04 LTS&lt;/code&gt; on the hardware and AI stack side is not that one GPU vendor got a standout upgrade. It is that &lt;strong&gt;Intel&amp;rsquo;s DPC++, NVIDIA&amp;rsquo;s CUDA, and AMD&amp;rsquo;s ROCm are all entering the Ubuntu ecosystem in a more official, in-repository, and maintainable way&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;If you used to think of Ubuntu as &amp;ldquo;the system first, then I assemble the GPU environment myself,&amp;rdquo; &lt;code&gt;26.04&lt;/code&gt; starts to look more like a distribution that is willing to actively carry AI and heterogeneous computing workloads.&lt;/p&gt;
</description>
        </item>
        <item>
        <title>How to Fix Ollama Using CPU Instead of GPU</title>
        <link>https://knightli.com/en/2026/04/24/fix-ollama-using-cpu-instead-of-gpu/</link>
        <pubDate>Fri, 24 Apr 2026 18:30:00 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/04/24/fix-ollama-using-cpu-instead-of-gpu/</guid>
        <description>&lt;p&gt;When running local LLMs, one of the most frustrating problems is this: your machine clearly has a GPU, yet &lt;code&gt;Ollama&lt;/code&gt; still leans heavily on the &lt;code&gt;CPU&lt;/code&gt;, and performance is painfully slow.&lt;/p&gt;
&lt;p&gt;The short version is that this is usually not caused by one single issue. The most common causes are:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;Ollama&lt;/code&gt; is not detecting any usable GPU&lt;/li&gt;
&lt;li&gt;The driver, &lt;code&gt;ROCm&lt;/code&gt;, or &lt;code&gt;CUDA&lt;/code&gt; environment is not set up correctly&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;Ollama&lt;/code&gt; service was started without the right environment variables&lt;/li&gt;
&lt;li&gt;The model is too large and has fallen back to &lt;code&gt;CPU&lt;/code&gt; or mixed &lt;code&gt;CPU/GPU&lt;/code&gt; loading&lt;/li&gt;
&lt;li&gt;On AMD platforms, there may be extra compatibility issues such as &lt;code&gt;ROCm&lt;/code&gt; version mismatch, &lt;code&gt;gfx&lt;/code&gt; settings, or device visibility problems&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The fastest way to troubleshoot it is to go through the checks below in order.&lt;/p&gt;
&lt;h2 id=&#34;1-first-confirm-whether-ollama-is-really-not-using-the-gpu&#34;&gt;1. First, confirm whether Ollama is really not using the GPU
&lt;/h2&gt;&lt;p&gt;The most direct check is:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;ollama ps
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Focus on the &lt;code&gt;PROCESSOR&lt;/code&gt; column.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;100% GPU&lt;/code&gt;: the model is fully running on the GPU&lt;/li&gt;
&lt;li&gt;&lt;code&gt;100% CPU&lt;/code&gt;: the GPU is not being used at all&lt;/li&gt;
&lt;li&gt;Results like &lt;code&gt;48%/52% CPU/GPU&lt;/code&gt;: part of the model is in VRAM, and part has spilled into system memory&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you see &lt;code&gt;100% CPU&lt;/code&gt;, the next step is to focus on environment and service configuration.&lt;br&gt;
If you see mixed loading, that does not necessarily mean the GPU is broken. In many cases, it simply means VRAM is not enough.&lt;/p&gt;
&lt;h2 id=&#34;2-rule-out-the-most-common-misunderstanding-first-the-model-does-not-fit-into-vram&#34;&gt;2. Rule out the most common misunderstanding first: the model does not fit into VRAM
&lt;/h2&gt;&lt;p&gt;Many people assume that once a GPU is installed, &lt;code&gt;Ollama&lt;/code&gt; will always run fully on it. That is not how it works.&lt;/p&gt;
&lt;p&gt;If the model is too large, the context is too long, or some other loaded model is already occupying VRAM, &lt;code&gt;Ollama&lt;/code&gt; may fall back to:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Partial GPU + partial CPU&lt;/li&gt;
&lt;li&gt;Full &lt;code&gt;100% CPU&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;At this point, the two simplest tests are:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Try a smaller model first&lt;br&gt;
For example, test with a &lt;code&gt;4B&lt;/code&gt; or &lt;code&gt;7B&lt;/code&gt; model before jumping straight to much larger ones.&lt;/li&gt;
&lt;li&gt;Unload other active models and test again&lt;br&gt;
Run &lt;code&gt;ollama ps&lt;/code&gt; first and make sure nothing else is occupying VRAM.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;If smaller models use the GPU but larger ones do not, the real problem is usually VRAM capacity rather than the driver.&lt;/p&gt;
&lt;h2 id=&#34;3-check-whether-the-gpu-driver-and-the-lower-level-runtime-are-actually-working&#34;&gt;3. Check whether the GPU driver and the lower-level runtime are actually working
&lt;/h2&gt;&lt;p&gt;If even small models run only on &lt;code&gt;CPU&lt;/code&gt;, the next step is to check the underlying environment.&lt;/p&gt;
&lt;h3 id=&#34;nvidia&#34;&gt;NVIDIA
&lt;/h3&gt;&lt;p&gt;First confirm that the driver is working and the system can see the GPU. A common check is:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;nvidia-smi
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;If this already fails, &lt;code&gt;Ollama&lt;/code&gt; is very unlikely to use the GPU correctly.&lt;/p&gt;
&lt;h3 id=&#34;amd--rocm&#34;&gt;AMD / ROCm
&lt;/h3&gt;&lt;p&gt;If you are using an &lt;code&gt;AMD GPU&lt;/code&gt;, especially with &lt;code&gt;ROCm&lt;/code&gt;, start with:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;rocminfo
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;rocm-smi
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;If these tools cannot list the device properly, the problem is still below &lt;code&gt;Ollama&lt;/code&gt;, so there is no point debugging the application layer yet.&lt;/p&gt;
&lt;p&gt;On AMD, the most common issue is not simply “is the driver installed,” but rather:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The &lt;code&gt;ROCm&lt;/code&gt; version does not match the OS version&lt;/li&gt;
&lt;li&gt;The current GPU architecture has incomplete support&lt;/li&gt;
&lt;li&gt;The device exists, but the runtime is not being exposed correctly to &lt;code&gt;Ollama&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;4-restart-the-ollama-service-not-just-your-terminal&#34;&gt;4. Restart the Ollama service, not just your terminal
&lt;/h2&gt;&lt;p&gt;This is a very common trap.&lt;/p&gt;
&lt;p&gt;Many people install drivers, change environment variables, fix &lt;code&gt;ROCm&lt;/code&gt;, then just open a new terminal and continue with &lt;code&gt;ollama run&lt;/code&gt;. But if &lt;code&gt;Ollama&lt;/code&gt; is running as a background service, it may still be using the old environment.&lt;/p&gt;
&lt;p&gt;So the safer approach is:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Fully restart the &lt;code&gt;Ollama&lt;/code&gt; service&lt;/li&gt;
&lt;li&gt;Reboot the machine if necessary&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you are running it as a service on Linux, make sure the service process was actually restarted instead of reusing the old one.&lt;/p&gt;
&lt;h2 id=&#34;5-check-whether-the-environment-variables-are-really-reaching-the-service&#34;&gt;5. Check whether the environment variables are really reaching the service
&lt;/h2&gt;&lt;p&gt;This matters especially on &lt;code&gt;AMD ROCm&lt;/code&gt; systems.&lt;/p&gt;
&lt;p&gt;Some machines work fine when commands are run manually in a shell, but the &lt;code&gt;Ollama&lt;/code&gt; service still uses only &lt;code&gt;CPU&lt;/code&gt;. In that case, the usual reason is that the service process never received the variables you set in your shell.&lt;/p&gt;
&lt;p&gt;Common variables to look at include:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;ROCR_VISIBLE_DEVICES
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;HSA_OVERRIDE_GFX_VERSION
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Specifically:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;ROCR_VISIBLE_DEVICES&lt;/code&gt; limits or selects which GPUs &lt;code&gt;ROCm&lt;/code&gt; can see&lt;/li&gt;
&lt;li&gt;&lt;code&gt;HSA_OVERRIDE_GFX_VERSION&lt;/code&gt; is often used as a compatibility workaround on some AMD platforms&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you only &lt;code&gt;export&lt;/code&gt; these variables in the current terminal, but &lt;code&gt;Ollama&lt;/code&gt; is started by systemd, a desktop background service, or another daemon, they may not take effect.&lt;/p&gt;
&lt;p&gt;In other words, “it looks set in my terminal” does not mean &lt;code&gt;Ollama&lt;/code&gt; is actually using it.&lt;/p&gt;
&lt;h2 id=&#34;6-on-amd-platforms-focus-on-rocm-compatibility&#34;&gt;6. On AMD platforms, focus on ROCm compatibility
&lt;/h2&gt;&lt;p&gt;Based on the public page metadata, the original video for this topic is tied to &lt;code&gt;AMD Max+ 395&lt;/code&gt;, &lt;code&gt;strix halo&lt;/code&gt;, and &lt;code&gt;AMD ROCm&lt;/code&gt;.&lt;br&gt;
In setups like these, &lt;code&gt;Ollama&lt;/code&gt; failing to use the GPU is often more dependent on version matching than on NVIDIA systems.&lt;/p&gt;
&lt;p&gt;Start by checking these:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Whether the installed &lt;code&gt;ROCm&lt;/code&gt; version fits the current OS and GPU&lt;/li&gt;
&lt;li&gt;Whether the GPU belongs to an architecture with solid &lt;code&gt;ROCm&lt;/code&gt; support&lt;/li&gt;
&lt;li&gt;Whether you need to set &lt;code&gt;HSA_OVERRIDE_GFX_VERSION&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Whether an older &lt;code&gt;Ollama&lt;/code&gt; build or older inference runtime is causing compatibility issues&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;If &lt;code&gt;rocminfo&lt;/code&gt; works and the GPU is visible to the system, but &lt;code&gt;Ollama&lt;/code&gt; still runs only on &lt;code&gt;CPU&lt;/code&gt;, the issue is often in the version combination rather than in model parameters.&lt;/p&gt;
&lt;h2 id=&#34;7-in-docker-wsl-or-remote-environments-also-check-device-mapping&#34;&gt;7. In Docker, WSL, or remote environments, also check device mapping
&lt;/h2&gt;&lt;p&gt;If you are not running on bare metal but inside:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Docker&lt;/li&gt;
&lt;li&gt;WSL&lt;/li&gt;
&lt;li&gt;Remote containers&lt;/li&gt;
&lt;li&gt;Virtualized environments&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;then you need to check one more layer: whether the GPU device is actually being exposed inside that environment.&lt;/p&gt;
&lt;p&gt;A typical symptom looks like this:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The host machine can see the GPU&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Ollama&lt;/code&gt; inside the container or subsystem still uses only &lt;code&gt;CPU&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In that case, the issue may not be &lt;code&gt;Ollama&lt;/code&gt; itself. The container or subsystem may simply not have GPU access.&lt;/p&gt;
&lt;h2 id=&#34;8-check-logs-last-but-check-them-for-the-right-reason&#34;&gt;8. Check logs last, but check them for the right reason
&lt;/h2&gt;&lt;p&gt;If you have already gone through the earlier steps, the most effective next move is not endless reinstalling, but looking directly at the &lt;code&gt;Ollama&lt;/code&gt; startup and runtime logs.&lt;/p&gt;
&lt;p&gt;Focus on two kinds of messages:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Whether a GPU was detected at all&lt;/li&gt;
&lt;li&gt;Whether there are driver, library loading, or device initialization errors&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If the logs clearly say something like “no compatible GPU found” or “failed to initialize ROCm/CUDA,” the troubleshooting direction becomes much clearer immediately.&lt;/p&gt;
&lt;h2 id=&#34;troubleshooting-order&#34;&gt;Troubleshooting Order
&lt;/h2&gt;&lt;p&gt;If you only want the shortest path, use this order:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Run &lt;code&gt;ollama ps&lt;/code&gt; and confirm whether it is &lt;code&gt;GPU&lt;/code&gt;, &lt;code&gt;CPU&lt;/code&gt;, or mixed loading&lt;/li&gt;
&lt;li&gt;Try a smaller model to rule out VRAM limits&lt;/li&gt;
&lt;li&gt;Use &lt;code&gt;nvidia-smi&lt;/code&gt;, &lt;code&gt;rocminfo&lt;/code&gt;, and &lt;code&gt;rocm-smi&lt;/code&gt; to verify the lower-level environment first&lt;/li&gt;
&lt;li&gt;Fully restart the &lt;code&gt;Ollama&lt;/code&gt; service&lt;/li&gt;
&lt;li&gt;Check service environment variables, especially &lt;code&gt;ROCR_VISIBLE_DEVICES&lt;/code&gt; and &lt;code&gt;HSA_OVERRIDE_GFX_VERSION&lt;/code&gt; on AMD&lt;/li&gt;
&lt;li&gt;If you are in Docker or WSL, verify device mapping&lt;/li&gt;
&lt;li&gt;Finally, inspect logs for the exact error&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id=&#34;conclusion&#34;&gt;Conclusion
&lt;/h2&gt;&lt;p&gt;When &lt;code&gt;Ollama&lt;/code&gt; uses &lt;code&gt;CPU&lt;/code&gt; instead of &lt;code&gt;GPU&lt;/code&gt;, the root cause usually falls into one of three groups:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The GPU is not being detected at all&lt;/li&gt;
&lt;li&gt;The GPU is detectable, but the runtime environment is not reaching &lt;code&gt;Ollama&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;The GPU is working, but the model is too large and falls back to &lt;code&gt;CPU&lt;/code&gt; or mixed memory&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Once you separate those three cases, troubleshooting becomes much faster.&lt;br&gt;
If you are on an AMD platform, pay special attention to &lt;code&gt;ROCm&lt;/code&gt; version matching, device visibility, and compatibility variables instead of focusing only on the &lt;code&gt;Ollama&lt;/code&gt; command itself.&lt;/p&gt;
&lt;p&gt;Original video: &lt;a class=&#34;link&#34; href=&#34;https://www.bilibili.com/video/BV1cHoYBqE8k/&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;https://www.bilibili.com/video/BV1cHoYBqE8k/&lt;/a&gt;&lt;/p&gt;
</description>
        </item>
        <item>
        <title>What Is NVIDIA nvbandwidth: How to Use This GPU Bandwidth Testing Tool</title>
        <link>https://knightli.com/en/2026/04/24/nvidia-nvbandwidth-guide/</link>
        <pubDate>Fri, 24 Apr 2026 14:41:35 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/04/24/nvidia-nvbandwidth-guide/</guid>
        <description>&lt;p&gt;If you have recently been troubleshooting interconnect performance between multiple &lt;code&gt;NVIDIA GPU&lt;/code&gt;s, or you want to verify the real bandwidth between &lt;code&gt;PCIe&lt;/code&gt;, &lt;code&gt;NVLink&lt;/code&gt;, host memory, and VRAM, &lt;code&gt;NVIDIA/nvbandwidth&lt;/code&gt; is a small tool worth knowing about.&lt;/p&gt;
&lt;p&gt;It is not a general benchmark utility, and it is not a hidden command inside a large model framework. It is an open-source tool from NVIDIA specifically designed to measure bandwidth and latency for GPU-related memory copies. Instead of only looking at theoretical bandwidth, &lt;code&gt;nvbandwidth&lt;/code&gt; is better at answering a practical question: &lt;strong&gt;how much bandwidth can this machine and its current GPU interconnects actually deliver right now?&lt;/strong&gt;&lt;/p&gt;
&lt;h2 id=&#34;1-what-does-nvbandwidth-do&#34;&gt;1. What does &lt;code&gt;nvbandwidth&lt;/code&gt; do
&lt;/h2&gt;&lt;p&gt;According to the official README, &lt;code&gt;nvbandwidth&lt;/code&gt; is a command-line tool for measuring bandwidth on &lt;code&gt;NVIDIA GPU&lt;/code&gt;s.&lt;/p&gt;
&lt;p&gt;It mainly focuses on transfer performance across different &lt;code&gt;memcpy&lt;/code&gt; patterns, such as:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;GPU -&amp;gt; GPU&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;CPU -&amp;gt; GPU&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;GPU -&amp;gt; CPU&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Transfers between GPUs across multiple nodes&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These tests are especially useful in scenarios like:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Troubleshooting interconnect bottlenecks in multi-GPU training or inference&lt;/li&gt;
&lt;li&gt;Verifying the actual behavior of links such as &lt;code&gt;NVLink&lt;/code&gt;, &lt;code&gt;PCIe&lt;/code&gt;, and &lt;code&gt;C2C&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Comparing transfer differences across servers, topologies, drivers, or CUDA versions&lt;/li&gt;
&lt;li&gt;Performing baseline hardware validation before cluster deployment&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In short, &lt;code&gt;nvbandwidth&lt;/code&gt; is not about model throughput. It is about the lower-level ability to move data.&lt;/p&gt;
&lt;h2 id=&#34;2-it-does-not-produce-just-one-simple-score&#34;&gt;2. It does not produce just one simple score
&lt;/h2&gt;&lt;p&gt;Many people think of a bandwidth test as something that ends with a single number, but &lt;code&gt;nvbandwidth&lt;/code&gt; provides more detailed output than that.&lt;/p&gt;
&lt;p&gt;It reports results as matrices for each test type. For example, in a test like &lt;code&gt;device_to_device_memcpy_write_ce&lt;/code&gt;, it shows the bandwidth between each pair of GPUs by row and column. That means you can see more than just a rough system-wide speed estimate. You can also spot:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Which GPU pairs are especially fast&lt;/li&gt;
&lt;li&gt;Which paths are clearly limited by &lt;code&gt;PCIe&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Whether certain GPU pairs show abnormally low bandwidth&lt;/li&gt;
&lt;li&gt;Whether the multi-GPU topology matches your expectations&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you are working with an 8-GPU server, a dual-socket platform, or a multinode system, this matrix-style output is often more useful than a single average number.&lt;/p&gt;
&lt;h2 id=&#34;3-how-to-understand-ce-and-sm-copies&#34;&gt;3. How to understand &lt;code&gt;CE&lt;/code&gt; and &lt;code&gt;SM&lt;/code&gt; copies
&lt;/h2&gt;&lt;p&gt;The official documentation splits tests into two categories:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;CE&lt;/code&gt;: copy engine transfers based on &lt;code&gt;memcpy&lt;/code&gt; APIs&lt;/li&gt;
&lt;li&gt;&lt;code&gt;SM&lt;/code&gt;: kernel-based transfers&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These two result types are not guaranteed to match exactly, because they represent different copy paths.&lt;br&gt;
If you mainly want to understand regular device-to-device transfer behavior, you will usually look at &lt;code&gt;CE&lt;/code&gt; first. If you want to study execution details more closely, then &lt;code&gt;SM&lt;/code&gt; is worth checking too.&lt;/p&gt;
&lt;p&gt;The README also explains that bandwidth results use the median across multiple test runs by default. Newer versions additionally include variability statistics, which makes it easier to judge how stable the numbers are.&lt;/p&gt;
&lt;h2 id=&#34;4-what-environment-does-it-require&#34;&gt;4. What environment does it require
&lt;/h2&gt;&lt;p&gt;&lt;code&gt;nvbandwidth&lt;/code&gt; is not a pure binary utility that you simply download and run. It expects a standard CUDA development environment.&lt;/p&gt;
&lt;p&gt;The current README lists these basic requirements:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;CUDA Toolkit 11.x&lt;/code&gt; or newer&lt;/li&gt;
&lt;li&gt;A compiler with &lt;code&gt;C++17&lt;/code&gt; support&lt;/li&gt;
&lt;li&gt;&lt;code&gt;CMake 3.20+&lt;/code&gt;, with &lt;code&gt;3.24+&lt;/code&gt; recommended&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Boost program_options&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;A usable &lt;code&gt;CUDA&lt;/code&gt; device and a compatible driver&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The requirements are higher if you want the multinode version. The current README explicitly states:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Multinode builds require &lt;code&gt;CUDA Toolkit 12.3&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;The driver must be &lt;code&gt;550&lt;/code&gt; or newer&lt;/li&gt;
&lt;li&gt;&lt;code&gt;MPI&lt;/code&gt; is required&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;nvidia-imex&lt;/code&gt; service must be configured&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;So this is much more of an engineering tool for Linux GPU servers and clusters than something aimed at casual desktop use.&lt;/p&gt;
&lt;h2 id=&#34;5-how-to-build-and-run-the-single-node-version&#34;&gt;5. How to build and run the single-node version
&lt;/h2&gt;&lt;p&gt;The single-node build process is straightforward:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;cmake .
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;make
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;On &lt;code&gt;Ubuntu&lt;/code&gt; / &lt;code&gt;Debian&lt;/code&gt;, the project also provides a &lt;code&gt;debian_install.sh&lt;/code&gt; script that installs common dependencies and builds the project.&lt;/p&gt;
&lt;p&gt;After building, you can check the help output first:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;./nvbandwidth -h
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Some commonly used options include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;-l&lt;/code&gt;: list available tests&lt;/li&gt;
&lt;li&gt;&lt;code&gt;-t&lt;/code&gt;: run a specific test by name or index&lt;/li&gt;
&lt;li&gt;&lt;code&gt;-p&lt;/code&gt;: run tests by prefix&lt;/li&gt;
&lt;li&gt;&lt;code&gt;-b&lt;/code&gt;: set the memcpy buffer size, default &lt;code&gt;512 MiB&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;-i&lt;/code&gt;: set the number of benchmark iterations&lt;/li&gt;
&lt;li&gt;&lt;code&gt;-j&lt;/code&gt;: output &lt;code&gt;JSON&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;-H&lt;/code&gt;: enable huge pages for host memory allocation&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you just want to run the default test suite once, use:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;./nvbandwidth
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;If you only want to test one specific item, such as a device-to-device copy:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;./nvbandwidth -t device_to_device_memcpy_read_ce
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;h2 id=&#34;6-multinode-support-is-one-of-its-standout-features&#34;&gt;6. Multinode support is one of its standout features
&lt;/h2&gt;&lt;p&gt;&lt;code&gt;nvbandwidth&lt;/code&gt; is not only for single-node multi-GPU testing. It also supports multinode scenarios.&lt;/p&gt;
&lt;p&gt;According to the README, the multinode build is done like this:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;cmake -DMULTINODE&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;m&#34;&gt;1&lt;/span&gt; .
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;make
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;At runtime, it is typically used together with &lt;code&gt;mpirun&lt;/code&gt;, with one process launched per GPU.&lt;br&gt;
The documentation also requires all participating ranks to belong to the same multinode clique, and it recommends mainly running tests with the &lt;code&gt;multinode&lt;/code&gt; prefix under MPI.&lt;/p&gt;
&lt;p&gt;That makes its positioning much closer to high-performance computing and large GPU systems than to simple workstation self-checks.&lt;/p&gt;
&lt;p&gt;If you are working with &lt;code&gt;NVLink&lt;/code&gt; multinode deployments or more complex platforms such as &lt;code&gt;GB200&lt;/code&gt; / &lt;code&gt;Grace Hopper&lt;/code&gt;, the value of &lt;code&gt;nvbandwidth&lt;/code&gt; is much higher than it would be on a typical consumer GPU setup.&lt;/p&gt;
&lt;h2 id=&#34;7-what-changed-in-v09&#34;&gt;7. What changed in &lt;code&gt;v0.9&lt;/code&gt;
&lt;/h2&gt;&lt;p&gt;As of &lt;strong&gt;April 24, 2026&lt;/strong&gt;, the GitHub Releases page shows that the latest version of &lt;code&gt;nvbandwidth&lt;/code&gt; is &lt;strong&gt;&lt;code&gt;v0.9&lt;/code&gt;&lt;/strong&gt;, released on &lt;strong&gt;April 8, 2026&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;The most notable updates in this release include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Added variability statistics to bandwidth output&lt;/li&gt;
&lt;li&gt;Added huge page support for host memory (&lt;code&gt;Windows&lt;/code&gt; excluded)&lt;/li&gt;
&lt;li&gt;Added pair sampling for device-to-device tests&lt;/li&gt;
&lt;li&gt;Added a troubleshooting guide&lt;/li&gt;
&lt;li&gt;Unified single-node and multinode execution paths&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Two engineering-oriented changes are also worth noting:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Improved CUDA architecture detection without relying as much on direct GPU access&lt;/li&gt;
&lt;li&gt;Deprecated &lt;code&gt;Volta&lt;/code&gt; (&lt;code&gt;sm_70&lt;/code&gt; / &lt;code&gt;sm_72&lt;/code&gt;) support in &lt;code&gt;CUDA Toolkit 13.0+&lt;/code&gt; environments&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;So if you only looked at early versions before, &lt;code&gt;v0.9&lt;/code&gt; is no longer just a basic bandwidth tester. It is clearly moving toward better automation, troubleshooting, and large-scale system validation.&lt;/p&gt;
&lt;h2 id=&#34;8-when-is-it-a-good-fit&#34;&gt;8. When is it a good fit
&lt;/h2&gt;&lt;p&gt;&lt;code&gt;nvbandwidth&lt;/code&gt; is especially suitable when:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;You want to verify real interconnect bandwidth between multiple &lt;code&gt;NVIDIA GPU&lt;/code&gt;s&lt;/li&gt;
&lt;li&gt;You suspect one GPU is installed in a bandwidth-limited &lt;code&gt;PCIe&lt;/code&gt; slot&lt;/li&gt;
&lt;li&gt;You want to compare &lt;code&gt;NVLink&lt;/code&gt; paths against non-&lt;code&gt;NVLink&lt;/code&gt; paths&lt;/li&gt;
&lt;li&gt;You are deploying a multinode GPU cluster and need to validate the links&lt;/li&gt;
&lt;li&gt;You want test results in &lt;code&gt;JSON&lt;/code&gt; for automation pipelines&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;But if your goal is only to answer questions like &amp;ldquo;how fast is training&amp;rdquo; or &amp;ldquo;how many tokens per second can inference reach,&amp;rdquo; this tool is not the whole answer.&lt;br&gt;
In that case, you still need workload-level testing with your training framework, inference engine, or real application.&lt;/p&gt;
&lt;h2 id=&#34;9-how-to-think-about-its-value&#34;&gt;9. How to think about its value
&lt;/h2&gt;&lt;p&gt;Many GPU performance problems are not really caused by insufficient compute. They happen because the data path is not working as expected.&lt;/p&gt;
&lt;p&gt;For example:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;GPUs are not using the intended interconnect path&lt;/li&gt;
&lt;li&gt;Cross-NUMA access is reducing speed&lt;/li&gt;
&lt;li&gt;Certain GPU pairs have abnormal bandwidth&lt;/li&gt;
&lt;li&gt;Multinode communication is only partially configured&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These issues are often hard to diagnose if you only look at &lt;code&gt;nvidia-smi&lt;/code&gt; or model throughput.&lt;br&gt;
A lower-level, matrix-oriented tool like &lt;code&gt;nvbandwidth&lt;/code&gt; is useful precisely because it exposes what is happening at the interconnect layer.&lt;/p&gt;
&lt;p&gt;So a simple way to think about it is: &lt;strong&gt;&lt;code&gt;nvbandwidth&lt;/code&gt; is a command-line health check tool for bandwidth on NVIDIA GPU systems.&lt;/strong&gt;&lt;/p&gt;
&lt;h2 id=&#34;related-links&#34;&gt;Related links
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;GitHub project: &lt;a class=&#34;link&#34; href=&#34;https://github.com/NVIDIA/nvbandwidth&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;https://github.com/NVIDIA/nvbandwidth&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Releases: &lt;a class=&#34;link&#34; href=&#34;https://github.com/NVIDIA/nvbandwidth/releases&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;https://github.com/NVIDIA/nvbandwidth/releases&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</description>
        </item>
        <item>
        <title>How to Check Whether a Tesla V100 Has ECC Errors</title>
        <link>https://knightli.com/en/2026/04/23/check-tesla-v100-ecc-errors/</link>
        <pubDate>Thu, 23 Apr 2026 11:50:21 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/04/23/check-tesla-v100-ecc-errors/</guid>
        <description>&lt;p&gt;If you have a &lt;code&gt;Tesla V100&lt;/code&gt; on hand and want to do a basic health check first, &lt;code&gt;ECC&lt;/code&gt; status is one of the most useful things to look at.&lt;/p&gt;
&lt;p&gt;The most direct method is to inspect the card&amp;rsquo;s detailed information with &lt;code&gt;nvidia-smi&lt;/code&gt;.&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;nvidia-smi -q
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;c1&#34;&gt;# 查询第 0 块 GPU&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;nvidia-smi -q -i &lt;span class=&#34;m&#34;&gt;0&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Focus on the &lt;code&gt;ECC Errors&lt;/code&gt; section.&lt;/p&gt;
&lt;p&gt;On a card in normal condition, the four common groups of counters under &lt;code&gt;ECC Errors&lt;/code&gt; should all be &lt;code&gt;0&lt;/code&gt; or &lt;code&gt;N/A&lt;/code&gt;. If any of them already show a non-zero value, it means the card has seen that type of ECC anomaly before, and you should further evaluate whether it is still suitable for continued use.&lt;/p&gt;
&lt;p&gt;Reference output:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt; 1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 3
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 4
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 5
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 6
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 7
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 8
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 9
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;10
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;11
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;12
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;13
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;14
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;15
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;16
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;17
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;18
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;19
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;20
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;21
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;22
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;23
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;24
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;25
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;26
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;27
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;28
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;29
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;30
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;31
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;32
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;33
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;34
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;35
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;36
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;37
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;38
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;39
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;40
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;41
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;42
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;43
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;44
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;nvidia-smi -q
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    ECC Mode
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;        Current                          : Enabled
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;        Pending                          : Enabled
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    ECC Errors
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;        Volatile
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;            Single Bit
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;                Device Memory            : 0
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;                Register File            : 0
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;                L1 Cache                 : 0
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;                L2 Cache                 : 0
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;                Texture Memory           : N/A
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;                Texture Shared           : N/A
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;                CBU                      : N/A
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;                Total                    : 0
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;            Double Bit
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;                Device Memory            : 0
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;                Register File            : 0
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;                L1 Cache                 : 0
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;                L2 Cache                 : 0
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;                Texture Memory           : N/A
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;                Texture Shared           : N/A
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;                CBU                      : 0
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;                Total                    : 0
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;        Aggregate
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;            Single Bit
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;                Device Memory            : 0
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;                Register File            : 0
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;                L1 Cache                 : 0
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;                L2 Cache                 : 0
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;                Texture Memory           : N/A
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;                Texture Shared           : N/A
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;                CBU                      : N/A
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;                Total                    : 0
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;            Double Bit
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;                Device Memory            : 0
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;                Register File            : 0
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;                L1 Cache                 : 0
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;                L2 Cache                 : 0
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;                Texture Memory           : N/A
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;                Texture Shared           : N/A
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;                CBU                      : 0
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;                Total                    : 0
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;    Retired Pages
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;You can think of it like this:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;Volatile&lt;/code&gt; is the error count for the current power cycle&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Aggregate&lt;/code&gt; is the lifetime accumulated error count&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Single Bit&lt;/code&gt; means correctable errors&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Double Bit&lt;/code&gt; means uncorrectable errors, which are more serious&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you only want a quick screening rule, remember this:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Most items should be &lt;code&gt;0&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;N/A&lt;/code&gt; is normal for some not-applicable entries&lt;/li&gt;
&lt;li&gt;If &lt;code&gt;Double Bit&lt;/code&gt; or the total count is not &lt;code&gt;0&lt;/code&gt;, do not rely only on a seller&amp;rsquo;s verbal description; it is better to continue with fuller stress testing and stability checks&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This does not replace a complete inspection, but it is enough for a first round of checks after a &lt;code&gt;V100&lt;/code&gt; arrives.&lt;/p&gt;
</description>
        </item>
        <item>
        <title>Is Tesla V100 Still Worth Buying: ECC Checks, Cooling Mods, and DIY Pitfalls</title>
        <link>https://knightli.com/en/2026/04/23/tesla-v100-buying-ecc-cooling-diy-guide/</link>
        <pubDate>Thu, 23 Apr 2026 11:15:10 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/04/23/tesla-v100-buying-ecc-cooling-diy-guide/</guid>
        <description>&lt;p&gt;If you have been looking at used &lt;code&gt;Tesla V100&lt;/code&gt; cards recently, you have probably seen two very different opinions:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;one side says the card is still strong and offers great value&lt;/li&gt;
&lt;li&gt;the other says the market is full of traps and DIY users can easily get burned&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Both are true.&lt;/p&gt;
&lt;p&gt;The point is not that &lt;code&gt;V100&lt;/code&gt; is unbuyable. The point is that you cannot buy it the same way you would buy a normal consumer GPU. What matters is not only whether it boots, and not only whether the seller says &amp;ldquo;like new&amp;rdquo; or &amp;ldquo;pulled from an original server&amp;rdquo;. What matters is whether the card has been tampered with, what its &lt;code&gt;ECC&lt;/code&gt; condition looks like, and whether the cooling and power setup are actually reliable.&lt;/p&gt;
&lt;p&gt;This article pulls together the most useful checks for buying and using one in practice.&lt;/p&gt;
&lt;h2 id=&#34;quick-takeaways&#34;&gt;Quick Takeaways
&lt;/h2&gt;&lt;p&gt;If you only want the short version, remember these points:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;V100&lt;/code&gt; was produced roughly from &lt;code&gt;2017&lt;/code&gt; to &lt;code&gt;2021&lt;/code&gt;, and &lt;code&gt;2021&lt;/code&gt; cards are uncommon in the &lt;code&gt;16G&lt;/code&gt; version&lt;/li&gt;
&lt;li&gt;looking only at &amp;ldquo;zero ECC&amp;rdquo; or &amp;ldquo;original pull&amp;rdquo; is not enough, because both data and physical condition can be altered&lt;/li&gt;
&lt;li&gt;the biggest risk is often not buying an old card, but buying one that was disassembled, reflashed, or paired with a bad cooling setup&lt;/li&gt;
&lt;li&gt;for &lt;code&gt;DIY&lt;/code&gt; users, the real problem is usually not the core itself, but the adapter board, power delivery, hotspot temperature, and backplate cooling&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;1-start-with-production-date-and-batch-clues&#34;&gt;1. Start with Production Date and Batch Clues
&lt;/h2&gt;&lt;p&gt;A very practical method is to check the chip date first, then see whether the dates on nearby components match it.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://knightli.com/2026/04/23/tesla-v100-buying-ecc-cooling-diy-guide/1.png&#34;
	width=&#34;1139&#34;
	height=&#34;670&#34;
	srcset=&#34;https://knightli.com/2026/04/23/tesla-v100-buying-ecc-cooling-diy-guide/1_hu_a8325dae98af3ae7.png 480w, https://knightli.com/2026/04/23/tesla-v100-buying-ecc-cooling-diy-guide/1_hu_40537b27bd676168.png 1024w&#34;
	loading=&#34;lazy&#34;
	
		alt=&#34;Tesla V100&#34;
	
	
		class=&#34;gallery-image&#34; 
		data-flex-grow=&#34;170&#34;
		data-flex-basis=&#34;408px&#34;
	
&gt;&lt;/p&gt;
&lt;p&gt;For example, if the chip surface shows &lt;code&gt;1828&lt;/code&gt;, it usually means:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;18&lt;/code&gt; = year &lt;code&gt;2018&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;28&lt;/code&gt; = week &lt;code&gt;28&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;So that chip was produced in week &lt;code&gt;28&lt;/code&gt; of &lt;code&gt;2018&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Besides the chip package, nearby inductors often carry date-related markings too. If the chip date and inductor date are far apart, for example:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;chip date is &lt;code&gt;2017&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;inductors point to &lt;code&gt;2020&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;then you should be cautious. It does not automatically prove the card is bad, but it does suggest it is no longer in a very original state.&lt;/p&gt;
&lt;p&gt;On the other hand, if the dates broadly line up, such as:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;a &lt;code&gt;2018&lt;/code&gt; chip with &lt;code&gt;2018&lt;/code&gt; surrounding components&lt;/li&gt;
&lt;li&gt;a late &lt;code&gt;2019&lt;/code&gt; chip paired with &lt;code&gt;2020&lt;/code&gt; components&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;that is much more normal.&lt;/p&gt;
&lt;h2 id=&#34;2-do-not-only-look-at-the-chip-check-inductors-springs-and-frame&#34;&gt;2. Do Not Only Look at the Chip: Check Inductors, Springs, and Frame
&lt;/h2&gt;&lt;p&gt;Visual inspection is best broken into a few separate checks.&lt;/p&gt;
&lt;h3 id=&#34;1-touch-the-inductors-first&#34;&gt;1. Touch the inductors first
&lt;/h3&gt;&lt;p&gt;Gently press or touch the inductors. Under normal conditions, none of them should feel loose.&lt;/p&gt;
&lt;p&gt;If one of them is already moving, it usually means:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;the solder condition is not healthy&lt;/li&gt;
&lt;li&gt;the problem may worsen with continued use&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Even if the card still works now, that is not a good sign.&lt;/p&gt;
&lt;h3 id=&#34;2-check-whether-the-retaining-spring-has-been-removed-before&#34;&gt;2. Check whether the retaining spring has been removed before
&lt;/h3&gt;&lt;p&gt;There is a useful logic here:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;if the seller insists this is an &amp;ldquo;original server pull&amp;rdquo;&lt;/li&gt;
&lt;li&gt;then the retaining spring generally should not have been casually removed&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In a normal factory server environment, people do not usually remove this spring for no reason.&lt;/p&gt;
&lt;p&gt;If the spring comes off very easily, the card was probably opened before. If the seller is also claiming it is untouched, that claim deserves skepticism.&lt;/p&gt;
&lt;h3 id=&#34;3-if-the-frame-comes-apart-too-easily-that-is-also-suspicious&#34;&gt;3. If the frame comes apart too easily, that is also suspicious
&lt;/h3&gt;&lt;p&gt;Once the middle frame is removed, if the whole structure separates with almost no effort, that usually means the card has already been disassembled multiple times.&lt;/p&gt;
&lt;p&gt;That matters on used &lt;code&gt;V100&lt;/code&gt; cards because reflashing, modification, and repair work often leave exactly these kinds of traces.&lt;/p&gt;
&lt;h2 id=&#34;3-if-the-backplate-separates-too-easily-suspect-a-reflash-or-prior-tampering&#34;&gt;3. If the Backplate Separates Too Easily, Suspect a Reflash or Prior Tampering
&lt;/h2&gt;&lt;p&gt;One especially important detail is that there is a metal plate under the &lt;code&gt;PCB&lt;/code&gt;. It is not only for protection; it also helps with heat dissipation.&lt;/p&gt;
&lt;p&gt;In a normal original condition, this backplate is usually not easy to remove. Reasons include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;adhesive&lt;/li&gt;
&lt;li&gt;a tight structural fit&lt;/li&gt;
&lt;li&gt;the design was not meant for repeated disassembly&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If the backplate separates from the &lt;code&gt;PCB&lt;/code&gt; with only a little force, then you should suspect:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;it has been opened before&lt;/li&gt;
&lt;li&gt;the card may have had its &lt;code&gt;VBIOS&lt;/code&gt; reflashed&lt;/li&gt;
&lt;li&gt;there may have been secondary modifications&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That does not automatically make it unusable, but it is clearly inconsistent with &amp;ldquo;original and untouched.&amp;rdquo;&lt;/p&gt;
&lt;h2 id=&#34;4-how-to-read-ecc-what-matters-most-is-not-whether-it-is-zero-but-whether-it-grows&#34;&gt;4. How to Read &lt;code&gt;ECC&lt;/code&gt;: What Matters Most Is Not Whether It Is Zero, but Whether It Grows
&lt;/h2&gt;&lt;p&gt;&lt;code&gt;ECC&lt;/code&gt; is one of the first things people look at on a &lt;code&gt;V100&lt;/code&gt;, and it really needs to be interpreted carefully.&lt;/p&gt;
&lt;p&gt;A common method is to use &lt;code&gt;nvidia-smi&lt;/code&gt; in detailed mode and check the &lt;code&gt;ECC Errors&lt;/code&gt; section.&lt;/p&gt;
&lt;h3 id=&#34;1-real-time-errors-are-the-most-dangerous&#34;&gt;1. Real-time errors are the most dangerous
&lt;/h3&gt;&lt;p&gt;The upper section can be understood as real-time errors.&lt;/p&gt;
&lt;p&gt;If those numbers keep increasing while the card is running, that usually means the card is already in an unstable state.&lt;/p&gt;
&lt;p&gt;In simple terms:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;a card that runs without new errors matters more than a static zero reading&lt;/li&gt;
&lt;li&gt;a card that starts increasing errors under stress is much more worrying than one with only historical accumulated counts&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;2-lifetime-accumulated-errors-are-not-always-scary&#34;&gt;2. Lifetime accumulated errors are not always scary
&lt;/h3&gt;&lt;p&gt;Another section shows lifetime accumulated errors, meaning how many corrected or uncorrected events happened across the card&amp;rsquo;s life.&lt;/p&gt;
&lt;p&gt;If those values are only:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;single digits&lt;/li&gt;
&lt;li&gt;or maybe in the teens&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;that is not automatically a disaster.&lt;/p&gt;
&lt;p&gt;If real-time errors do not continue increasing during actual use, the card may still be perfectly usable.&lt;/p&gt;
&lt;h3 id=&#34;3-the-page-retirement-section-deserves-more-attention&#34;&gt;3. The page retirement section deserves more attention
&lt;/h3&gt;&lt;p&gt;The page retirement section is even more important, because it indicates memory blocks that were retired after uncorrectable errors.&lt;/p&gt;
&lt;p&gt;A practical way to think about it is:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;single-bit and double-bit categories may each have retired blocks&lt;/li&gt;
&lt;li&gt;if the total climbs past &lt;code&gt;10&lt;/code&gt;, you are entering a range where caution is warranted&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That does not always mean the card is unusable, but it does suggest reduced effective memory and weaker long-term confidence.&lt;/p&gt;
&lt;h2 id=&#34;5-do-not-worship-zero-ecc-the-data-itself-can-be-manipulated&#34;&gt;5. Do Not Worship &amp;ldquo;Zero ECC&amp;rdquo;: The Data Itself Can Be Manipulated
&lt;/h2&gt;&lt;p&gt;There is a very practical warning here:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;ECC&lt;/code&gt; numbers are not inherently sacred.&lt;/p&gt;
&lt;p&gt;If a card has:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;extremely clean-looking data&lt;/li&gt;
&lt;li&gt;but obvious signs of disassembly&lt;/li&gt;
&lt;li&gt;and a structure that clearly looks worked on&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;then you should not trust &amp;ldquo;zero ECC&amp;rdquo; by itself.&lt;/p&gt;
&lt;p&gt;A useful analogy is an old car that suddenly shows &lt;code&gt;0&lt;/code&gt; mileage and almost no tire wear after many years. It is hard not to suspect the odometer was touched.&lt;/p&gt;
&lt;p&gt;The same idea applies to &lt;code&gt;V100&lt;/code&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;numbers that look too perfect are not always good news&lt;/li&gt;
&lt;li&gt;what matters is whether the data, the physical condition, and the stress-test behavior all make sense together&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;6-stress-testing-is-necessary-but-testing-only-the-core-is-not-enough&#34;&gt;6. Stress Testing Is Necessary, but Testing Only the Core Is Not Enough
&lt;/h2&gt;&lt;p&gt;You can use a tool such as &lt;code&gt;gpu-burn&lt;/code&gt; to stress the card for several minutes or longer and watch:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;whether it remains stable&lt;/li&gt;
&lt;li&gt;whether the card drops out&lt;/li&gt;
&lt;li&gt;whether new &lt;code&gt;ECC&lt;/code&gt; errors appear&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;But there is another important point:&lt;/p&gt;
&lt;p&gt;Testing only the core does not prove the entire card is healthy.&lt;/p&gt;
&lt;p&gt;A lot of &lt;code&gt;V100&lt;/code&gt; failures do not start with the core. They start with:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;overheating in the power-delivery area&lt;/li&gt;
&lt;li&gt;insufficient cooling around the backplate&lt;/li&gt;
&lt;li&gt;excessive hotspot temperatures&lt;/li&gt;
&lt;li&gt;adapter boards and cooling systems that are always operating too close to the edge&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;So stress testing only proves that &amp;ldquo;the card can run right now.&amp;rdquo; It does not prove that &amp;ldquo;this DIY setup will survive in the long run.&amp;rdquo;&lt;/p&gt;
&lt;h2 id=&#34;7-for-diy-users-the-real-failure-point-is-usually-cooling-and-power-not-the-purchase-itself&#34;&gt;7. For DIY Users, the Real Failure Point Is Usually Cooling and Power, Not the Purchase Itself
&lt;/h2&gt;&lt;p&gt;This is probably the most important part of the entire topic.&lt;/p&gt;
&lt;p&gt;The core idea is simple:&lt;/p&gt;
&lt;p&gt;For &lt;code&gt;DIY&lt;/code&gt; users, casually combining an adapter base with a generic cooler is not a robust plan.&lt;/p&gt;
&lt;p&gt;That is because &lt;code&gt;V100&lt;/code&gt; is not a normal consumer card. It is a server accelerator with:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;high power draw&lt;/li&gt;
&lt;li&gt;high heat density&lt;/li&gt;
&lt;li&gt;complicated heat distribution&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The chip is not the only thing producing heat. The backplate, power area, and connector region also get hot, and sometimes very hot.&lt;/p&gt;
&lt;h3 id=&#34;1-do-not-only-watch-average-gpu-temperature&#34;&gt;1. Do not only watch average GPU temperature
&lt;/h3&gt;&lt;p&gt;Many monitoring tools show the average card temperature, but the more dangerous number is often the &lt;code&gt;hot spot&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;That means:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;the visible temperature may only be in the 60s Celsius&lt;/li&gt;
&lt;li&gt;while local hotspots may already be over 100C&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That is why some DIY &lt;code&gt;V100&lt;/code&gt; builds look &amp;ldquo;fine&amp;rdquo; on paper and then suddenly die later.&lt;/p&gt;
&lt;h3 id=&#34;2-backplate-cooling-must-be-considered&#34;&gt;2. Backplate cooling must be considered
&lt;/h3&gt;&lt;p&gt;Cooling for the backplate and power area cannot be ignored.&lt;/p&gt;
&lt;p&gt;If you only cool the core, but:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;the &lt;code&gt;MOS&lt;/code&gt; area is neglected&lt;/li&gt;
&lt;li&gt;the backplate gets no heat transfer help&lt;/li&gt;
&lt;li&gt;the rear side lacks proper thermal design&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;then the full setup is still incomplete.&lt;/p&gt;
&lt;h3 id=&#34;3-cheap-improvised-water-cooling-setups-are-risky&#34;&gt;3. Cheap improvised water-cooling setups are risky
&lt;/h3&gt;&lt;p&gt;You should be cautious about the &amp;ldquo;random adapter board + cheap AIO water cooler&amp;rdquo; style setup.&lt;/p&gt;
&lt;p&gt;The issue is not that it always fails immediately. The issue is that it often has:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;uneven water-channel coverage&lt;/li&gt;
&lt;li&gt;incomplete cooling for the power-delivery area&lt;/li&gt;
&lt;li&gt;poor control of the actual hotspot zones&lt;/li&gt;
&lt;li&gt;unpredictable long-term lifespan&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;8-if-you-still-want-to-diy-at-least-watch-these-points&#34;&gt;8. If You Still Want to DIY, At Least Watch These Points
&lt;/h2&gt;&lt;p&gt;The most practical recommendations are:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;prefer more mature adapter-board solutions with a better track record&lt;/li&gt;
&lt;li&gt;do not focus only on the core; the rear power area and backplate need thermal attention too&lt;/li&gt;
&lt;li&gt;the water block needs real coverage and even heat handling, not just physical contact&lt;/li&gt;
&lt;li&gt;after stress testing, keep watching temperatures, hotspots, and long-term behavior&lt;/li&gt;
&lt;li&gt;PSU quality also affects coil whine and overall stability&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In other words, the hard part of a DIY &lt;code&gt;V100&lt;/code&gt; build is not &amp;ldquo;getting it to boot.&amp;rdquo; The hard part is &amp;ldquo;keeping it alive and stable afterward.&amp;rdquo;&lt;/p&gt;
&lt;h2 id=&#34;9-coil-whine-and-adapter-board-variance-are-real-problems-too&#34;&gt;9. Coil Whine and Adapter-Board Variance Are Real Problems Too
&lt;/h2&gt;&lt;p&gt;Two more points are often overlooked.&lt;/p&gt;
&lt;h3 id=&#34;1-coil-whine-may-not-be-fully-eliminable&#34;&gt;1. Coil whine may not be fully eliminable
&lt;/h3&gt;&lt;p&gt;It depends on the individual card, the inductors, capacitors, and the power environment. It is not something you can always solve with one cable or one small accessory.&lt;/p&gt;
&lt;h3 id=&#34;2-adapter-board-variance-is-huge&#34;&gt;2. Adapter-board variance is huge
&lt;/h3&gt;&lt;p&gt;That is why some sellers, even when they are willing to sell a bare card, still emphasize:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;bench-testing it first&lt;/li&gt;
&lt;li&gt;recording the serial number&lt;/li&gt;
&lt;li&gt;doing stress tests&lt;/li&gt;
&lt;li&gt;documenting the process&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Because a lot of disputes are not caused by the silicon itself. They are caused by the adapter board and cooling solution paired with it afterward.&lt;/p&gt;
&lt;h2 id=&#34;closing&#34;&gt;Closing
&lt;/h2&gt;&lt;p&gt;So, is &lt;code&gt;Tesla V100&lt;/code&gt; still worth buying? Yes, but only if you understand what you are buying and how you plan to use it afterward.&lt;/p&gt;
&lt;p&gt;If you only check:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;whether it powers on&lt;/li&gt;
&lt;li&gt;whether &lt;code&gt;ECC&lt;/code&gt; is all zero&lt;/li&gt;
&lt;li&gt;whether the seller says &amp;ldquo;original pull&amp;rdquo;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;that is nowhere near enough.&lt;/p&gt;
&lt;p&gt;The more useful things to verify are:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;whether the dates and batch clues line up&lt;/li&gt;
&lt;li&gt;whether there are suspicious signs of prior disassembly&lt;/li&gt;
&lt;li&gt;whether the backplate and structure were clearly opened before&lt;/li&gt;
&lt;li&gt;whether errors increase under stress testing&lt;/li&gt;
&lt;li&gt;whether your cooling and power setup are actually trustworthy&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Especially for &lt;code&gt;DIY&lt;/code&gt; users, the most dangerous part of &lt;code&gt;V100&lt;/code&gt; is often not &amp;ldquo;buying an old card&amp;rdquo;, but underestimating how demanding these cards are about cooling, power delivery, and modification quality.&lt;/p&gt;
</description>
        </item>
        <item>
        <title>llama.cpp GPU Performance Ranking: Full CUDA, ROCm, and Vulkan Scoreboards Explained with pp512 / tg128 / FA</title>
        <link>https://knightli.com/en/2026/04/23/llama-cpp-gpu-benchmark-cuda-rocm-vulkan-scoreboard/</link>
        <pubDate>Thu, 23 Apr 2026 10:22:04 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/04/23/llama-cpp-gpu-benchmark-cuda-rocm-vulkan-scoreboard/</guid>
        <description>&lt;h2 id=&#34;understanding-the-metrics-first&#34;&gt;Understanding the Metrics First
&lt;/h2&gt;&lt;h3 id=&#34;what-is-q4_0&#34;&gt;What is Q4_0
&lt;/h3&gt;&lt;p&gt;Q4_0 is a 4-bit quantization format. It does not mean the model is stronger. It means the model is smaller, uses less VRAM, and fits on more devices. Most of these scoreboards standardize on Llama 2 7B, Q4_0 so that GPU-to-GPU comparisons are easier.&lt;/p&gt;
&lt;h3 id=&#34;what-is-pp512&#34;&gt;What is pp512
&lt;/h3&gt;&lt;p&gt;pp512 usually means prompt processing 512 tokens, which is the throughput while processing 512 input tokens.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;pp = prompt processing&lt;/li&gt;
&lt;li&gt;512 = input length is 512 tokens&lt;/li&gt;
&lt;li&gt;/s = tokens per second
This is closer to prompt-ingestion speed, so it is often much higher than generation speed.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;what-is-g128&#34;&gt;What is 	g128
&lt;/h3&gt;&lt;pre&gt;&lt;code&gt;g128 usually means 	ext generation 128 tokens, which is the speed while generating 128 tokens continuously.
&lt;/code&gt;&lt;/pre&gt;
&lt;ul&gt;
&lt;li&gt;g = text generation&lt;/li&gt;
&lt;li&gt;128 = generate 128 tokens continuously&lt;/li&gt;
&lt;li&gt;/s = tokens per second
This is usually closer to the speed users actually feel in interactive usage.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;what-is-fa&#34;&gt;What is FA
&lt;/h3&gt;&lt;p&gt;FA stands for Flash Attention.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;with FA means Flash Attention is enabled&lt;/li&gt;
&lt;li&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;o FA means Flash Attention is disabled
On many GPUs, FA improves pp512 more clearly than 	g128, but the gain is not identical across backends, drivers, and GPU architectures.&lt;/p&gt;
&lt;h3 id=&#34;how-to-read-s&#34;&gt;How to read 	/s
&lt;/h3&gt;&lt;pre&gt;&lt;code&gt;/s means 	okens per second. When reading these scoreboards, the key rule is to compare the same type of test with the same settings.
&lt;/code&gt;&lt;/pre&gt;
&lt;ul&gt;
&lt;li&gt;Do not compare pp512 and 	g128 as if they were the same thing&lt;/li&gt;
&lt;li&gt;Do not mix
o FA and with FA&lt;/li&gt;
&lt;li&gt;Do not assume CUDA, ROCm, and Vulkan are directly interchangeable&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;quick-takeaways&#34;&gt;Quick Takeaways
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;CUDA is still the strongest overall path in llama.cpp GPU benchmarks, especially on high-end Nvidia GPUs.&lt;/li&gt;
&lt;li&gt;ROCm is already delivering strong results on high-end AMD GPUs and Instinct accelerators.&lt;/li&gt;
&lt;li&gt;Vulkan has the broadest hardware coverage, including Nvidia, AMD, Intel, older GPUs, and some Apple / Asahi setups.&lt;/li&gt;
&lt;li&gt;g128 is closer to everyday perceived speed, while pp512 is better for judging prompt throughput.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;cuda-scoreboards&#34;&gt;CUDA Scoreboards
&lt;/h2&gt;&lt;h3 id=&#34;llama-2-7b-q4_0-no-fa&#34;&gt;Llama 2 7B, Q4_0, no FA
&lt;/h3&gt;&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Chip&lt;/th&gt;
          &lt;th&gt;Memory&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;pp512 t/s&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;tg128 t/s&lt;/th&gt;
          &lt;th&gt;Commit&lt;/th&gt;
          &lt;th&gt;Thanks to&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;RTX 5090&lt;/td&gt;
          &lt;td&gt;32 GB / GDDR7 / 512 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;14073.41 ± 115.16&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;290.02 ± 1.10&lt;/td&gt;
          &lt;td&gt;8cf6b42&lt;/td&gt;
          &lt;td&gt;@totaldev&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;RTX PRO 6000 Blackwell&lt;/td&gt;
          &lt;td&gt;96 GB / GDDR7 / 512 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;14854.63 ± 22.73&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;274.20 ± 0.14&lt;/td&gt;
          &lt;td&gt;79c1160&lt;/td&gt;
          &lt;td&gt;@Tom94&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;H100 80 GB&lt;/td&gt;
          &lt;td&gt;80 GB / HBM3 / 5120 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;9918.34 ± 176.97&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;267.81 ± 1.54&lt;/td&gt;
          &lt;td&gt;5143fa8&lt;/td&gt;
          &lt;td&gt;@Hedede&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;A100 80 GB&lt;/td&gt;
          &lt;td&gt;80 GB / HBM2e / 5120 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;4849.53 ± 8.94&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;190.88 ± 0.33&lt;/td&gt;
          &lt;td&gt;5143fa8&lt;/td&gt;
          &lt;td&gt;@Hedede&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;RTX 4090 D&lt;/td&gt;
          &lt;td&gt;24 GB / GDDR6X / 384 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;10293.86 ± 134.72&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;189.33 ± 0.19&lt;/td&gt;
          &lt;td&gt;79c1160&lt;/td&gt;
          &lt;td&gt;@autonomous-AI-lab&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;RTX 4090&lt;/td&gt;
          &lt;td&gt;24 GB / GDDR6X / 384 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;11992.70 ± 107.99&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;186.21 ± 0.13&lt;/td&gt;
          &lt;td&gt;2241453&lt;/td&gt;
          &lt;td&gt;@lhl&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;RTX 5080&lt;/td&gt;
          &lt;td&gt;16 GB / GDDR7 / 256 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;8297.36 ± 9.50&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;181.99 ± 0.42&lt;/td&gt;
          &lt;td&gt;8a4280c&lt;/td&gt;
          &lt;td&gt;@Hedede&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;RTX 5070 Ti&lt;/td&gt;
          &lt;td&gt;16 GB / GDDR7 / 256 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;6952.38 ± 13.73&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;176.85 ± 0.07&lt;/td&gt;
          &lt;td&gt;933414c&lt;/td&gt;
          &lt;td&gt;@TinyServal&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;RTX 6000 Ada&lt;/td&gt;
          &lt;td&gt;48 GB / GDDR6 / 384 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;9229.23 ± 101.78&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;176.07 ± 0.26&lt;/td&gt;
          &lt;td&gt;b8e09f0&lt;/td&gt;
          &lt;td&gt;@Hedede&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;RTX 3090 Ti&lt;/td&gt;
          &lt;td&gt;24 GB / GDDR6X / 384 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;6567.49 ± 20.30&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;171.19 ± 3.98&lt;/td&gt;
          &lt;td&gt;9c35706&lt;/td&gt;
          &lt;td&gt;@slaren&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;RTX 3090&lt;/td&gt;
          &lt;td&gt;24 GB / GDDR6X / 384 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;5174.69 ± 21.83&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;158.16 ± 0.21&lt;/td&gt;
          &lt;td&gt;c76b420&lt;/td&gt;
          &lt;td&gt;@m18coppola&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;L40&lt;/td&gt;
          &lt;td&gt;48 GB / GDDR6 / 384 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;8870.49 ± 378.76&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;152.01 ± 0.28&lt;/td&gt;
          &lt;td&gt;ee09828&lt;/td&gt;
          &lt;td&gt;@Hedede&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;RTX 4080 SUPER&lt;/td&gt;
          &lt;td&gt;16 GB / GDDR6X / 256 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;8125.15 ± 41.05&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;148.33 ± 0.20&lt;/td&gt;
          &lt;td&gt;81086cd&lt;/td&gt;
          &lt;td&gt;@zacharyarnaise&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;RTX 4080&lt;/td&gt;
          &lt;td&gt;16 GB / GDDR6X / 256 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;8031.64 ± 26.49&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;142.49 ± 0.16&lt;/td&gt;
          &lt;td&gt;20638e4&lt;/td&gt;
          &lt;td&gt;@Ristovski&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;RTX 3080&lt;/td&gt;
          &lt;td&gt;10 GB / GDDR6X / 320 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;5013.86 ± 24.80&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;139.65 ± 0.99&lt;/td&gt;
          &lt;td&gt;9c35706&lt;/td&gt;
          &lt;td&gt;@slaren&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;RTX A6000&lt;/td&gt;
          &lt;td&gt;48 GB / GDDR6 / 384 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;4913.93 ± 6.79&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;138.73 ± 2.75&lt;/td&gt;
          &lt;td&gt;4795c91&lt;/td&gt;
          &lt;td&gt;@Hedede&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;RTX 4070 Ti SUPER&lt;/td&gt;
          &lt;td&gt;16 GB / GDDR6X / 256 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;6924.53 ± 13.87&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;132.26 ± 0.16&lt;/td&gt;
          &lt;td&gt;9c35706&lt;/td&gt;
          &lt;td&gt;@Ristovski&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;RTX PRO 4000 Blackwell&lt;/td&gt;
          &lt;td&gt;24 GB / GDDR7 / 192 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;4992.83 ± 113.52&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;131.66 ± 0.20&lt;/td&gt;
          &lt;td&gt;7d77f07&lt;/td&gt;
          &lt;td&gt;@Hedede&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;RTX A5000&lt;/td&gt;
          &lt;td&gt;24 GB / GDDR6 / 384 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;4028.16 ± 19.14&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;130.07 ± 2.74&lt;/td&gt;
          &lt;td&gt;e5155e6&lt;/td&gt;
          &lt;td&gt;@Hedede&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Tesla V100&lt;/td&gt;
          &lt;td&gt;32 GB / HBM2 / 4096 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;3042.64 ± 40.71&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;129.08 ± 0.05&lt;/td&gt;
          &lt;td&gt;51f5a45&lt;/td&gt;
          &lt;td&gt;@Hedede&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;RTX 5070&lt;/td&gt;
          &lt;td&gt;12 GB / GDDR7 / 192 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;5184.75 ± 18.70&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;127.54 ± 0.46&lt;/td&gt;
          &lt;td&gt;@Spyro000&lt;/td&gt;
          &lt;td&gt;-&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;A40&lt;/td&gt;
          &lt;td&gt;48 GB / GDDR6 / 384 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;4609.01 ± 10.67&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;124.11 ± 0.17&lt;/td&gt;
          &lt;td&gt;3470a5c&lt;/td&gt;
          &lt;td&gt;@Hedede&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;A30&lt;/td&gt;
          &lt;td&gt;24 GB / HBM2e / 3072 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;2767.10 ± 1.88&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;124.81 ± 0.16&lt;/td&gt;
          &lt;td&gt;583cb83&lt;/td&gt;
          &lt;td&gt;@Hedede&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Titan V&lt;/td&gt;
          &lt;td&gt;12 GB / HBM2 / 3072 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;2617.46 ± 2.10&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;108.79 ± 0.05&lt;/td&gt;
          &lt;td&gt;e56abd2&lt;/td&gt;
          &lt;td&gt;@Hedede&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;RTX 2080 Ti&lt;/td&gt;
          &lt;td&gt;11 GB / GDDR6 / 352 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;2890.66 ± 2.42&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;107.51 ± 0.21&lt;/td&gt;
          &lt;td&gt;9c35706&lt;/td&gt;
          &lt;td&gt;@ariya&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Quadro RTX 6000&lt;/td&gt;
          &lt;td&gt;24 GB / GDDR6 / 384 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;2751.18 ± 19.43&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;102.77 ± 0.04&lt;/td&gt;
          &lt;td&gt;b8e09f0&lt;/td&gt;
          &lt;td&gt;@Hedede&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Quadro RTX 8000&lt;/td&gt;
          &lt;td&gt;48 GB / GDDR6 / 384 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;2709.95 ± 3.35&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;102.68 ± 0.03&lt;/td&gt;
          &lt;td&gt;b8e09f0&lt;/td&gt;
          &lt;td&gt;@Hedede&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;RTX A4500&lt;/td&gt;
          &lt;td&gt;20 GB / GDDR6 / 320 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;2827.20 ± 66.43&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;97.32 ± 2.80&lt;/td&gt;
          &lt;td&gt;5cdb27e&lt;/td&gt;
          &lt;td&gt;@aleksyx&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;RTX 5060 Ti 16 GB&lt;/td&gt;
          &lt;td&gt;16 GB / GDDR7 / 128 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;3737.25 ± 6.79&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;90.94 ± 0.02&lt;/td&gt;
          &lt;td&gt;89d1029&lt;/td&gt;
          &lt;td&gt;@mike-llamacpp&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;RTX 2070 SUPER&lt;/td&gt;
          &lt;td&gt;8 GB / GDDR6 / 256 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;2088.34 ± 1.94&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;88.06 ± 0.28&lt;/td&gt;
          &lt;td&gt;bc07349&lt;/td&gt;
          &lt;td&gt;@phstudy&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;RTX A4000&lt;/td&gt;
          &lt;td&gt;16 GB / GDDR6 / 256 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;2684.06 ± 15.28&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;83.77 ± 0.37&lt;/td&gt;
          &lt;td&gt;65349f2&lt;/td&gt;
          &lt;td&gt;@TinyServal&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Titan Xp&lt;/td&gt;
          &lt;td&gt;12 GB / GDDR5X / 384 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1154.96 ± 1.46&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;76.08 ± 0.08&lt;/td&gt;
          &lt;td&gt;c4510dc&lt;/td&gt;
          &lt;td&gt;@Hedede&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;RTX 3060&lt;/td&gt;
          &lt;td&gt;12 GB / GDDR6 / 192 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;2137.50 ± 10.12&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;75.57 ± 0.07&lt;/td&gt;
          &lt;td&gt;baa9255&lt;/td&gt;
          &lt;td&gt;@QuantiusBenignus&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Quadro RTX 4000&lt;/td&gt;
          &lt;td&gt;8 GB / GDDR6 / 256 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1536.89 ± 0.90&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;65.62 ± 0.62&lt;/td&gt;
          &lt;td&gt;7d77f07&lt;/td&gt;
          &lt;td&gt;@Hedede&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;RTX 4060 Ti 8 GB&lt;/td&gt;
          &lt;td&gt;8 GB / GDDR6 / 128 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;3394.63 ± 7.44&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;63.86 ± 0.01&lt;/td&gt;
          &lt;td&gt;89d1029&lt;/td&gt;
          &lt;td&gt;@mike-llamacpp&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;GTX 1080 Ti&lt;/td&gt;
          &lt;td&gt;11 GB / GDDR5X / 352 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1084.41 ± 3.01&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;62.49 ± 0.06&lt;/td&gt;
          &lt;td&gt;9c35706&lt;/td&gt;
          &lt;td&gt;@ariya&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;RTX A4000 Ada&lt;/td&gt;
          &lt;td&gt;20 GB / GDDR6 / 160 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;2779.77 ± 9.91&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;61.83 ± 0.04&lt;/td&gt;
          &lt;td&gt;a74a0d6&lt;/td&gt;
          &lt;td&gt;@sdwolfz&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;RTX 2060 SUPER&lt;/td&gt;
          &lt;td&gt;8 GB / GDDR6 / 256 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1420.24 ± 1.95&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;60.04 ± 0.01&lt;/td&gt;
          &lt;td&gt;5c0eb5e&lt;/td&gt;
          &lt;td&gt;@ggerganov&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Tesla P100&lt;/td&gt;
          &lt;td&gt;16 GB / HBM2 / 4096 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;760.80 ± 2.92&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;58.35 ± 0.00&lt;/td&gt;
          &lt;td&gt;b8372ee&lt;/td&gt;
          &lt;td&gt;@Hedede&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;DGX Spark&lt;/td&gt;
          &lt;td&gt;128 GB / LPDDR5x&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;3062.31 ± 11.02&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;57.21 ± 0.06&lt;/td&gt;
          &lt;td&gt;5acd455&lt;/td&gt;
          &lt;td&gt;@ggerganov&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Tesla P40&lt;/td&gt;
          &lt;td&gt;24 GB / GDDR5 / 384 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1007.42 ± 1.23&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;54.74 ± 0.07&lt;/td&gt;
          &lt;td&gt;c76b420&lt;/td&gt;
          &lt;td&gt;@m18coppola&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;RTX 2000 Ada&lt;/td&gt;
          &lt;td&gt;16 GB / GDDR6 / 128 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1956.22 ± 7.74&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;50.62 ± 0.04&lt;/td&gt;
          &lt;td&gt;756cfea&lt;/td&gt;
          &lt;td&gt;@DigitalRudeness&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Tesla T4&lt;/td&gt;
          &lt;td&gt;16 GB / GDDR6 / 256 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1219.06 ± 4.18&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;46.38 ± 0.73&lt;/td&gt;
          &lt;td&gt;d32e03f&lt;/td&gt;
          &lt;td&gt;@pt13762104&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;RTX 4050 Laptop&lt;/td&gt;
          &lt;td&gt;6 GB / GDDR6 / 96 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1725.85 + 17.85&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;43.72 + 0.41&lt;/td&gt;
          &lt;td&gt;d79d8f3&lt;/td&gt;
          &lt;td&gt;@TimCabbage&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;GTX 1660&lt;/td&gt;
          &lt;td&gt;6 GB / GDDR5 / 192 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;148.91 ± 0.01&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;41.35 ± 0.02&lt;/td&gt;
          &lt;td&gt;9515c61&lt;/td&gt;
          &lt;td&gt;@ariya&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Tesla M40&lt;/td&gt;
          &lt;td&gt;24 GB / GDDR5 / 384 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;282.65 ± 0.15&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;38.04 ± 0.02&lt;/td&gt;
          &lt;td&gt;97d5117&lt;/td&gt;
          &lt;td&gt;@Hedede&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;GTX 1070 Ti&lt;/td&gt;
          &lt;td&gt;8 GB / GDDR5 / 256 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;714.44 ± 2.04&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;37.82 ± 0.02&lt;/td&gt;
          &lt;td&gt;79c1160&lt;/td&gt;
          &lt;td&gt;@pebaryan&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Jetson AGX Orin&lt;/td&gt;
          &lt;td&gt;64 GB / LPDDR5 / 256 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;991.31 ± 1.15&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;33.58 ± 0.14&lt;/td&gt;
          &lt;td&gt;c1b1876&lt;/td&gt;
          &lt;td&gt;@TinyServal&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Tesla P4&lt;/td&gt;
          &lt;td&gt;8 GB / GDDR5 / 256 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;514.53 ± 3.06&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;33.29 ± 0.00&lt;/td&gt;
          &lt;td&gt;c76b420&lt;/td&gt;
          &lt;td&gt;@m18coppola&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;P106-100&lt;/td&gt;
          &lt;td&gt;6 GB / GDDR5 / 192 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;406.94 ± 0.25&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;30.40 ± 0.02&lt;/td&gt;
          &lt;td&gt;5fd160b&lt;/td&gt;
          &lt;td&gt;@pebaryan&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;GTX 1060&lt;/td&gt;
          &lt;td&gt;6 GB / GDDR5 / 192 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;416.85 ± 1.75&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;27.79 ± 0.02&lt;/td&gt;
          &lt;td&gt;5fd160b&lt;/td&gt;
          &lt;td&gt;@pebaryan&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Quadro T1000&lt;/td&gt;
          &lt;td&gt;4 GB / GDDR5 / 128 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;79.44 ± 0.01&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;27.82 ± 0.18&lt;/td&gt;
          &lt;td&gt;f6da8cb&lt;/td&gt;
          &lt;td&gt;@hanabu&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Quadro P2000&lt;/td&gt;
          &lt;td&gt;5 GB / GDDR5 / 160 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;309.30 ± 0.05&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;23.63 ± 0.00&lt;/td&gt;
          &lt;td&gt;baa9255&lt;/td&gt;
          &lt;td&gt;@TinyServal&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Quadro P1000&lt;/td&gt;
          &lt;td&gt;4 GB / GDDR5 / 128 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;183.40 ± 0.11&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;13.99 ± 0.13&lt;/td&gt;
          &lt;td&gt;1e74897&lt;/td&gt;
          &lt;td&gt;@aleksyx&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Tesla K80&lt;/td&gt;
          &lt;td&gt;12 GB / GDDR5 / 384 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;133.14 ± 0.55&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;13.80 ± 0.02&lt;/td&gt;
          &lt;td&gt;32732f2&lt;/td&gt;
          &lt;td&gt;@pebaryan&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;h3 id=&#34;llama-2-7b-q4_0-with-fa&#34;&gt;Llama 2 7B, Q4_0, with FA
&lt;/h3&gt;&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Chip&lt;/th&gt;
          &lt;th&gt;Memory&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;pp512 t/s&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;tg128 t/s&lt;/th&gt;
          &lt;th&gt;Commit&lt;/th&gt;
          &lt;th&gt;Thanks to&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;RTX 5090&lt;/td&gt;
          &lt;td&gt;32 GB / GDDR7 / 512 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;14970.15 ± 381.06&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;300.40 ± 0.28&lt;/td&gt;
          &lt;td&gt;8cf6b42&lt;/td&gt;
          &lt;td&gt;@totaldev&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;RTX PRO 6000 Blackwell&lt;/td&gt;
          &lt;td&gt;96 GB / GDDR7 / 512 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;16618.98 ± 20.66&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;281.11 ± 0.41&lt;/td&gt;
          &lt;td&gt;5143fa8&lt;/td&gt;
          &lt;td&gt;@Tom94&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;H100 80 GB&lt;/td&gt;
          &lt;td&gt;80 GB / HBM3 / 5120 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;11263.29 ± 98.34&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;280.74 ± 1.17&lt;/td&gt;
          &lt;td&gt;5143fa8&lt;/td&gt;
          &lt;td&gt;@Hedede&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;A100 80 GB&lt;/td&gt;
          &lt;td&gt;80 GB / HBM2e / 5120 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;5285.96 ± 6.58&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;200.90 ± 0.12&lt;/td&gt;
          &lt;td&gt;5143fa8&lt;/td&gt;
          &lt;td&gt;@Hedede&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;RTX 4090 D&lt;/td&gt;
          &lt;td&gt;24 GB / GDDR6X / 384 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;12506.97 ± 11.51&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;191.57 ± 0.03&lt;/td&gt;
          &lt;td&gt;79c1160&lt;/td&gt;
          &lt;td&gt;@autonomous-AI-lab&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;RTX 4090&lt;/td&gt;
          &lt;td&gt;24 GB / GDDR6X / 384 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;14770.63 ± 102.93&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;188.96 ± 0.05&lt;/td&gt;
          &lt;td&gt;2241453&lt;/td&gt;
          &lt;td&gt;@lhl&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;RTX 5080&lt;/td&gt;
          &lt;td&gt;16 GB / GDDR7 / 256 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;9487.70 ± 21.89&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;184.68 ± 0.05&lt;/td&gt;
          &lt;td&gt;8a4280c&lt;/td&gt;
          &lt;td&gt;@Hedede&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;RTX 5070 Ti&lt;/td&gt;
          &lt;td&gt;16 GB / GDDR7 / 256 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;8419.56 ± 35.50&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;182.43 ± 0.09&lt;/td&gt;
          &lt;td&gt;933414c&lt;/td&gt;
          &lt;td&gt;@TinyServal&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;RTX 6000 Ada&lt;/td&gt;
          &lt;td&gt;48 GB / GDDR6 / 384 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;10576.85 ± 530.21&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;179.47 ± 0.32&lt;/td&gt;
          &lt;td&gt;b8e09f0&lt;/td&gt;
          &lt;td&gt;@Hedede&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;RTX 3090 Ti&lt;/td&gt;
          &lt;td&gt;24 GB / GDDR6X / 384 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;6924.01 ± 10.76&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;172.26 ± 1.31&lt;/td&gt;
          &lt;td&gt;9c35706&lt;/td&gt;
          &lt;td&gt;@slaren&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;RTX PRO 4500 Blackwell&lt;/td&gt;
          &lt;td&gt;32 GB / GDDR7 / 256 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;7251.66 ± 92.40&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;168.90 ± 0.20&lt;/td&gt;
          &lt;td&gt;becc481&lt;/td&gt;
          &lt;td&gt;@Hedede&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;RTX 3090&lt;/td&gt;
          &lt;td&gt;24 GB / GDDR6X / 384 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;5560.06 ± 16.28&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;161.89 ± 0.18&lt;/td&gt;
          &lt;td&gt;c76b420&lt;/td&gt;
          &lt;td&gt;@m18coppola&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;L40&lt;/td&gt;
          &lt;td&gt;48 GB / GDDR6 / 384 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;10097.64 ± 671.22&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;153.76 ± 0.12&lt;/td&gt;
          &lt;td&gt;ee09828&lt;/td&gt;
          &lt;td&gt;@Hedede&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;RTX 4080 SUPER&lt;/td&gt;
          &lt;td&gt;16 GB / GDDR6X / 256 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;9439.01 ± 56.75&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;147.48 ± 1.41&lt;/td&gt;
          &lt;td&gt;81086cd&lt;/td&gt;
          &lt;td&gt;@zacharyarnaise&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;RTX 4080&lt;/td&gt;
          &lt;td&gt;16 GB / GDDR6X / 256 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;9205.93 ± 22.31&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;143.47 ± 0.02&lt;/td&gt;
          &lt;td&gt;20638e4&lt;/td&gt;
          &lt;td&gt;@Ristovski&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;RTX A6000&lt;/td&gt;
          &lt;td&gt;48 GB / GDDR6 / 384 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;5662.39 ± 13.87&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;144.87 ± 0.18&lt;/td&gt;
          &lt;td&gt;4795c91&lt;/td&gt;
          &lt;td&gt;@Hedede&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;RTX 3080&lt;/td&gt;
          &lt;td&gt;10 GB / GDDR6X / 320 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;5569.56 ± 14.04&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;139.95 ± 0.95&lt;/td&gt;
          &lt;td&gt;9c35706&lt;/td&gt;
          &lt;td&gt;@slaren&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;RTX PRO 4000 Blackwell&lt;/td&gt;
          &lt;td&gt;24 GB / GDDR7 / 192 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;5674.44 ± 139.53&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;136.38 ± 0.13&lt;/td&gt;
          &lt;td&gt;7d77f07&lt;/td&gt;
          &lt;td&gt;@Hedede&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;RTX A5000&lt;/td&gt;
          &lt;td&gt;24 GB / GDDR6 / 384 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;4552.15 ± 9.68&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;135.83 ± 0.11&lt;/td&gt;
          &lt;td&gt;e5155e6&lt;/td&gt;
          &lt;td&gt;@Hedede&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Tesla V100&lt;/td&gt;
          &lt;td&gt;32 GB / HBM2 / 4096 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;2973.78 ± 3.62&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;134.76 ± 0.02&lt;/td&gt;
          &lt;td&gt;51f5a45&lt;/td&gt;
          &lt;td&gt;@Hedede&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;RTX 4070 Ti SUPER&lt;/td&gt;
          &lt;td&gt;16 GB / GDDR6X / 256 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;7612.32 ± 37.35&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;132.85 ± 0.31&lt;/td&gt;
          &lt;td&gt;9c35706&lt;/td&gt;
          &lt;td&gt;@Ristovski&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;A30&lt;/td&gt;
          &lt;td&gt;24 GB / HBM2e / 3072 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;3068.72 ± 0.63&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;131.93 ± 0.18&lt;/td&gt;
          &lt;td&gt;583cb83&lt;/td&gt;
          &lt;td&gt;@Hedede&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;RTX 5070&lt;/td&gt;
          &lt;td&gt;12 GB / GDDR7 / 192 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;5783.44 ± 36.95&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;128.21 ± 2.52&lt;/td&gt;
          &lt;td&gt;@Spyro000&lt;/td&gt;
          &lt;td&gt;-&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;A40&lt;/td&gt;
          &lt;td&gt;48 GB / GDDR6 / 384 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;5256.38 ± 19.39&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;126.24 ± 0.06&lt;/td&gt;
          &lt;td&gt;3470a5c&lt;/td&gt;
          &lt;td&gt;@Hedede&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Titan V&lt;/td&gt;
          &lt;td&gt;12 GB / HBM2 / 3072 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;2481.25 ± 1.31&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;112.17 ± 0.01&lt;/td&gt;
          &lt;td&gt;e56abd2&lt;/td&gt;
          &lt;td&gt;@Hedede&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;RTX 2080 Ti&lt;/td&gt;
          &lt;td&gt;11 GB / GDDR6 / 352 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;3107.61 ± 4.34&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;109.17 ± 0.07&lt;/td&gt;
          &lt;td&gt;9c35706&lt;/td&gt;
          &lt;td&gt;@ariya&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Quadro RTX 6000&lt;/td&gt;
          &lt;td&gt;24 GB / GDDR6 / 384 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;3053.96 ± 1.37&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;104.38 ± 0.04&lt;/td&gt;
          &lt;td&gt;b8e09f0&lt;/td&gt;
          &lt;td&gt;@Hedede&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Quadro RTX 8000&lt;/td&gt;
          &lt;td&gt;48 GB / GDDR6 / 384 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;3052.35 ± 5.64&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;103.63 ± 0.02&lt;/td&gt;
          &lt;td&gt;b8e09f0&lt;/td&gt;
          &lt;td&gt;@Hedede&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;RTX A4500&lt;/td&gt;
          &lt;td&gt;20 GB / GDDR6 / 320 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;3453.10 ± 49.19&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;103.00 ± 0.25&lt;/td&gt;
          &lt;td&gt;5cdb27e&lt;/td&gt;
          &lt;td&gt;@aleksyx&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;RTX 5060 Ti 16 GB&lt;/td&gt;
          &lt;td&gt;16 GB / GDDR7 / 128 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;4195.53 ± 1.98&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;93.46 ± 0.01&lt;/td&gt;
          &lt;td&gt;89d1029&lt;/td&gt;
          &lt;td&gt;@mike-llamacpp&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;RTX 2070 SUPER&lt;/td&gt;
          &lt;td&gt;8 GB / GDDR6 / 256 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;2293.29 ± 5.91&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;87.71 ± 0.29&lt;/td&gt;
          &lt;td&gt;bc07349&lt;/td&gt;
          &lt;td&gt;@phstudy&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;RTX A4000&lt;/td&gt;
          &lt;td&gt;16 GB / GDDR6 / 256 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;2807.83 ± 52.44&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;85.17 ± 0.66&lt;/td&gt;
          &lt;td&gt;65349f2&lt;/td&gt;
          &lt;td&gt;@TinyServal&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;RTX 3060&lt;/td&gt;
          &lt;td&gt;12 GB / GDDR6 / 192 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;2407.67 ± 3.73&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;76.92 ± 0.03&lt;/td&gt;
          &lt;td&gt;baa9255&lt;/td&gt;
          &lt;td&gt;@QuantiusBenignus&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Titan Xp&lt;/td&gt;
          &lt;td&gt;12 GB / GDDR5X / 384 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1218.12 ± 1.82&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;73.84 ± 0.04&lt;/td&gt;
          &lt;td&gt;c4510dc&lt;/td&gt;
          &lt;td&gt;@Hedede&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Quadro RTX 4000&lt;/td&gt;
          &lt;td&gt;8 GB / GDDR6 / 256 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1662.80 ± 2.04&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;67.62 ± 0.67&lt;/td&gt;
          &lt;td&gt;7d77f07&lt;/td&gt;
          &lt;td&gt;@Hedede&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;RTX 4060 Ti 8 GB&lt;/td&gt;
          &lt;td&gt;8 GB / GDDR6 / 128 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;3803.45 ± 70.80&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;64.03 ± 0.53&lt;/td&gt;
          &lt;td&gt;89d1029&lt;/td&gt;
          &lt;td&gt;@mike-llamacpp&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Tesla P100&lt;/td&gt;
          &lt;td&gt;16 GB / HBM2 / 4096 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;787.36 ± 3.27&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;61.99 ± 0.00&lt;/td&gt;
          &lt;td&gt;b8372ee&lt;/td&gt;
          &lt;td&gt;@Hedede&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;GTX 1080 Ti&lt;/td&gt;
          &lt;td&gt;11 GB / GDDR5X / 352 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1138.14 ± 2.02&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;61.38 ± 0.03&lt;/td&gt;
          &lt;td&gt;9c35706&lt;/td&gt;
          &lt;td&gt;@ariya&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;RTX A4000 Ada&lt;/td&gt;
          &lt;td&gt;20 GB / GDDR6 / 160 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;3171.86 ± 4.34&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;61.37 ± 0.01&lt;/td&gt;
          &lt;td&gt;a74a0d6&lt;/td&gt;
          &lt;td&gt;@sdwolfz&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;RTX 2060 SUPER&lt;/td&gt;
          &lt;td&gt;8 GB / GDDR6 / 256 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1563.77 ± 0.51&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;61.13 ± 0.05&lt;/td&gt;
          &lt;td&gt;5c0eb5e&lt;/td&gt;
          &lt;td&gt;@ggerganov&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;DGX Spark&lt;/td&gt;
          &lt;td&gt;128 GB / LPDDR5x&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;3661.37 ± 38.66&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;56.74 ± 0.03&lt;/td&gt;
          &lt;td&gt;5acd455&lt;/td&gt;
          &lt;td&gt;@ggerganov&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Tesla P40&lt;/td&gt;
          &lt;td&gt;24 GB / GDDR5 / 384 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1079.66 ± 0.18&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;53.73 ± 0.05&lt;/td&gt;
          &lt;td&gt;c76b420&lt;/td&gt;
          &lt;td&gt;@m18coppola&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;RTX 2000 Ada&lt;/td&gt;
          &lt;td&gt;16 GB / GDDR6 / 128 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;2250.14 ± 5.91&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;50.71 ± 0.01&lt;/td&gt;
          &lt;td&gt;756cfea&lt;/td&gt;
          &lt;td&gt;@DigitalRudeness&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Tesla T4&lt;/td&gt;
          &lt;td&gt;16 GB / GDDR6 / 256 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1309.73 ± 1.02&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;44.03 ± 0.57&lt;/td&gt;
          &lt;td&gt;d32e03f&lt;/td&gt;
          &lt;td&gt;@pt13762104&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;GTX 1660&lt;/td&gt;
          &lt;td&gt;6 GB / GDDR5 / 192 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;154.45 ± 0.52&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;41.43 ± 0.01&lt;/td&gt;
          &lt;td&gt;9515c61&lt;/td&gt;
          &lt;td&gt;@ariya&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Tesla M40&lt;/td&gt;
          &lt;td&gt;24 GB / GDDR5 / 384 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;290.17 ± 0.11&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;39.98 ± 0.01&lt;/td&gt;
          &lt;td&gt;97d5117&lt;/td&gt;
          &lt;td&gt;@Hedede&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;GTX 1070 Ti&lt;/td&gt;
          &lt;td&gt;8 GB / GDDR5 / 256 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;790.52 ± 2.39&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;37.87 ± 0.00&lt;/td&gt;
          &lt;td&gt;79c1160&lt;/td&gt;
          &lt;td&gt;@pebaryan&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Jetson AGX Orin&lt;/td&gt;
          &lt;td&gt;64 GB / LPDDR5 / 256 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1171.96 ± 4.70&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;35.88 ± 0.18&lt;/td&gt;
          &lt;td&gt;c1b1876&lt;/td&gt;
          &lt;td&gt;@TinyServal&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Tesla P4&lt;/td&gt;
          &lt;td&gt;8 GB / GDDR5 / 256 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;529.53 ± 2.12&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;33.12 ± 0.03&lt;/td&gt;
          &lt;td&gt;c76b420&lt;/td&gt;
          &lt;td&gt;@m18coppola&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;P106-100&lt;/td&gt;
          &lt;td&gt;6 GB / GDDR5 / 192 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;438.49 ± 0.38&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;30.64 ± 0.06&lt;/td&gt;
          &lt;td&gt;5fd160b&lt;/td&gt;
          &lt;td&gt;@pebaryan&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;GTX 1060&lt;/td&gt;
          &lt;td&gt;6 GB / GDDR5 / 192 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;446.19 ± 0.81&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;28.18 ± 0.01&lt;/td&gt;
          &lt;td&gt;5fd160b&lt;/td&gt;
          &lt;td&gt;@pebaryan&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Quadro T1000&lt;/td&gt;
          &lt;td&gt;4 GB / GDDR5 / 128 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;27.46 ± 0.23&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;27.46 ± 0.23&lt;/td&gt;
          &lt;td&gt;f6da8cb&lt;/td&gt;
          &lt;td&gt;@hanabu&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Quadro P2000&lt;/td&gt;
          &lt;td&gt;5 GB / GDDR5 / 160 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;311.55 ± 0.19&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;23.76 ± 0.01&lt;/td&gt;
          &lt;td&gt;baa9255&lt;/td&gt;
          &lt;td&gt;@TinyServal&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Tesla K80&lt;/td&gt;
          &lt;td&gt;12 GB / GDDR5 / 384 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;133.36 ± 0.60&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;14.27 ± 0.32&lt;/td&gt;
          &lt;td&gt;32732f2&lt;/td&gt;
          &lt;td&gt;@pebaryan&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Quadro P1000&lt;/td&gt;
          &lt;td&gt;4 GB / GDDR5 / 128 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;173.82 ± 0.02&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;13.65 ± 0.14&lt;/td&gt;
          &lt;td&gt;1e74897&lt;/td&gt;
          &lt;td&gt;@aleksyx&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;h2 id=&#34;apple-silicon-as-a-reference-baseline&#34;&gt;Apple Silicon as a Reference Baseline
&lt;/h2&gt;&lt;p&gt;Discussion #4167 is useful because it established a more unified benchmark format early on. Besides Q4_0, it also includes F16 and Q8_0, which helps explain PP / TG / t/s.
The thread explicitly defines:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;PP = prompt processing&lt;/li&gt;
&lt;li&gt;TG = 	ext-generation&lt;/li&gt;
&lt;li&gt;/s = 	okens per second
A representative example is the M2 Ultra time-series comparison:
&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Time&lt;/th&gt;
          &lt;th&gt;Device&lt;/th&gt;
          &lt;th&gt;Version / Note&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;Bandwidth GB/s&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;GPU Cores&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;F16 PP&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;F16 TG&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;Q8_0 PP&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;Q8_0 TG&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;Q4_0 PP&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;Q4_0 TG&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;2023-11-21&lt;/td&gt;
          &lt;td&gt;M2 Ultra&lt;/td&gt;
          &lt;td&gt;8e672ef&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;800&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;76&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1401.85&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;41.02&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1248.59&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;66.64&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1238.48&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;94.27&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;2024-11-12&lt;/td&gt;
          &lt;td&gt;M2 Ultra&lt;/td&gt;
          &lt;td&gt;86ed72d + FA&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;800&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;76&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1525.95&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;43.15&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1368.18&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;73.11&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1391.78&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;108.80&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;2025-08-02&lt;/td&gt;
          &lt;td&gt;M2 Ultra&lt;/td&gt;
          &lt;td&gt;5c0eb5e + FA&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;800&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;76&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1561.35&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;43.24&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1386.97&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;73.35&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1412.42&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;109.41&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Representative Apple Silicon entries shown in the thread:&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Device&lt;/td&gt;
          &lt;td&gt;Q4_0 PP&lt;/td&gt;
          &lt;td&gt;Q4_0 TG&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;Q8_0 PP&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;Q8_0 TG&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;F16 PP&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;F16 TG&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;&amp;mdash;&lt;/td&gt;
          &lt;td&gt;&amp;mdash;:&lt;/td&gt;
          &lt;td&gt;&amp;mdash;:&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;&amp;mdash;:&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;&amp;mdash;:&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;&amp;mdash;:&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;&amp;mdash;:&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;M1 Pro 16 GPU&lt;/td&gt;
          &lt;td&gt;266.25&lt;/td&gt;
          &lt;td&gt;36.41&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;270.37&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;22.34&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;302.14&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;12.75&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;M2 Ultra 76 GPU&lt;/td&gt;
          &lt;td&gt;1238.48&lt;/td&gt;
          &lt;td&gt;94.27&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1248.59&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;66.64&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1401.85&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;41.02&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;M3 Max 40 GPU&lt;/td&gt;
          &lt;td&gt;690.99&lt;/td&gt;
          &lt;td&gt;65.85&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;749.37&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;43.00&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;794.26&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;25.27&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;rocm--hip-scoreboards&#34;&gt;ROCm / HIP Scoreboards
&lt;/h2&gt;&lt;h3 id=&#34;llama-2-7b-q4_0-no-fa-1&#34;&gt;Llama 2 7B, Q4_0, no FA
&lt;/h3&gt;&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Chip&lt;/th&gt;
          &lt;th&gt;Memory&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;pp512 t/s&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;tg128 t/s&lt;/th&gt;
          &lt;th&gt;Commit&lt;/th&gt;
          &lt;th&gt;Thanks to&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;Instinct MI300X&lt;/td&gt;
          &lt;td&gt;192 GB / HBM3 / 8192 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;11476.40 ± 72.79&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;232.92 ± 0.53&lt;/td&gt;
          &lt;td&gt;ee3a9fc&lt;/td&gt;
          &lt;td&gt;@yeahdongcn&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;RX 7900 XTX&lt;/td&gt;
          &lt;td&gt;24 GB / GDDR6 / 384 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;3552.27 ± 101.96&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;167.11 ± 0.50&lt;/td&gt;
          &lt;td&gt;2f0c2db&lt;/td&gt;
          &lt;td&gt;@Diablo-D3&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Instinct MI210&lt;/td&gt;
          &lt;td&gt;64 GB / HBM2e / 4096 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;2486.22 ± 9.58&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;124.51 ± 0.04&lt;/td&gt;
          &lt;td&gt;8160b38&lt;/td&gt;
          &lt;td&gt;@65a&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Pro W7900&lt;/td&gt;
          &lt;td&gt;48 GB / GDDR6 / 384 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;3213.17 ± 80.47&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;121.18 ± 0.06&lt;/td&gt;
          &lt;td&gt;8160b38&lt;/td&gt;
          &lt;td&gt;@65a&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;RX 7900 XT&lt;/td&gt;
          &lt;td&gt;20 GB / GDDR6 / 320 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;3098.38 ± 24.02&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;116.15 ± 0.06&lt;/td&gt;
          &lt;td&gt;1e15bfd&lt;/td&gt;
          &lt;td&gt;@AdamNiederer&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;RX 9070&lt;/td&gt;
          &lt;td&gt;16 GB / GDDR6 / 256 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;2381.77 ± 3.68&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;114.48 ± 0.60&lt;/td&gt;
          &lt;td&gt;d0660f2&lt;/td&gt;
          &lt;td&gt;@andj1210&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Instinct MI100&lt;/td&gt;
          &lt;td&gt;32 GB / HBM2 / 4096 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;2732.83 ± 1.98&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;110.48 ± 0.14&lt;/td&gt;
          &lt;td&gt;9c35706&lt;/td&gt;
          &lt;td&gt;@firefox42&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;RX 9070 XT&lt;/td&gt;
          &lt;td&gt;16 GB / GDDR6 / 256 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;5055.19 ± 109.58&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;101.27 ± 0.27&lt;/td&gt;
          &lt;td&gt;583cb83&lt;/td&gt;
          &lt;td&gt;@Hadrianneue&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;RX 7800 XT&lt;/td&gt;
          &lt;td&gt;16 GB / GDDR6 / 256 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;2151.81 + 17.94&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;100.94 + 0.10&lt;/td&gt;
          &lt;td&gt;00131d6&lt;/td&gt;
          &lt;td&gt;@olegshulyakov&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Instinct MI50&lt;/td&gt;
          &lt;td&gt;32 GB / HBM2 / 4096 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1057.24 ± 0.53&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;98.95 ± 0.25&lt;/td&gt;
          &lt;td&gt;97d5117&lt;/td&gt;
          &lt;td&gt;@wtarreau&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;RX 7900 GRE&lt;/td&gt;
          &lt;td&gt;16 GB / GDDR6 / 256 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1456.98 ± 12.39&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;96.07 ± 0.10&lt;/td&gt;
          &lt;td&gt;6fa3b55&lt;/td&gt;
          &lt;td&gt;@MihaiBojescu&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;AI PRO R9700&lt;/td&gt;
          &lt;td&gt;32 GB / GDDR6 / 256 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;4443.54 ± 339.25&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;93.84 ± 0.26&lt;/td&gt;
          &lt;td&gt;bd4ef13&lt;/td&gt;
          &lt;td&gt;@gogich77&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Instinct MI60&lt;/td&gt;
          &lt;td&gt;32 GB / HBM2 / 4096 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1289.11 ± 0.62&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;91.46 ± 0.13&lt;/td&gt;
          &lt;td&gt;504af20&lt;/td&gt;
          &lt;td&gt;@Said-Akbar&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;RX 6900 XT&lt;/td&gt;
          &lt;td&gt;16 GB / GDDR6 / 256 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1889.84 ± 31.21&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;88.49 ± 0.00&lt;/td&gt;
          &lt;td&gt;a972fae&lt;/td&gt;
          &lt;td&gt;@notgood&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Pro VII&lt;/td&gt;
          &lt;td&gt;16 GB / HBM2 / 4096 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1064.99 ± 1.18&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;87.45 ± 0.04&lt;/td&gt;
          &lt;td&gt;2739a71&lt;/td&gt;
          &lt;td&gt;@8XXD8&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;RX 6800 XT&lt;/td&gt;
          &lt;td&gt;16 GB / GDDR6 / 256 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1447.07 ± 1.36&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;83.92 ± 0.03&lt;/td&gt;
          &lt;td&gt;79c1160&lt;/td&gt;
          &lt;td&gt;@MrLavender&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Pro V620&lt;/td&gt;
          &lt;td&gt;32 GB / GDDR6 / 256 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1803.65 ± 2.54&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;74.66 ± 0.01&lt;/td&gt;
          &lt;td&gt;5c0eb5e&lt;/td&gt;
          &lt;td&gt;@samteezy&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;RX 9060 XT&lt;/td&gt;
          &lt;td&gt;16 GB / GDDR6 / 256 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1419.67 ± 3.64&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;67.58 ± 0.24&lt;/td&gt;
          &lt;td&gt;a0e13dc&lt;/td&gt;
          &lt;td&gt;@lcy0321&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;RX 5700 XT&lt;/td&gt;
          &lt;td&gt;8 GB / GDDR6 / 256 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;354.17 ± 0.18&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;67.55 ± 0.04&lt;/td&gt;
          &lt;td&gt;c05e8c9&lt;/td&gt;
          &lt;td&gt;@daniandtheweb&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Instinct MI25&lt;/td&gt;
          &lt;td&gt;16 GB / HBM2 / 2048 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;409.83 ± 0.23&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;63.94 ± 0.06&lt;/td&gt;
          &lt;td&gt;2739a71&lt;/td&gt;
          &lt;td&gt;@8XXD8&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;AI Max+ 395&lt;/td&gt;
          &lt;td&gt;128 GB / LPDDR5&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;911.36 ± 1.79&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;50.01 ± 0.07&lt;/td&gt;
          &lt;td&gt;e60f241&lt;/td&gt;
          &lt;td&gt;@firefox42&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;RX 7600 XT&lt;/td&gt;
          &lt;td&gt;16 GB / GDDR6 / 128 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1099.64 ± 2.05&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;48.58 ± 0.06&lt;/td&gt;
          &lt;td&gt;9c35706&lt;/td&gt;
          &lt;td&gt;@wbruna&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;RX Vega 64&lt;/td&gt;
          &lt;td&gt;8 GB / HBM2 / 2048 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;240.68 ± 0.09&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;48.46 ± 0.09&lt;/td&gt;
          &lt;td&gt;ec428b0&lt;/td&gt;
          &lt;td&gt;@davispuh&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Radeon 8060S&lt;/td&gt;
          &lt;td&gt;System Shared / DDR5&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;351.36 ± 0.67&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;47.97 ± 0.33&lt;/td&gt;
          &lt;td&gt;1d0125b&lt;/td&gt;
          &lt;td&gt;@hspak&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Radeon 880M&lt;/td&gt;
          &lt;td&gt;System Shared / DDR5&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;163.25 ± 13.86&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;12.97 ± 1.63&lt;/td&gt;
          &lt;td&gt;c55d53a&lt;/td&gt;
          &lt;td&gt;@Hedede&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;h3 id=&#34;llama-2-7b-q4_0-with-fa-1&#34;&gt;Llama 2 7B, Q4_0, with FA
&lt;/h3&gt;&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Chip&lt;/th&gt;
          &lt;th&gt;Memory&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;pp512 t/s&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;tg128 t/s&lt;/th&gt;
          &lt;th&gt;Commit&lt;/th&gt;
          &lt;th&gt;Thanks to&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;Instinct MI300X&lt;/td&gt;
          &lt;td&gt;192 GB / HBM3 / 8192 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;11945.97 ± 54.29&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;218.53 ± 0.09&lt;/td&gt;
          &lt;td&gt;ee3a9fc&lt;/td&gt;
          &lt;td&gt;@yeahdongcn&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;RX 7900 XTX&lt;/td&gt;
          &lt;td&gt;24 GB / GDDR6 / 384 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;3874.25 ± 11.92&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;170.12 ± 0.56&lt;/td&gt;
          &lt;td&gt;2f0c2db&lt;/td&gt;
          &lt;td&gt;@Diablo-D3&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Pro W7900&lt;/td&gt;
          &lt;td&gt;48 GB / GDDR6 / 384 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;3472.86 ± 52.86&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;127.43 ± 0.12&lt;/td&gt;
          &lt;td&gt;8160b38&lt;/td&gt;
          &lt;td&gt;@65a&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Instinct MI210&lt;/td&gt;
          &lt;td&gt;64 GB / HBM2e / 4096 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;2571.82 ± 2.89&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;130.18 ± 0.06&lt;/td&gt;
          &lt;td&gt;8160b38&lt;/td&gt;
          &lt;td&gt;@65a&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;RX 9070&lt;/td&gt;
          &lt;td&gt;16 GB / GDDR6 / 256 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;2452.68 ± 1.33&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;115.32 ± 0.52&lt;/td&gt;
          &lt;td&gt;d0660f2&lt;/td&gt;
          &lt;td&gt;@andj1210&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;RX 7900 XT&lt;/td&gt;
          &lt;td&gt;20 GB / GDDR6 / 320 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;3261.75 ± 9.09&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;112.30 ± 0.06&lt;/td&gt;
          &lt;td&gt;1e15bfd&lt;/td&gt;
          &lt;td&gt;@AdamNiederer&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Instinct MI50&lt;/td&gt;
          &lt;td&gt;32 GB / HBM2 / 4096 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1129.43 ± 0.15&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;105.82 ± 0.07&lt;/td&gt;
          &lt;td&gt;97d5117&lt;/td&gt;
          &lt;td&gt;@wtarreau&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Instinct MI100&lt;/td&gt;
          &lt;td&gt;32 GB / HBM2 / 4096 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;2755.00 ± 3.68&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;104.71 ± 0.10&lt;/td&gt;
          &lt;td&gt;9c35706&lt;/td&gt;
          &lt;td&gt;@firefox42&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;AI PRO R9700&lt;/td&gt;
          &lt;td&gt;32 GB / GDDR6 / 256 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;4773.07 ± 49.30&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;97.98 ± 0.13&lt;/td&gt;
          &lt;td&gt;bd4ef13&lt;/td&gt;
          &lt;td&gt;@gogich77&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;RX 7900 GRE&lt;/td&gt;
          &lt;td&gt;16 GB / GDDR6 / 256 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1598.79 ± 11.48&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;97.53 ± 0.06&lt;/td&gt;
          &lt;td&gt;6fa3b55&lt;/td&gt;
          &lt;td&gt;@MihaiBojescu&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;RX 9070 XT&lt;/td&gt;
          &lt;td&gt;16 GB / GDDR6 / 256 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;4903.51 ± 96.36&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;97.28 ± 0.13&lt;/td&gt;
          &lt;td&gt;583cb83&lt;/td&gt;
          &lt;td&gt;@Hadrianneue&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;RX 7800 XT&lt;/td&gt;
          &lt;td&gt;16 GB / GDDR6 / 256 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;2304.63 + 2.85&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;95.99 + 0.21&lt;/td&gt;
          &lt;td&gt;00131d6&lt;/td&gt;
          &lt;td&gt;@olegshulyakov&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;RX 6900 XT&lt;/td&gt;
          &lt;td&gt;16 GB / GDDR6 / 256 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1948.31 ± 13.51&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;85.04 ± 0.02&lt;/td&gt;
          &lt;td&gt;a972fae&lt;/td&gt;
          &lt;td&gt;@notgood&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Pro V620&lt;/td&gt;
          &lt;td&gt;32 GB / GDDR6 / 256 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1256.86 ± 0.55&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;70.83 ± 0.02&lt;/td&gt;
          &lt;td&gt;5c0eb5e&lt;/td&gt;
          &lt;td&gt;@samteezy&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;RX 9060 XT&lt;/td&gt;
          &lt;td&gt;16 GB / GDDR6 / 256 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1479.27 ± 0.71&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;65.42 ± 0.19&lt;/td&gt;
          &lt;td&gt;a0e13dc&lt;/td&gt;
          &lt;td&gt;@lcy0321&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;RX 5700 XT&lt;/td&gt;
          &lt;td&gt;8 GB / GDDR6 / 256 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;314.17 ± 0.29&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;62.02 ± 0.05&lt;/td&gt;
          &lt;td&gt;c05e8c9&lt;/td&gt;
          &lt;td&gt;@daniandtheweb&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;AI Max+ 395&lt;/td&gt;
          &lt;td&gt;128 GB / LPDDR5&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1003.53 ± 2.91&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;49.87 ± 0.02&lt;/td&gt;
          &lt;td&gt;e60f241&lt;/td&gt;
          &lt;td&gt;@firefox42&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Radeon 8060S&lt;/td&gt;
          &lt;td&gt;System Shared / DDR5&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;366.08 ± 1.44&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;48.97 ± 0.15&lt;/td&gt;
          &lt;td&gt;1d0125b&lt;/td&gt;
          &lt;td&gt;@hspak&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;RX 7600 XT&lt;/td&gt;
          &lt;td&gt;16 GB / GDDR6 / 128 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1199.16 ± 1.07&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;47.65 ± 0.06&lt;/td&gt;
          &lt;td&gt;9c35706&lt;/td&gt;
          &lt;td&gt;@wbruna&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;RX Vega 64&lt;/td&gt;
          &lt;td&gt;8 GB / HBM2 / 2048 bit&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;153.17 ± 0.72&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;42.46 ± 0.40&lt;/td&gt;
          &lt;td&gt;ec428b0&lt;/td&gt;
          &lt;td&gt;@davispuh&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Radeon 880M&lt;/td&gt;
          &lt;td&gt;System Shared / DDR5&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;213.31 ± 14.05&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;16.16 ± 1.41&lt;/td&gt;
          &lt;td&gt;c55d53a&lt;/td&gt;
          &lt;td&gt;@Hedede&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;h2 id=&#34;vulkan-scoreboards&#34;&gt;Vulkan Scoreboards
&lt;/h2&gt;&lt;h3 id=&#34;llama-2-7b-q4_0-no-fa-2&#34;&gt;Llama 2 7B, Q4_0, no FA
&lt;/h3&gt;&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Chip&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;pp512 t/s&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;tg128 t/s&lt;/th&gt;
          &lt;th&gt;Commit&lt;/th&gt;
          &lt;th&gt;Comments&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;Nvidia RTX 5090&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;10381.64 ± 508.84&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;263.63 ± 0.91&lt;/td&gt;
          &lt;td&gt;ca71fb9&lt;/td&gt;
          &lt;td&gt;coopmat2&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;AMD Radeon RX 7900 XTX&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;3531.93 ± 31.74&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;191.28 ± 0.20&lt;/td&gt;
          &lt;td&gt;2f0c2db&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Nvidia RTX 4090&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;9452.03 ± 187.70&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;187.97 ± 0.21&lt;/td&gt;
          &lt;td&gt;4ae88d0&lt;/td&gt;
          &lt;td&gt;coopmat2&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Nvidia RTX 5080&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;7444.99 ± 20.11&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;185.10 ± 0.54&lt;/td&gt;
          &lt;td&gt;f6b533d&lt;/td&gt;
          &lt;td&gt;coopmat2&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Nvidia A100&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;6389.86 ± 4.83&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;160.78 ± 0.16&lt;/td&gt;
          &lt;td&gt;2257758&lt;/td&gt;
          &lt;td&gt;coopmat2&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Nvidia RTX 3090&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;4298.97 ± 10.59&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;160.13 ± 0.25&lt;/td&gt;
          &lt;td&gt;4ae88d0&lt;/td&gt;
          &lt;td&gt;coopmat2&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Nvidia RTX 4080 Super&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;7101.18 ± 269.79&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;147.13 ± 5.64&lt;/td&gt;
          &lt;td&gt;81086cd&lt;/td&gt;
          &lt;td&gt;coopmat2&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Nvidia RTX 3080&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;4287.11 ± 55.50&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;139.15 ± 0.05&lt;/td&gt;
          &lt;td&gt;7c7d6ce&lt;/td&gt;
          &lt;td&gt;coopmat2&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Nvidia RTX A5000&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;3641.55 ± 9.05&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;139.89 ± 0.69&lt;/td&gt;
          &lt;td&gt;4ae88d0&lt;/td&gt;
          &lt;td&gt;coopmat2&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;AMD Radeon RX 9070 XT&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;5036.04 ± 88.16&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;137.11 ± 0.02&lt;/td&gt;
          &lt;td&gt;e9fd8dc&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Nvidia RTX 5070 Ti&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;6213.63 ± 27.72&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;135.63 ± 0.18&lt;/td&gt;
          &lt;td&gt;d13d0f6&lt;/td&gt;
          &lt;td&gt;coopmat2&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;AMD Radeon AI Pro R9700&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;4036.04 ± 34.58&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;130.19 ± 0.39&lt;/td&gt;
          &lt;td&gt;3191462&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Nvidia Tesla V100&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1391.39 ± 1.19&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;129.58 ± 0.58&lt;/td&gt;
          &lt;td&gt;7d77f07&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Nvidia RTX 4070 Ti Super&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;6099.18 ± 154.30&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;129.45 ± 0.18&lt;/td&gt;
          &lt;td&gt;4ae88d0&lt;/td&gt;
          &lt;td&gt;coopmat2&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;AMD Radeon RX 7900 XT&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;2941.58 ± 17.17&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;123.18 ± 0.40&lt;/td&gt;
          &lt;td&gt;71e74a3&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;AMD Radeon RX 9070&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;3164.10 ± 66.84&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;119.71 ± 3.40&lt;/td&gt;
          &lt;td&gt;21c17b5&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;AMD Radeon RX 7800 XT&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;2017.33 ± 19.30&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;118.27 ± 0.27&lt;/td&gt;
          &lt;td&gt;4fdbc1e&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;AMD Radeon RX 7900 GRE&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;2336.31 ± 7.52&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;116.11 ± 0.26&lt;/td&gt;
          &lt;td&gt;4b2a477&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Apple M3 Ultra&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1116.83 ± 0.55&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;115.54 ± 0.78&lt;/td&gt;
          &lt;td&gt;2d451c8&lt;/td&gt;
          &lt;td&gt;MoltenVK&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Intel Arc Pro B70&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;3379.00 ± 47.92&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;112.02 ± 1.08&lt;/td&gt;
          &lt;td&gt;b863507&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Nvidia Titan V&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;984.36 ± 4.13&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;108.86 ± 0.28&lt;/td&gt;
          &lt;td&gt;e56abd2&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;AMD Radeon Pro VII&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1078.54 ± 0.86&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;107.82 ± 0.14&lt;/td&gt;
          &lt;td&gt;N/A&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;AMD Radeon RX 6900 XT&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1837.21 ± 25.44&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;104.60 ± 0.30&lt;/td&gt;
          &lt;td&gt;a972fae&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Intel Arc Pro A60&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;2261.11 ± 9.53&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;104.25 ± 0.07&lt;/td&gt;
          &lt;td&gt;97d5117&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;AMD Radeon RX 6800 XT&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1752.92 ± 1.71&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;100.32 ± 0.97&lt;/td&gt;
          &lt;td&gt;N/A&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;AMD Radeon VII&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1059.14 ± 0.56&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;101.19 ± 0.53&lt;/td&gt;
          &lt;td&gt;77d6ae4&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Nvidia RTX 2080 Ti&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1888.24 ± 9.20&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;97.58 ± 6.60&lt;/td&gt;
          &lt;td&gt;N/A&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;AMD Radeon RX 6800&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1698.69 ± 0.80&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;95.61 ± 0.19&lt;/td&gt;
          &lt;td&gt;4b385bf&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;AMD Radeon Pro W6800X Duo&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;687.71 ± 4.33&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;94.82 ± 0.12&lt;/td&gt;
          &lt;td&gt;N/A&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Nvidia RTX 5060 Ti&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;3460.92 ± 7.16&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;93.51 ± 0.15&lt;/td&gt;
          &lt;td&gt;89f10ba&lt;/td&gt;
          &lt;td&gt;coopmat2&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Nvidia RTX 4070&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;3179.37 ± 46.16&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;92.29 ± 0.28&lt;/td&gt;
          &lt;td&gt;9a48399&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;AMD Radeon Pro W6800X&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;510.80 ± 0.13&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;86.47 ± 0.46&lt;/td&gt;
          &lt;td&gt;13b4548&lt;/td&gt;
          &lt;td&gt;MoltenVK&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;AMD Radeon RX 6700 XT&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1051.20 ± 0.98&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;83.88 ± 0.08&lt;/td&gt;
          &lt;td&gt;6d75883&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;AMD Radeon RX 6750 XT&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1040.58 ± 0.35&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;81.98 ± 0.03&lt;/td&gt;
          &lt;td&gt;228f34c&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;AMD Radeon Pro V620&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1595.32 ± 1.59&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;81.78 ± 0.06&lt;/td&gt;
          &lt;td&gt;03d4698&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Nvidia RTX 3070&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;2113.02 ± 7.38&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;78.71 ± 0.13&lt;/td&gt;
          &lt;td&gt;1b8fb81&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;AMD Radeon Instinct MI60&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;369.26 ± 2.48&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;78.16 ± 1.40&lt;/td&gt;
          &lt;td&gt;504af20&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Nvidia RTX 3060&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1815.70 ± 5.85&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;75.94 ± 0.80&lt;/td&gt;
          &lt;td&gt;92c0b38&lt;/td&gt;
          &lt;td&gt;coopmat2&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Apple M4 Max&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;724.77 ± 20.93&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;75.02 ± 0.14&lt;/td&gt;
          &lt;td&gt;1ece0cb6&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Nvidia Tesla T10&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1692.70 ± 2.05&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;75.01 ± 0.21&lt;/td&gt;
          &lt;td&gt;7f76692&lt;/td&gt;
          &lt;td&gt;coopmat2&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Nvidia RTX A4000&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;2248.14 ± 7.59&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;73.74 ± 0.08&lt;/td&gt;
          &lt;td&gt;f5245b5&lt;/td&gt;
          &lt;td&gt;coopmat2&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;AMD Radeon RX 5700 XT&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;529.69 ± 0.26&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;70.73 ± 0.04&lt;/td&gt;
          &lt;td&gt;4fdbc1e&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;AMD Radeon RX 9060 XT&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;2141.67 ± 6.87&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;70.54 ± 0.74&lt;/td&gt;
          &lt;td&gt;ed52f36&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Intel Arc B580&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;620.94 ± 15.33&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;70.14 ± 0.28&lt;/td&gt;
          &lt;td&gt;7f76692&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;AMD Radeon Pro V540&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;583.88 ± 6.56&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;69.64 ± 0.24&lt;/td&gt;
          &lt;td&gt;9da3dcd&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;AMD Radeon Pro W5700&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;449.85 ± 0.46&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;68.55 ± 0.15&lt;/td&gt;
          &lt;td&gt;23bc779&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Intel Arc Pro B60&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;522.36 ± 3.60&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;68.55 ± 0.01&lt;/td&gt;
          &lt;td&gt;516a4ca&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Nvidia GTX 1080 Ti&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;540.69 ± 0.71&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;64.99 ± 0.08&lt;/td&gt;
          &lt;td&gt;360d653&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Nvidia RTX 2070 Super&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1199.13 ± 7.70&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;64.64 ± 0.20&lt;/td&gt;
          &lt;td&gt;b7552cf&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Nvidia RTX 3070 Mobile&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1689.40 ± 19.57&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;63.64 ± 0.39&lt;/td&gt;
          &lt;td&gt;ceff6bb&lt;/td&gt;
          &lt;td&gt;coopmat2&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Nvidia Tesla P100&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;678.14 ± 1.40&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;63.16 ± 0.06&lt;/td&gt;
          &lt;td&gt;eec1e33&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;AMD BC-250&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;370.66 ± 0.04&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;62.32 ± 0.32&lt;/td&gt;
          &lt;td&gt;5886f4f&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;AMD Radeon RX 6650 XT&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1029.52 ± 1.21&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;62.14 ± 0.02&lt;/td&gt;
          &lt;td&gt;dbb852b&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Nvidia RTX 4060 Mobile&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;2135.66 ± 23.18&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;59.53 ± 0.03&lt;/td&gt;
          &lt;td&gt;a5c07dc&lt;/td&gt;
          &lt;td&gt;coopmat2&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Nvidia Tesla P40&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;488.06 ± 0.27&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;59.36 ± 0.16&lt;/td&gt;
          &lt;td&gt;N/A&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Nvidia GTX 1660 Ti Mobile&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;511.67 ± 2.85&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;56.60 ± 0.07&lt;/td&gt;
          &lt;td&gt;b43556e&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;AMD Radeon Instinct MI25&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;439.42 ± 0.34&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;54.69 ± 0.03&lt;/td&gt;
          &lt;td&gt;2739a71&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;AMD Radeon RX 6600 XT&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;574.65 ± 0.86&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;53.92 ± 0.11&lt;/td&gt;
          &lt;td&gt;091592d&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;AMD Ryzen AI Max+ 395&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1288.96 ± 6.49&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;53.59 ± 0.38&lt;/td&gt;
          &lt;td&gt;7f76692&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;AMD Radeon RX 7600 XT&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;840.85 ± 3.02&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;53.02 ± 0.01&lt;/td&gt;
          &lt;td&gt;01d8eaa&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Intel Arc A770&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1073.85 + 29.68&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;52.56 + 0.11&lt;/td&gt;
          &lt;td&gt;a69d54f&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Nvidia GB10&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;2737.79 ± 19.56&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;52.28 ± 0.03&lt;/td&gt;
          &lt;td&gt;b9da444&lt;/td&gt;
          &lt;td&gt;coopmat2&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;AMD FirePro S9300 x2&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;247.26 ± 0.43&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;51.86 ± 0.11&lt;/td&gt;
          &lt;td&gt;eec1e33&lt;/td&gt;
          &lt;td&gt;Split across two GPUs&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;AMD Radeon RX 6600&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;761.89 ± 1.76&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;50.63 ± 0.02&lt;/td&gt;
          &lt;td&gt;b1c70e2&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;AMD Radeon RX Vega 56&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;439.87 ± 0.61&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;50.23 ± 0.14&lt;/td&gt;
          &lt;td&gt;92c0b38&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Intel Arc B570&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;913.95 ± 0.90&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;49.64 ± 0.03&lt;/td&gt;
          &lt;td&gt;7f76692&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Nvidia RTX 3060 Mobile&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1059.76 ± 3.54&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;49.03 ± 0.13&lt;/td&gt;
          &lt;td&gt;dbb3a47&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;AMD Radeon RX 6800M&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;861.99 ± 7.67&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;48.71 ± 0.71&lt;/td&gt;
          &lt;td&gt;8e6f8bc&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;AMD Radeon RX 6600M&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;605.59 ± 0.65&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;48.21 ± 0.07&lt;/td&gt;
          &lt;td&gt;fe5b78c&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Intel Arc A770M&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;875.92 ± 2.16&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;47.69 ± 0.16&lt;/td&gt;
          &lt;td&gt;eeee367&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Nvidia P104-100&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;311.90 ± 0.22&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;46.18 ± 0.05&lt;/td&gt;
          &lt;td&gt;eec1e33&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;AMD Radeon RX Vega 64&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;356.08 ± 0.09&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;45.73 ± 0.18&lt;/td&gt;
          &lt;td&gt;ec428b0&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Nvidia RTX A2000&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1245.19 ± 8.76&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;45.52 ± 0.54&lt;/td&gt;
          &lt;td&gt;b1afcab&lt;/td&gt;
          &lt;td&gt;coopmat2&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;AMD Radeon RX 7600M XT&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;459.39 ± 2.34&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;45.28 ± 0.10&lt;/td&gt;
          &lt;td&gt;b9ab0a4&lt;/td&gt;
          &lt;td&gt;eGPU&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;AMD Radeon Pro V340&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;375.41 ± 0.24&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;45.16 ± 0.06&lt;/td&gt;
          &lt;td&gt;9da3dcd&lt;/td&gt;
          &lt;td&gt;Split across two GPUs&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Nvidia GTX 1070 Ti&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;297.50 ± 0.54&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;42.86 ± 1.20&lt;/td&gt;
          &lt;td&gt;860a9e4&lt;/td&gt;
          &lt;td&gt;eGPU&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Intel Arc A750&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1075.94 ± 13.89&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;42.66 ± 0.18&lt;/td&gt;
          &lt;td&gt;c1b1876&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Nvidia RTX 4050 Mobile&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1154.28 + 15.76&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;41.89 + 0.10&lt;/td&gt;
          &lt;td&gt;d79d8f3&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Nvidia GTX 1070&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;321.57 ± 0.93&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;41.48 ± 0.09&lt;/td&gt;
          &lt;td&gt;eec1e33&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Intel Arc Pro B50&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;193.50 ± 0.24&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;39.99 ± 0.10&lt;/td&gt;
          &lt;td&gt;7b43f55&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Nvidia Tesla M40&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;92.48 ± 0.02&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;39.35 ± 1.22&lt;/td&gt;
          &lt;td&gt;b8372ee&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;AMD Radeon RX 580&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;258.03 ± 0.71&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;39.32 ± 0.03&lt;/td&gt;
          &lt;td&gt;de4c07f&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;AMD Radeon RX 470&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;218.07 ± 0.56&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;38.63 ± 0.21&lt;/td&gt;
          &lt;td&gt;e288693&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;AMD Radeon Pro W5500&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;315.39 ± 3.76&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;36.82 ± 0.38&lt;/td&gt;
          &lt;td&gt;860a9e4&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;AMD Radeon RX 480&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;248.66 ± 0.28&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;34.71 ± 0.14&lt;/td&gt;
          &lt;td&gt;3b15924&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Apple M2 Ultra&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;205.98 ± 0.02&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;34.34 ± 0.12&lt;/td&gt;
          &lt;td&gt;dbb852b&lt;/td&gt;
          &lt;td&gt;Asahi Linux&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Nvidia GTX 980&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;186.24 ± 0.09&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;33.90 ± 0.51&lt;/td&gt;
          &lt;td&gt;860a9e4&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Nvidia P106-100&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;183.78 ± 0.26&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;29.77 ± 0.04&lt;/td&gt;
          &lt;td&gt;23bc779&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;AMD FirePro W8100&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;155.22 ± 0.17&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;29.52 ± 0.05&lt;/td&gt;
          &lt;td&gt;4536363&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Nvidia Tesla P4&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;265.54 ± 0.21&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;28.03 ± 0.14&lt;/td&gt;
          &lt;td&gt;24d2ee0&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;AMD Radeon RX 6500 XT&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;255.25 ± 0.35&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;27.81 ± 0.10&lt;/td&gt;
          &lt;td&gt;g9fdfcd&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Apple M3&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;263.70 ± 0.02&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;26.39 ± 0.14&lt;/td&gt;
          &lt;td&gt;b9ab0a4&lt;/td&gt;
          &lt;td&gt;MoltenVK&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;AMD FirePro S10000&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;94.78 ± 0.02&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;25.32 ± 0.02&lt;/td&gt;
          &lt;td&gt;914a82d&lt;/td&gt;
          &lt;td&gt;Split across two GPUs&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Nvidia Quadro P2000&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;169.55 ± 0.17&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;23.05 ± 0.03&lt;/td&gt;
          &lt;td&gt;63f8fe0&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Intel Core Ultra 200 Series&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;544.95 ± 4.15&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;22.49 ± 0.09&lt;/td&gt;
          &lt;td&gt;cea560f&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;AMD Ryzen AI 9 300 Series&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;479.07 ± 0.41&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;22.41 ± 0.18&lt;/td&gt;
          &lt;td&gt;N/A&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;AMD Ryzen 6000 Series&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;240.89 ± 0.52&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;21.26 ± 0.08&lt;/td&gt;
          &lt;td&gt;ee09828&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Apple M2 Pro&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;62.70 ± 0.03&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;20.95 ± 0.11&lt;/td&gt;
          &lt;td&gt;1fe0029&lt;/td&gt;
          &lt;td&gt;Asahi Linux&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Nvidia GTX 1050 Ti&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;136.42 ± 0.67&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;20.96 ± 0.21&lt;/td&gt;
          &lt;td&gt;2f0c2db&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;AMD Ryzen 8000 Series&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;266.19 ± 1.36&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;20.53 ± 0.08&lt;/td&gt;
          &lt;td&gt;a5c07dc&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;AMD Ryzen 7000 Series&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;281.62 ± 1.56&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;19.91 ± 0.07&lt;/td&gt;
          &lt;td&gt;ebce03e&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;AMD Ryzen Z1 Extreme&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;199.36 ± 7.02&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;18.77 ± 0.02&lt;/td&gt;
          &lt;td&gt;53ff6b9&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;AMD FirePro D700&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;69.95 ± 0.04&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;16.62 ± 0.01&lt;/td&gt;
          &lt;td&gt;d3bd719&lt;/td&gt;
          &lt;td&gt;MoltenVK, running in FP16 mode on FP32 only chip&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;AMD Radeon Pro WX 4100&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;78.79 ± 0.10&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;16.05 ± 0.07&lt;/td&gt;
          &lt;td&gt;860a9e4&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Apple M2&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;50.79 ± 0.16&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;13.50 ± 0.02&lt;/td&gt;
          &lt;td&gt;8c0d6bb&lt;/td&gt;
          &lt;td&gt;Asahi Linux&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Apple M1&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;38.29 ± 0.00&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;12.47 ± 0.03&lt;/td&gt;
          &lt;td&gt;2370665&lt;/td&gt;
          &lt;td&gt;Asahi Linux&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;AMD Ryzen 5000 Series&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;90.55 ± 0.08&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;10.98 ± 0.07&lt;/td&gt;
          &lt;td&gt;d84635b&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Intel Core 1100 Series&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;187.20 ± 1.78&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;10.39 ± 0.04&lt;/td&gt;
          &lt;td&gt;abb9f3c&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;AMD Radeon RX 550&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;52.66 ± 0.49&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;10.20 ± 0.01&lt;/td&gt;
          &lt;td&gt;N/A&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;AMD Ryzen 4000 Series&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;103.87 ± 0.02&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;9.63 ± 0.01&lt;/td&gt;
          &lt;td&gt;4b385bf&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Nvidia Tesla K80&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;89.46 ± 0.10&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;9.39 ± 0.06&lt;/td&gt;
          &lt;td&gt;5d46bab&lt;/td&gt;
          &lt;td&gt;Running on single GPU&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Nvidia Tesla K40&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;64.37 ± 0.09&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;9.30 ± 0.19&lt;/td&gt;
          &lt;td&gt;eec1e33&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;MediaTek Dimensity 9400&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;38.36 ± 15.15&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;8.92 ± 0.06&lt;/td&gt;
          &lt;td&gt;b9ab0a4&lt;/td&gt;
          &lt;td&gt;GPU supports coopmat but pp512 is faster with it turned off&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Intel Core Ultra 100 Series&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;185.51 ± 0.22&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;8.21 ± 0.07&lt;/td&gt;
          &lt;td&gt;1d72c84&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;AMD Ryzen 3000 Series&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;48.63 ± 0.10&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;8.49 ± 0.01&lt;/td&gt;
          &lt;td&gt;1fe0029&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;CIX CD8180&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;2.80 ± 0.01&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;5.51 ± 0.00&lt;/td&gt;
          &lt;td&gt;4dca015&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Intel Core 1000 Series&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;25.58 ± 0.00&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;4.25 ± 0.18&lt;/td&gt;
          &lt;td&gt;N/A&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Intel Core 8000 Series&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;25.43 ± 0.17&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;3.35 ± 0.03&lt;/td&gt;
          &lt;td&gt;c4df49a&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Intel N150&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;28.84 ± 0.02&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;2.93 ± 0.00&lt;/td&gt;
          &lt;td&gt;4f63cd7&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;h3 id=&#34;llama-2-7b-q4_0-fa-enabled&#34;&gt;Llama 2 7B, Q4_0, FA enabled
&lt;/h3&gt;&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Chip&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;pp512 t/s&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;tg128 t/s&lt;/th&gt;
          &lt;th&gt;Commit&lt;/th&gt;
          &lt;th&gt;Comments&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;Nvidia RTX 5090&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;11796.38 ± 601.36&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;273.68 ± 0.52&lt;/td&gt;
          &lt;td&gt;ca71fb9&lt;/td&gt;
          &lt;td&gt;coopmat2&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;AMD Radeon RX 7900 XTX&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;3332.90 ± 11.47&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;195.30 ± 0.23&lt;/td&gt;
          &lt;td&gt;2f0c2db&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Nvidia RTX 5080&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;8054.59 ± 35.68&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;192.17 ± 0.21&lt;/td&gt;
          &lt;td&gt;f6b533d&lt;/td&gt;
          &lt;td&gt;coopmat2&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Nvidia RTX 4090&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;10830.41 ± 36.25&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;190.10 ± 0.31&lt;/td&gt;
          &lt;td&gt;4ae88d0&lt;/td&gt;
          &lt;td&gt;coopmat2&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Nvidia A100&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;7064.40 ± 1.63&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;170.56 ± 0.02&lt;/td&gt;
          &lt;td&gt;2257758&lt;/td&gt;
          &lt;td&gt;coopmat2&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Nvidia RTX 3090&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;4732.33 ± 4.80&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;162.28 ± 0.21&lt;/td&gt;
          &lt;td&gt;4ae88d0&lt;/td&gt;
          &lt;td&gt;coopmat2&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Nvidia RTX 4080 Super&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;8007.37 ± 46.03&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;150.20 ± 0.26&lt;/td&gt;
          &lt;td&gt;81086cd&lt;/td&gt;
          &lt;td&gt;coopmat2&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Nvidia RTX 3080&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;4913.83 ± 21.52&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;145.74 ± 0.16&lt;/td&gt;
          &lt;td&gt;7c7d6ce&lt;/td&gt;
          &lt;td&gt;coopmat2&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Nvidia Tesla V100&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1411.25 ± 2.12&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;142.13 ± 0.03&lt;/td&gt;
          &lt;td&gt;7d77f07&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Nvidia RTX A5000&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;4071.22 ± 13.13&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;140.43 ± 0.22&lt;/td&gt;
          &lt;td&gt;4ae88d0&lt;/td&gt;
          &lt;td&gt;coopmat2&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;AMD Radeon RX 9070 XT&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;4911.74 ± 28.52&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;138.20 ± 0.18&lt;/td&gt;
          &lt;td&gt;e9fd8dc&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Nvidia RTX 5070 Ti&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;6764.53 ± 11.95&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;135.65 ± 0.02&lt;/td&gt;
          &lt;td&gt;d13d0f6&lt;/td&gt;
          &lt;td&gt;coopmat2&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;AMD Radeon AI Pro R9700&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;4333.83 ± 29.36&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;130.90 ± 0.12&lt;/td&gt;
          &lt;td&gt;3191462&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;AMD Radeon RX 7900 XT&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;3043.93 ± 10.42&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;124.20 ± 0.09&lt;/td&gt;
          &lt;td&gt;71e74a3&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;AMD Radeon RX 7800 XT&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;2094.64 ± 14.38&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;119.63 ± 0.13&lt;/td&gt;
          &lt;td&gt;4fdbc1e&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;AMD Radeon RX 9070&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;3277.24 ± 18.17&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;119.55 ± 0.06&lt;/td&gt;
          &lt;td&gt;21c17b5&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;AMD Radeon RX 7900 GRE&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;2402.07 ± 22.50&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;116.77 ± 0.08&lt;/td&gt;
          &lt;td&gt;4b2a477&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Apple M3 Ultra&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1115.55 ± 0.75&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;115.99 ± 0.12&lt;/td&gt;
          &lt;td&gt;2d451c8&lt;/td&gt;
          &lt;td&gt;MoltenVK&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Intel Arc Pro B70&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;3314.53 ± 17.95&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;111.63 ± 0.05&lt;/td&gt;
          &lt;td&gt;b863507&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Nvidia Titan V&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;792.74 ± 4.30&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;109.21 ± 0.72&lt;/td&gt;
          &lt;td&gt;e56abd2&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;AMD Radeon Pro VII&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;783.94 ± 0.77&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;108.45 ± 0.48&lt;/td&gt;
          &lt;td&gt;N/A&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;AMD Radeon RX 6900 XT&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1761.93 ± 4.75&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;106.15 ± 0.04&lt;/td&gt;
          &lt;td&gt;a972fae&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Nvidia RTX 2080 Ti&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1936.25 ± 32.08&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;100.99 ± 0.24&lt;/td&gt;
          &lt;td&gt;N/A&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;AMD Radeon RX 6800 XT&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1704.79 ± 0.71&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;100.50 ± 0.06&lt;/td&gt;
          &lt;td&gt;N/A&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;AMD Radeon Pro W6800X Duo&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;795.28 ± 0.72&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;100.08 ± 0.02&lt;/td&gt;
          &lt;td&gt;N/A&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Nvidia RTX 5060 Ti&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;3912.65 ± 5.86&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;97.01 ± 0.14&lt;/td&gt;
          &lt;td&gt;89f10ba&lt;/td&gt;
          &lt;td&gt;coopmat2&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;AMD Radeon RX 6800&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1749.46 ± 3.36&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;96.65 ± 0.48&lt;/td&gt;
          &lt;td&gt;4b385bf&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Nvidia RTX 4070&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;4293.57 ± 27.70&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;91.49 ± 0.89&lt;/td&gt;
          &lt;td&gt;9a48399&lt;/td&gt;
          &lt;td&gt;coopmat2&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;AMD Radeon RX 6750 XT&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;997.05 ± 0.45&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;82.29 ± 0.06&lt;/td&gt;
          &lt;td&gt;228f34c&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;AMD Radeon RX 6700 XT&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1010.90 ± 12.89&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;81.86 ± 0.19&lt;/td&gt;
          &lt;td&gt;6d75883&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Nvidia RTX 3060&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;2012.88 ± 10.12&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;80.59 ± 0.02&lt;/td&gt;
          &lt;td&gt;92c0b38&lt;/td&gt;
          &lt;td&gt;coopmat2&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;AMD Radeon Pro V620&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1556.31 ± 2.82&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;79.24 ± 0.09&lt;/td&gt;
          &lt;td&gt;03d4698&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Nvidia RTX A4000&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;2482.74 ± 26.05&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;76.07 ± 0.08&lt;/td&gt;
          &lt;td&gt;f5245b5&lt;/td&gt;
          &lt;td&gt;coopmat2&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Nvidia Tesla T10&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1840.14 ± 1.22&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;76.05 ± 0.13&lt;/td&gt;
          &lt;td&gt;7f76692&lt;/td&gt;
          &lt;td&gt;coopmat2&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;AMD Radeon RX 5700 XT&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;538.31 ± 0.35&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;74.43 ± 0.03&lt;/td&gt;
          &lt;td&gt;4fdbc1e&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Intel Arc B580&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;419.49 ± 3.37&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;72.00 ± 0.24&lt;/td&gt;
          &lt;td&gt;7f76692&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Apple M4 Max&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;557.46 ± 26.87&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;71.79 ± 4.16&lt;/td&gt;
          &lt;td&gt;1ece0cb6&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;AMD Radeon Pro W5700&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;446.98 ± 0.39&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;71.30 ± 0.24&lt;/td&gt;
          &lt;td&gt;23bc779&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Intel Arc Pro B60&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;274.76 ± 0.27&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;70.54 ± 0.03&lt;/td&gt;
          &lt;td&gt;516a4ca&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;AMD Radeon RX 9060 XT&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1915.41 ± 7.90&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;70.52 ± 0.16&lt;/td&gt;
          &lt;td&gt;ed52f36&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Nvidia Tesla P100&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;685.51 ± 0.88&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;66.48 ± 0.02&lt;/td&gt;
          &lt;td&gt;eec1e33&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;AMD Radeon RX 6650 XT&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1088.90 ± 0.40&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;64.53 ± 0.75&lt;/td&gt;
          &lt;td&gt;dbb852b&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Nvidia GTX 1080 Ti&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;529.96 ± 0.38&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;64.63 ± 0.10&lt;/td&gt;
          &lt;td&gt;360d653&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;AMD BC-250&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;356.87 ± 1.24&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;63.14 ± 0.09&lt;/td&gt;
          &lt;td&gt;5886f4f&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Nvidia RTX 3070 Mobile&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1832.07 ± 57.14&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;62.92 ± 0.37&lt;/td&gt;
          &lt;td&gt;ceff6bb&lt;/td&gt;
          &lt;td&gt;coopmat2&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Nvidia RTX 4060 Mobile&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;2358.03 ± 12.17&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;60.01 ± 0.08&lt;/td&gt;
          &lt;td&gt;a5c07dc&lt;/td&gt;
          &lt;td&gt;coopmat2&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Nvidia Tesla P40&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;484.37 ± 0.27&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;59.22 ± 0.15&lt;/td&gt;
          &lt;td&gt;N/A&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Nvidia GTX 1660 Ti Mobile&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;514.34 ± 0.88&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;57.30 ± 0.42&lt;/td&gt;
          &lt;td&gt;b43556e&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;AMD Radeon RX 7600 XT&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1024.38 ± 7.56&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;56.11 ± 0.02&lt;/td&gt;
          &lt;td&gt;01d8eaa&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;AMD FirePro S9300 x2&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;243.33 ± 0.22&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;55.64 ± 0.06&lt;/td&gt;
          &lt;td&gt;eec1e33&lt;/td&gt;
          &lt;td&gt;Split across two GPUs&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Nvidia GB10&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;3279.89 ± 26.78&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;53.64 ± 0.05&lt;/td&gt;
          &lt;td&gt;b9da444&lt;/td&gt;
          &lt;td&gt;coopmat2&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;AMD Radeon RX 6600&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;808.76 ± 0.15&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;53.24 ± 0.03&lt;/td&gt;
          &lt;td&gt;b1c70e2&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Intel Arc A770&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1119.68 + 30.25&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;53.07 + 0.09&lt;/td&gt;
          &lt;td&gt;a69d54f&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;AMD Ryzen AI Max+ 395&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1357.07 ± 10.94&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;53.00 ± 0.13&lt;/td&gt;
          &lt;td&gt;7f76692&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;AMD Radeon RX Vega 56&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;428.54 ± 0.50&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;52.66 ± 0.03&lt;/td&gt;
          &lt;td&gt;92c0b38&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Intel Arc B570&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;288.51 ± 0.09&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;50.49 ± 0.05&lt;/td&gt;
          &lt;td&gt;7f76692&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Nvidia P104-100&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;325.30 ± 0.25&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;48.64 ± 0.04&lt;/td&gt;
          &lt;td&gt;eec1e33&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;AMD Radeon Pro V340&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;360.23 ± 0.74&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;47.54 ± 0.06&lt;/td&gt;
          &lt;td&gt;9da3dcd&lt;/td&gt;
          &lt;td&gt;Split across two GPUs&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;AMD Radeon RX 6800M&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;784.16 ± 2.76&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;49.06 ± 0.34&lt;/td&gt;
          &lt;td&gt;8e6f8bc&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;AMD Radeon RX Vega 64&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;320.12 ± 0.22&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;47.06 ± 0.01&lt;/td&gt;
          &lt;td&gt;ec428b0&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Nvidia RTX A2000&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;1361.85 ± 3.26&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;45.69 ± 0.20&lt;/td&gt;
          &lt;td&gt;b1afcab&lt;/td&gt;
          &lt;td&gt;coopmat2&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Intel Arc A770M&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;384.74 ± 0.78&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;45.68 ± 0.06&lt;/td&gt;
          &lt;td&gt;eeee367&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Intel Arc A750&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;303.37 ± 1.44&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;43.96 ± 0.03&lt;/td&gt;
          &lt;td&gt;c1b1876&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Nvidia GTX 1070 Ti&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;292.85 ± 0.23&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;43.42 ± 0.34&lt;/td&gt;
          &lt;td&gt;860a9e4&lt;/td&gt;
          &lt;td&gt;eGPU&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Nvidia GTX 1070&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;330.84 ± 1.02&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;43.33 ± 0.06&lt;/td&gt;
          &lt;td&gt;360d653&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Nvidia Tesla M40&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;93.35 ± 0.01&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;41.68 ± 0.01&lt;/td&gt;
          &lt;td&gt;b8372ee&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Intel Arc Pro B50&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;132.48 ± 0.04&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;41.02 ± 0.04&lt;/td&gt;
          &lt;td&gt;7b43f55&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;AMD Radeon RX 470&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;197.26 ± 0.27&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;37.28 ± 0.11&lt;/td&gt;
          &lt;td&gt;3769fe6&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;AMD Radeon RX 480&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;194.52 ± 0.61&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;37.23 ± 0.09&lt;/td&gt;
          &lt;td&gt;0bcb40b&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Apple M2 Ultra&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;198.83 ± 0.85&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;198.83 ± 0.85&lt;/td&gt;
          &lt;td&gt;dbb852b&lt;/td&gt;
          &lt;td&gt;Asahi Linux&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Nvidia GTX 980&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;180.97 ± 0.74&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;34.16 ± 0.10&lt;/td&gt;
          &lt;td&gt;860a9e4&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Nvidia P106-100&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;183.40 ± 0.34&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;30.79 ± 0.32&lt;/td&gt;
          &lt;td&gt;23bc779&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;AMD FirePro W8100&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;140.52 ± 0.34&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;29.28 ± 0.14&lt;/td&gt;
          &lt;td&gt;4536363&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Nvidia Tesla P4&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;287.14 ± 0.29&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;28.37 ± 0.24&lt;/td&gt;
          &lt;td&gt;24d2ee0&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Nvidia Quadro P2000&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;181.71 ± 0.12&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;23.77 ± 0.02&lt;/td&gt;
          &lt;td&gt;63f8fe0&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Intel Core Ultra 200 Series&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;536.48 ± 1.27&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;23.05 ± 0.04&lt;/td&gt;
          &lt;td&gt;cea560f&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;AMD Ryzen AI 9 300 Series&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;532.59 ± 3.55&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;22.31 ± 0.06&lt;/td&gt;
          &lt;td&gt;N/A&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;AMD Ryzen 6000 Series&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;277.91 ± 0.37&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;21.15 ± 0.09&lt;/td&gt;
          &lt;td&gt;ee09828&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Apple M2 Pro&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;58.86 ± 0.02&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;20.97 ± 0.03&lt;/td&gt;
          &lt;td&gt;1fe0029&lt;/td&gt;
          &lt;td&gt;Asahi Linux&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;AMD Ryzen 8000 Series&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;297.39 ± 1.22&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;20.59 ± 0.38&lt;/td&gt;
          &lt;td&gt;a5c07dc&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;AMD Ryzen 7000 Series&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;312.85 ± 2.51&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;20.09 ± 0.35&lt;/td&gt;
          &lt;td&gt;835b2b9&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Nvidia GTX 1050 Ti&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;127.54 ± 1.03&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;20.08 ± 0.17&lt;/td&gt;
          &lt;td&gt;2f0c2db&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;AMD Radeon Pro WX 4100&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;75.59 ± 0.19&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;16.56 ± 0.04&lt;/td&gt;
          &lt;td&gt;860a9e4&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Apple M1&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;35.93 ± 0.00&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;12.85 ± 0.02&lt;/td&gt;
          &lt;td&gt;2370665&lt;/td&gt;
          &lt;td&gt;Asahi Linux&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Apple M2&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;46.81 ± 0.08&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;12.25 ± 2.30&lt;/td&gt;
          &lt;td&gt;8c0d6bb&lt;/td&gt;
          &lt;td&gt;Asahi Linux&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;AMD Ryzen 5000 Series&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;79.06 ± 0.01&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;10.75 ± 0.00&lt;/td&gt;
          &lt;td&gt;5d195f1&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Intel Core 1100 Series&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;174.77 ± 4.47&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;10.58 ± 0.03&lt;/td&gt;
          &lt;td&gt;abb9f3c&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Nvidia Tesla K40&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;64.37 ± 0.02&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;9.92 ± 0.06&lt;/td&gt;
          &lt;td&gt;eec1e33&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;AMD Ryzen 4000 Series&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;113.32 ± 0.01&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;9.87 ± 0.01&lt;/td&gt;
          &lt;td&gt;4b385bf&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Nvidia Tesla K80&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;88.26 ± 0.19&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;9.49 ± 0.01&lt;/td&gt;
          &lt;td&gt;5d46bab&lt;/td&gt;
          &lt;td&gt;Running on single GPU&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;AMD Ryzen 5 3000 Series&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;47.41 ± 0.14&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;8.47 ± 0.01&lt;/td&gt;
          &lt;td&gt;1fe0029&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Intel Core Ultra 100 Series&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;77.66 ± 2.75&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;7.75 ± 0.05&lt;/td&gt;
          &lt;td&gt;2e89f76&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Intel Core 8000 Series&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;25.55 ± 0.04&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;3.35 ± 0.02&lt;/td&gt;
          &lt;td&gt;c4df49a&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Intel N150&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;25.59 ± 0.00&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;2.91 ± 0.00&lt;/td&gt;
          &lt;td&gt;4f63cd7&lt;/td&gt;
          &lt;td&gt;&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;h2 id=&#34;how-to-use-these-tables&#34;&gt;How to Use These Tables
&lt;/h2&gt;&lt;ol&gt;
&lt;li&gt;Decide whether you care more about 	g128 or pp512.
For chat and interactive use, 	g128 usually matters more. For long prompts and batch throughput, pp512 matters more.&lt;/li&gt;
&lt;li&gt;Match the backend you actually use.
Nvidia users should usually prioritize CUDA. AMD users should compare ROCm and Vulkan first. Cross-platform users should pay close attention to Vulkan.&lt;/li&gt;
&lt;li&gt;Check FA last.
On many GPUs, enabling FA improves pp512 more than 	g128, so a single headline number can be misleading.&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id=&#34;one-sentence-summary&#34;&gt;One-Sentence Summary
&lt;/h2&gt;&lt;p&gt;In llama.cpp benchmarks, pp512, 	g128, Q4_0, FA, and CUDA / ROCm / Vulkan describe different dimensions. Once the benchmark context is clear, the tables become much easier to read.&lt;/p&gt;
&lt;h2 id=&#34;sources&#34;&gt;Sources
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;CUDA discussion #15013: &lt;a class=&#34;link&#34; href=&#34;https://github.com/ggml-org/llama.cpp/discussions/15013&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;https://github.com/ggml-org/llama.cpp/discussions/15013&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Apple Silicon discussion #4167: &lt;a class=&#34;link&#34; href=&#34;https://github.com/ggml-org/llama.cpp/discussions/4167&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;https://github.com/ggml-org/llama.cpp/discussions/4167&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;ROCm discussion #15021: &lt;a class=&#34;link&#34; href=&#34;https://github.com/ggml-org/llama.cpp/discussions/15021&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;https://github.com/ggml-org/llama.cpp/discussions/15021&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Vulkan discussion #10879: &lt;a class=&#34;link&#34; href=&#34;https://github.com/ggml-org/llama.cpp/discussions/10879&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;https://github.com/ggml-org/llama.cpp/discussions/10879&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</description>
        </item>
        <item>
        <title>What the Common GPU Inference Benchmark Metrics Actually Mean: FA, pp512, tg128, and Q4_0</title>
        <link>https://knightli.com/en/2026/04/23/how-to-read-llm-cuda-scoreboard-fa-pp512-tg128-q4-0/</link>
        <pubDate>Thu, 23 Apr 2026 00:15:00 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/04/23/how-to-read-llm-cuda-scoreboard-fa-pp512-tg128-q4-0/</guid>
        <description>&lt;p&gt;As soon as you start looking at local LLM or GPU inference benchmarks, you quickly run into a stack of abbreviations: &lt;code&gt;FA&lt;/code&gt;, &lt;code&gt;pp512&lt;/code&gt;, &lt;code&gt;tg128&lt;/code&gt;, and &lt;code&gt;Q4_0&lt;/code&gt;. They all look like performance metrics, but without context they can be surprisingly hard to interpret.&lt;/p&gt;
&lt;p&gt;For example, you may see a line like this:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;CUDA Scoreboard for Llama 2 7B, Q4_0 (no FA)
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Then right below it, you might also see:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;pp512 t/s
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;tg128 t/s
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;If you do not unpack what these terms mean, it becomes difficult to understand what the benchmark is actually measuring, or how to compare the results of two different GPUs.&lt;/p&gt;
&lt;p&gt;This article is not about which GPU is the better buy. It is specifically about breaking down the most common metrics you see in GPU inference benchmarks.&lt;/p&gt;
&lt;h2 id=&#34;first-what-the-whole-title-line-is-actually-saying&#34;&gt;First, what the whole title line is actually saying
&lt;/h2&gt;&lt;p&gt;A line like &lt;code&gt;CUDA Scoreboard for Llama 2 7B, Q4_0 (no FA)&lt;/code&gt; already tells you most of the test setup.&lt;/p&gt;
&lt;p&gt;At minimum, it contains four layers of information:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;CUDA&lt;/code&gt;: the benchmark is running on the NVIDIA CUDA path&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Llama 2 7B&lt;/code&gt;: the model being tested is the 7B version of &lt;code&gt;Llama 2&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Q4_0&lt;/code&gt;: the model uses a 4-bit quantized format&lt;/li&gt;
&lt;li&gt;&lt;code&gt;no FA&lt;/code&gt;: &lt;code&gt;Flash Attention&lt;/code&gt; was disabled in this test&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;So in practical terms, this kind of title usually means:&lt;/p&gt;
&lt;p&gt;&amp;ldquo;A benchmark of a quantized large model running on an NVIDIA GPU, measured under a specific inference path.&amp;rdquo;&lt;/p&gt;
&lt;h2 id=&#34;what-fa-means-flash-attention&#34;&gt;What FA means: Flash Attention
&lt;/h2&gt;&lt;p&gt;Here, &lt;code&gt;FA&lt;/code&gt; stands for &lt;code&gt;Flash Attention&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;It is one of the most important acceleration techniques in large-model training and inference, mainly because it optimizes how attention is computed. In Transformer models, attention is already one of the most expensive and memory-bandwidth-heavy parts of the entire pipeline.&lt;/p&gt;
&lt;p&gt;A traditional attention implementation often suffers from a few problems:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;frequent memory reads and writes&lt;/li&gt;
&lt;li&gt;many intermediate results&lt;/li&gt;
&lt;li&gt;repeated data movement between VRAM and on-chip cache&lt;/li&gt;
&lt;li&gt;rapidly growing overhead as context length increases&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;What &lt;code&gt;Flash Attention&lt;/code&gt; does, in simple terms, is:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;reorganize the computation order&lt;/li&gt;
&lt;li&gt;reduce how often intermediate results are written back to VRAM&lt;/li&gt;
&lt;li&gt;keep more of the work inside faster cache&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That gives it three typical advantages:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;it is faster&lt;/li&gt;
&lt;li&gt;it saves memory&lt;/li&gt;
&lt;li&gt;it is mathematically equivalent to standard attention rather than a lower-accuracy shortcut&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That is why so many modern inference and training frameworks treat it as a major optimization feature.&lt;/p&gt;
&lt;h2 id=&#34;what-no-fa-means&#34;&gt;What no FA means
&lt;/h2&gt;&lt;p&gt;If &lt;code&gt;FA&lt;/code&gt; means &lt;code&gt;Flash Attention&lt;/code&gt;, then &lt;code&gt;no FA&lt;/code&gt; simply means that &lt;code&gt;Flash Attention&lt;/code&gt; was not enabled for this test.&lt;/p&gt;
&lt;p&gt;In other words, the benchmark was measured using a more traditional attention implementation.&lt;/p&gt;
&lt;p&gt;There are several reasons benchmark tables explicitly label &lt;code&gt;no FA&lt;/code&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;to keep a baseline for comparison&lt;/li&gt;
&lt;li&gt;to support hardware or software environments where &lt;code&gt;FA&lt;/code&gt; is unavailable&lt;/li&gt;
&lt;li&gt;to avoid mixing scores from different optimization conditions&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;So when you see &lt;code&gt;no FA&lt;/code&gt;, you should not read it as &amp;ldquo;this GPU is weak.&amp;rdquo; A more accurate reading is:&lt;/p&gt;
&lt;p&gt;&amp;ldquo;This score was measured without Flash Attention enabled.&amp;rdquo;&lt;/p&gt;
&lt;h2 id=&#34;what-q4_0-means-a-quantization-format&#34;&gt;What Q4_0 means: a quantization format
&lt;/h2&gt;&lt;p&gt;&lt;code&gt;Q4_0&lt;/code&gt; refers to a 4-bit quantization format.&lt;/p&gt;
&lt;p&gt;The original model weights are usually not stored at such low precision. Quantization compresses higher-precision weights into a lower-bit representation so the model becomes easier to run on consumer GPUs.&lt;/p&gt;
&lt;p&gt;A rough way to think about it is:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;Q&lt;/code&gt;: Quantization&lt;/li&gt;
&lt;li&gt;&lt;code&gt;4&lt;/code&gt;: 4-bit&lt;/li&gt;
&lt;li&gt;&lt;code&gt;_0&lt;/code&gt;: a specific quantization scheme identifier&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Its practical importance is straightforward:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;smaller model size&lt;/li&gt;
&lt;li&gt;lower VRAM requirements&lt;/li&gt;
&lt;li&gt;better chances of fitting on consumer hardware&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;So &lt;code&gt;Llama 2 7B, Q4_0&lt;/code&gt; does not mean just &amp;ldquo;a normal 7B model.&amp;rdquo; It means &amp;ldquo;a 7B model already compressed using a 4-bit quantization format.&amp;rdquo;&lt;/p&gt;
&lt;h2 id=&#34;what-pp512-ts-means&#34;&gt;What pp512 t/s means
&lt;/h2&gt;&lt;p&gt;&lt;code&gt;pp512&lt;/code&gt; usually means:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;Prompt Processing 512 tokens&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;It measures how fast the model processes the input prompt, usually in &lt;code&gt;t/s&lt;/code&gt;, meaning &lt;code&gt;tokens per second&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Here, &lt;code&gt;512&lt;/code&gt; means the prompt length used in the test was &lt;code&gt;512 tokens&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;This metric does not measure output speed. It measures how quickly the model encodes and computes over the input before it starts responding. You can think of it as the speed of the &amp;ldquo;reading the prompt first&amp;rdquo; stage.&lt;/p&gt;
&lt;p&gt;One important property of this stage is that it is usually much more parallelizable.&lt;/p&gt;
&lt;p&gt;Because the input sequence can be processed in batches, the GPU can often keep its compute units highly utilized. That is why &lt;code&gt;pp512&lt;/code&gt; numbers can look extremely high, sometimes almost suspiciously high at first glance.&lt;/p&gt;
&lt;p&gt;So if you see something like:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;pp512 ≈ 14000 t/s
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;there is no reason to panic. That is measuring prompt-processing throughput, not the speed of token-by-token output generation.&lt;/p&gt;
&lt;h2 id=&#34;what-tg128-ts-means&#34;&gt;What tg128 t/s means
&lt;/h2&gt;&lt;p&gt;&lt;code&gt;tg128&lt;/code&gt; usually means:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;Text Generation 128 tokens&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;It measures the average speed of generating &lt;code&gt;128 tokens&lt;/code&gt;, again in &lt;code&gt;t/s&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;This metric is much closer to what people intuitively mean when they ask whether a model feels fast, because it is directly measuring the output stage.&lt;/p&gt;
&lt;p&gt;But the biggest difference from &lt;code&gt;pp512&lt;/code&gt; is that text generation is usually autoregressive.&lt;/p&gt;
&lt;p&gt;That means:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;the model must generate the first token&lt;/li&gt;
&lt;li&gt;then use that to generate the second&lt;/li&gt;
&lt;li&gt;then continue to the third&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;So this stage cannot be parallelized the way prompt processing can, and it is naturally much slower.&lt;/p&gt;
&lt;p&gt;That is why it is perfectly normal to see something like:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;pp512&lt;/code&gt; in the tens of thousands of &lt;code&gt;t/s&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;tg128&lt;/code&gt; only in the hundreds of &lt;code&gt;t/s&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This is not a benchmark error. These two metrics are measuring fundamentally different workloads.&lt;/p&gt;
&lt;h2 id=&#34;why-pp512-and-tg128-differ-so-much&#34;&gt;Why pp512 and tg128 differ so much
&lt;/h2&gt;&lt;p&gt;This is often the first thing people find confusing when reading a scoreboard.&lt;/p&gt;
&lt;p&gt;The short explanation is:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;pp512&lt;/code&gt; is closer to measuring parallel throughput, while &lt;code&gt;tg128&lt;/code&gt; is closer to measuring token-by-token generation ability.&lt;/p&gt;
&lt;p&gt;To expand on that:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;the input stage is easier to parallelize&lt;/li&gt;
&lt;li&gt;the output stage depends on sequential token generation&lt;/li&gt;
&lt;li&gt;generation is usually more sensitive to memory bandwidth and cache behavior&lt;/li&gt;
&lt;li&gt;so generation speed being much lower than prompt-processing speed is entirely normal&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That also explains an interesting pattern you sometimes see in GPU comparisons:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;one GPU is stronger in &lt;code&gt;pp512&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;another ends up slightly faster in &lt;code&gt;tg128&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That is not contradictory. One metric leans more toward peak compute throughput, while the other reflects the actual memory and latency behavior of the generation path.&lt;/p&gt;
&lt;h2 id=&#34;how-to-think-about-ts&#34;&gt;How to think about t/s
&lt;/h2&gt;&lt;p&gt;Here, &lt;code&gt;t/s&lt;/code&gt; simply means &lt;code&gt;tokens per second&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;It tells you how many tokens the model can process or generate in one second.&lt;/p&gt;
&lt;p&gt;But there is one important caveat: a &lt;code&gt;token&lt;/code&gt; is not the same thing as a character or a word. It is the unit produced by the model&amp;rsquo;s tokenizer, and its actual text length can vary a lot across models and languages.&lt;/p&gt;
&lt;p&gt;So in practice, &lt;code&gt;t/s&lt;/code&gt; is most useful for:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;comparing different GPUs on the same model&lt;/li&gt;
&lt;li&gt;comparing different parameter settings in the same environment&lt;/li&gt;
&lt;li&gt;comparing a framework before and after a specific optimization is enabled&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;It is much less reliable as a universal &amp;ldquo;absolute speed&amp;rdquo; metric across different models, frameworks, and tokenizers.&lt;/p&gt;
&lt;h2 id=&#34;what-to-focus-on-first-when-reading-a-scoreboard&#34;&gt;What to focus on first when reading a scoreboard
&lt;/h2&gt;&lt;p&gt;If you do not want to get buried under abbreviations every time, start with these questions.&lt;/p&gt;
&lt;h3 id=&#34;1-what-model-is-being-tested&#34;&gt;1. What model is being tested
&lt;/h3&gt;&lt;p&gt;For example, is it &lt;code&gt;Llama 2 7B&lt;/code&gt;? Is it the same quantized variant, such as &lt;code&gt;Q4_0&lt;/code&gt;? If the model or quantization format changes, direct comparison becomes much less meaningful.&lt;/p&gt;
&lt;h3 id=&#34;2-whether-key-optimizations-are-enabled&#34;&gt;2. Whether key optimizations are enabled
&lt;/h3&gt;&lt;p&gt;The most common example is &lt;code&gt;FA&lt;/code&gt;. If one benchmark uses &lt;code&gt;Flash Attention&lt;/code&gt; and the other does not, those scores are not directly comparable.&lt;/p&gt;
&lt;h3 id=&#34;3-whether-the-metric-is-measuring-input-speed-or-output-speed&#34;&gt;3. Whether the metric is measuring input speed or output speed
&lt;/h3&gt;&lt;p&gt;&lt;code&gt;pp512&lt;/code&gt; and &lt;code&gt;tg128&lt;/code&gt; are measuring different stages. One is closer to prompt-reading speed, the other is closer to answer-generation speed.&lt;/p&gt;
&lt;h3 id=&#34;4-whether-you-care-about-throughput-or-user-feel&#34;&gt;4. Whether you care about throughput or user feel
&lt;/h3&gt;&lt;p&gt;If you care more about how quickly a long prompt gets processed, &lt;code&gt;pp512&lt;/code&gt; matters more. If you care more about how fast the model feels while answering, &lt;code&gt;tg128&lt;/code&gt; is usually closer to the real experience.&lt;/p&gt;
&lt;h2 id=&#34;a-more-practical-way-to-remember-all-this&#34;&gt;A more practical way to remember all this
&lt;/h2&gt;&lt;p&gt;If you want to compress all of these into one short memory aid, you can think of them like this:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;Q4_0&lt;/code&gt;: the model is compressed into a 4-bit quantized version&lt;/li&gt;
&lt;li&gt;&lt;code&gt;FA&lt;/code&gt;: whether Flash Attention is enabled&lt;/li&gt;
&lt;li&gt;&lt;code&gt;pp512&lt;/code&gt;: how fast the model processes a 512-token input&lt;/li&gt;
&lt;li&gt;&lt;code&gt;tg128&lt;/code&gt;: how fast the model generates a 128-token output&lt;/li&gt;
&lt;li&gt;&lt;code&gt;t/s&lt;/code&gt;: speed unit, tokens per second&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Once those five points are clear, it becomes much easier to judge what a given CUDA Scoreboard is actually measuring.&lt;/p&gt;
&lt;h2 id=&#34;closing&#34;&gt;Closing
&lt;/h2&gt;&lt;p&gt;GPU benchmark tables often look more complicated than they really are, not because the metrics themselves are mysterious, but because model identity, quantization, optimization flags, and different stages of throughput are all compressed into a few short abbreviations.&lt;/p&gt;
&lt;p&gt;Once you unpack terms like &lt;code&gt;FA&lt;/code&gt;, &lt;code&gt;Q4_0&lt;/code&gt;, &lt;code&gt;pp512&lt;/code&gt;, and &lt;code&gt;tg128&lt;/code&gt;, these benchmark tables become much easier to read.&lt;/p&gt;
&lt;p&gt;What matters is not just remembering a raw score, but knowing:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;which model configuration the score came from&lt;/li&gt;
&lt;li&gt;whether key optimizations were enabled&lt;/li&gt;
&lt;li&gt;whether it measured input or output behavior&lt;/li&gt;
&lt;li&gt;whether it reflects compute throughput or something closer to actual generation feel&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That makes it much easier to judge what these results really mean.&lt;/p&gt;
</description>
        </item>
        <item>
        <title>A Practical Guide to Common Tensor Formats in LLMs: FP32, FP16, BF16, TF32, and FP8</title>
        <link>https://knightli.com/en/2026/04/22/common-tensor-formats-fp32-fp16-bf16-tf32-fp8/</link>
        <pubDate>Wed, 22 Apr 2026 22:40:00 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/04/22/common-tensor-formats-fp32-fp16-bf16-tf32-fp8/</guid>
        <description>&lt;p&gt;As soon as you start working with large-model training, inference, or deployment, you quickly run into a familiar set of abbreviations: &lt;code&gt;FP32&lt;/code&gt;, &lt;code&gt;FP16&lt;/code&gt;, &lt;code&gt;BF16&lt;/code&gt;, &lt;code&gt;TF32&lt;/code&gt;, and &lt;code&gt;FP8&lt;/code&gt;. They may look like small labels on a model page, but their impact is much bigger than a naming difference.&lt;/p&gt;
&lt;p&gt;These formats determine how numbers are stored in memory and represented during computation. They directly affect training stability, inference speed, and even how large a model a given GPU can realistically handle.&lt;/p&gt;
&lt;p&gt;So if you want to understand precision trade-offs in large models, one of the best places to start is not a benchmark chart for a specific model, but a clear picture of what these tensor formats are and why they were designed the way they are.&lt;/p&gt;
&lt;h2 id=&#34;what-tensor-formats-actually-determine&#34;&gt;What tensor formats actually determine
&lt;/h2&gt;&lt;p&gt;At its core, a large model is a massive set of matrix operations over huge numbers of parameters, and the tensor format is how those numbers are stored in memory and represented during computation.&lt;/p&gt;
&lt;p&gt;The trade-off usually revolves around three dimensions:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;precision&lt;/li&gt;
&lt;li&gt;VRAM usage&lt;/li&gt;
&lt;li&gt;compute speed&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This is actually a lot like image formats. Lossless formats preserve more detail, but take more space and load more slowly. Compressed formats discard information that is less noticeable to the eye in exchange for smaller size and faster handling. Large models can accept similar trade-offs because, across extremely large parameter sets, many tiny numerical changes do not significantly affect the final output.&lt;/p&gt;
&lt;p&gt;That is why the model world has developed a whole family of precision formats.&lt;/p&gt;
&lt;h2 id=&#34;how-a-number-is-represented&#34;&gt;How a number is represented
&lt;/h2&gt;&lt;p&gt;Before getting into the formats, it helps to remember one basic structure. A floating-point number is usually made of three parts:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;sign bit: determines positive or negative&lt;/li&gt;
&lt;li&gt;exponent bits: determine numerical range&lt;/li&gt;
&lt;li&gt;mantissa bits: determine numerical detail&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In large models, mantissa precision certainly matters, but many models are even more sensitive to insufficient numerical range, meaning too few exponent bits and a higher risk of overflow or unstable training. A lot of tensor format design is essentially about reallocating a limited number of bits between range and detail.&lt;/p&gt;
&lt;p&gt;The diagram below gives a quick overall view:&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://knightli.com/2026/04/22/common-tensor-formats-fp32-fp16-bf16-tf32-fp8/tensor-format-overview.svg&#34;
	
	
	
	loading=&#34;lazy&#34;
	
		alt=&#34;Overview of the bit layouts of FP32, FP16, BF16, TF32, and FP8&#34;
	
	
&gt;&lt;/p&gt;
&lt;h2 id=&#34;fp32-the-most-stable-but-expensive&#34;&gt;FP32: the most stable, but expensive
&lt;/h2&gt;&lt;p&gt;&lt;code&gt;FP32&lt;/code&gt; is the traditional single-precision floating-point format. It uses 32 bits in total, or 4 bytes.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://knightli.com/2026/04/22/common-tensor-formats-fp32-fp16-bf16-tf32-fp8/fp32-layout.svg&#34;
	
	
	
	loading=&#34;lazy&#34;
	
		alt=&#34;FP32 bit layout diagram&#34;
	
	
&gt;&lt;/p&gt;
&lt;p&gt;Its strengths are straightforward:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;wide numerical range&lt;/li&gt;
&lt;li&gt;high precision&lt;/li&gt;
&lt;li&gt;the most stable training behavior&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;But the downside is just as clear: it consumes a lot of VRAM.&lt;/p&gt;
&lt;p&gt;A very rough estimate is:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;VRAM usage ≈ parameter count × bytes per parameter
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;If a 27B model stores weights entirely in &lt;code&gt;FP32&lt;/code&gt;, the weights alone take roughly:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;27B × 4 bytes ≈ 108GB
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;And that still does not include activations, KV cache, optimizer state, or other runtime overhead. So in modern large-model training and inference, &lt;code&gt;FP32&lt;/code&gt; is no longer the default so much as the most stable baseline format.&lt;/p&gt;
&lt;h2 id=&#34;fp16-half-the-size-but-less-stable&#34;&gt;FP16: half the size, but less stable
&lt;/h2&gt;&lt;p&gt;&lt;code&gt;FP16&lt;/code&gt; compresses each parameter to 2 bytes, cutting memory usage roughly in half compared with &lt;code&gt;FP32&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://knightli.com/2026/04/22/common-tensor-formats-fp32-fp16-bf16-tf32-fp8/fp16-layout.svg&#34;
	
	
	
	loading=&#34;lazy&#34;
	
		alt=&#34;FP16 bit layout diagram&#34;
	
	
&gt;&lt;/p&gt;
&lt;p&gt;For the same 27B model, if you only look at weight size:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;27B × 2 bytes ≈ 54GB
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;That already explains why many deployment guides place a 27B model around the 50GB VRAM range.&lt;/p&gt;
&lt;p&gt;The advantages of &lt;code&gt;FP16&lt;/code&gt; are obvious:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;much lower VRAM pressure&lt;/li&gt;
&lt;li&gt;higher throughput&lt;/li&gt;
&lt;li&gt;widely used in early mixed-precision training&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Its weakness is the relatively small exponent range. In large-model training, that makes overflow more likely and often requires extra techniques such as loss scaling, which adds engineering complexity.&lt;/p&gt;
&lt;p&gt;So &lt;code&gt;FP16&lt;/code&gt; is still common, but in many scenarios it is no longer the most comfortable option.&lt;/p&gt;
&lt;h2 id=&#34;bf16-a-more-practical-half-precision-for-the-large-model-era&#34;&gt;BF16: a more practical half precision for the large-model era
&lt;/h2&gt;&lt;p&gt;&lt;code&gt;BF16&lt;/code&gt; also uses 2 bytes, but it makes a different trade-off from &lt;code&gt;FP16&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://knightli.com/2026/04/22/common-tensor-formats-fp32-fp16-bf16-tf32-fp8/bf16-layout.svg&#34;
	
	
	
	loading=&#34;lazy&#34;
	
		alt=&#34;BF16 bit layout diagram&#34;
	
	
&gt;&lt;/p&gt;
&lt;p&gt;It keeps a much larger exponent range, making its dynamic range closer to &lt;code&gt;FP32&lt;/code&gt;, while giving up some mantissa precision. That trade-off works especially well for large models, because they are often more sensitive to range than to losing a few mantissa bits.&lt;/p&gt;
&lt;p&gt;That is why many training frameworks, many large-model papers, and many real deployment setups prefer &lt;code&gt;BF16&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;A simple way to think about it is:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;VRAM cost close to &lt;code&gt;FP16&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;stability closer to &lt;code&gt;FP32&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If one 27B deployment guide asks for roughly 50GB of VRAM while another optimized one gets closer to 30GB, the former often still lives in the &lt;code&gt;FP16/BF16&lt;/code&gt; layer, while the latter has usually moved further toward lower precision or quantization.&lt;/p&gt;
&lt;h2 id=&#34;tf32-not-about-saving-vram-but-about-accelerating-fp32-workflows&#34;&gt;TF32: not about saving VRAM, but about accelerating FP32 workflows
&lt;/h2&gt;&lt;p&gt;&lt;code&gt;TF32&lt;/code&gt; is easy to mistake for yet another memory-saving format, but its role is different.&lt;/p&gt;
&lt;p&gt;In common terms, you can roughly think of it as a computation format that keeps a large exponent range while shortening mantissa precision.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://knightli.com/2026/04/22/common-tensor-formats-fp32-fp16-bf16-tf32-fp8/tf32-layout.svg&#34;
	
	
	
	loading=&#34;lazy&#34;
	
		alt=&#34;TF32 computation format diagram&#34;
	
	
&gt;&lt;/p&gt;
&lt;p&gt;But it is important to note that &lt;code&gt;TF32&lt;/code&gt; is more like an internal computation format used on the Tensor Core path, rather than something primarily used to store weights like &lt;code&gt;FP16&lt;/code&gt; or &lt;code&gt;BF16&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;It is mainly a computation mode NVIDIA provides on newer GPUs. The goal is not to reduce VRAM usage, but to make originally &lt;code&gt;FP32&lt;/code&gt;-based training workflows run faster without requiring major code changes.&lt;/p&gt;
&lt;p&gt;Its role can be summarized in one sentence:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;externally it still looks like an &lt;code&gt;FP32&lt;/code&gt; workflow&lt;/li&gt;
&lt;li&gt;internally it performs faster approximate matrix math&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;So &lt;code&gt;TF32&lt;/code&gt; mainly solves the problem that &lt;code&gt;FP32&lt;/code&gt; is too slow, not that &lt;code&gt;FP32&lt;/code&gt; uses too much memory. If your question is why the same model can have very different VRAM requirements, &lt;code&gt;TF32&lt;/code&gt; is not the main answer.&lt;/p&gt;
&lt;h2 id=&#34;fp8-further-compression-but-much-more-demanding-engineering&#34;&gt;FP8: further compression, but much more demanding engineering
&lt;/h2&gt;&lt;p&gt;Going one step further leads to &lt;code&gt;FP8&lt;/code&gt;. It compresses each value into even fewer bits, reducing memory bandwidth and storage cost even more.&lt;/p&gt;
&lt;p&gt;It usually appears not as one single format, but as two common variants: &lt;code&gt;E4M3&lt;/code&gt; and &lt;code&gt;E5M2&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://knightli.com/2026/04/22/common-tensor-formats-fp32-fp16-bf16-tf32-fp8/fp8-layout.svg&#34;
	
	
	
	loading=&#34;lazy&#34;
	
		alt=&#34;FP8 variant diagram&#34;
	
	
&gt;&lt;/p&gt;
&lt;p&gt;But &lt;code&gt;FP8&lt;/code&gt; comes with an obvious cost: once the bit count gets that low, it becomes very hard to preserve both range and precision at the same time. In practice, different variants are often used for different stages to balance forward passes, backward passes, and gradients.&lt;/p&gt;
&lt;p&gt;This format family represents a more aggressive strategy:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;give up more precision&lt;/li&gt;
&lt;li&gt;gain lower storage cost and higher throughput&lt;/li&gt;
&lt;li&gt;rely on more mature hardware and frameworks&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;It has a lot of potential, but for most users, the main practical dividing lines are still &lt;code&gt;FP32&lt;/code&gt;, &lt;code&gt;FP16&lt;/code&gt;, and &lt;code&gt;BF16&lt;/code&gt;.&lt;/p&gt;
&lt;h2 id=&#34;why-understanding-these-formats-matters&#34;&gt;Why understanding these formats matters
&lt;/h2&gt;&lt;p&gt;Many people first treat these abbreviations as implementation details on a download page. In practice, though, they change how you think about both training and deployment.&lt;/p&gt;
&lt;p&gt;For example, they help explain:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;why some training setups care so much about numerical stability&lt;/li&gt;
&lt;li&gt;why some inference stacks emphasize quantization and low precision first&lt;/li&gt;
&lt;li&gt;why models with similar parameter counts can still have very different deployment requirements&lt;/li&gt;
&lt;li&gt;why some formats are better for storing weights while others make more sense as compute paths&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you keep unpacking those questions, they usually lead back to the same issue: how you choose to trade off precision, range, memory, and speed.&lt;/p&gt;
&lt;p&gt;That is why understanding &lt;code&gt;FP32&lt;/code&gt;, &lt;code&gt;FP16&lt;/code&gt;, &lt;code&gt;BF16&lt;/code&gt;, &lt;code&gt;TF32&lt;/code&gt;, and &lt;code&gt;FP8&lt;/code&gt; is not just about decoding a glossary. It is about understanding what is really being exchanged when you read a training config, choose an inference engine, or compare deployment options.&lt;/p&gt;
&lt;h2 id=&#34;a-practical-mental-model&#34;&gt;A practical mental model
&lt;/h2&gt;&lt;p&gt;If you do not want to memorize all the details right away, it helps to remember them in this order:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;FP32&lt;/code&gt;: most stable, most expensive&lt;/li&gt;
&lt;li&gt;&lt;code&gt;FP16&lt;/code&gt;: lower VRAM use, but smaller range&lt;/li&gt;
&lt;li&gt;&lt;code&gt;BF16&lt;/code&gt;: similar VRAM cost to &lt;code&gt;FP16&lt;/code&gt;, but more suitable stability for large models&lt;/li&gt;
&lt;li&gt;&lt;code&gt;TF32&lt;/code&gt;: mainly solves slow &lt;code&gt;FP32&lt;/code&gt;, not VRAM usage&lt;/li&gt;
&lt;li&gt;&lt;code&gt;FP8&lt;/code&gt;: a more aggressive compression and acceleration route&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;img src=&#34;https://knightli.com/2026/04/22/common-tensor-formats-fp32-fp16-bf16-tf32-fp8/tensor-format-summary.svg&#34;
	
	
	
	loading=&#34;lazy&#34;
	
		alt=&#34;Summary chart of common tensor formats&#34;
	
	
&gt;&lt;/p&gt;
&lt;p&gt;After that, when you see &lt;code&gt;fp16&lt;/code&gt;, &lt;code&gt;bf16&lt;/code&gt;, or &lt;code&gt;fp8&lt;/code&gt; on a model download page, or when different deployment guides give wildly different VRAM thresholds, it no longer looks like a difference in wording. Those labels reflect very different precision budgets and engineering choices.&lt;/p&gt;
&lt;h2 id=&#34;closing&#34;&gt;Closing
&lt;/h2&gt;&lt;p&gt;Tensor formats in large models may look like a discussion about bit widths, but underneath they are really a discussion about engineering trade-offs.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;FP32&lt;/code&gt;, &lt;code&gt;FP16&lt;/code&gt;, &lt;code&gt;BF16&lt;/code&gt;, &lt;code&gt;TF32&lt;/code&gt;, and &lt;code&gt;FP8&lt;/code&gt; are not simply better or worse than one another. Each one sits at a different point on the trade-off curve between stability, range, precision, memory, and speed.&lt;/p&gt;
&lt;p&gt;Once you understand that layer clearly, it becomes much easier to read training papers, tune inference settings, and compare deployment strategies with the right mental model.&lt;/p&gt;
</description>
        </item>
        <item>
        <title>A 16GB GPU Can Still Run 35B Models: VRAM Compression Strategies for MoE Models in LM Studio</title>
        <link>https://knightli.com/en/2026/04/22/16gb-gpu-run-35b-moe-models-in-lm-studio/</link>
        <pubDate>Wed, 22 Apr 2026 21:47:34 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/04/22/16gb-gpu-run-35b-moe-models-in-lm-studio/</guid>
        <description>&lt;p&gt;Many people think of 16GB VRAM as the point where local LLM deployment more or less tops out at 12B to 14B models, and anything larger becomes too painful even with quantization. That view is understandable, but it is not the true ceiling of a 16GB GPU.&lt;/p&gt;
&lt;p&gt;If your model choice and parameter setup are good enough, a 16GB GPU does not have to stay limited to “small-parameter” models. One representative approach is to use &lt;code&gt;MoE&lt;/code&gt; models inside &lt;code&gt;LM Studio&lt;/code&gt; with a sensible unloading strategy, so that 35B-class models can still run at a genuinely usable speed.&lt;/p&gt;
&lt;h2 id=&#34;01-why-a-16gb-gpu-is-not-necessarily-limited-to-12b-to-14b&#34;&gt;01 Why a 16GB GPU is not necessarily limited to 12B to 14B
&lt;/h2&gt;&lt;p&gt;The core idea is straightforward: VRAM size matters, but model architecture matters just as much.&lt;/p&gt;
&lt;p&gt;If you try to cram a standard dense model into a 16GB GPU, you will hit the wall quickly. These models usually involve all parameters during inference, so VRAM pressure and bandwidth pressure rise immediately.&lt;/p&gt;
&lt;p&gt;But &lt;code&gt;MoE&lt;/code&gt; models are different. Their total parameter count can be large, while only part of the expert parameters are activated in a single inference step. Take a 35B-class model as an example: although the total parameter count is high, the actual number of parameters participating in each inference step is much smaller, so its real VRAM requirement is not as extreme as many people assume.&lt;/p&gt;
&lt;p&gt;That is exactly why a 16GB GPU still leaves some room to work with.&lt;/p&gt;
&lt;h2 id=&#34;02-key-practical-takeaway-35b-moe-models-can-run-surprisingly-fast&#34;&gt;02 Key practical takeaway: 35B MoE models can run surprisingly fast
&lt;/h2&gt;&lt;p&gt;One representative case is a quantized &lt;code&gt;MoE&lt;/code&gt; model such as &lt;code&gt;Qwen 3.5 35B A3B&lt;/code&gt;. With a 16GB GPU and the right settings in &lt;code&gt;LM Studio&lt;/code&gt;, &lt;code&gt;Q6&lt;/code&gt; quantization can reach something above 30 &lt;code&gt;tokens/s&lt;/code&gt;, and &lt;code&gt;Q4&lt;/code&gt; can sometimes test even higher.&lt;/p&gt;
&lt;p&gt;That result matters not just because the model “runs,” but because the speed is already in a clearly usable range.&lt;/p&gt;
&lt;p&gt;As a comparison, large models of a similar scale that are not &lt;code&gt;MoE&lt;/code&gt; often run into VRAM overflow and sharply lower speed on a 16GB GPU. In other words, the outcome is not determined by parameter count alone. What matters is how those parameters are actually used during inference.&lt;/p&gt;
&lt;h2 id=&#34;03-in-lm-studio-the-key-is-not-just-one-parameter&#34;&gt;03 In LM Studio, the key is not just one parameter
&lt;/h2&gt;&lt;p&gt;If you want this kind of model to run smoothly on a 16GB GPU, the real trick is not luck. It is tuning two parameters correctly:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;GPU Offload&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;the setting that forces part of the expert layers into CPU memory&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The first one is easy to understand. &lt;code&gt;GPU Offload&lt;/code&gt; is basically something you push as high as possible, so the model prioritizes GPU computation.&lt;/p&gt;
&lt;p&gt;The second one is the real key here. It is not the traditional “borrow system memory after VRAM overflows” approach. Instead, it proactively places part of the expert layers into CPU memory to reduce VRAM usage in advance. Since &lt;code&gt;MoE&lt;/code&gt; models do not activate every expert on every step anyway, moving some experts into memory does not hurt overall inference speed as much as many people would expect.&lt;/p&gt;
&lt;p&gt;A safer way to tune it is to start within a range and then adjust gradually for your machine:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;start with related values somewhere between &lt;code&gt;20&lt;/code&gt; and &lt;code&gt;35&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;then fine-tune based on VRAM usage and memory pressure&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;At its core, this method is using system memory to buy back VRAM headroom.&lt;/p&gt;
&lt;h2 id=&#34;04-it-can-still-run-at-128k-context-and-smaller-contexts-reduce-vram-further&#34;&gt;04 It can still run at 128K context, and smaller contexts reduce VRAM further
&lt;/h2&gt;&lt;p&gt;Another interesting point is that even with the context length pushed to &lt;code&gt;128K&lt;/code&gt;, a 35B-class &lt;code&gt;MoE&lt;/code&gt; model can still maintain a relatively high speed.&lt;/p&gt;
&lt;p&gt;That tells us something important: the bottleneck of a 16GB GPU is not as rigid as many people imagine. Especially inside a local inference tool like &lt;code&gt;LM Studio&lt;/code&gt;, the real question is often not simply “can it run or not,” but rather:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;are you willing to trade more system memory for less VRAM usage&lt;/li&gt;
&lt;li&gt;are you willing to shorten the context length&lt;/li&gt;
&lt;li&gt;are you willing to accept different capability tradeoffs across quantization levels&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If the context is reduced further from &lt;code&gt;128K&lt;/code&gt; to &lt;code&gt;64K&lt;/code&gt; or &lt;code&gt;32K&lt;/code&gt;, VRAM pressure can drop even more. That means some 35B-class &lt;code&gt;MoE&lt;/code&gt; models may even run, barely, on GPUs with less VRAM, though speed and memory pressure will need to be rebalanced.&lt;/p&gt;
&lt;h2 id=&#34;05-the-cost-of-this-approach-much-higher-demands-on-ram-and-virtual-memory&#34;&gt;05 The cost of this approach: much higher demands on RAM and virtual memory
&lt;/h2&gt;&lt;p&gt;This kind of setup is not free performance.&lt;/p&gt;
&lt;p&gt;What you need to watch is that once VRAM pressure is compressed further, system RAM usage rises noticeably, and virtual memory pressure rises too. In other words, you are not removing the cost. You are shifting pressure from the GPU to RAM and disk swap space.&lt;/p&gt;
&lt;p&gt;So if you want to try it yourself, it is worth checking a few things first:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;whether your system RAM is large enough&lt;/li&gt;
&lt;li&gt;whether your virtual memory allocation is large enough&lt;/li&gt;
&lt;li&gt;whether too many background applications are already consuming resources&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If those conditions are not in place, what you may get is not “35B running fast,” but an overall machine that becomes slow everywhere.&lt;/p&gt;
&lt;h2 id=&#34;06-more-aggressive-quantization-is-not-always-better&#34;&gt;06 More aggressive quantization is not always better
&lt;/h2&gt;&lt;p&gt;There is another practical tradeoff here. Lower-bit quantization often saves more VRAM, but that does not automatically make it the best choice.&lt;/p&gt;
&lt;p&gt;The practical takeaway is that some models do run faster under &lt;code&gt;Q4&lt;/code&gt;, but their original capability can also degrade more. By comparison, &lt;code&gt;Q6&lt;/code&gt; tends to strike a better balance between speed and capability retention. So the right choice depends on what you care about more:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;maximum speed and fitting into VRAM&lt;/li&gt;
&lt;li&gt;or preserving more of the model’s original capability&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Those two priorities do not necessarily lead to the same quantization choice.&lt;/p&gt;
&lt;h2 id=&#34;07-what-kinds-of-models-are-worth-trying&#34;&gt;07 What kinds of models are worth trying
&lt;/h2&gt;&lt;p&gt;From this angle, the best thing to try is not “blindly chase bigger parameter counts,” but to first look for models that fit this strategy:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;models built on &lt;code&gt;MoE&lt;/code&gt; architecture&lt;/li&gt;
&lt;li&gt;models that are well supported in &lt;code&gt;LM Studio&lt;/code&gt; and have complete quantized variants&lt;/li&gt;
&lt;li&gt;models with clear advantages in long context or instruction following&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;And the idea does not stop at one 35B &lt;code&gt;MoE&lt;/code&gt; model. It also extends naturally to other directions, such as experimental models with stronger long-context memory, better instruction following, or lighter quantized versions with strong speed performance.&lt;/p&gt;
&lt;p&gt;The logic behind this is very consistent: first find models whose architecture fits the “trade memory for VRAM” strategy, and then talk about tuning. Do not start from parameter count alone and decide from there.&lt;/p&gt;
&lt;h2 id=&#34;08-short-conclusion&#34;&gt;08 Short conclusion
&lt;/h2&gt;&lt;p&gt;If you happen to have a 16GB GPU and assume local LLMs stop at 12B to 14B, that assumption is worth updating.&lt;/p&gt;
&lt;p&gt;A more accurate way to put it is:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;a 16GB GPU is not automatically ruled out for larger models&lt;/li&gt;
&lt;li&gt;dense models and &lt;code&gt;MoE&lt;/code&gt; models need to be considered separately&lt;/li&gt;
&lt;li&gt;&lt;code&gt;GPU Offload&lt;/code&gt; and expert-layer transfer to CPU memory inside &lt;code&gt;LM Studio&lt;/code&gt; can significantly change VRAM usage&lt;/li&gt;
&lt;li&gt;in practice, you are trading higher memory pressure for larger model scale and better usable speed&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This approach will not fit every machine, but it does show one important thing: in local LLM deployment, VRAM is not the only limit. Model architecture and inference configuration matter just as much.&lt;/p&gt;
</description>
        </item>
        <item>
        <title>12V-2x6 vs. 12VHPWR: Notes on GPU 16-Pin Power Connector Differences</title>
        <link>https://knightli.com/en/2026/04/19/12v-2x6-vs-12vhpwr-gpu-power-connector-notes/</link>
        <pubDate>Sun, 19 Apr 2026 23:21:17 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/04/19/12v-2x6-vs-12vhpwr-gpu-power-connector-notes/</guid>
        <description>&lt;p&gt;&lt;img src=&#34;https://knightli.com/2026/04/19/12v-2x6-vs-12vhpwr-gpu-power-connector-notes/4.png&#34;
	width=&#34;1516&#34;
	height=&#34;774&#34;
	srcset=&#34;https://knightli.com/2026/04/19/12v-2x6-vs-12vhpwr-gpu-power-connector-notes/4_hu_defcdc0fe696070d.png 480w, https://knightli.com/2026/04/19/12v-2x6-vs-12vhpwr-gpu-power-connector-notes/4_hu_d75571d9af707f1a.png 1024w&#34;
	loading=&#34;lazy&#34;
	
		alt=&#34;Comparison&#34;
	
	
		class=&#34;gallery-image&#34; 
		data-flex-grow=&#34;195&#34;
		data-flex-basis=&#34;470px&#34;
	
&gt;&lt;/p&gt;
&lt;p&gt;Among recent high-end GPUs, the power connector that gets discussed most often is probably &lt;code&gt;12VHPWR&lt;/code&gt; and the newer &lt;code&gt;12V-2x6&lt;/code&gt;. Both look like 16-pin connectors, using a &lt;code&gt;12 + 4&lt;/code&gt; layout, but they are not exactly the same interface.&lt;/p&gt;
&lt;p&gt;In simple terms, &lt;code&gt;12V-2x6&lt;/code&gt; can be understood as a revision of the earlier &lt;code&gt;12VHPWR&lt;/code&gt; design under &lt;code&gt;ATX 3.1&lt;/code&gt; and &lt;code&gt;PCIe CEM 5.1&lt;/code&gt;. It keeps the high-power output capability, but uses a more conservative design for insertion detection and terminal structure. The goal is to reduce the risk of the connector continuing to carry load when it is not fully seated.&lt;/p&gt;
&lt;h2 id=&#34;01-cable-differences-are-small&#34;&gt;01 Cable Differences Are Small
&lt;/h2&gt;&lt;p&gt;The first question many people care about is whether &lt;code&gt;12V-2x6&lt;/code&gt; and &lt;code&gt;12VHPWR&lt;/code&gt; modular cables can be used interchangeably.&lt;/p&gt;
&lt;p&gt;Looking only at the cable itself, the difference is usually not large. The real change is mainly on the board-side connector, such as the GPU socket or the modular power-supply backplate socket. Both newer &lt;code&gt;12V-2x6&lt;/code&gt; modular cables and older &lt;code&gt;12VHPWR&lt;/code&gt; modular cables are still intended for 16-pin GPU power delivery.&lt;/p&gt;
&lt;p&gt;So compatibility should not be judged only by cable length, wire gauge, or appearance. The GPU-side and PSU-side socket specification, terminal quality, and the power-supply vendor&amp;rsquo;s official compatibility statement matter more.&lt;/p&gt;
&lt;h2 id=&#34;02-key-mechanical-changes&#34;&gt;02 Key Mechanical Changes
&lt;/h2&gt;&lt;p&gt;&lt;img src=&#34;https://knightli.com/2026/04/19/12v-2x6-vs-12vhpwr-gpu-power-connector-notes/1.png&#34;
	width=&#34;1469&#34;
	height=&#34;505&#34;
	srcset=&#34;https://knightli.com/2026/04/19/12v-2x6-vs-12vhpwr-gpu-power-connector-notes/1_hu_f49f1d41f41658ca.png 480w, https://knightli.com/2026/04/19/12v-2x6-vs-12vhpwr-gpu-power-connector-notes/1_hu_9451e0053960c4aa.png 1024w&#34;
	loading=&#34;lazy&#34;
	
		alt=&#34;Connector comparison&#34;
	
	
		class=&#34;gallery-image&#34; 
		data-flex-grow=&#34;290&#34;
		data-flex-basis=&#34;698px&#34;
	
&gt;&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://knightli.com/2026/04/19/12v-2x6-vs-12vhpwr-gpu-power-connector-notes/2.png&#34;
	width=&#34;1390&#34;
	height=&#34;743&#34;
	srcset=&#34;https://knightli.com/2026/04/19/12v-2x6-vs-12vhpwr-gpu-power-connector-notes/2_hu_b338f61621a3f769.png 480w, https://knightli.com/2026/04/19/12v-2x6-vs-12vhpwr-gpu-power-connector-notes/2_hu_dee9d88d14eab23d.png 1024w&#34;
	loading=&#34;lazy&#34;
	
		alt=&#34;Connector comparison&#34;
	
	
		class=&#34;gallery-image&#34; 
		data-flex-grow=&#34;187&#34;
		data-flex-basis=&#34;448px&#34;
	
&gt;&lt;/p&gt;
&lt;p&gt;The point of &lt;code&gt;12V-2x6&lt;/code&gt; is not to completely change the outer shape of the connector, but to adjust the pin structure.&lt;/p&gt;
&lt;p&gt;Its 12 main power pins are longer and make contact earlier, while the 4 SENSE signal pins are shorter and make contact later. The logic is straightforward: only when the connector is inserted deeply enough should the SENSE pins conduct correctly, allowing the GPU to identify the intended power capability.&lt;/p&gt;
&lt;p&gt;This change targets a typical problem exposed by early &lt;code&gt;12VHPWR&lt;/code&gt; connectors: the plug may look inserted, but may not actually be fully seated. Under high load, insufficient contact can generate heat, and in severe cases may burn the plug or socket.&lt;/p&gt;
&lt;h2 id=&#34;03-more-conservative-sense-logic&#34;&gt;03 More Conservative SENSE Logic
&lt;/h2&gt;&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;SENSE0&lt;/th&gt;
          &lt;th&gt;SENSE1&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;Initial Power (Power Up)&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;Max Sustained Power&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;Ground&lt;/td&gt;
          &lt;td&gt;Ground&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;375 W&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;600 W&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Open&lt;/td&gt;
          &lt;td&gt;Ground&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;225 W&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;450 W&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Ground&lt;/td&gt;
          &lt;td&gt;Open&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;150 W&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;300 W&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Short&lt;/td&gt;
          &lt;td&gt;Short&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;100 W&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;150 W&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Open&lt;/td&gt;
          &lt;td&gt;Open&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;0 W&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;0 W&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The safety improvement in &lt;code&gt;12V-2x6&lt;/code&gt; centers on the SENSE logic.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://knightli.com/2026/04/19/12v-2x6-vs-12vhpwr-gpu-power-connector-notes/3.png&#34;
	width=&#34;1413&#34;
	height=&#34;594&#34;
	srcset=&#34;https://knightli.com/2026/04/19/12v-2x6-vs-12vhpwr-gpu-power-connector-notes/3_hu_da90f28c8ff5ee55.png 480w, https://knightli.com/2026/04/19/12v-2x6-vs-12vhpwr-gpu-power-connector-notes/3_hu_d32cc33cf944e495.png 1024w&#34;
	loading=&#34;lazy&#34;
	
		alt=&#34;Powered only when fully inserted&#34;
	
	
		class=&#34;gallery-image&#34; 
		data-flex-grow=&#34;237&#34;
		data-flex-basis=&#34;570px&#34;
	
&gt;&lt;/p&gt;
&lt;p&gt;In the newer definition, if &lt;code&gt;SENSE0&lt;/code&gt; and &lt;code&gt;SENSE1&lt;/code&gt; are in the &lt;code&gt;Open&lt;/code&gt; floating state, the GPU will not power up normally or will not enter the corresponding high-power input state. In other words, when the connector is not seated properly, the system is more inclined to prevent operation instead of letting the GPU keep drawing power.&lt;/p&gt;
&lt;p&gt;This is more conservative than early &lt;code&gt;12VHPWR&lt;/code&gt;. In older designs, even if the SENSE state was not ideal, some cases could still allow a certain level of power input. For high-power GPUs, that tolerance can become a risk.&lt;/p&gt;
&lt;p&gt;Shortening the SENSE pins is essentially a way to make &amp;ldquo;fully inserted&amp;rdquo; a stricter prerequisite.&lt;/p&gt;
&lt;h2 id=&#34;04-what-h-means&#34;&gt;04 What H++ Means
&lt;/h2&gt;&lt;p&gt;Newer &lt;code&gt;12V-2x6&lt;/code&gt; connectors often carry an &lt;code&gt;H++&lt;/code&gt; mark. It indicates that the connector terminals support &lt;code&gt;9.2A&lt;/code&gt; or higher current capability, distinguishing them from earlier &lt;code&gt;12VHPWR&lt;/code&gt; connectors marked &lt;code&gt;H+&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;It is worth noting that &lt;code&gt;H++&lt;/code&gt; does not mean the connector&amp;rsquo;s power limit rises beyond 600W. Whether new or old, the common upper limit for this 16-pin GPU power scheme is still &lt;code&gt;600W&lt;/code&gt;. &lt;code&gt;H++&lt;/code&gt; is better understood as terminal-specification and connector-version identification, not simply &amp;ldquo;higher wattage.&amp;rdquo;&lt;/p&gt;
&lt;h2 id=&#34;05-what-it-means-for-pc-building&#34;&gt;05 What It Means for PC Building
&lt;/h2&gt;&lt;p&gt;For everyday PC building, the biggest value of &lt;code&gt;12V-2x6&lt;/code&gt; is reducing insertion-related risk, but it is not a magic shield.&lt;/p&gt;
&lt;p&gt;When using this kind of connector, it is still worth paying attention to a few things:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Fully insert the plug; do not rely only on whether it &amp;ldquo;looks inserted.&amp;rdquo;&lt;/li&gt;
&lt;li&gt;Avoid bending the cable sharply right next to the GPU connector.&lt;/li&gt;
&lt;li&gt;Do not let the side panel force pressure onto the cable.&lt;/li&gt;
&lt;li&gt;Prefer original, custom, or adapter cables explicitly supported by the PSU or GPU vendor.&lt;/li&gt;
&lt;li&gt;Avoid cheap adapters of unknown origin on high-power GPUs.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If the case is tight, a 90-degree L-shaped cable or vendor-certified custom cable can reduce bending pressure. Still, terminal quality, wire gauge, and vendor certification matter more than appearance.&lt;/p&gt;
&lt;h2 id=&#34;06-quick-summary&#34;&gt;06 Quick Summary
&lt;/h2&gt;&lt;p&gt;&lt;code&gt;12V-2x6&lt;/code&gt; is not a connector that is &amp;ldquo;basically the same as 12VHPWR because it looks the same.&amp;rdquo; Its real changes are inside the connector structure and detection logic.&lt;/p&gt;
&lt;p&gt;You can think of it this way:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The cable form is similar, but board-side connector and terminal design are more important.&lt;/li&gt;
&lt;li&gt;The main power pins are longer, while the SENSE pins are shorter.&lt;/li&gt;
&lt;li&gt;When the connector is not fully seated, the newer design is more likely to prevent the GPU from entering a working state.&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;H++&lt;/code&gt; mark identifies terminals with higher current capability.&lt;/li&gt;
&lt;li&gt;The common GPU power limit is still &lt;code&gt;600W&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you are building a system with a high-power GPU, &lt;code&gt;12V-2x6&lt;/code&gt; is indeed more reassuring than early &lt;code&gt;12VHPWR&lt;/code&gt;. But the final safety still depends on whether the plug is fully seated, cable quality, PSU design, and case cable-management space. A better connector standard does not make careless installation safe.&lt;/p&gt;
</description>
        </item>
        <item>
        <title>Ollama Multi-GPU Notes: VRAM Pooling, GPU Selection, and Common Misunderstandings</title>
        <link>https://knightli.com/en/2026/04/19/ollama-multiple-gpu-notes/</link>
        <pubDate>Sun, 19 Apr 2026 00:18:00 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/04/19/ollama-multiple-gpu-notes/</guid>
        <description>&lt;p&gt;When running local inference with Ollama, a few questions come up quickly: if I already have one GPU and my motherboard still has empty PCIe slots, does adding more GPUs help? Do the GPUs need to be identical? Can VRAM be combined? Will it accelerate inference like a multi-GPU training framework?&lt;/p&gt;
&lt;p&gt;This note summarizes how Ollama behaves with multiple GPUs. The short version:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Ollama supports multiple GPUs.&lt;/li&gt;
&lt;li&gt;The main value of multiple GPUs is usually fitting larger models into available VRAM, not getting linear token/s scaling.&lt;/li&gt;
&lt;li&gt;By default, if a model fits entirely on one GPU, Ollama tends to load it on a single GPU.&lt;/li&gt;
&lt;li&gt;If a model does not fit on one GPU, Ollama can spread it across available GPUs.&lt;/li&gt;
&lt;li&gt;Mixed GPU models may be visible to Ollama, but performance and placement may not be ideal.&lt;/li&gt;
&lt;li&gt;SLI / NVLink is not required for multi-GPU use.&lt;/li&gt;
&lt;li&gt;To limit which GPUs Ollama can use, use &lt;code&gt;CUDA_VISIBLE_DEVICES&lt;/code&gt;, &lt;code&gt;ROCR_VISIBLE_DEVICES&lt;/code&gt;, or &lt;code&gt;GGML_VK_VISIBLE_DEVICES&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;official-behavior-single-gpu-first-multi-gpu-when-needed&#34;&gt;Official Behavior: Single GPU First, Multi-GPU When Needed
&lt;/h2&gt;&lt;p&gt;Ollama&amp;rsquo;s FAQ describes the multi-GPU loading logic directly: when loading a new model, Ollama estimates the required VRAM and compares it with currently available GPU memory. If the model can fit entirely on one GPU, it loads the model onto that GPU. If it cannot fit on a single GPU, the model is spread across all available GPUs.&lt;/p&gt;
&lt;p&gt;The reason is performance. Keeping a model on one GPU usually reduces data transfers across the PCIe bus during inference, so it is often faster.&lt;/p&gt;
&lt;p&gt;So do not think of Ollama multi-GPU as &amp;ldquo;more cards automatically means several times faster.&amp;rdquo; A more accurate model is:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Small model fits on one GPU: usually runs on one GPU.&lt;/li&gt;
&lt;li&gt;Large model does not fit on one GPU: split across multiple GPUs.&lt;/li&gt;
&lt;li&gt;Still not enough VRAM: part of the model falls back to system memory, and speed drops noticeably.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Use this command to see where the model is loaded:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;ollama ps
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;The &lt;code&gt;PROCESSOR&lt;/code&gt; column may show something like:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;100% GPU
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;48%/52% CPU/GPU
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;100% CPU
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;If you see &lt;code&gt;48%/52% CPU/GPU&lt;/code&gt;, part of the model is already in system memory. In that case, adding more GPU memory or using a larger-VRAM GPU is usually more useful than continuing to rely on CPU/RAM.&lt;/p&gt;
&lt;h2 id=&#34;multi-gpu-is-not-simple-compute-stacking&#34;&gt;Multi-GPU Is Not Simple Compute Stacking
&lt;/h2&gt;&lt;p&gt;Local LLM inference is not the same as SLI in games. With Ollama on multiple GPUs, the common pattern is that different layers or tensors are placed on different devices. This can make a larger model fit into the combined available VRAM, but data may still need to move between devices during inference.&lt;/p&gt;
&lt;p&gt;So multi-GPU benefits usually fall into two categories:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;VRAM benefit: larger models fit more easily, or less of the model falls back to CPU/RAM.&lt;/li&gt;
&lt;li&gt;Performance benefit: usually most obvious when a model would otherwise not fit on one GPU or would heavily spill to CPU.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If an 8B or 14B model already fits entirely on a single RTX 3090, forcing it across two GPUs may not be faster. It may even slow down due to cross-GPU transfer overhead. Ollama&amp;rsquo;s default &amp;ldquo;use one GPU when it fits&amp;rdquo; strategy avoids that unnecessary PCIe cost.&lt;/p&gt;
&lt;h2 id=&#34;sli-or-nvlink-is-not-required&#34;&gt;SLI or NVLink Is Not Required
&lt;/h2&gt;&lt;p&gt;Ollama multi-GPU does not depend on SLI. Multiple normal PCIe GPUs can be scheduled as long as the driver and Ollama can detect them.&lt;/p&gt;
&lt;p&gt;NVLink or higher PCIe bandwidth may help in some cross-GPU scenarios, but it is not a requirement. Many used GPU servers and workstations can run multiple GPUs over ordinary PCIe.&lt;/p&gt;
&lt;p&gt;What you should pay attention to is PCIe bandwidth. The difference between &lt;code&gt;x1&lt;/code&gt;, &lt;code&gt;x4&lt;/code&gt;, &lt;code&gt;x8&lt;/code&gt;, and &lt;code&gt;x16&lt;/code&gt; affects how quickly a model is loaded into VRAM. If you frequently switch large models, PCIe bandwidth becomes more important. After a model is loaded, PCIe usually matters less during generation, but cross-GPU splitting can still add overhead.&lt;/p&gt;
&lt;p&gt;Safer rules:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Prefer x16 / x8 over mining-style x1 risers.&lt;/li&gt;
&lt;li&gt;PCIe bandwidth matters more when switching large models frequently.&lt;/li&gt;
&lt;li&gt;If a model stays resident in VRAM for a long time, PCIe bandwidth is less visible.&lt;/li&gt;
&lt;li&gt;For multi-GPU machines, check motherboard PCIe topology and CPU-attached lanes.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;limit-which-nvidia-gpus-ollama-uses&#34;&gt;Limit Which NVIDIA GPUs Ollama Uses
&lt;/h2&gt;&lt;p&gt;On NVIDIA multi-GPU systems, use &lt;code&gt;CUDA_VISIBLE_DEVICES&lt;/code&gt; to control which GPUs Ollama can see.&lt;/p&gt;
&lt;p&gt;Temporary run:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nv&#34;&gt;CUDA_VISIBLE_DEVICES&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;0,1 ollama serve
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Use only the second GPU:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nv&#34;&gt;CUDA_VISIBLE_DEVICES&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;m&#34;&gt;1&lt;/span&gt; ollama serve
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Force Ollama not to use NVIDIA GPUs:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nv&#34;&gt;CUDA_VISIBLE_DEVICES&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;-1 ollama serve
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;The official docs note that numeric IDs may change order, so GPU UUIDs are more reliable. Check UUIDs first:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;nvidia-smi -L
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Example output:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;GPU 0: NVIDIA GeForce RTX 3090 (UUID: GPU-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx)
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;GPU 1: NVIDIA GeForce RTX 3070 (UUID: GPU-yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy)
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Then specify the UUID:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nv&#34;&gt;CUDA_VISIBLE_DEVICES&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;GPU-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx ollama serve
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;If Ollama is installed as a Linux systemd service, put the variable into the service environment:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;sudo systemctl edit ollama.service
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Add:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-ini&#34; data-lang=&#34;ini&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;k&#34;&gt;[Service]&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;na&#34;&gt;Environment&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s&#34;&gt;&amp;#34;CUDA_VISIBLE_DEVICES=0,1&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Reload and restart:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;sudo systemctl daemon-reload
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;sudo systemctl restart ollama
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;h2 id=&#34;amd-and-vulkan-device-selection&#34;&gt;AMD and Vulkan Device Selection
&lt;/h2&gt;&lt;p&gt;For AMD ROCm, use &lt;code&gt;ROCR_VISIBLE_DEVICES&lt;/code&gt; to control visible GPUs:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nv&#34;&gt;ROCR_VISIBLE_DEVICES&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;0,1 ollama serve
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;To force Ollama not to use ROCm GPUs, use an invalid ID:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nv&#34;&gt;ROCR_VISIBLE_DEVICES&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;-1 ollama serve
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Ollama&amp;rsquo;s GPU docs also mention experimental Vulkan support. For Vulkan GPUs, use &lt;code&gt;GGML_VK_VISIBLE_DEVICES&lt;/code&gt;:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nv&#34;&gt;OLLAMA_VULKAN&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;m&#34;&gt;1&lt;/span&gt; &lt;span class=&#34;nv&#34;&gt;GGML_VK_VISIBLE_DEVICES&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;m&#34;&gt;0&lt;/span&gt; ollama serve
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;If Vulkan devices cause problems, disable them:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nv&#34;&gt;GGML_VK_VISIBLE_DEVICES&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;-1 ollama serve
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;AMD multi-GPU setups are more likely to run into driver, ROCm version, and GFX version compatibility issues. The official docs also mention Linux ROCm driver requirements and compatibility overrides such as &lt;code&gt;HSA_OVERRIDE_GFX_VERSION&lt;/code&gt;. If you mix different generations of AMD GPUs, first verify that each card works on its own before trying multi-GPU.&lt;/p&gt;
&lt;h2 id=&#34;exposing-multiple-gpus-in-docker&#34;&gt;Exposing Multiple GPUs in Docker
&lt;/h2&gt;&lt;p&gt;If you run Ollama in Docker, NVIDIA setups usually require &lt;code&gt;nvidia-container-toolkit&lt;/code&gt;, then &lt;code&gt;--gpus&lt;/code&gt; to expose devices.&lt;/p&gt;
&lt;p&gt;Expose all GPUs:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;4
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;5
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;6
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;docker run -d &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  --gpus&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;all &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  -v ollama:/root/.ollama &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  -p 11434:11434 &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  --name ollama &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  ollama/ollama
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Expose specific GPUs:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;4
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;5
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;6
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;docker run -d &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  --gpus &lt;span class=&#34;s1&#34;&gt;&amp;#39;&amp;#34;device=0,1&amp;#34;&amp;#39;&lt;/span&gt; &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  -v ollama:/root/.ollama &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  -p 11434:11434 &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  --name ollama &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  ollama/ollama
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;You can also combine this with environment variables:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;4
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;5
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;6
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;7
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;docker run -d &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  --gpus&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;all &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  -e &lt;span class=&#34;nv&#34;&gt;CUDA_VISIBLE_DEVICES&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;0,1 &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  -v ollama:/root/.ollama &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  -p 11434:11434 &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  --name ollama &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  ollama/ollama
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;If &lt;code&gt;nvidia-smi&lt;/code&gt; cannot see GPUs inside the container, Ollama cannot use them either. Troubleshoot Docker GPU passthrough first, then Ollama.&lt;/p&gt;
&lt;h2 id=&#34;what-is-ollama_sched_spread&#34;&gt;What Is &lt;code&gt;OLLAMA_SCHED_SPREAD&lt;/code&gt;
&lt;/h2&gt;&lt;p&gt;In some multi-GPU configuration discussions, you may see &lt;code&gt;OLLAMA_SCHED_SPREAD=1&lt;/code&gt; or &lt;code&gt;OLLAMA_SCHED_SPREAD=true&lt;/code&gt;. It is related to Ollama&amp;rsquo;s scheduler and is often used when people want models or requests to be spread more broadly across GPUs.&lt;/p&gt;
&lt;p&gt;Example:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nv&#34;&gt;OLLAMA_SCHED_SPREAD&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;m&#34;&gt;1&lt;/span&gt; ollama serve
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Or with systemd:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-ini&#34; data-lang=&#34;ini&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;k&#34;&gt;[Service]&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;na&#34;&gt;Environment&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s&#34;&gt;&amp;#34;OLLAMA_SCHED_SPREAD=true&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;But it is not a magic switch. Enabling it does not imply linear token/s scaling, and it may still run into OOM when multiple models are loaded, VRAM estimates are tight, context length grows, or the KV cache expands. The core FAQ behavior still applies: if one GPU can fully hold the model, one GPU is usually more efficient; if one GPU cannot hold it, then multi-GPU splitting becomes useful.&lt;/p&gt;
&lt;p&gt;Treat &lt;code&gt;OLLAMA_SCHED_SPREAD&lt;/code&gt; as an advanced scheduling experiment, not a required multi-GPU setting. Understand the default behavior first, then adjust based on &lt;code&gt;ollama ps&lt;/code&gt;, logs, and &lt;code&gt;nvidia-smi&lt;/code&gt;.&lt;/p&gt;
&lt;h2 id=&#34;how-to-check-whether-multiple-gpus-are-being-used&#34;&gt;How to Check Whether Multiple GPUs Are Being Used
&lt;/h2&gt;&lt;p&gt;Useful commands:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;ollama ps
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;watch -n 0.5 nvidia-smi
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;View the Ollama service logs:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;journalctl -u ollama -f
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;If using Docker:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;docker logs -f ollama
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Watch for:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Whether Ollama discovers compatible GPUs.&lt;/li&gt;
&lt;li&gt;Whether the model shows &lt;code&gt;100% GPU&lt;/code&gt; or a CPU/GPU split.&lt;/li&gt;
&lt;li&gt;Whether each GPU has VRAM allocated.&lt;/li&gt;
&lt;li&gt;Whether VRAM grows on multiple GPUs during model loading.&lt;/li&gt;
&lt;li&gt;Whether generation token/s improves compared with CPU/RAM spillover.&lt;/li&gt;
&lt;li&gt;Whether OOM or model unloading happens frequently.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;GPU utilization alone can be misleading. LLM inference does not always keep GPUs fully loaded, especially with multiple GPUs, low batch sizes, small contexts, slow CPUs, or slow PCIe links.&lt;/p&gt;
&lt;h2 id=&#34;common-misunderstandings&#34;&gt;Common Misunderstandings
&lt;/h2&gt;&lt;h3 id=&#34;misunderstanding-1-two-12gb-gpus-equal-one-24gb-gpu&#34;&gt;Misunderstanding 1: Two 12GB GPUs Equal One 24GB GPU
&lt;/h3&gt;&lt;p&gt;Not exactly. Multiple GPUs can place a model across devices, but cross-device access has overhead. It solves the &amp;ldquo;does not fit&amp;rdquo; problem, but it is not equivalent to the speed and stability of one large-VRAM GPU.&lt;/p&gt;
&lt;h3 id=&#34;misunderstanding-2-different-gpu-models-cannot-be-mixed&#34;&gt;Misunderstanding 2: Different GPU Models Cannot Be Mixed
&lt;/h3&gt;&lt;p&gt;Not necessarily. If the driver, compute capability, and runtime libraries support the cards, Ollama can see multiple GPUs. But mixed setups are usually limited by the slower card, smaller VRAM, and PCIe topology. The most predictable setup is still same model, same VRAM size, and well-supported same-generation drivers.&lt;/p&gt;
&lt;h3 id=&#34;misunderstanding-3-multi-gpu-is-always-faster-than-single-gpu&#34;&gt;Misunderstanding 3: Multi-GPU Is Always Faster Than Single-GPU
&lt;/h3&gt;&lt;p&gt;Not always. If the model fits completely on one fast GPU, single-GPU may be faster. Multi-GPU is mainly useful for large models, long contexts, or insufficient single-GPU VRAM.&lt;/p&gt;
&lt;h3 id=&#34;misunderstanding-4-nvlink--sli-is-required&#34;&gt;Misunderstanding 4: NVLink / SLI Is Required
&lt;/h3&gt;&lt;p&gt;No. Ordinary PCIe multi-GPU systems can be used by Ollama. NVLink is not a prerequisite.&lt;/p&gt;
&lt;h3 id=&#34;misunderstanding-5-adding-a-gpu-does-not-require-restarting-services&#34;&gt;Misunderstanding 5: Adding a GPU Does Not Require Restarting Services
&lt;/h3&gt;&lt;p&gt;Not always true. Linux systemd services, Windows background apps, and Docker containers may need to be restarted before they rediscover devices and environment variables.&lt;/p&gt;
&lt;h2 id=&#34;gpu-selection-suggestions&#34;&gt;GPU Selection Suggestions
&lt;/h2&gt;&lt;p&gt;For Ollama local inference, the rough priority is:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Larger single-GPU VRAM is usually easier to manage.&lt;/li&gt;
&lt;li&gt;Identical GPUs are easier to troubleshoot than mixed GPUs.&lt;/li&gt;
&lt;li&gt;More complete PCIe lanes make large-model loading smoother.&lt;/li&gt;
&lt;li&gt;Older cards should be checked for CUDA compute capability or ROCm support first.&lt;/li&gt;
&lt;li&gt;Multi-GPU power, cooling, and chassis airflow must be planned ahead.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;For budget second-hand platforms:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Dual RTX 3090 remains a common high-VRAM option.&lt;/li&gt;
&lt;li&gt;Older Tesla cards such as P40 / M40 have large VRAM, but power, cooling, driver support, and performance all need trade-offs.&lt;/li&gt;
&lt;li&gt;Cards such as RTX 4070 / 4070 Ti have good efficiency, but single-card VRAM can be limiting.&lt;/li&gt;
&lt;li&gt;Multiple old 8GB cards can be fun to experiment with, but are not ideal for running large models long-term.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;summary&#34;&gt;Summary
&lt;/h2&gt;&lt;p&gt;Ollama multi-GPU support is best understood as &amp;ldquo;VRAM expansion first, performance acceleration second.&amp;rdquo; If the model fits entirely on one GPU, the default single-GPU path is usually faster. If one GPU cannot hold it, multi-GPU can spread the model across devices and avoid heavy CPU/RAM spillover, making larger models usable.&lt;/p&gt;
&lt;p&gt;In practice, use &lt;code&gt;ollama ps&lt;/code&gt; to check where the model is loaded, then use &lt;code&gt;nvidia-smi&lt;/code&gt; or ROCm tools to observe VRAM allocation. For GPU selection, use &lt;code&gt;CUDA_VISIBLE_DEVICES&lt;/code&gt; on NVIDIA, &lt;code&gt;ROCR_VISIBLE_DEVICES&lt;/code&gt; on AMD ROCm, and &lt;code&gt;GGML_VK_VISIBLE_DEVICES&lt;/code&gt; for Vulkan. If running in Docker, first make sure the container can see the GPUs.&lt;/p&gt;
&lt;p&gt;Multi-GPU is not magic. It can help fit larger models, but it does not guarantee linear speedup. The stable route is still to prefer large-VRAM single GPUs or identical multi-GPU setups, while considering driver support, PCIe, power, cooling, and model quantization together.&lt;/p&gt;
&lt;h2 id=&#34;references&#34;&gt;References
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;Ollama FAQ: How does Ollama load models on multiple GPUs?: &lt;a class=&#34;link&#34; href=&#34;https://github.com/ollama/ollama/blob/main/docs/faq.mdx&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;https://github.com/ollama/ollama/blob/main/docs/faq.mdx&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Ollama GPU docs: Hardware support / GPU Selection: &lt;a class=&#34;link&#34; href=&#34;https://github.com/ollama/ollama/blob/main/docs/gpu.mdx&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;https://github.com/ollama/ollama/blob/main/docs/gpu.mdx&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Ollama Docker Hub: &lt;a class=&#34;link&#34; href=&#34;https://hub.docker.com/r/ollama/ollama&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;https://hub.docker.com/r/ollama/ollama&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;NVIDIA Container Toolkit: &lt;a class=&#34;link&#34; href=&#34;https://github.com/NVIDIA/nvidia-container-toolkit&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;https://github.com/NVIDIA/nvidia-container-toolkit&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</description>
        </item>
        <item>
        <title>How to Check Whether an Ollama Model Is Loaded on GPU</title>
        <link>https://knightli.com/en/2026/04/06/check-ollama-model-loaded-on-gpu/</link>
        <pubDate>Mon, 06 Apr 2026 10:15:18 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/04/06/check-ollama-model-loaded-on-gpu/</guid>
        <description>&lt;p&gt;If you want to confirm whether an Ollama model is actually running on GPU, the most direct way is checking processor allocation for currently loaded models.&lt;/p&gt;
&lt;h2 id=&#34;command&#34;&gt;Command
&lt;/h2&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;ollama ps
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;h2 id=&#34;example-output&#34;&gt;Example Output
&lt;/h2&gt;&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;NAME        ID            SIZE    PROCESSOR   UNTIL
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;llama3:70b  bcfb190ca3a7  42 GB   100% GPU    4 minutes from now
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;h2 id=&#34;how-to-read-the-processor-column&#34;&gt;How to Read the &lt;code&gt;PROCESSOR&lt;/code&gt; Column
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;&lt;code&gt;100% GPU&lt;/code&gt;: The model is fully loaded into GPU VRAM.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;100% CPU&lt;/code&gt;: The model is fully loaded in system memory (no GPU inference).&lt;/li&gt;
&lt;li&gt;&lt;code&gt;48%/52% CPU/GPU&lt;/code&gt;: The model is split between system memory and GPU VRAM.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;practical-tips&#34;&gt;Practical Tips
&lt;/h2&gt;&lt;ol&gt;
&lt;li&gt;If you expect GPU usage but see &lt;code&gt;100% CPU&lt;/code&gt;, first check GPU drivers, CUDA/ROCm environment, and Ollama runtime settings.&lt;/li&gt;
&lt;li&gt;With larger models and limited VRAM, CPU/GPU mixed loading is common.&lt;/li&gt;
&lt;li&gt;For performance troubleshooting, run &lt;code&gt;ollama ps&lt;/code&gt; before checking speed metrics to locate bottlenecks faster.&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id=&#34;summary&#34;&gt;Summary
&lt;/h2&gt;&lt;p&gt;&lt;code&gt;ollama ps&lt;/code&gt; is the first step to verify real GPU usage. Focus on the &lt;code&gt;PROCESSOR&lt;/code&gt; column to quickly identify where the model is loaded and decide your next optimization action.&lt;/p&gt;
&lt;!-- ollama-related-links:start --&gt;
&lt;h2 id=&#34;related-posts&#34;&gt;Related Posts
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://knightli.com/en/2026/04/05/google-gemma-4-model-comparison/&#34; &gt;Gemma 4 Model Comparison and Selection&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://knightli.com/en/2026/04/05/llm-quantization-guide-fp16-q4-q2/&#34; &gt;LLM Quantization Guide (FP16/Q8/Q5/Q4/Q2)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://knightli.com/en/2026/04/06/uninstall-ollama-on-linux/&#34; &gt;Completely Uninstall Ollama on Linux&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://knightli.com/en/2026/04/06/ollama-model-storage-path-and-migration/&#34; &gt;Ollama Model Storage Path and Migration&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;!-- ollama-related-links:end --&gt;
</description>
        </item>
        
    </channel>
</rss>
