<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
    <channel>
        <title>GTX 1060 on KnightLi Blog</title>
        <link>https://knightli.com/en/tags/gtx-1060/</link>
        <description>Recent content in GTX 1060 on KnightLi Blog</description>
        <generator>Hugo -- gohugo.io</generator>
        <language>en</language>
        <lastBuildDate>Wed, 24 Jun 2026 10:07:45 +0800</lastBuildDate><atom:link href="https://knightli.com/en/tags/gtx-1060/index.xml" rel="self" type="application/rss+xml" /><item>
        <title>GTX 1060 Running Qwen 35B: Optimizing llama.cpp from 3 tok/s to 17 tok/s</title>
        <link>https://knightli.com/en/2026/06/24/gtx-1060-qwen-35b-llama-cpp-optimization-guide/</link>
        <pubDate>Wed, 24 Jun 2026 10:07:45 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/06/24/gtx-1060-qwen-35b-llama-cpp-optimization-guide/</guid>
        <description>&lt;p&gt;Can a 6GB GTX 1060 run a 35B-class large language model?&lt;/p&gt;
&lt;p&gt;Under the usual assumptions, the first answer is probably &amp;ldquo;not really.&amp;rdquo; A 35B model is huge, 6GB of VRAM is tiny, and even a quantized model can easily run into slow generation, memory pressure, limited context length, or instability after a short run.&lt;/p&gt;
&lt;p&gt;But if the model uses an MoE architecture, and you combine that with &lt;code&gt;llama.cpp&lt;/code&gt; layer offloading, CPU memory, and parameter tuning, the answer becomes more interesting. It will not feel like a high-end GPU setup, but it can move from &amp;ldquo;barely runs&amp;rdquo; to &amp;ldquo;usable for local experiments.&amp;rdquo;&lt;/p&gt;
&lt;p&gt;This guide is organized around practical tuning. The goal is not to mythologize the GTX 1060, but to explain where to look, what to tune, and how to identify bottlenecks when running Qwen 35B-like models on low-VRAM hardware.&lt;/p&gt;
&lt;h2 id=&#34;start-with-the-conclusion&#34;&gt;Start with the conclusion
&lt;/h2&gt;&lt;p&gt;For a low-VRAM GPU running a 35B model, the key is not to force everything into VRAM. The goal is to let the GPU handle the parts that benefit most from acceleration.&lt;/p&gt;
&lt;p&gt;The rough workflow is:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;4
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;5
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;6
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;7
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;Get it running first
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;-&amp;gt; Understand why the default speed is slow
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;-&amp;gt; Tune GPU offloading
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;-&amp;gt; Use MoE characteristics to reduce unnecessary load
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;-&amp;gt; Fix memory and cache bottlenecks
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;-&amp;gt; Increase context length only after that
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;-&amp;gt; Finally handle stability
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;If you focus only on &amp;ldquo;how do I fit this into VRAM,&amp;rdquo; you can easily tune in the wrong direction. The more practical goal is to make VRAM, system memory, CPU, disk, and context cache work together.&lt;/p&gt;
&lt;h2 id=&#34;prepare-the-environment&#34;&gt;Prepare the environment
&lt;/h2&gt;&lt;p&gt;This kind of setup works best if you have:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;An NVIDIA GPU with around 6GB VRAM, such as a GTX 1060 6GB;&lt;/li&gt;
&lt;li&gt;Enough system RAM, because low RAM makes swap and OOM problems much more likely;&lt;/li&gt;
&lt;li&gt;A CUDA-enabled build of &lt;code&gt;llama.cpp&lt;/code&gt;;&lt;/li&gt;
&lt;li&gt;A quantized model file suitable for low-VRAM testing;&lt;/li&gt;
&lt;li&gt;Realistic expectations about speed;&lt;/li&gt;
&lt;li&gt;The ability to monitor VRAM, RAM, and process usage.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Start by checking the environment:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;nvidia-smi
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;free -h
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;./llama-cli --help
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;If &lt;code&gt;nvidia-smi&lt;/code&gt; cannot see the GPU, or your &lt;code&gt;llama.cpp&lt;/code&gt; build has no CUDA support, parameter tuning will not deliver the result you want.&lt;/p&gt;
&lt;h2 id=&#34;step-1-make-the-model-run-first&#34;&gt;Step 1: Make the model run first
&lt;/h2&gt;&lt;p&gt;Do not chase 17 tok/s at the beginning. The first goal is simple: can the model load and produce output?&lt;/p&gt;
&lt;p&gt;A basic command usually looks like this:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;4
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;./llama-cli &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  -m /path/to/model.gguf &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  -p &lt;span class=&#34;s2&#34;&gt;&amp;#34;Explain what an MoE model is in three sentences&amp;#34;&lt;/span&gt; &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  -n &lt;span class=&#34;m&#34;&gt;128&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;If this fails, do not add GPU options yet. Check:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Whether the model path is correct;&lt;/li&gt;
&lt;li&gt;Whether the quantization format is supported by your &lt;code&gt;llama.cpp&lt;/code&gt; version;&lt;/li&gt;
&lt;li&gt;Whether system RAM is sufficient;&lt;/li&gt;
&lt;li&gt;Whether you downloaded the right model variant;&lt;/li&gt;
&lt;li&gt;Whether the binary supports the model architecture.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Once the model can run, start optimizing speed.&lt;/p&gt;
&lt;h2 id=&#34;why-the-default-speed-may-be-only-3-toks&#34;&gt;Why the default speed may be only 3 tok/s
&lt;/h2&gt;&lt;p&gt;On low-VRAM hardware, slow defaults usually come from multiple bottlenecks at once.&lt;/p&gt;
&lt;p&gt;Common cases include:&lt;/p&gt;
&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Bottleneck&lt;/th&gt;
          &lt;th&gt;Symptom&lt;/th&gt;
          &lt;th&gt;Direction&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;Too little GPU offload&lt;/td&gt;
          &lt;td&gt;GPU is idle, CPU is busy&lt;/td&gt;
          &lt;td&gt;Increase GPU offload within limits&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Too much offload&lt;/td&gt;
          &lt;td&gt;VRAM overflows or errors appear&lt;/td&gt;
          &lt;td&gt;Reduce offloaded layers&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Memory bandwidth limit&lt;/td&gt;
          &lt;td&gt;CPU is busy but token speed is low&lt;/td&gt;
          &lt;td&gt;Reduce overhead, try a better quantization&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Context too large&lt;/td&gt;
          &lt;td&gt;Slow start or RAM spikes&lt;/td&gt;
          &lt;td&gt;Test with a smaller context first&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Swap is active&lt;/td&gt;
          &lt;td&gt;The whole system feels stuck&lt;/td&gt;
          &lt;td&gt;Add RAM or lower parameters&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Poor batch settings&lt;/td&gt;
          &lt;td&gt;Prompt processing is slow&lt;/td&gt;
          &lt;td&gt;Tune batch-related parameters&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Do not look only at the tok/s number. Keep these running while testing:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;watch -n &lt;span class=&#34;m&#34;&gt;1&lt;/span&gt; nvidia-smi
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;htop
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Watch VRAM usage, GPU utilization, CPU load, and system memory together.&lt;/p&gt;
&lt;h2 id=&#34;step-2-tune-gpu-offloading&#34;&gt;Step 2: Tune GPU offloading
&lt;/h2&gt;&lt;p&gt;The most common acceleration path in &lt;code&gt;llama.cpp&lt;/code&gt; is offloading part of the model to the GPU.&lt;/p&gt;
&lt;p&gt;A typical parameter is:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;-ngl &lt;span class=&#34;m&#34;&gt;20&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Or in a fuller command:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;4
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;5
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;./llama-cli &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  -m /path/to/model.gguf &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  -p &lt;span class=&#34;s2&#34;&gt;&amp;#34;Write a local LLM tuning checklist&amp;#34;&lt;/span&gt; &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  -n &lt;span class=&#34;m&#34;&gt;256&lt;/span&gt; &lt;span class=&#34;se&#34;&gt;\
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  -ngl &lt;span class=&#34;m&#34;&gt;20&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;The value &lt;code&gt;20&lt;/code&gt; is not universal. With a low-VRAM GPU, test gradually:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;4
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;-ngl 10
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;-ngl 15
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;-ngl 20
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;-ngl 25
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;After each change, check three things:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Whether the process starts normally;&lt;/li&gt;
&lt;li&gt;Whether VRAM is close to full;&lt;/li&gt;
&lt;li&gt;Whether tok/s actually improves.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;If VRAM is already full, increasing &lt;code&gt;-ngl&lt;/code&gt; will usually make things less stable, not faster.&lt;/p&gt;
&lt;h2 id=&#34;step-3-understand-why-moe-matters&#34;&gt;Step 3: Understand why MoE matters
&lt;/h2&gt;&lt;p&gt;MoE models are different from dense models.&lt;/p&gt;
&lt;p&gt;The key idea is that the total parameter count is large, but not every expert is activated for every token. In other words, a 35B label does not always mean every token requires the full 35B worth of computation.&lt;/p&gt;
&lt;p&gt;That is why low-VRAM GPUs can sometimes experiment with these models.&lt;/p&gt;
&lt;p&gt;But two warnings matter:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;MoE is not free magic; the model file is still large;&lt;/li&gt;
&lt;li&gt;When VRAM is insufficient, CPU memory still carries a lot of data.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;So when tuning MoE models, the focus is to put the high-value acceleration work on the GPU and spend VRAM only where it helps most.&lt;/p&gt;
&lt;h2 id=&#34;step-4-fix-memory-bottlenecks&#34;&gt;Step 4: Fix memory bottlenecks
&lt;/h2&gt;&lt;p&gt;Many people assume low-VRAM failure is only a VRAM problem. In practice, VRAM, RAM, and cache pressure often fail together.&lt;/p&gt;
&lt;p&gt;If system RAM is close to full, or swap starts being used heavily, speed will drop sharply. Check with:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;free -h
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Or:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;vmstat &lt;span class=&#34;m&#34;&gt;1&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Look for frequent swap activity.&lt;/p&gt;
&lt;p&gt;Useful directions include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Use a smaller quantized model;&lt;/li&gt;
&lt;li&gt;Reduce context length;&lt;/li&gt;
&lt;li&gt;Lower batch size;&lt;/li&gt;
&lt;li&gt;Stop unrelated background tasks;&lt;/li&gt;
&lt;li&gt;Put the model on a fast SSD;&lt;/li&gt;
&lt;li&gt;Leave enough free system memory.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If system RAM is too small, a 6GB GPU alone cannot make the experience smooth.&lt;/p&gt;
&lt;h2 id=&#34;step-5-do-not-max-out-context-length-first&#34;&gt;Step 5: Do not max out context length first
&lt;/h2&gt;&lt;p&gt;Many models advertise long-context support, but low-VRAM machines should not start there.&lt;/p&gt;
&lt;p&gt;Begin with a smaller context:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;-c &lt;span class=&#34;m&#34;&gt;4096&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;If that is stable, try:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;-c &lt;span class=&#34;m&#34;&gt;8192&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;When increasing context length, watch memory and speed.&lt;/p&gt;
&lt;p&gt;Longer context means higher KV cache pressure. On low-VRAM devices, long context usually exposes problems more quickly than short-answer generation.&lt;/p&gt;
&lt;p&gt;If your goal is local Q&amp;amp;A, code snippet explanation, or short summarization, you do not need a very large context at the beginning.&lt;/p&gt;
&lt;h2 id=&#34;step-6-watch-batch-parameters&#34;&gt;Step 6: Watch batch parameters
&lt;/h2&gt;&lt;p&gt;Batch-related parameters in &lt;code&gt;llama.cpp&lt;/code&gt; affect prompt processing and generation behavior. Names can vary between versions, so check help first:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;./llama-cli --help
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;General ideas:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;For long prompts, batch tuning may improve prompt processing;&lt;/li&gt;
&lt;li&gt;When VRAM is tight, too large a batch can reduce stability;&lt;/li&gt;
&lt;li&gt;Do not copy someone else&amp;rsquo;s values blindly. Test on your own machine.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Change only one parameter at a time.&lt;/p&gt;
&lt;p&gt;For example, first fix the model, context, and &lt;code&gt;-ngl&lt;/code&gt;, then try batch settings. Otherwise it is hard to know which change helped.&lt;/p&gt;
&lt;h2 id=&#34;step-7-record-your-five-key-parameters&#34;&gt;Step 7: Record your five key parameters
&lt;/h2&gt;&lt;p&gt;The worst part of local low-VRAM tuning is getting something to run today and forgetting the working settings tomorrow.&lt;/p&gt;
&lt;p&gt;Record these items for each test:&lt;/p&gt;
&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;Parameter&lt;/th&gt;
          &lt;th&gt;What to record&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;Model file&lt;/td&gt;
          &lt;td&gt;Model name, quantization, file size&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;GPU offload&lt;/td&gt;
          &lt;td&gt;&lt;code&gt;-ngl&lt;/code&gt; or related offload settings&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Context length&lt;/td&gt;
          &lt;td&gt;&lt;code&gt;-c&lt;/code&gt; value&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Batch&lt;/td&gt;
          &lt;td&gt;batch / ubatch and related settings&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;Result&lt;/td&gt;
          &lt;td&gt;tok/s, VRAM, RAM, stability&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;A simple note can look like this:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;4
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;5
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;6
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;7
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;model: Qwen-xx-35B-xxx.gguf
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;gpu: GTX 1060 6GB
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;ngl: 20
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;ctx: 4096
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;batch: default
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;speed: about 17 tok/s
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;status: stable for short text, long context needs more testing
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Next time you change models or machines, you will have a baseline.&lt;/p&gt;
&lt;h2 id=&#34;a-steadier-test-flow&#34;&gt;A steadier test flow
&lt;/h2&gt;&lt;p&gt;Use this order:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Run without GPU offload to confirm the model loads;&lt;/li&gt;
&lt;li&gt;Add a low &lt;code&gt;-ngl&lt;/code&gt; value and confirm output;&lt;/li&gt;
&lt;li&gt;Increase &lt;code&gt;-ngl&lt;/code&gt; gradually to find the VRAM limit;&lt;/li&gt;
&lt;li&gt;Fix &lt;code&gt;-ngl&lt;/code&gt;, then tune context length;&lt;/li&gt;
&lt;li&gt;Fix context, then test batch;&lt;/li&gt;
&lt;li&gt;Use the same prompt to compare tok/s;&lt;/li&gt;
&lt;li&gt;Run for 10 to 20 minutes and watch stability;&lt;/li&gt;
&lt;li&gt;Record the final parameters.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Do not use a different prompt every time. Different prompts make speed numbers hard to compare.&lt;/p&gt;
&lt;p&gt;Prepare a fixed test prompt:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;Explain in 800 words why MoE models can be suitable for low-VRAM inference, and give three cautions.
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Use the same prompt in every round so you can tell whether the optimization really helped.&lt;/p&gt;
&lt;h2 id=&#34;common-failed-attempts&#34;&gt;Common failed attempts
&lt;/h2&gt;&lt;p&gt;These mistakes waste a lot of time when tuning large models on low-VRAM hardware.&lt;/p&gt;
&lt;h3 id=&#34;1-raising-gpu-offload-blindly&#34;&gt;1. Raising GPU offload blindly
&lt;/h3&gt;&lt;p&gt;You see &lt;code&gt;-ngl&lt;/code&gt; improve speed, so you keep increasing it.&lt;/p&gt;
&lt;p&gt;The problem is that a GTX 1060 has only 6GB VRAM. Once you cross the limit, the program may error out, or it may start but become unstable.&lt;/p&gt;
&lt;h3 id=&#34;2-starting-with-very-long-context&#34;&gt;2. Starting with very long context
&lt;/h3&gt;&lt;p&gt;Long context puts heavy pressure on memory and KV cache. Make the model stable with short context first, then expand.&lt;/p&gt;
&lt;h3 id=&#34;3-looking-only-at-average-toks&#34;&gt;3. Looking only at average tok/s
&lt;/h3&gt;&lt;p&gt;tok/s matters, but it is not the only metric.&lt;/p&gt;
&lt;p&gt;Also watch:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;First-token latency;&lt;/li&gt;
&lt;li&gt;Prompt processing speed;&lt;/li&gt;
&lt;li&gt;Whether VRAM overflows;&lt;/li&gt;
&lt;li&gt;Whether long runs remain stable;&lt;/li&gt;
&lt;li&gt;Whether the system becomes unusably slow.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;4-not-recording-parameters&#34;&gt;4. Not recording parameters
&lt;/h3&gt;&lt;p&gt;Local inference tuning takes repeated experiments. Without notes, it is easy to forget the settings that worked.&lt;/p&gt;
&lt;h2 id=&#34;what-to-expect-from-a-gtx-1060&#34;&gt;What to expect from a GTX 1060
&lt;/h2&gt;&lt;p&gt;What is an old GPU like the GTX 1060 good for?&lt;/p&gt;
&lt;p&gt;Good for:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Learning &lt;code&gt;llama.cpp&lt;/code&gt;;&lt;/li&gt;
&lt;li&gt;Testing GGUF models;&lt;/li&gt;
&lt;li&gt;Running short local Q&amp;amp;A;&lt;/li&gt;
&lt;li&gt;Experimenting with local model parameters;&lt;/li&gt;
&lt;li&gt;Trying low-resource MoE inference;&lt;/li&gt;
&lt;li&gt;Deciding whether a model is worth deploying on better hardware.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Not ideal for:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;High-concurrency services;&lt;/li&gt;
&lt;li&gt;Heavy long-context use;&lt;/li&gt;
&lt;li&gt;Multiple users at the same time;&lt;/li&gt;
&lt;li&gt;Production-scale RAG;&lt;/li&gt;
&lt;li&gt;Latency-sensitive real-time applications.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Treat the GTX 1060 as an experiment machine and it is valuable. Treat it as a production LLM server and you will be disappointed.&lt;/p&gt;
&lt;h2 id=&#34;one-sentence-summary&#34;&gt;One-sentence summary
&lt;/h2&gt;&lt;p&gt;Running Qwen 35B-like models on 6GB VRAM is not about stuffing everything into the GPU. It is about coordinating &lt;code&gt;llama.cpp&lt;/code&gt; GPU offload, MoE behavior, system memory, context length, and batch parameters.&lt;/p&gt;
&lt;p&gt;If you have an old GTX 1060, try this order:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;Run first -&amp;gt; Tune -ngl -&amp;gt; Watch VRAM -&amp;gt; Control context -&amp;gt; Check RAM -&amp;gt; Test batch -&amp;gt; Record tok/s
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Going from 3 tok/s to 17 tok/s is not magic. It comes from breaking the bottlenecks apart one by one.&lt;/p&gt;
</description>
        </item>
        
    </channel>
</rss>
