When selecting a Llama GGUF model on Hugging Face, you can think of quantization levels like resolution: lower levels need less VRAM/RAM, but quality drops gradually.
Understand 32, 16, and Q levels first
32: closest to original/uncompressed quality, but hardware demand is extreme.16: still very close to original quality, around half the size of32.Q8: common entry point for quantized models (Q8_0orQ8).Q6,Q5,Q4,Q3,Q2: lower number means lower resource use and higher quality loss risk.
What K_M / K_S means
K_M and K_S are mixed quantization variants:
- most weights stay at the target quantization level
- important parts keep higher precision
So at the same level, Qx_K_M or Qx_K_S is usually slightly better than plain Qx.
Practical picking strategy
- If hardware allows, start with
Q8. - If memory is tight, step down through
Q6/Q5/Q4. - Try not to go below
Q4;Q4_K_Mis a common lower bound. - Below
Q4, quality degradation becomes increasingly visible.
Quality order (best to worst)
3216
– Above this point, quality is effectively the same, but hardware requirements are extreme –
Q8Q6_K_MQ6_K_SQ6Q5_K_MQ5_K_SQ5
– This is the typical sweet spot –
Q4_K_MQ4_K_SQ4
– Below this point, quality loss becomes visible –
Q3_K_MQ3_K_SQ3Q2_K_MQ2_K_SQ2
If you want one short rule: start with Q8 or Q6_K_M, then move down to Q5 or Q4_K_M only when needed.