Quantisation

Module: tool mastery

What it is

Quantisation reduces model size by using lower-precision numbers for weights. Instead of 16-bit or 32-bit floating point numbers, quantised models might use 8-bit, 4-bit, or even 2-bit integers. This dramatically reduces memory requirements and can speed up inference with minimal accuracy loss.

Why it matters

Quantisation makes running large models on consumer hardware possible. A 70B model that needs 140GB of memory at full precision might fit in 35GB at 4-bit quantisation. If you want to run models locally, quantisation levels (Q4, Q8, etc.) determine whether a model fits on your hardware.