LoRA vs QLoRA: What’s the Difference?

15 июня15 июн

1 мин

#ai #dev (канал https://t.me/cybermerlin_pub - больше статей \ полезного) Both methods fine‑tune LLMs without recalculating all weights, but QLoRA goes further: it quantises the base model to fit large models into limited memory. - Base model precision: FP16 / BF16 (16‑bit) - VRAM required (7B model): 12–16 GB - Quality after training: Baseline (100%) - Training speed: High - Inference speed: High - Where it works: Regular GPUs (12 GB+) - Base model precision: 4‑bit (NF4 / FP4) - VRAM required (7B model): 6–8 GB - Quality after training: 99.5–100% (almost no loss) - Training speed: Lower (due to quantisation/dequantisation) - Inference speed: Lower (unpacking adapter into 4‑bit) - Where it works: Edge GPUs, laptops (4–8 GB) 1. The base model is quantised to 4‑bit (NF4 – a special type for normally distributed weights). 2. LoRA adapters are trained in FP16 on top of this 4‑bit base. 3. At inference, you can either merge the adapters with the 4‑bit weights or load them separately. Key po

Оглавление

LoRA vs QLoRA: What’s the Difference?
Comparison
LoRA

#ai #dev

(канал https://t.me/cybermerlin_pub - больше статей \ полезного)

LoRA vs QLoRA: What’s the Difference?

Both methods fine‑tune LLMs without recalculating all weights, but QLoRA goes further: it quantises the base model to fit large models into limited memory.

Comparison

LoRA

- Base model precision: FP16 / BF16 (16‑bit)

- VRAM required (7B model): 12–16 GB

- Quality after training: Baseline (100%)

- Training speed: High

- Inference speed: High

- Where it works: Regular GPUs (12 GB+)

QLoRA

- Base model precision: 4‑bit (NF4 / FP4)

- VRAM required (7B model): 6–8 GB

- Quality after training: 99.5–100% (almost no loss)

- Training speed: Lower (due to quantisation/dequantisation)

- Inference speed: Lower (unpacking adapter into 4‑bit)

- Where it works: Edge GPUs, laptops (4–8 GB)

How QLoRA Works

1. The base model is quantised to 4‑bit (NF4 – a special type for normally distributed weights).

2. LoRA adapters are trained in FP16 on top of this 4‑bit base.

3. At inference, you can either merge the adapters with the 4‑bit weights or load them separately.

Key point: QLoRA retains high accuracy thanks to “double quantisation” and the use of paged memory.

When to Choose Which

- LoRA – if you have a GPU with 16+ GB VRAM and you want maximum training speed.

- QLoRA – if you are training on an RTX 3060 (12 GB), a laptop GPU, or cheap cloud T4s (16 GB). Also, QLoRA allows you to fine‑tune 70B models on a single A100 80 GB (with LoRA on a 70B you would need 2–4 A100s).

Example (Unsloth – optimised QLoRA)

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(

model_name="unsloth/llama-3-8b-bnb-4bit",

load_in_4bit=True, # QLoRA

max_seq_length=2048,

)

model = FastLanguageModel.get_peft_model(model, r=16)

# ... training ...

LoRA is the classic choice for regular GPUs. QLoRA is a breakthrough that makes fine‑tuning large models accessible on modest hardware, with almost no quality loss. Choose QLoRA when VRAM is limited.