#ai #dev (канал https://t.me/cybermerlin_pub - больше статей \ полезного) Both methods fine‑tune LLMs without recalculating all weights, but QLoRA goes further: it quantises the base model to fit large models into limited memory. - Base model precision: FP16 / BF16 (16‑bit) - VRAM required (7B model): 12–16 GB - Quality after training: Baseline (100%) - Training speed: High - Inference speed: High - Where it works: Regular GPUs (12 GB+) - Base model precision: 4‑bit (NF4 / FP4) - VRAM required (7B model): 6–8 GB - Quality after training: 99.5–100% (almost no loss) - Training speed: Lower (due to quantisation/dequantisation) - Inference speed: Lower (unpacking adapter into 4‑bit) - Where it works: Edge GPUs, laptops (4–8 GB) 1. The base model is quantised to 4‑bit (NF4 – a special type for normally distributed weights). 2. LoRA adapters are trained in FP16 on top of this 4‑bit base. 3. At inference, you can either merge the adapters with the 4‑bit weights or load them separately. Key po