29 подписчиков

🚀 NVIDIA just cranked up LLM speed by 53x

11 сентября11 сен

~1 мин

🚀 NVIDIA just cranked up LLM speed by 53x! 🤯

Imagine slashing your inference budget by 98% while keeping precision on par with the best models. 🎯💰

📌 Here’s the scoop:

It’s called Post Neural Architecture Search (PostNAS) — a game-changing upgrade for pre-trained models.

🧊 Freeze the Knowledge: We take a powerful model (think Qwen2.5) and “freeze” its MLP layers to keep that brainpower intact.

🔄 Surgical Replacement: We swap out most slow O(n²) attention layers for the super-efficient JetBlock design with linear attention. 🚀

⚡️ Hybrid Power: A few full-attention layers remain in key spots to maintain deep reasoning abilities.

✨ Meet Jet-Nemotron:

- 2,885 tokens/sec ⚡️

- 47x less KV-cache (just 154 MB)

- Top-notch accuracy at warp speed!

🔑 Why it matters:

For businesses: 53x speedup = 98% savings on large-scale deployment. AI project ROI is about to get a serious makeover!

For engineers: SOTA-level performance is now attainable even on low-memory devices.

For researchers: Instead of spending millions on pre-training, you can whip up new efficient models with architectural tweaks. 💡🔍