LLMs on a Budget
Note: The book is currently a work in progress, with chapters being released one by one. The first chapter is already available, and the second is scheduled for release at the end of November. By purchasing the book now, you’ll receive a discounted price that includes immediate access to all published chapters, accompanying notebooks, and future chapters/notebooks as they’re released until the book is complete. Once you bought the book, you will be notified when I publish new chapters.
Use the code "O4PW2F0" for a 30% discount.
Key Features of LLMs on a Budget
- Learn everything you need to know, step by step, about fine-tuning, quantizing, running, and serving LLMs on consumer hardware.
- Get exclusive access to detailed, fully-commented notebooks that walk you through every process.
- Stay ahead of the curve with cutting-edge methods and continuous updates. This book will be regularly updated with the latest techniques through the end of 2025.
About the Author
I’ve authored over 150 articles and developed more than 100 AI notebooks for my newsletter, The Kaitchup - AI on a Budget. With a Ph.D. in natural language processing and extensive industry experience working with large neural models and LLMs, I focus on making AI accessible and affordable. To explore my work further, consider subscribing to The Kaitchup.
Book Description
Large language models (LLMs) are notoriously difficult to fine-tune and run on consumer hardware using standard techniques, often requiring professional-grade GPUs with substantial memory.
This book will show you advanced techniques for fine-tuning, quantizing, running, and serving LLMs on consumer hardware. It explains, in clear and accessible language, and with code, parameter-efficient fine-tuning methods like LoRA and its many variants (QLoRA, DoRA, VeRA, etc.), as well as effective quantization strategies.
You’ll learn how to select the right LLMs and configure hyperparameters to maximize your hardware's potential. The book includes practical examples in detailed Jupyter notebooks, demonstrating the use of models like Llama 3.1 8B (for 16 GB GPUs) and TinyLlama (for 6 GB GPUs).
By the end of this book, you will have mastered the art of running LLMs on a budget and will be fully equipped to adapt your skills to the latest LLMs as they emerge in the coming years.
What you will learn
- Fine-tune LLMs using parameter-efficient techniques like LoRA, QLoRA, DoRA, VeRA, and others, with tools such as Hugging Face libraries, Unsloth, and Axolotl.
- Prepare, format, and synthesize datasets for fine-tuning LLMs.
- Learn how to quantize LLMs to significantly reduce memory consumption, enabling deployment on low-end hardware configurations.
- Efficiently run and serve LLMs using optimized inference frameworks like vLLM, NVIDIA’s TensorRT-LLM, and Hugging Face's TGI.
- Master the evaluation of LLMs, including how to assess the credibility and value of published benchmark scores, an essential skill to find out which LLMs to select for your applications.
Who this book is for
This course is designed to be accessible to a wide range of learners, regardless of their prior experience in machine learning. It is crafted to be easily understood by beginners in AI and LLMs, while also providing valuable insights, tips, and tricks for those with more expertise in LLMs.
No advanced mathematical background is required; all equations are explained in plain English. For those interested in the mathematical foundations, relevant scientific papers are provided for further reading.
A basic understanding of Python is necessary to follow and understand the code examples. While familiarity with Hugging Face libraries (like transformers, TRL, and PEFT) is helpful, it is not required. You won’t need prior knowledge of deep learning frameworks such as PyTorch or TensorFlow to benefit from this course.
Outline
A chapter contains more than 50 pages.
-
Chapter 1: Parameter-Efficient Fine-Tuning for Large Language Models[released]
- Understanding the Cost of Full Fine-tuning
- LoRA Adapters: Basics, Cost, Hyperparameters, and Performance
- How to Code LoRA Fine-tuning, Step by Step:
- With Hugging Face Transformers, TRL and PEFT
- With Unsloth
- With Axolotl (if you don’t want to code)
- Load and Merge your LoRA Adapters
- Using Adapters for Inference with Transformers and vLLM
-
Chapter 2: Prepare Your Training Dataset[Release: November 2024]
- The Common Characteristics of a Good Training Dataset
- Formatting a Dataset for Instruct Fine-tuning
- Generate Your Own Synthetic Dataset
- With GPT-4o mini
- With an open LLM
-
Chapter 3: Quantization for LLMs[Release: December 2024]
- The Basics of Quantization
- Quantization Aware Training vs. Post-training Quantization
- Popular Quantization Algorithms
- GPTQ
- AWQ
- GGUF: K-Quants and Imatrix
- AutoRound
- Bitsandbytes
- AQLM
- HQQ
- How to Choose a Quantization Algorithm
- Quantization to a Lower Precision vs. Using a Smaller Model
-
Chapter 4: Quantization in Fine-tuning[Release: January, 2025]
- The Basics of QLoRA
- How to Code QLoRA Fine-tuning, Step by Step:
- With Hugging Face Transformers, TRL and PEFT
- With Unsloth
- With Axolotl
- Quantization and Paging of the Optimizer States
- Benchmarking QLoRA with Different Quantization Techniques
- Advanced Techniques
- GaLore and Q-GaLore
- End-to-end FP8 Fine-tuning
- rsLoRA
- DoRA
- VeRA
- X-LoRA
- Chapter 5: Running and Serving LLMs[under conception]
- Chapter 6: Evaluating LLMs[under conception]
The Book