Fine-tuning Alpaca 30b 4-bit on consumer hardware

I want to write about fine-tuning Alpaca 30b 4-bit on consumer hardware, but before I can, I'll need to give a little background. My basic goal was to figure out "what's the most powerful AI I can customize and run on my shiny new 4090."

The answer right now is LLaMA 30b. Normally, fine-tuning this model is impossible on consumer hardware due to the low VRAM (clever nVidia) but there are clever new methods called LoRA and PEFT whereby the model is quantized and the VRAM requirements are dramatically decreased.

I'll also comment that, if you are currently building a consumer desktop for the purpose of training AI models, stop now and build a dual-3090 SLI instead of a 4090. This will maximize your VRAM and enable more scenarios (for example fine tuning LLaMA 65b), and the performance of dual-3090 is close enough to the performance of a 4090.

I link some academic papers and I recommend that you take the time to, if not read them, at least ask GPT4 to explain them to you like you are 12, an undergrad, or a grad student.

LLaMA is a large language model created by Meta. It is not licensed for commercial use, so you can only use it for personal or academic purposes. (there is a petition requesting Meta to re-license the model here) What makes LLaMA special is that it's designed to be efficient and high quality. This makes it an excellent base for working on consumer hardware - if you start with an efficient model then the other techniques will be even more effective.

Next, I'd like to talk about Alpaca. Alpaca is a set of weights that can be applied to LLaMA to tune it to be good at following instructions. Instruct tuning is how you get from a base language model to a chatbot you can interact with such as ChatGPT. In essence, Alpaca is an effort to train LLaMA to be like ChatGPT.

Alpaca is not just the results (the weights) it also describes the methodology used to create them. In particular, Alpaca used a method called Self-Instruct to generate a dataset that was then used to fine-tune the LLaMA model. I will talk more about instruct datasets in a separate post, because there are a number to choose from.

Finally I want to talk about LoRA and quantization, which comes in 8-bit and 4-bit flavors. You can think of this as similar to jpeg, but for large language models. You give up a bit of quality in exchange for a significant reduction of compute and memory requirements for fine-tuning and inference.

In my next post the rubber hits the road, and I will discuss the nitty gritty of making it happen.