It's been a while. And I'm digging into Machine Learning.
I was watching the excellent video by Andrej Karpathy about how to write a GPT (of which GPT-3 is an example) from scratch, using the paper "Attention is all you need"
I implemented it from scratch while watching the video, and also refactored and added some features to make it better.
My code is located here:
What's this do?
- Generates fake Shakespearean text (could be trained on any text by replacing the training data)
FRIAR THUM:
Northum, my father: if detering
Come with my replies, tribunes. Come, title,
It by please now.
TYBALT:
Why, the should I do you gow not in this state,
For esignation to fear any oils?
My hand must we this likely demest for unvy's houses,
Wars 'govern all our erested vains of restre:
Let's tender out and content thousand for me:
That I will wish through the that I meet,
Unless the kingly country, perfection
And those grim our stone. Here any mist noble gague.
CLARENCE:
A sail what
- I added anaconda environment so it's reproducible.
conda env create
conda activate gpt
- It uses DirectML so it works in Windows or WSL with any video card (tested on AMD Radeon) It can be easily adapted to run in Linux, using Cuda or ROCm.
dml = torch_directml.device()
torch.device(dml)
...
data = torch.tensor(encode(text), dtype=torch.long, device=dml)
- It uses config.ini for the hyper parameters so you can modify them without changing the code.
import configparser
config = configparser.ConfigParser()
config.read('config.ini')
default = config['DEFAULT']
config.batch_size = int(default['batch_size'])
config.block_size = int(default['block_size'])
config.max_iters = int(default['max_iters'])
config.eval_interval = int(default['eval_interval'])
config.learning_rate = float(default['learning_rate'])
config.eval_iters = int(default['eval_iters'])
config.n_embd = int(default['n_embd'])
config.n_head = int(default['n_head'])
config.n_layer = int(default['n_layer'])
config.dropout = float(default['dropout'])
with open('input.txt', 'r', encoding='utf8') as f:
config.chars = sorted(list(set(f.read())))
config.vocab_size = len(config.chars)
Split into "generate.py" vs "train.py", and abstracted the model
"train.py" also saves the model to a file when finished generating, and "generate.py" loads the model from the file.
torch.save(blm.state_dict(), 'model.pt')
blm = BigramLanguageModel().to(dml)
blm.load_state_dict(torch.load('model.pt', map_location=dml))
Implementing this from scratch, based on only watching a video and no copy-pasting, is an excellent way to become familiar with the concepts of training a large language model and working with your hardware limitations, as well as hyperparameter tuning. I highly recommend this exercise.