When a16z generously sponsored Dolphin, I had some compute budget, and because the original dolphin-13b was a flop, I had some time to go back to the drawing board. When I was ready to train the next iteration, I reconsidered whether to rent or buy the compute for the build. I ultimately decided to buy, because I have the skill and interest, I'm good at finding deals, and owning a cluster would give me the ability to continue executing on future projects beyond Dolphin, not to mention the satisfaction of building the AI end-to-end, like a baker baking bread from scratch. Artisan AI.
What to build?
I am building 4 servers of 8x AMD Instinct MI100 and 4 servers of 8x NVIDIA GeForce RTX 4090. But for now, I'm just going to get 2 servers up and running at a time. Later, after I've built out all of the servers and tested them, then I will get the full 8 server setup running. (that'll require extra electrical work)
I went with the MI100s because I got a killer deal with Rhino Technology (who have an excellent sales and technical team, I highly recommend) on some refurbished Gigabyte G482-Z53 servers that came preinstalled each with 4x MI100s. As a scrappy guy building in my garage, I gotta roll with the deals that I can find. And the servers support 8x PCI-e gen4 x16. Exactly what I needed.
This capability is pretty hard to find, by the way. Usually, the PCI slots are bifurcated and don't get all 16 lanes. And for training AI models for bandwidth I really need each card to get all 16 lanes. The servers also came preinstalled with dual AMD EPYC 7742 64-core CPUs (wow!) and 256GB RAM. Which is plenty to start with. I had a very good start to my cluster with these servers.
So I'm starting with the easy ones, the MI100s. They are easy because they fit in the server, unlike the 4090s, which are larger and won't fit in the chassis.
My inspiration:
The care and feeding of servers
Then I had to do some math. I planned to limit each card to 300 watts each. So I figured, I need 240 volts and 25 amps per server.
For the cooling, for each pair of servers I got a 25000 BTU air conditionter (240v, 30 amps)
So for my "building" phase when I only need to power 2 servers at a time, I will have 2 breakers. One 240v 50 amps for the two servers, and one 240v 30 amps for the air conditioner. Later, when all the servers are ready for operation, I will need 4x 50 amp breakers and 4x 30 amp breakers. And I figured with the amps and the length of the wire, I needed an 8-gauge 3 conductor wire that I got from the hardware store. And a couple of flush-mounted outlets, one 6-50 for the pair of servers and one 6-30 for the air conditioner.
I found these PDUs that can each handle 50 amps and power 2 servers that each have 3 power supplies. Perfect.
As I have never done this before, it took a few weeks of researching and ordering things to get all of this put together before I was able to power up the first server.
Getting Ubuntu Server installed
Pretty straightforward, just put Ubuntu on a thumb drive and boot from that. Of course, I needed to get a monitor and a keyboard working too. And one of my servers turned out to have a nonfunctional VGA output so i switched to another server for now.
Getting it on the network
So I didn't wanna run a wire ALL the way to my living room. I bought these TP-Link AC600 wireless adapters but of course they didn't just work when I plug them in. So I had to first get the server on the network so I could install the driver. So I hooked it up with ethernet to my workstation and used Windows wifi sharing to bridge the wifi to the ethernet. That got the server on the internet. After that, I was able to download updates. I installed links2 web browser because Ubuntu Server has no GUI. After that, I was able to install the driver for the TP-Link AC600 and get it connected. Then I give the mac address an assigned ip address on my router's dhcp, so that I can forward port 22 and dynamic dns so I can SSH to it from the outside. (of course, I add my public ed25519 key and disable password login) This server will act as my bastion, I will ssh from this to the other servers in my cluster.
Installing drivers, ROCm, and HIP
I was told I should use the docker image, but I couldn't get that working. Instead I installed the drivers from the repository.
The trick is this: don't try to install multiple versions of ROCm, just install version 5.7, and install the nightly version of Pytorch that works with 5.7. That's the combination that made everything work for me. Actually, AMD's install experience is better than NVidia's. Also you gotta install NVidia's Cuda too before you install HIP.
https://docs.amd.com/projects/HIP/en/docs-5.3.0/how_to_guides/install.html
Inference with Oobabooga
Setup Oobabooga as normal, using requirements-rocm.txt
then duplicate this drive for the other servers
Future Plans
- I am going to make any changes required to get Axolotl running on these servers.
I am going to put a Lustre cluster on these servers, I plan to do 7x 2tb ssd on each server and using 100GbE so it should be fast enough and able to handle the nodes saving and loading checkpoints. This is very important for multinode training.
I am going to train more versions of dolphin and other models using these servers.
I am eventually going to train my own base model.