Practical AI with AMD Instinct MI300X

When the MI300X was launched in December 2023, there was a lot of optimism. We were desperate for an alternative to nVidia’s expensive and scarce H100s, especially one that provided 192gb of VRAM per GPU compared to the H100’s 80gb. And the MI300X was being rented for reasonable prices compared to nVidia’s H100s.

Once I got my hands on a node, it was a nightmare to work with. The prevailing attitude was “if you are working with non-nVidia hardware, then you deserve your suffering” and few people were interested in fixing any related issues. Missing packages, broken dependencies, poor documentation, and the dreaded "compile it yourself" instructions that make you question your life choices at 2 AM. I've been there, trust me. (xformers, flash attention, bitsandbytes) Getting PyTorch to recognize the hardware felt like teaching a cat to bark. Driver issues were a given, not an exception. Even simple operations could turn into multi-day archaeological expeditions through GitHub issues, obscure forum posts, and increasingly desperate Google searches.

Even once the software was cobbled into a working pile of hot mess, it would crash sometimes hours into a training job, due to firmware bugs. But I kept at it. Why? Because when you're an independent researcher training models without VC funding or corporate backing, you work with what's available and what you can afford.

The Next Era of Data at Instacart | by Nate Kupp | tech-at-instacart

Fast forward to 18 months later - I can honestly say that the MI300X is a joy to work with. Not perfect - I'll be honest about the rough edges - but it works well enough that I'm training and publishing large models on Hugging Face with it. And that's worth documenting, especially since few people have successfully navigated these waters.

My MI300X experience and this guide wouldn't be possible without Hot Aisle, who sponsors my work with access to their excellent MI300X compute infrastructure. They offer on-demand rental of mi300x GPUs at reasonable rates, which has been crucial for independent researchers like myself. If you're looking to try MI300X without the massive capital investment, their platform is worth checking out. They offer flexible configurations from single GPU VMs (1x 192GB MI300X with 224GB RAM) to full bare metal servers (8x 192GB MI300X with 2TB RAM), all at $1.99 per GPU hour. They're particularly supportive of developers migrating from CUDA or contributing to open source - mention your use case when signing up and they'll provide free credits to help you get started - tell them Eric sent you.

Without further ado, I present my full guide to working with AI using MI300X with a glorious 1.5TB of VRAM.

Install ROCm

wget https://repo.radeon.com/amdgpu-install/6.4.2/ubuntu/noble/amdgpu-install_6.4.60402-1_all.deb
sudo apt install ./amdgpu-install_6.4.60402-1_all.deb
sudo apt update
sudo apt install python3-setuptools python3-wheel
sudo usermod -a -G render,video $LOGNAME
sudo apt install rocm

nano ~/.bashrc

# Add environment variables to ~/.bashrc
export LLVM_PATH=/opt/rocm/llvm
export ROCM_PATH=/opt/rocm
export HIP_PLATFORM=amd
export XLA_FLAGS="--xla_gpu_enable_triton_gemm=false"
export PATH=$ROCM_PATH/bin:$PATH
export LD_LIBRARY_PATH=$ROCM_PATH/lib:$LD_LIBRARY_PATH
export PYTORCH_ROCM_ARCH="gfx942"
export TRITON_CACHE_DIR=$HOME/.triton/cache

Supervised Fine Tuning (SFT) a Language Model with Axolotl on MI300X

Create environment, install axolotl

First off - Axolotl (the software team) is certainly prone to the “wtf you doing trying to use AMD, good luck with that” attitude. The nVidia presumption is baked in, hard. Also, Axolotl’s setup has a bad habit of clobbering all my carefully manually installed dependencies. (their typical response is “just use the docker” which is really not helpful) So first, we will create the environment and install Axolotl (which will install all the wrong dependencies) and then later we will override what it installed.

git clone https://github.com/axolotl-ai-cloud/axolotl.git
cd axolotl

# create environment
curl -LsSf https://astral.sh/uv/install.sh | sh
uv venv --python 3.11 --seed
source .venv/bin/activate

# install axolotl
pip install -U packaging==23.2 setuptools==75.8.0 wheel ninja
pip install --no-build-isolation .[flash-attn,deepspeed]

Install pytorch

I haven’t had problems with this, though sometimes specific tools are picky about versions.

uv pip uninstall torch torchvision torchaudio
uv pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm6.4

Install flash attention

In my experience you need to install from source. However, that’s relatively straightforward.

uv pip install einops ninja packaging
git clone https://github.com/Dao-AILab/flash-attention.git
cd flash-attention
python setup.py install

Install xformers

In the past this didn’t work at all. But it seems to work now!

uv pip install xformers --index-url https://download.pytorch.org/whl/rocm6.4

Install bitsandbytes

The nightmare. This one was really designed for CUDA from the ground up. AMD has their own fork that’s older, but works. You need to compile it in your environment.

# this is the procedure is supposed to work - and it "installs" but fails later when you try to run something that uses it
# git clone -b multi-backend-refactor https://github.com/bitsandbytes-foundation/bitsandbytes.git
# cd bitsandbytes
# sudo apt-get install -y build-essential cmake 
# cmake -DCOMPUTE_BACKEND=hip -DBNB_ROCM_ARCH="gfx942" -S .
# make
# uv pip install -e .

# this procedure actually works, though it's an older version of bitsandbytes.
git clone --recurse https://github.com/ROCm/bitsandbytes
cd bitsandbytes
git checkout rocm_enabled
cmake . -DCOMPUTE_BACKEND=hip -DHIP_TARGETS=gfx942
make -j$(nproc)
pip install .

Install transformers and deepspeed

No problems.

uv pip install -U transformers accelerate datasets huggingface-hub hf-transfer deepspeed

Install vllm

uv pip install vllm --torch-backend=auto

Practical AI with AMD Instinct MI300X

Install ROCm

Supervised Fine Tuning (SFT) a Language Model with Axolotl on MI300X

Create environment, install axolotl

Install pytorch

Install flash attention

Install xformers

Install bitsandbytes

Install transformers and deepspeed

Install vllm

Comments

More from this blog

The With Programming Language

The Demonization of DeepSeek

Demystifying OpenAI's Terms of Use with Regards to Dataset Licenses

From Zero to Fineturning with Axolotl on ROCm

Command Palette

Install ROCm

Supervised Fine Tuning (SFT) a Language Model with Axolotl on MI300X

Create environment, install axolotl

Install pytorch

Install flash attention

Install xformers

Install bitsandbytes

Install transformers and deepspeed

Install vllm

Comments

More from this blog