LLM Training — AMD Hardware and Software Stack

3 min readOct 2, 2023

The importance of the AMD stack is increasing due to the shortages of NVIDIA H100 GPUs and the closed-source, proprietary nature of CUDA software. It means you can develop a specialized new LLM MoE training framework end-to-end — from the Python frontend down to the binaries. You no longer need to force-fit it into CUDA, cuDNN and other closed-source NVIDIA libraries¹.

AMD’s data center GPU lineup includes the AMD Instinct series, which is specifically designed for high-performance computing (HPC) and artificial intelligence (AI) workloads in data centers.

The AMD Instinct series GPUs are built on AMD’s CDNA (Compute DNA) architecture, which is optimized for data center applications. The AMD CDNA architecture is built on AMD ROCm open software platform for GPU computing.

The AMD Instinct MI250 GPU has 128GB of HBM3² memory, as opposed to the NVIDIA H100’s 80GB of HBM3 memory. As we saw in the previous CUDA example, the amount of local GPU memory is one of the key constraints for GPU performance, since data and instructions must be copied to this local memory from the CPU or external storage to be processed by the GPU. The MI250’s significantly larger memory allotment of 128GB compared to the H100’s 80GB helps to alleviate this memory bottleneck and could allow it to process bigger models on larger datasets than the H100.

AMD Matrix Core is specialized matrix multiplication unit, the equivalent of Google TPU or NVIDIA Tensor Core.

AMD ROCm (Radeon Open Compute) is an open source platform for general computing on AMD GPUs only that is similar to NVIDIA’s CUDA.

- ROCm provides open source kernels, drivers and libraries for programming AMD GPUs using languages like C++, Python, etc.

- ROCm supports Linux operating systems and uses the open source HCC compiler to generate code for AMD GPUs.

- It aims to abstract hardware differences and provide a standard programming model across Radeon cards.

AMD shares ROCm source code and collaborates openly with developers to improve it.

HIP

HIP is a C++ Runtime API and Kernel Language that allows developers to create portable applications for AMD and NVIDIA GPUs from single source code.

Memory allocation on AMD GPUs and data copying methods are quite similar to those used in NVIDIA CUDA.

¹ AMD ROCm is also supported by PyTorch ( i.e., Meta ). Lamini and Hugging Face services use AMD Instinct GPUs.

² Also known as high bandwidth memory (HBM3), the memory
chips are stacked and placed in the same package as the processing chip. The extensive width provides high bandwidth, while placing the memory chips in the same package as the processor chip reduces latency and power consumption. However, there are practical limitations to the number of memory dies that can be stacked due to manufacturing constraints, thermal considerations, and signal integrity issues. These limitations impose a restriction on the maximum size of HBM3 memory.

LLM Training — AMD Hardware and Software Stack

HIP

Written by Ranko Mosic