Hardware and Software Stacks for LLM Training and Inference

7 min readSep 18, 2023

A few short years ago we ( and Jeff Dean of Google a year later ) announced the birth of the new ML stack⁵. Let’s see what is out there now and where things are going.

Part 2 AMD Hardware and Software Stack

Part 3 Google Hardware and Software Stack

Part 4 Open Source LLM Software Stack — OpenAI Triton

NVIDIA Hardware / Software stack

CUDA

CUDA has been instrumental in establishing NVIDIA as the leader in the high-end compute market. CUDA is a focal point because higher-level frameworks like TensorFlow or PyTorch eventually generate CUDA code to execute on NVIDIA GPUs. CUDA is one of the relatively rare examples of software that is both proprietary¹ and free.

The workhorse of the coming AGI apocalypse is matrix multiplication code, executed in massively parallel fashion. Hardware based matrix multiplication using lower precision data types like bfloat16 is a defining feature of new deep learning hardware. In NVIDIA lingo the hardware is called Tensor Core and makes it possible to perform efficient matrix multiplication.

The NVIDIA Tensor Core basically performs only one kind of operation: matrix-multiply-and-accumulate on 4×4 matrices.

A single CUDA Tensor Core can perform 16 x 16 x 16 half-precision matrix multiplication using warp level primitive wmma::mma_sync in fewer number of clock cycles³.

CUDA WMMA provides a direct way to calculate 16x16 matrix matrix-multiply-and-accumulate ( C = AB ) using a CUDA Warp (32 threads).

// Calculate AB with NVIDIA Tensor Cores
// Kernel executed by 1 Warp (32 Threads)
__global__ void tensorOp(float *D, half *A, half *B) {
// 1. Declare the fragments
wmma::fragment<wmma::matrix_a,  WMMA_M,  WMMA_N,  WMMA_K, half , wmma::col_major> Amat;
wmma::fragment<wmma::matrix_b,  WMMA_M,  WMMA_N,  WMMA_K, half , wmma::col_major> Bmat;
wmma::fragment<wmma::accumulator, WMMA_M, WMMA_N, WMMA_K, float , void> Cmat ;
// 2. Initialize the output to zero
wmma::fill_fragment( Cmat, 0.0f ) ;
// 3. Load the inputs into the fragments
wmma::load_matrix_sync( Amat , A,  MATRIX_M) ;
wmma::load_matrix_sync( Bmat , B,  MATRIX_K) ;
// 4. Perform the matrix multiplication
wmma::mma_sync ( Cmat , Amat , Bmat , Cmat ) ;
// 5. Store the result from fragment to global

wmma::store_matrix_sync(D, Cmat , MATRIX_M , wmma::mem_col_major ) ;

nvcc compiler translates² the single mma_sync primitive into sets of HMMA instructions with the same target registers ( R8, R4 ), which are all shared between threads. For one thread group, all HMMA instruction sets contribute to the same positions in the matrix C, and they each accumulate and multiply elements from different parts of A and B matrices⁷.

 HMMA.1688.F32 R8, R16, R27, R8 
 HMMA.1688.F32 R4, R16, R31, R4 
 HMMA.1688.F32 R8, R28, R30, R8 
 HMMA.1688.F32 R4, R28, R32, R4

An important point is that the two-dimensional tensors are provided as 1-D arrays. For this reason, we need to declare if the 1-D arrays should be interpreted either as row- or column-major.

NVIDIA cuDNN

cuDNN is a library of efficient implementations of deep learning primitives.

Unfortunately, these libraries only support a restricted set of tensor operations, leaving the implementation of novel primitives to experts.

PyTorch is using cuDNN extensively. NVIDIA Megatron LM (LLM pretrain library) does not directly utilize cuDNN; it relies on cuDNN through its integration with PyTorch⁶.

NVIDIA NCCL

NCCL has found great application in Deep Learning Frameworks, where the AllReduce collective⁸ is heavily used for neural network training. Efficient scaling of neural network training is possible with the multi-GPU and multi node communication provided by NCCL.

The NVIDIA Collective Communications Library (NCCL, pronounced “Nickel”) is a library providing inter-GPU communication primitives that are topology-aware and can be easily integrated into applications.

NCCL implements both collective communication and point-to-point send/receive primitives. It is not a full-blown parallel programming framework; rather, it is a library focused on accelerating inter-GPU communication.

NCCL provides the following collective communication primitives :

AllReduce
Broadcast
Reduce
AllGather
ReduceScatter

PyTorch FSDP ( Fully Sharded Data Parallel ) API — necessary for multi node LLM training — is also a wrapper for NCCL.

The AllReduce operation is performing reductions on data (for example, sum, min, max) across devices and writing the result in the receive buffers of every rank.

In an allreduce operation between k ranks and performing a sum, each rank will provide an array Vk of N values, and receive an identical arrays S of N values, where S[i] = V0[i]+V1[i]+…+Vk-1[i].

AllReduce is commonly used to synchronize gradient updates between workers after each mini-batch. This allows data parallel training.

Here is how AllReduce synchronizes gradient updates between workers after each mini-batch during distributed deep learning training:

Each worker independently processes a mini-batch of data to calculate gradients for its portion of the model.
The workers then call AllReduce to collectively combine the gradient tensors.
With AllReduce, each worker divides its own gradient by the number of workers. This performs the element-wise sum of all gradients across workers.
For example, if two workers had gradients [1, 2, 3] and [4, 5, 6], after AllReduce both would have [2.5, 3.5, 4.5].
By doing an element-wise sum, AllReduce synchronizes the gradients so all workers now have the total gradient seen by the entire distributed dataset.
After AllReduce, each worker applies the new synchronized gradients to simultaneously update their portion of the model, keeping the overall model in sync.
This gradient averaging process prevents each worker’s model from diverging during distributed training.
By synchronizing after each mini-batch, workers constantly keep their models aligned as training proceeds in parallel across all machines/GPUs.

testResult_t AllReduceRunColl(void* sendbuff, void* recvbuff, size_t count, ncclDataType_t type, ncclRedOp_t op, int root, ncclComm_t comm, cudaStream_t stream) {
  NCCLCHECK(ncclAllReduce(sendbuff, recvbuff, count, type, op, comm, stream));
  return testSuccess;
}

In this NCCL all_reduce.cu example, the above code will perform sum on sendbuff data and store the result in recvbuff.

NVIDIA TensorRT

NVIDIA® TensorRT™, an SDK for high-performance deep learning inference

NVIDIA H100

It is currently the most powerful NVIDIA GPU chip. It is based on a new Hopper architecture.

NVIDIA GH200

¹Published work, together with our experiments, consistently show that
bare-metal peak performance is inaccessible to software written in plain
CUDA and without a deeper understanding of the hardware compared to
what is currently publicly available. There is consensus that the high degree
of optimization found in NVIDIA’s libraries such as cuBlas and cuDNN is
inaccessible to authors of CUDA code, even when they use inline PTX assembly. We believe that manufacturers write optimized libraries as low-level
SASS code, possibly with assemblers which they do not make available to the
public.

² NVIDIA T4; nvcc 11.8.0 -arch=sm_75 -lcublas -lcurand; SASS assembly code.

³ We note that while NVIDIA Tensor Core implements 4×4 matrix multiplications in hardware, CUDA 9 ( and up ) WMMA allows us only to compute larger matrix multiplications.

New Hopper FP8 Precisions — 2x throughput and half the footprint of
H100 FP16 / BF16

Each Tensor Core performs a multiplication of two matrices with half precision floating-point entries and adds the result to an accumulator in single precision. One of the motivations for matrix multiplication in half precision is that the matrix entries that are multiplied in neural network are small with respect to the value of the previous iteration. For this reason, the multiplication result is still small in value. However, the result is accumulated to another value that might be much larger. In addition, deep neural network training are tolerant to precision loss up to certain degree. Thus, high precision calculations are not critical for the completion of many deep neural network trainings.

⁴

⁵ The Challenge of Uncertainty in a Fast Moving Field
One challenge for building machine learning accelerator hardware is that the ML research field is moving extremely fast. Chip design projects that are started today often take 18 months to 24 months to finish the design, fabricate the semiconductor parts and get them back and install them into a production
datacenter environment. For these parts to be economically viable, they typically must have lifetimes of at least three years. So, the challenge for computer architects building ML hardware is to predict where the fast moving field of machine learning will be in the 2 to 5 year time frame.

⁶ Efforts are being made to surpass the capabilities offered by cuDNN,— OpenAI Triton project is an example.

⁷ OpenAI matrix multiplication code adds inline PTX assembly instructions to hardcode shared memory load / store detail.

 asm volatile ("{\n\t"
            ".reg .u32 c<4>, k<4>;\n\t"
            "mov.u32 c0, 0;\n\t"
            "mov.u32 c1, 0;\n\t"
            "mov.u32 c2, 0;\n\t"
            "mov.u32 c3, 0;\n\t"
            "mov.u32 k0, 0;\n\t"
            "mov.u32 k1, 0;\n\t"
            "mov.u32 k2, 0;\n\t"
            "mov.u32 k3, 0;\n\t"
            "@bc ld.global.nc.v4.u32 {c0, c1, c2, c3}, [%0];\n\t"
            "@bk ld.global.nc.v4.u32 {k0, k1, k2, k3}, [%1];\n\t"
            "bar.sync 0;\n\t"
            "st.shared.v4.u32 [%2 +  0*80*2], {c0, c1, c2, c3};\n\t"
            "st.shared.v4.u32 [%2 + 32*80*2], {k0, k1, k2, k3};\n\t"
            "bar.sync 0;\n\t"
            "}" :: "l"(X + offsetC), "l"(E + offsetK), "r"(writeS));

In this case it loads 4 elements from X into c<4> registers and 4 elements from E into k<4> registers and stores the loaded elements into shared memory.

Essentially this creates potentially faster but brittle, non — portable code.

⁸ NCCL is similar to MPI.

Hardware and Software Stacks for LLM Training and Inference

NVIDIA Hardware / Software stack

CUDA

Written by Ranko Mosic