LLM Training —Google Hardware and Software Stack

3 min readOct 29, 2023

Why Does Specialized Hardware Make Sense for Deep Learning Models?
Deep learning models have three properties that make them different than many other kinds of more general purpose computations. First, they are very tolerant of reduced-precision computations. Second the computations performed by most models are simply different compositions of a relatively small handful of operations like matrix multiplies, vector operations, application of convolutional kernels, and other dense linear algebra calculations [Vanhoucke et al. 2011]. Third, many of the mechanisms developed over the past 40 years to enable general-purpose programs to run with high performance on
modern CPUs, such as branch predictors, speculative execution, hyperthreaded-execution processing cores, and deep cache memory hierarchies and TLB subsystems are unnecessary for machine learning computations. So, the opportunity exists to build computational hardware that is specialized for dense,
low-precision linear algebra, and not much else, but is still programmable at the level of specifying programs as different compositions of mostly linear algebra-style operations. This confluence of characteristics is not dissimilar from the observations that led to the development of specialized digital
signal processors (DSPs) for telecom applications starting in the 1980s
[ en.wikipedia.org/wiki/Digital_signal_processor]. A key difference though, is because of the broad applicability of deep learning to huge swaths of computational problems across many domains and fields
of endeavor, this hardware, despite its narrow set of supported operations, can be used for a wide variety of important computations, rather than the more narrowly tailored uses of DSPs.

Google has not been shy about stay back a node or two with its TPU designs, and that is absolutely on purpose to keep the cost of chip design and production low.

The TensorFlow ecosystem contains a number of compilers and optimizers that operate at multiple levels of the software and hardware stack.

It’s actually more complicated than this

In this diagram, we can see that TensorFlow graphs[1] can be run a number of different ways. This includes:

Sending them to the TensorFlow executor that invokes hand-written op-kernels
Converting them to XLA High-Level Optimizer representation (XLA HLO), which in turn can invoke the LLVM compiler for CPU or GPU, or else continue to use XLA for TPU. (Or some combination of the two!)
Converting them to TensorRT, nGraph, or another compiler format for a hardware-specific instruction set
Converting graphs to TensorFlow Lite format, which is then executed inside the TensorFlow Lite runtime, or else further converted to run on GPUs or DSPs via the Android Neural Networks API (NNAPI) or related tech.

XLA (Accelerated Linear Algebra) is a domain-specific compiler for linear algebra that can accelerate TensorFlow models with potentially no source code changes.

JAX

Flax

Flax is a high-performance neural network library and ecosystem for JAX that is designed for flexibility: Try new forms of training by forking an example and by modifying the training loop, not by adding features to a framework.

Flax is being developed in close collaboration with the JAX team and comes with everything you need to start your research, including:

Neural network API (flax.linen): Dense, Conv, {Batch|Layer|Group} Norm, Attention, Pooling, {LSTM|GRU} Cell, Dropout
Utilities and patterns: replicated training, serialization and checkpointing, metrics, prefetching on device
Educational examples that work out of the box: MNIST, LSTM seq2seq, Graph Neural Networks, Sequence Tagging
Fast, tuned large-scale end-to-end examples: CIFAR10, ResNet on ImageNet, Transformer LM1b

¹ TensorFlow is suspiciously missing from this Google schema. TF is still quite popular on github though.

LLM Training —Google Hardware and Software Stack

JAX

Flax

Written by Ranko Mosic