Diffusion Gen AI Models

Ranko Mosic
3 min readMay 9, 2024

--

Diffusion models are a family of probabilistic generative models that progressively destruct data by injecting noise, then learn to reverse this process for sample generation.

Here we link to code samples illustrating major points ( DDPM implementation).

Training

We start with a cat image. Diffusion model will learn to generate synthetic cat images based on a bunch of images like this.

Cat image

Digital image is coded as a matrix of 0–255 values i.e. integers representing red, green, blue values ( center cropped to 128x128x3 shape )¹.

Progressively more noise is added at each timestep (p part ):

In reverse direction ( p ) noisy data is randomly sampled and passed to noise prediction model ( U-net i.e. convolutional net with a few modifications ).

t = torch.randint(0, timesteps, (batch_size,), device=device).long() # generates 128 ( batch size ) random numbers in 0-200 range

loss = p_losses(model, batch, t, loss_type="huber")
model = Unet(
dim=image_size,
channels=channels,
dim_mults=(1, 2, 4,)
def p_losses(denoise_model, x_start, t, noise=None, loss_type="l1"):
if noise is None:
noise = torch.randn_like(x_start)

x_noisy = q_sample(x_start=x_start, t=t, noise=noise)
predicted_noise = denoise_model(x_noisy, t)

if loss_type == 'l1':
loss = F.l1_loss(noise, predicted_noise)
elif loss_type == 'l2':
loss = F.mse_loss(noise, predicted_noise)
elif loss_type == "huber":
loss = F.smooth_l1_loss(noise, predicted_noise)
else:
raise NotImplementedError()

return loss

denoise_model takes noisy image ( x_noisy ) and random timestep tensor and outputs predicted_noise. The loss between noise and predicted_noise is then minimized via standard SGD:

 loss.backward()
optimizer.step()

In learning how to denoise the model is forced to learn the underlying data structure i..e what are important general characteristics of cat images. During prompt induced generation process diffusion model will produce synthetic realistic cat images .

¹RGB values of (255, 0, 0) represent pure red, (0, 255, 0) green (0, 0, 255), blue respectively.

(480, 640, 3) original image RGB values are ( [[[140 25 56], [[144 25 67]], [[146 24 73]] …), for example. These values are then normalized linearly to (-1, 1 ) range, for example:

[[[ 0.27843142 0.2941177 0.33333337 … -0.02745098 0.00392163 -0.01176471] [ 0.27843142 0.28627455 0.3176471 … -0.05098039 -0.01176471 0.00392163]

If we follow a single pixel value transformation while adding noise for pixel at position 0 its value goes from 0.2784 to 0.2545. The calculation is 0.2784 * 0.9999 + 0.0100 * -2.3896 = 0.25447616 (pixel start value * sqrt(alpha) + (1 — sqrt(alpha)) * random noise ).

# forward diffusion
def q_sample(x_start, t, noise=None):
if noise is None:
noise = torch.randn_like(x_start)

sqrt_alphas_cumprod_t = extract(sqrt_alphas_cumprod, t, x_start.shape)
sqrt_one_minus_alphas_cumprod_t = extract(
sqrt_one_minus_alphas_cumprod, t, x_start.shape
)

return sqrt_alphas_cumprod_t * x_start + sqrt_one_minus_alphas_cumprod_t * noise

The amount of added noise at each timestep is controlled via parameter change schedule ( alpha ) i.e. parameter value. The process is parallelizable as timesteps are independent of each other.

Closed form formula — actually no need for repeated noise application
Training

--

--

Ranko Mosic

Applied AI Consultant Full Stack. GLG Network Expert https://glginsights.com/ . AI tech advisor for VCs, investors, startups.