OpenAI Sora and Open Source Efforts to Catch Up

Ranko Mosic
5 min readFeb 17, 2024

My mental model of Sora is that it is the “GPT-2 moment” for video generation.

Big Picture

Sora is built on our diffusion³ transformer (DiT) model (published in ICCV 2023) — it’s a diffusion model with a transformer backbone, in short: DiT = [VAE encoder + ViT + DDPM + VAE decoder].

Sora

Just like GPT for text is autoregressive language model, Sora is probably autoregressive world simulator.

(Auto-regressive) Long Video Generation: a significant breakthrough in Sora is the ability to generate very long videos. The difference between producing a 2-second video and a 1-minute video is monumental.
In Sora, this is probably achieved through
joint frame prediction that allows auto-regressive sampling, yet a major challenge is how to address error accumulation and maintain quality/consistency through time. A very long (and bi-directional) context for conditioning? Or could scaling up simply lessen the issue?

This would imply sequential video generation i.e. the process is not parallel, which is perhaps one of the reasons why currently Sora is limited to 1-minute videos⁴.

Components

Importantly, Sora is a diffusion transformer.

Sora is based on of a few major algorithms:

  • Variational Encoder / Decoder )
  • Diffusion
  • Latent diffusion ( VaE + diffusion ); VaE is generative model by itself but here its primary role is to reduce the size of the space diffusion tackles ( from individual pixels to lower dimensional latent space ( 256 x 256 to 32 x 32 in the example below )
  • Diffusion Transformers: transformers applied to visual data; image is split into patches ( tokenized ), then processed by transformers.

We introduce Diffusion Transformers (DiTs), a new architecture for diffusion models. DiT is based on the Vision Transformer (ViT) architecture which operates on sequences of patches.

We show that by constructing and benchmarking the DiT design space under the Latent Diffusion Models (LDMs) framework¹, where diffusion models are
trained within a VAE’s latent space, we
can successfully replace the U-Net backbone with a transformer.

Transformer decoder. After the final DiT block, we need to decode our sequence of image tokens into an output noise prediction and an output diagonal covariance prediction.

Essentially we see image / frame data processed through a number of steps aimed at modeling data in a computationally tracktable way.

Open Source

DiT code by William Peebles — now OpenAI Sora Team Lead.

To run DiT-B/4 model training on ImageNet⁵ :

git clone https://github.com/facebookresearch/DiT.git
cd DiT
wget http://cs231n.stanford.edu/tiny-imagenet-200.zip
unzip tiny-imagenet-200.zip
torchrun --nnodes=1 --nproc_per_node=1 train.py --model DiT-B/4 --data-path /content/DiT/tiny-imagenet-200/train
[2024-02-26 13:28:55] DiT Parameters: 130,475,648
/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py:557: UserWarning: This DataLoader will create 4 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
warnings.warn(_create_warning_msg(
[2024-02-26 13:28:55] Dataset contains 100,000 images (/content/DiT/tiny-imagenet-200/train)
[2024-02-26 13:28:55] Training for 1400 epochs...
[2024-02-26 13:28:55] Beginning epoch 0...
[2024-02-26 13:31:47] (step=0000100) Train Loss: 0.4348, Train Steps/Sec: 0.58
[2024-02-26 13:34:51] (step=0000200) Train Loss: 0.2109, Train Steps/Sec: 0.54
[2024-02-26 13:37:59] (step=0000300) Train Loss: 0.2227, Train Steps/Sec: 0.5
vae = AutoencoderKL.from_pretrained(f"stabilityai/sd-vae-ft-{args.vae}").to(device)
x = vae.encode(x).latent_dist.sample().mul_(0.18215)
x= torch.Size([32, 3, 256, 256]) # before vae
x= torch.Size([32, 4, 32, 32]) # after vae
vae = AutoencoderKL.from_pretrained(f"stabilityai/sd-vae-ft-{args.vae}").to(device)
x = vae.encode(x).latent_dist.sample().mul_(0.18215)

Stability Diffusion² VaE projects into latent space and reduces input shape from 256x256x3 to 32x32x4.

Patch size is a part of model definition. DiT_B_4 has a patch size 4x4 pixels.

def DiT_B_4(**kwargs):
return DiT(depth=12, hidden_size=768, patch_size=4, num_heads=12, **kwargs)

¹Latent Diffusion we apply diffusion models in the latent space of powerful pretrained autoencoders ( VAE ).

Latent Diffusion paper uses U-Net. Main innovation of DiT is replacing denoising U-Net with transformers, thus unlocking scalability etc.

² Stability AI announced Stable Diffusion 3.0 — which could also be described as an open source attempt at Sora ( the paper is out, but they haven’t shown any videos yet :-) ). Our architecture ( also ) builds upon the DiT architecture.

³ Diffusion is generic stochastic process ( borrowed from physics ), here implemented as deep unsupervised learning method that can be applied to any data modality ( image, text, audio ). Diffusion models are an emerging class of generative models to obtain novel data.

Recover Structure by Reversing Time
Brownian motion looks the same forward and backwards.

.. then all we have to have the deep supervised Network learn is a function
which predicts the mean and the covariance which we already know of each step in the reverse diffusion process and that’ll give us a generative model.

How diffusion process works

Due to their iterative nature and the associated computational costs, as well as the long sampling times during inference, research on formulations for more efficient training and/or faster sampling of these models has increased. ( ie high compute demand is not going anywhere anytime soon ).

⁵ If low on memory ( torch.cuda.OutOfMemoryError: CUDA out of memory) change train.py ( using smaller model / bigger patch size also helps — for example DiT-S/8 ):

parser.add_argument(" - global-batch-size", type=int, default=256)
to
parser.add_argument(" - global-batch-size", type=int, default=32)

--

--

Ranko Mosic

Applied AI Consultant Full Stack. GLG Network Expert https://glginsights.com/ . AI tech advisor for VCs, investors, startups.