The Path to Open Source State-of-the-Art LLM: Mixture-of-Experts ( With DeepSpeed Colab MoE Demo )

Ranko Mosic
9 min readJul 24, 2023

--

The release of Llama2 has caused quite a stir, as it is a powerful dense open-source-ish model with 70 billion parameters and trained on 2 trillion tokens. However, despite its impressive capabilities, it may still lag behind leading proprietary models like GPT-4, PaLM2, and Claude in terms of benchmark performance.

It may be worthwhile to investigate where the differences lie, as it is likely that next releases of Llama will follow suit⁴.

ChatGPT-4 significantly outperforms Llama 2 in terms of parameter size, with approximately 1.76 trillion parameters compared to Llama 2’s largest version with 70 billion parameters. ChatGPT-4 is based on eight models with 220 billion parameters each, connected by a ( sparse¹ ) Mixture-of-Experts ( MoE ).

Fedus 2022

In deep learning, models typically reuse the same parameters for all inputs. Mixture of Experts (MoE) models defy this and instead select different parameters for each incoming example. The result is a sparsely-activated model with an outrageous number of parameters but a constant computational cost.

MoE models build upon the observation that language models can
be decomposed into smaller, specialized sub-models, or “experts”, that focus on distinct aspects of the input data, thereby enabling more efficient computation and resource allocation.

Introducing MoE models may be counterproductive² if used in classic pretrain/fine-tune without instruction tuning. However, MoE can dramatically improve performance when instruction tuning is introduced, either used independently or in conjunction with task-specific fine-tuning.

Implementing MoE is a non-trivial exercise that requires modification of the model’s code, as well as adjustments for data and parameter flows.

Using sparsity requires careful consideration for how the model will be used in downstream usage too. If there are lots of machines to pre-train a model, but a lot less for fine-tuning or serving, then the amount of sparsity (e.g. the number of experts) should be tailored to fit the amount of memory available in the downstream use cases.

Case Study — DeepSpeed MoE on Colab

DeepSpeed, Microsoft developed and open sourced deep learning optimization library built on top of Nvidia Megatron. DeepSpeed MoE is an implementation of PR-MoE algorithm.

Here is how to run cifar10_deepspeed.py — an MoE modification of a classic cifar10_tutorial.py — a CIFAR10 training script.on Google Colab on a single GPU:

!pip install deepspeed
!bash run_ds_moe.sh

run_ds_moe.sh content:

#!/bin/bash

# Number of nodes
NUM_NODES=1
# Number of GPUs per node
NUM_GPUS=1
# Size of expert parallel world (should be less than total world size)
EP_SIZE=1
# Number of total experts
EXPERTS=1

deepspeed --num_nodes=${NUM_NODES} --num_gpus=${NUM_GPUS} cifar10_deepspeed.py \
--log-interval 100 \
--deepspeed \
--deepspeed_config ds_config.json \
--moe \
--ep-world-size ${EP_SIZE} \
--num-experts ${EXPERTS} \
--top-k 1 \
--noisy-gate-policy 'RSample' \
--moe-param-group

For the Colab demo purposes only, we have adjusted the parameters: NUM_GPUS to 1, EXPERTS to 1, and EP_SIZE to 1. With only one expert, this configuration is approximately equivalent to a dense model. In real life pre-training runs we will increase the number of experts to match the number of available GPUs.

In order to introduce MoE we modify the original self.fc3 layer.

class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.conv1 = nn.Conv2d(3, 6, 5)
self.pool = nn.MaxPool2d(2, 2)
self.conv2 = nn.Conv2d(6, 16, 5)
self.fc1 = nn.Linear(16 * 5 * 5, 120)
self.fc2 = nn.Linear(120, 84)
self.fc3 = nn.Linear(84, 10)

We modify the third fully connected layer to add the MoE layer. To cater for this, we need to add an additional fully-connected layer, whose input dimension is equal to the output dimension of the MoE layer.

class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.conv1 = nn.Conv2d(3, 6, 5)
self.pool = nn.MaxPool2d(2, 2)
self.conv2 = nn.Conv2d(6, 16, 5)
self.fc1 = nn.Linear(16 * 5 * 5, 120)
self.fc2 = nn.Linear(120, 84)
if args.moe:
fc3 = nn.Linear(84, 84)
self.moe_layer_list = []
for n_e in args.num_experts:
# create moe layers based on the number of experts
self.moe_layer_list.append(
deepspeed.moe.layer.MoE(
hidden_size=84,
expert=fc3,
num_experts=n_e,
ep_size=args.ep_world_size,
use_residual=args.mlp_type == 'residual',
k=args.top_k,
min_capacity=args.min_capacity,
noisy_gate_policy=args.noisy_gate_policy))
self.moe_layer_list = nn.ModuleList(self.moe_layer_list)
self.fc4 = nn.Linear(84, 10)

else:
self.fc3 = nn.Linear(84, 10)

Forward pass is also modified to allow for MoE from:

    def forward(self, x):
x = self.pool(F.relu(self.conv1(x)))
x = self.pool(F.relu(self.conv2(x)))
x = x.view(-1, 16 * 5 * 5)
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
x = self.fc3(x)
return x

To:

    def forward(self, x):
x = self.pool(F.relu(self.conv1(x)))
x = self.pool(F.relu(self.conv2(x)))
x = x.view(-1, 16 * 5 * 5)
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
if args.moe:
for layer in self.moe_layer_list:
x, _, _ = layer(x)
x = self.fc4(x)
else:
x = self.fc3(x)

return x

The call below will Initialize the DeepSpeed Engine. model_engine is DeepSpeed runtime engine which wraps the client model for distributed training.

model is nn.module class before applying any wrappers i.e. our net definition.

DeepSpeed will also take care of distributed data load.

# Initialize DeepSpeed to use the following features
# 1) Distributed model
# 2) Distributed data loader
# 3) DeepSpeed optimizer
model_engine, optimizer, trainloader, __ = deepspeed.initialize(
args=args, model=net, model_parameters=parameters, training_data=trainset)

DeepSpeed model_engine now takes care of forward and backprop steps:

    for i, data in enumerate(trainloader):
# get the inputs; data is a list of [inputs, labels]
inputs, labels = data[0].to(model_engine.local_rank), data[1].to(
model_engine.local_rank)
if fp16:
inputs = inputs.half()
outputs = model_engine(inputs)
loss = criterion(outputs, labels)

model_engine.backward(loss)
model_engine.step()

This example demonstrates how to effectively leverage distributed DeepSpeed capabilities and implement MoE logic with relative ease.

Llama2 and DeepSpeed

DeepSpeed also offers pre-built scripts for training GPT-3 in both dense and MoE configurations, as well as Llama2 pretrain scripts ( no MoE scripts so far ).

We would follow a similar process to customize and build an MoE Language Model (LLM) based on the Llama2 model and data sources. If building a custom domain LLM, we would also incorporate custom domain-specific datasets.

This script enables the training of a 1.3B GPT-3 model with 128 experts.

NUM_GPUS=64
###############################################################################
### MoE configs
## Number of experts. EP_SIZE 1 means dense model without MoE
# EP_SIZE=1
EP_SIZE=128

if [[ $EP_SIZE -gt $NUM_GPUS ]]; then
EP_PARALLEL_SIZE=$NUM_GPUS
else
EP_PARALLEL_SIZE=$EP_SIZE
fi

DeepSpeed builds on top of Nvidia Megatron library:

megatron_options=" \
--override-opt_param-scheduler \
--adam-beta1 0.9 \
--adam-beta2 0.95 \
--tensor-model-parallel-size ${MP_SIZE} \
--moe-expert-parallel-size ${EP_PARALLEL_SIZE} \
--num-experts ${EP_SIZE} \
--moe-loss-coeff ${MLC} \
--moe-train-capacity-factor ${MOE_TRAIN_CAP_FACTOR} \
--moe-eval-capacity-factor ${MOE_EVAL_CAP_FACTOR} \
--moe-min-capacity ${MOE_MIN_CAP}

DeepSpeed has developed its own improvement for MoE called the PR-MoE algorithm. PR-MoE is a hybrid dense and MoE model created using residual connections, while applying experts only where they are most effective.

DeepSpeed PR-MoE

MoE Inference

Mixture of Experts (MoE) models are significantly larger than their dense counterparts, which poses a concern during inference. To ensure low latency, it is crucial for the entire model to fit within GPU memory³. Due to their size, these models often require distributed memory across multiple GPUs. Distillation is a method used to reduce model size while maintaining the quality of generated content. DeepSpeed has developed a variant of distillation called Mixture of Students (MoS).

Staged KD (Knowledge Distillation) in DeepSpeed’s Mixture-of-Students (MoS) framework is a technique used to train large-scale language models more efficiently. It differs from standard knowledge distillation by introducing an additional intermediate teacher-student stage. Here’s an overview of how staged KD is achieved in DeepSpeed MoS and how it differs from standard knowledge distillation:

1. Standard Knowledge Distillation:

— In standard knowledge distillation, there are typically two models involved: a large teacher model and a smaller student model.
— The teacher model is a pre-trained, high-capacity model that possesses rich knowledge and generalization abilities.
— The student model is a smaller model that aims to mimic the behavior and predictions of the teacher model.
— During training, the student model is trained to minimize the discrepancy between its predictions and the soft targets (probability distributions) generated by the teacher model.

2. Staged Knowledge Distillation in DeepSpeed MoS:

— DeepSpeed MoS extends the concept of knowledge distillation by introducing a staged training process with intermediate teacher-student stages.
— The architecture consists of multiple stages, each with a teacher-student pair.
— The initial stage starts with a large teacher model and a smaller student model (similar to standard knowledge distillation).
— In each subsequent stage, the teacher becomes the student from the previous stage, and a new, smaller student model is introduced.
— The training process involves distilling knowledge from the teacher of the previous stage to the student of the current stage, effectively cascading the knowledge from one stage to the next.
— The final stage’s student model is the one used for inference.

The key difference between staged KD in DeepSpeed MoS and standard knowledge distillation lies in the introduction of intermediate stages. By cascading knowledge across stages, DeepSpeed MoS aims to improve the training efficiency and performance of large-scale language models.

.

¹Sparse expert models, of which, Mixture-of-Experts (MoE) is the most popular variant, are neural networks where a set of the parameters are partitioned into “experts”, each with a unique weight.

Sparse models are a generalization of a dense model; a sparse model with a single expert is roughly a dense model. Fundamentally, sparse models allow to vastly increase the number of parameters in a model by increasing the number of experts, while keeping the FLOPs per example approximately constant. This can be good or bad depending on the setup and how the model is going to be used
later.

At a high level, sparsity is good when you have many accelerators (e.g. GPU/TPU) to host all the additional parameters that comes when using sparsity. Typically models are trained using data parallelism where different machines will get different slices of the training/inference data. The machines used for operating on the different slices of data can now be used to host many more model parameters. Therefore, sparse models are good when training with data parallelism and/or have high throughput while serving: training/serving on many machines which can host all of the parameters.

² We demonstrate that in the absence of instruction tuning, MoE models fall short in performance when compared to dense models on downstream tasks.

Several research papers, especially at larger scales, note that MoE models transferred to new domains (as in fine-tuning) lags their dense counterparts. Fedus et al. (2021); Narang et al. (2021)compared the pre-training perplexity versus fine-tuning performance for dense and sparse models.
They noticed for a given pre-training perplexity, sparse models were fine-tuning worse on reasoning tasks, but better on knowledge heavy tasks. In addition to worse out-of-domain language modeling performance, Artetxe et al. (2021) observed worse fine-tuning compared to dense models on
multiple tasks including HellaSwag, PIQA and Winogrande.

³ MoE inference requires more GPUs than traditional inference methods. This is because MoE models are typically much larger than traditional models, and they require more memory to store. DeepSpeed also offers standard inference related features ( quantization ). Here is an example of how to run quantized ( dtype=torch.half, quantize_bits = 8) inference:

import os
import deepspeed
import torch
from transformers import pipeline

local_rank = int(os.getenv('LOCAL_RANK', '0'))
world_size = int(os.getenv('WORLD_SIZE', '1'))
generator = pipeline('text-generation', model='EleutherAI/gpt-neo-125M',
device=local_rank)



generator.model = deepspeed.init_inference(generator.model,
mp_size=world_size,
dtype=torch.half,
replace_with_kernel_inject=True)

string = generator("DeepSpeed is", do_sample=True, min_length=5)
if not torch.distributed.is_initialized() or torch.distributed.get_rank() == 0:
print(string)

⁴ If MoE Llama were to be developed, it would need to be trained from scratch, as MoE style fine-tuning the existing dense Llama2 model would be pointless.

Update Q4, 2023: Mistral released weights for Mixtral 8x7B, a high-quality sparse mixture of experts model (SMoE) with open weights.

Furthermore, we suspect that the brute force, scale-up approach to improving LLM performance is reaching its limit. DeepMind recently announced the development of the next generation LLM, which will incorporate memory and planning modules. The current generation of LLMs undoubtedly exhibits limitations, with the most significant one being hallucinations. Unfortunately, these issues cannot be reliably resolved using the existing approach.

--

--

Ranko Mosic

Applied AI Consultant Full Stack. GLG Network Expert https://glginsights.com/ . AI tech advisor for VCs, investors, startups.