Google Gemini, OpenAI GPT-5 , Multimodality and Open Source Attempts to Catch Up

Ranko Mosic
8 min readDec 27, 2023
JPEG is worth a thousand words ( ancient proverb ).

Today’s models mostly focus on one sense. Pathways will enable multiple senses.

Pathways

People rely on multiple senses to perceive the world. That’s very different from how contemporary AI systems digest information. Most of today’s models process just one modality of information at a time. They can take in text, or images or speech — but typically not all three at once.

Pathways could enable multimodal models that encompass vision, auditory, and language understanding simultaneously. So whether the model is processing the word “leopard,” the sound of someone saying “leopard,” or a video of a leopard running, the same response is activated internally: the concept of a leopard. The result is a model that’s more insightful and less prone to mistakes and biases.

Last year was the year of LLMs, but even more so the year of Multimodal Foundation Models ( LMM ). A common theme is that pretrain / finetune algorithms generate text², audio, video etc. representations³ and model their interactions. The end result should not only be improved image, video, audio and text understanding, but emerging world model and grounding, ultimately leading to AGI¹.

What was proven to work so far ( GPT etc. ) emerging model capabilities tickled by scalable massive compute deployed on massive unlabeled datasets in a self-supervised manner ( i.e. automated methods with minimal or no human in the loop ). Multi Modal Models are often trying to emulate similar approach — to find scalable data preparation, representation, pretrain and fine-tune methods. This will of course multi x the needed compute, which is why Sam Altman is now gunning for a $7T chip startup seed round. MM is much more complex undertaking compared to LLM next word prediction. We now deal with images, video, audio — pretty much any modality⁸— which not only has to be efficiently jointly represented, but also cross referenced — via some kind of attention mechanism — with other modalities, and then successfully generated.

Gemini is a family of generative AI models developed by Google DeepMind that is designed for multimodal use cases. Gemini models are trained on a dataset that is both multimodal and multilingual. Our pretraining dataset uses data from web documents, books, and code, and includes image, audio, and video data.

Gemini Ultra is huge:

Training Gemini Ultra used a large fleet of TPUv4 accelerators across multiple
datacenters.

Instruction tuning encompasses supervised fine tuning (SFT) and reinforcement learning through human feedback (RLHF) using a reward model. We apply instruction tuning in both text and multimodal settings.

Open Source

Open Flamingo is an attempt to emulate Google Flamingo.

Blip-2 is open-sourced Salesforce vision-to-language generation model which outperforms ( Google’s ) Flamingo80B by8.7% on zero-shot VQAv2 with 54x fewer trainable parameters.

GIT - Generative Image-to-text Transformer is open-sourced Microsoft model.

Compared with the concurrent work of Flamingo (Alayrac et al., 2022), we achieve higher accuracy (+5.4) on TextVQA and lower (-3.29) on VQAv2. Note that Flamingo’s model size is 80B, which
is 114 times of ours (0.7B)

¹ AGI is not a clearly defined term. We will also need a world model and action to get there i.e. MM by itself will not do.

Multimodality ultimately implies any modality input/output. An interesting research project is 4M — EPFL / Apple collaboration.

4M

Another interesting open source example is Microsoft Kosmos-2:

This work ( ) lays out the foundation for the development of Embodiment AI and sheds light on the big convergence of language, multimodal perception, action, and world modeling, which is a key step toward artifcial general intelligence.

² DALL·E’s vocabulary has tokens for both text and image concepts. Specifically, each image caption is represented using a maximum of 256 BPE-encoded tokens with a vocabulary size of 16384, and the image is represented using 1024 tokens with a vocabulary size of 8192.

³ CLIP is a very useful and widely used ( even by Google Gemini ) MM embedding model ( yet another hit by OpenAI A. Radford of GPT fame ).

We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks.

We find that CLIP, similar to the GPT family, learns to perform a wide set of tasks during pre-training including OCR, geo-localization, action recognition, and many others.

CLIP learns a multi-modal embedding space by jointly training an image
encoder and text encoder
to maximize the cosine similarity³ of the image and text embeddings of the N real pairs in the batch while minimizing the cosine similarity of the embeddings of the N² — N incorrect pairings.

Pixel2Seq casts object detection as a language modeling task conditioned on the observed pixel inputs.

With Pix2Seq, we propose a quantization and serialization scheme that converts bounding boxes and class labels into sequences of discrete tokens⁵ (similar to captions), and leverage an encoder-decoder architecture to perceive pixel inputs and generate the sequence of object descriptions. The training objective function is simply the maximum likelihood of tokens conditioned on pixel inputs and the preceding tokens.

Limitation is that the current approach for training Pix2Seq is entirely based on
human annotation, and by reducing such dependence, it can enable the model to benefit from more unlabeled data.

Joint or separate embeddings. CLIP uses encoding i.e. projects text and image into a common vector space and uses cosine similarity for matching ( there are no vocabs and tokens ). Numpy-like pseudocode for the core of an implementation of CLIP:

# image_encoder - ResNet or Vision Transformer
# text_encoder - CBOW or Text Transformer
# I[n, h, w, c] - minibatch of aligned images
# T[n, l] - minibatch of aligned texts
# W_i[d_i, d_e] - learned proj of image to embed
# W_t[d_t, d_e] - learned proj of text to embed
# t - learned temperature parameter
# extract feature representations of each modality
I_f = image_encoder(I) #[n, d_i]
T_f = text_encoder(T) #[n, d_t]
# joint multimodal embedding [n, d_e]
I_e = l2_normalize(np.dot(I_f, W_i), axis=1)
T_e = l2_normalize(np.dot(T_f, W_t), axis=1)
# scaled pairwise cosine similarities [n, n]
logits = np.dot(I_e, T_e.T) * np.exp(t)
# symmetric loss function
labels = np.arange(n)
loss_i = cross_entropy_loss(logits, labels, axis=0)
loss_t = cross_entropy_loss(logits, labels, axis=1)
loss = (loss_i + loss_t)/2

In commonly used object detection datasets, images have variable numbers of objects, represented as sets of bounding boxes and class labels. In Pix2Seq, a single object, defined by a bounding box and class label, is represented as [ymin, xmin, ymax, xmax, class]. However, typical language models are designed to process discrete tokens (or integers) and are unable to comprehend continuous numbers. So, instead of representing image coordinates as continuous numbers, we normalize the coordinates between 0 and 1 and quantize them into one of a few hundred or thousand discrete bins. The coordinates are then converted into discrete tokens as are the object descriptions, similar to image captions, which in turn can then be interpreted by the language model. The quantization process is achieved by multiplying the normalized coordinate (e.g., ymin) by the number of bins minus one, and rounding it to the nearest integer.

Multimodal Foundation Models

⁷ Gemini is based on landmark Flamingo project.

Flamingo fuses large language models with powerful visual representations — each separately pre-trained and frozen — by adding novel architectural components in between. Then it is trained on a mixture of complementary large-scale multimodal data coming only from the web, without using any data annotated for machine learning purposes. Following this method, we start from Chinchilla, our recently introduced compute-optimal 70B parameter language model, to train our final Flamingo model, an 80B parameter VLM. After this training is done, Flamingo can be directly adapted to vision tasks via simple few-shot learning without any additional task-specific tuning.

Text generation is performed by a Transformer decoder, conditioned on the visual representations produced by the Perceiver Resampler. We interleave pretrained and frozen text-only LM blocks with blocks trained from scratch that cross-attend to the visual output from the Perceiver Resampler.

Flamingo vision encoder is trained on a huge ALIGN dataset ( frequency-filtered Conceptual Captions dataset, 200x bigger ) and smaller LTIP.

⁸ MM even extends to the physical world — RT2 takes multimodality to the next level by incorporation action tokens in training dataset and as output. Multimodality thus extends to physical world i.e. robotics, perhaps opening path to improvements in currently depressed self-driving arena.

Our main contribution is RT-2, a family of models derived from fine-tuning large vision-language models trained on web-scale data to directly act as generalizable and semantically aware robotic policies⁴.

The vision-language models that we build on in this work take as input one or more images and produce a sequence of tokens, which conventionally represents
natural language text.

In this work, we adapt two previously proposed VLMs to act as VLA models: PaLI-X (Chen et al., 2023a) and PaLM-E (Driess et al., 2023).

To enable vision-language models to control a robot, they must be trained to output actions. We take a direct approach to this problem, representing actions as tokens in the model’s output, which are treated in the same way as language tokens.

RT2

To make RT-2 easily compatible with large, pre-trained vision-language models, our recipe is simple: we represent robot actions as another language, which can be cast into text tokens and trained together with Internet-scale vision-language datasets. In particular, we co-fine-tune (a combination of fine-tuning and co-training where we keep some of the old vision & text data around) an existing vision-language model with robot data. The robot data includes the current image, language command and the robot action at the particular time step. We represent the robot actions as text strings as shown below. An example of such a string could be a sequence of robot action token numbers: “1 128 91 241 5 101 127 217”.

--

--

Ranko Mosic

Applied AI Consultant Full Stack. GLG Network Expert https://glginsights.com/ . AI tech advisor for VCs, investors, startups.