Google BERT — Pre Training and Fine Tuning for NLP Tasks

Ranko Mosic
7 min readNov 5, 2018

A recently released BERT paper and code generated a lot of excitement in ML/NLP community¹.

BERT is a method of pre-training language representations, meaning that we train a general-purpose “language understanding” model on a large text corpus ( BooksCorpus and Wikipedia), and then use that model for downstream NLP tasks ( fine tuning )¹⁴ that we care about.

Models preconditioned with BERT achieved better than human performance on SQuAD 1.1 and lead on SQuAD 2.0³. BERT relies on massive compute for pre-training ( 4 days on 4 to 16 Cloud TPUs; pre-training on 8 GPUs would take 40–70 days i.e. is not feasible. BERT fine tuning tasks also require huge amounts of processing power, which makes it less attractive and practical for all but very specific tasks¹⁸ ). Typical uses would be fine tuning BERT for a particular task or for feature extraction.

BERT generates multiple, contextual, bidirectional word representations, as opposed to its predecessors (word2vec, GLoVe ).

BERT proposes a new training objective: the “masked language model” (MLM)¹³ . The masked language model randomly masks some of the tokens from the input, and the objective is to predict the original vocabulary id of
the masked word based only on its context.

The basic BERT building block is the Transformer¹⁹ ( as opposed to RNN based options like BiLSTM ). Central to the Transformer is the notion of attention —contextual co-occurrence statistics¹⁷.

Transformer is simpler, more parallelizable ( GPU friendly ) i.e. faster than RNN — it uses only straightforward matrix multiplication and simple few layer feed forward neural network with no recurrence and no weight sharing. BERT only implements Transformer encoder part ¹⁶:

BERT sentence classification demo is available for free on Colab Cloud TPU. BERT language model is fine tuned for MRPC task( sentence pairs semantic equivalence ).

For example, if input sentences are:

Ranko Mosic is one of the world foremost experts in Natural Language Processing arena. In a world where there aren’t that many NLP experts, Ranko is the one.

The model will conclude these two sentences are equivalent ( label = 1 ).

Where embeddings, tokenization, numericalization occur⁴ ( from Dissecting Bert )

BERT pretraining example tokenizes sentences from sample_text.txt, ( matches and ids tokens with content of vocab.txt )

The chips from his wood pile refused to kindle a fire to dry his bed-clothes, and he had recourse to a more provident neighbor’s to supply the deficiency.

The above sample sentence is Wordpiece tokenized⁵ ( following initial basic tokenization — converting all tokens to lower case, punctuation split ) into:

[‘the’, ‘chips’, ‘from’, ‘his’, ‘wood’, ‘pile’, ‘refused’, ‘to’, ‘kind’, ‘##le’, ‘a’, ‘fire’, ‘to’, ‘dry’, ‘his’, ‘bed’, ‘-’, ‘clothes’, ‘,’, ‘and’, ‘he’, ‘had’, ‘rec’, ‘##ours’, ‘##e’, ‘to’, ‘a’, ‘more’, ‘provide’, ‘##nt’, ‘neighbor’, “‘“, ‘s’, ‘to’, ‘supply’, ‘the’, ‘deficiency’, ‘.’]

Below is layout of a final record written to the output file. 15% of tokens are randomly masked⁶, segmentation id is added(0 or 1 i.e A or B, padded to 128 — max segment length; segments/sentences can have content from different actual sentences); sentences are randomly shuffled and randomized next_sentence label is added.

Embedding starts⁷ with randomly initialized embedding_table ( modeling.py ); shape is (30522, 768) i.e. (vocab_size, embedding vector size ):

Randomly initialized embedding_table

BERT uses tf.nn.embedding_lookup(embedding_table, input_ids) to match each input token_id ( input_id ) with initial random 768 dimensional embedding:

Part of initial embedding for token_id 3536 ( “wood”) — array of 768 random numbers with stdev 0.02

The next step is to add positional to input embedding. Positional embedding is described in section 3.5 of Attention is All You Need.

Multi-head attention starts with attention mask ( 1.0 for positions we want to attend to and 0.0 for masked positions )— procedure below returns Tensor(“bert/encoder/mul:0”, shape=(8, 128, 128).

Attention Mask

Below is a call to transformer_model:

BERT Transformer has configurable ( bert_config.json ) number of self-attention heads (it is self-attention because from_tensor, to_tensor are the same — layer_input with shape (1024, 768) ):

from_tensor, to_tensor are transformed to query_layer, key_layer and value_layer via tf.layer.dense¹⁰:

This is the core moment ( Scaled Dot-Product Attention on Figure 2 below ) — dot product similarity ( attention i.e. attention_score) between query and key is calculated:

Image From Attention Is All You Need

A standard dropout is applied ( with keep probability 1.0 – 0.1 = 0.9 ):

Input tensor — before and after dropout — 10% values are set to zero, the rest is multiplied by 1/0.9 = 1.11111

Finally scaled dot-product attention is matrix multiplication of attention_probs and value_layer:

We have now completed computation for a single attention layer¹¹.

Next comes feed forward part that is split into three layers of neural nets; only intermediate step has gelu activation function; outer layers feature dropout and layer normalization¹² ( for faster training ); layer_outputs with shape of (1024, 768) are appended to the all_layers list ( each layer_output is one of Nx layers on the Figure 1 above ):

Feed Forward

layer_outputs are finally brought back to their original shape ( (32, 128, 768):

¹ NY Times wrote about BERT. In a nutshell BERT is a humongous encoder — it features state of the art contextual representation of a huge text corpus: Wikipedia/BookCorpus -> BERT -> word encodings ( model i.e. weights ).

Oct 2020 Google Update: BERT is now used in almost every English search and we've expanded to dozens of languages.

..“encoder-only” models like BERT are designed
to produce a single prediction per input token or a single prediction for an entire input sequence. This makes them applicable for classification or span prediction tasks but not for generative tasks like translation or abstractive summarization.

Huggingface raised $40 million Series B ( Mar 11, 2021 ). It develops open source Transformer based software.

A number of papers was published further analyzing Transformers and its limitations.

³ https://twitter.com/stanfordnlp/status/1066742978381639680

⁵ Tokenizes a piece of text into its word pieces. For example, “unaffable” = [“un”, “##aff”, “##able”]; uses a greedy longest-match-first algorithm to perform tokenization using the given vocabulary. Sentences are randomly shuffled. Each token is assigned token_id ( unique accross all segments ); for example, ‘wood’ token_id is 3536

⁶ Please refer to page 6 of BERT paper for more details on why and how masking is done

Next, we get the embedding for each word in the sequence. Each word of the sequence is mapped to a emb_dim dimensional vector that the model will learn during training. You can think about it as a vector look-up for each token. The elements of those vectors are treated as model parameters and are optimized with back-propagation just like any other weights ( Dissecting Bert )

¹⁰ This layer implements the operation ( a standard NN layer ): outputs = activation(inputs * kernel + bias) where activation is the activation function passed as the activation argument (if not None), kernel is a weights matrix created by the layer, and bias is a bias vector created by the layer (only if use_bias is True).

¹¹ Multi-head attention layers are concatenated:

attention_output = tf.concat(attention_heads, axis=-1)

¹² layer_norm:

¹³ This is an example of self-supervised learning

¹⁴ Real life BERT based applications ( aside from Google search¹ ) are also sentiment analysis, classification, QA systems.

¹⁵ Interview with BERT first author Jacob Devlin

¹⁶ GPT-2 is using Transformer decoders.

¹⁷ Input sequence is split into vectorized tokens; logically each token is a query that is correlated with the rest of the tokens —keys ( and their corresponding values ).

¹⁸ General-purpose pretrained sentence encoders such as BERT are not ideal for real-world conversational AI applications; they are computationally heavy, slow, and expensive to train.

¹⁹ Dec 2, 2020 Update: it seems that DeepMind AlphaFold2 — protein shape prediction algorithm — also uses attention mechanism: For the latest version of AlphaFold, used at CASP14, we created an attention-based neural network system, trained end-to-end ..

Jul 16, 2021 Update: AlphaFold2 utilizes attention is confirmed ( the mechanism is similar to Reformer ).

DeepMind AlphaStar is also using Transformers: Observations of
player and opponent units are processed using a self-attention mechanism.

--

--

Ranko Mosic

Applied AI Consultant Full Stack. GLG Network Expert https://glginsights.com/ . AI tech advisor for VCs, investors, startups.