Salesforce decaNLP Multitask Question Answering Network

Ranko Mosic
8 min readJul 23, 2018

Recently open-sourced decaNLP model and code make it possible to easily deploy sophisticated integrated question answering system. While the idea of unified NLP architecture is not new ( Collobert/Weston 2008 paper got ICML 2018 Test of Time award; transfer across multiple learning tasks is explored in 1998 Lifelong Learning Algorithms by S. Thrun ), decaNLP provides direct insight into the latest research directed towards general NLP models and practical uses. It also demonstrates how a number of new-emerging-old concepts, techniques and tools ( using pretrained embeddings like GLoVe, transfer learning, dropout, LSTM, PyTorch) are put together to perform a meaningful task.

One of subtasks in decaNLP is based on a SQuAD dataset. I reproduced decaNLP SQuAD train and evaluation parts on a Google Cloud host with a single GPU card ( Nvidia K80 and instance with 50G of RAM), using PyTorch 0.3 ( the latest PyTorch release is 0.4, but decaNLP is not supporting it so far), Cuda 9.0 and cuDNN 7 ( as per the Docker file content ; it is not necessary to work with Docker files — components can also be installed individually via Miniconda, pip, in case you want to do it the harder way; I went down this path in a futile attempt to avoid using GPU ).

decaNLP is for now able to run on a single GPU only ( hence gpu 0 in the command above ).

Usual generic data preprocessing steps ( file load, train/val/test split) here occur via torchtext — a PyTorch library specialized for NLP tasks ( torchtext additionaly simplifies tokenization, vocabulary creation, embedding incorporation etc. ) .

Preprocessing format mask is defined via:

torchtext.data.ReversibleField is a subclass of Field class ( a subclass of RawField class ) which adds reversible tokenization capabilities via revtok library . Field class defines how data will be preprocessed.

Input SQuAD train data ( V2 added on github )
Context, questions and answers from the above train file

Contexts are paragraphs taken from the English Wikipedia, and answers are sequences of words copied from the context.

On the first run context, question and answers from training data are tokenized, stripped of spaces etc., and results written via torch.save ( serialized to Python pickle binary format file in .cache directory ), to be reused in subsequent runs:

On the first run decaNLP will download 100 dimension character n-gram, 300 dimension GloVe embeddings¹ and³ concatenate to a single 400 dimension vector⁵:

100 dimension embeddings for 4gram DODO.
100 dimension embeddings for the word first

Generic² preloaded embeddings will be transformed into vocab object for a specific text / task ( vocabulary object that will be used to numericalize a field — torchtext is a part of PyTorch) with vocab.itos ( A collections.defaultdict instance mapping token strings to numerical identifiers ) and vocab.stoi attributes ( A list of token strings indexed by their numerical identifiers ):

Vocabulary is being built via torchtext.data.ReversibleField.build_vocab method which combines tokenized training, validation datasets and preloaded embedding i.e. by counting word occurrences and indexing vocabulary array via its frequency:

self.vocab = Counter({‘ the ‘: 2013186, ‘, ‘: 1511643, ‘. ‘: 1048976, ‘ of ‘: 1010833, ‘ and ‘: 780201, ‘ in ‘: 722761, ‘ to ‘: 564613, ‘a ‘: 472420, ‘: ‘: 427046, ‘-’: 260389, ‘ is ‘: 248163, ‘ was ‘: 242769, ‘ as ‘: 239669, ‘ s ‘: 212798, ‘) ‘: 210872, ‘ (‘: 202493, ‘ question ‘: 198179, ‘ context ‘: 197655, ‘ for ‘: 197614, ‘? ‘: 195444, ‘ by ‘: 194364, ‘ that ‘: 180701, “‘“: 180698, ‘ with ‘: 175583

vectors: An indexed iterable (or other structure supporting __getitem__) that 159 given an input index, returns a FloatTensor representing the vector 160 for the token associated with the index. For example, 161 vector[stoi[“string”]] should return the vector for “string”.

Final vocab tensor will be:

self.vocab =
-0.1115 0.0791 0.0165 … 0.1272 0.4475 -0.0767
-1.2320 -0.2417 -0.8178 … 0.0000 0.0000 0.0000
-0.5703 -0.1804 -1.1885 … 0.1430 -0.9636 -0.4526
… ⋱ …
-0.3704 0.2480 -0.5484 … 0.0000 0.0000 0.0000
-0.1485 -0.0669 -0.7864 … 0.0000 0.0000 0.0000
0.0000 0.0000 0.0000 … 0.0000 0.0000 0.0000
[torch.FloatTensor of size 87176x400]

During the training, decaNLP will display a set of sample question every 1000 iterations:

The model will be saved to a checkpoint directory, which can be later used for inference/evaluation.

Context and Question Encoding

The central part of decaNLP³ is decaNLP/models/multitask_question_answering_network.py file that contains MultitaskQuestionAnsweringNetwork(nn.Module) class which is shared by all ten tasks in decaNLP. Steps in Figure 2 description are executed for any particular task. MQAN model has two main parts — encoder ( text data — questions, context ) is tranformed into numerical data ) and decoder ( numerical data is transformed to text — answers ).

The above context and question encoding ( Step 1 on Figure 2 ) is achieved in a few steps which convert text to numerical representation suitable for further mathematical transformations. Once encoding is completed we do not deal with text any more, we only deal with word representation i.e. numbers. Encoding of both context and
question sequences is designed to capture local and global interdependencies.

BiLSTM is shared between context and question, they are blended together so they map into the same space and are conditioned on each other; combination of initial word vector and the output of BiLSTM is then fed to the rest of the model⁴

PackedLSTM is implemented via standard PyTorch nn.LSTM:

Alignment — Step 2 from Figure 2

Alignment — Step 2 from Figure 2 example: take the first question and take a dot product with all of the vectors in the context; do a softmax over those dot
products to get attention weights and then use attention weights to sum over
the context vectors. ( for every single question vector with respect to the set of context vectors , and for every single context vector with respect to the set of question vectors; these weighted summation representations are then fed into another layer of an LSTM and it is called coattention³ ; context and question are blended together and conditioned on each other— Step 3, Figure 2 — CoattentiveLayer class)

Step 4, Figure 2: Shared BiLSTM is now separated to two channels for the rest of the model

Step 4 is self-attention as each question is conditioned on other questions, and each context is conditioned on other contexts.

Step 5, Figure 2 — FInal Encoding Step

Answer Generation — Decoding

Step 6, Figure 2 Answer generation — three decoder layers match last three encoder layers

How we choose which word we output: we read in an initial token to get started and then we produce some output state using LSTM.

Whole sequence is started via ( for context, it is quite similar for question and answer ):

¹GloVe paper digest.

² Google’s Jeff Dean doesn’t think current approach is good enough (15:00): “multitask learning has been a a sort of modest area of research and machine learning but it’s more like multitask learning of three or four things typically, not a thousand or a million .. I think that’s kind of the direction we need to head is
really really being able to build flexible systems that can do lots of
things.

We as a community haven’t really found the right approaches … you kind of used unsupervised data for a while and then when you have a problem you care about you start with that representation that’s been learned in the unsupervised domain and then you try to sort of refine it with the supervised data but that’s pretty unlikely … humans learn from unsupervised … your entire childhood you kind of wander around the world you take in lots of unsupervised data and then occasionally you get some supervised signal …you get this interleaving and refining of understanding the world in exactly the moments you want it; I think that’s really the way to leverage unsupervised data .. to use a lot of it but to interleave it with these rich sort of high-value supervised signals rather than using it as a pre step and then trying to do supervised learning on top of it.”

³ BryanMcCann, the first author and coder for most of the decaNLP, explains decaNLP in more detail in this video ( MQAN part starts around 16:00)

⁵ GLoVE is trained with character n-grams as input

--

--

Ranko Mosic

Applied AI Consultant Full Stack. GLG Network Expert https://glginsights.com/ . AI tech advisor for VCs, investors, startups.