github twitter linkedin rss
alternating neural attention for machine reading
Jan 11, 2017
5 minutes read

March 15th 2036: You sit with your child of four years, reading Rudyard Kipling’s The Jungle Book. After reading precisely 20 sentences of the story the child turns its face to you and says:

“Please caregiver, tell me what X represents in this sentence, ‘Yes,’ said X, ‘all the jungle fear Bagheera–all except Mowgli.’”

The child has carefully worded the query in this manner because you are not you, you are a Childcare Robot, and this is the only type of query you encountered in your training set. You respond, “Mowgli,” and the child nods knowingly and smiles. You twist your robo-face into your best imitation of a human smile.

Story Mode

I dumped all the attention outputs from the test set to see what the network was doing. You can pick a story from the drop-down on the left, and drag the slider to see the distributions at each step of the iteration. The labeled answer is listed below as well as the predicted answer from each time step.

Iterative Alternating Neural Attention for Machine Reading

I decided to get my hands dirty and implement one of the papers I had been reading. I picked a paper written by researchers at UdeM and a startup in Montreal called Maluuba (recently acquired by microsoft).

Machine readers are models that attempt to develop some kind of document understanding in order to apply them to various tasks: summarization, question answering, captioning, sentiment analysis, etc. There are a number of great articles written about recurrent neural networks, so I won’t explain how they work here.

The Model

The model takes as input a document, and a ‘cloze-style’ query which is a sentence with a word missing. The output of the model is a prediction to fill the blank in the query.

Bidirectional GRU for encoding the document and query

The document and query are split into tokens, given a unique id, and then used to train word embeddings. A common use of RNNs in NLP is for collapsing a sequence into a single vector. We hope that this vector encodes the meaning of the sentence in some high-dimensional space such that we can do interesting things with it! Many papers, including this one, use bidirectional RNNs in order to capture both the left-to-right semantics and the right-to-left. In this case the outputs of the forward and backward pass are concatenated.

Attentive query glimpses

The encoded query vectors are used to generate a glimpse of the query as a whole. The trick with this paper, is to build two attention distributions, one over the document, and one over the query. The document attention and query attention distributions represent the likelhood of each token being useful to predicting the answer. An additional novel feature is to iteratively reconstruct query and document glimpses rather than performing a single pass. The network alternates between glimpsing at the query, and using the information gleaned to create a document glimpse. The idea being that many inference problems require more than one logical hop. Then the document glimpse at time t is defined,

$$\mathbf{d}_t = \sum_{i} d_{i,t} \mathbf{\tilde{d}}_{t}$$

This acts like an approximation of the expected value of this document attention distribution. Although it doesn’t output samples from the distribution, it produces a new vector positioned near the mode of the attentions. We hope that this lets the network exploit the linear structure of the embedding space and enable these glimpse vectors to capture the most important properties of the query/document.

In order for the network to parameterize and control these glimpses, the network maintains a context vector. This is the same as the thought vectors mentioned above, and is implemented again as a GRU. This controls the query and document glimpses over the iterations. Finally, the last document attention distribution is used to compute the prediction. By summing over each word (words appear multiple times) we get the total likelihood of that word being the answer, and the highest probability is the network’s prediction.


Below is a picture from TensorBoard of the likelihood that the network has assigned to the correct answer. The x-axis is the probability [0,1], the y-axis is the density (ie. over the number of examples in that batch), and the z-axis is the training step.

You can see as the training proceeds (from back to front), the likelihood distribution for the answer shifts from near-zero to near one, demonstrating the improvement over time.

Likelihood of Answer vs. Time

My implementation acheived 65% accuracy on the Children’s Book Test (Named Entity) test set, compared to the paper’s 68.6%. This was after ~8 epochs, after which the validation loss started to increase. There are a few things missing from my implementation, including proper orthogonal initialization of the GRU weights, but undoubtedly much more.

If you play around with the “Story Mode” section above, you may notice that the distribution doesn’t change that much, especially the query distribution.

I was anticipating a more dramatic change in distribution as the network alternates glimpses, but it seems to immediately latch on to names and proper nouns. My initial guess is that the network memorizes those entities because they’ve read parts of the same stories in the training set. This is a rather large problem, and so these results can’t really demonstrate generalization.

What’s next

I want to apply this model to the other datasets in the original paper: Facebook’s bAbI tasks, DeepMinds CNN news dataset along with Maluuba’s additional question/answer data. I’m also interested in trying to apply Adaptive Computation Time, which allows an RNN to decide when to halt, rather than using a fixed number of iterations. It seems obvious to me that certain examples will be more difficult than others. Giving the network the ability to halt would allow it to take more computation time on more difficult examples, and maybe give more insight into how the glimpse mechanism works.

Back to posts