Nathan Schucher
https://nathanschucher.com/index.xml
Recent content on Nathan SchucherHugo -- gohugo.ioen-USWed, 25 Jan 2017 15:30:59 -0500probability theory as an extension of Aristotelian logic
https://nathanschucher.com/blog/probability-theory-as-an-extension-of-aristotelian-logic/
Wed, 25 Jan 2017 15:30:59 -0500https://nathanschucher.com/blog/probability-theory-as-an-extension-of-aristotelian-logic/
<p>Probability has always been a bit of a mystery to me. Manipulating basic probabilities according to the axioms of probability is fine. Venn diagrams representing joint distributions is fine. Reasoning about continuous probability is… fine. The calculations and algebra all makes sense, but I never got it. Probability Theory: The Logic of Science by Edward Jaynes starts from the ground up in a way that makes sense to me. Not only because it has so far avoided measure theory and infinite sets, but also because the goal is to design a robot that can reason (I think that’s pretty rad).</p>
<h1 id="aristotelian-logic">Aristotelian Logic</h1>
<p>Jaynes argues for development of a system of probability that acts as an extension of Aristotelian logic rather than infinite set theory.</p>
<p>Aristotelian Logic consists of propositions \(\{A, B, C, .. \}\) and premises ‘if A then B’ which can be acted upon by two arguments:</p>
<blockquote>
<p>Major Premise: If A is true then B is true</p>
<p>Minor Premise: A is true</p>
<p>Conclusion: B is true</p>
</blockquote>
<p>Or</p>
<blockquote>
<p>Major Premise: If A is true then B is true</p>
<p>Minor Premise: B is false</p>
<p>Conclusion: A is false</p>
</blockquote>
<p>However, how do we quantify the argument</p>
<blockquote>
<p>Major Premise: If A is true then B is true</p>
<p>Minor Premise: B is true</p>
<p>Conclusion: B is more plausible</p>
</blockquote>
<p>How much more plausible is B, not that we know A? Real numbers can be used to quantify this, so we need a robot that can reason about propositions (A, B, C, …), and their relative plausibilities:</p>
<h1 id="degree-of-plausibility-are-represented-by-real-numbers">Degree of plausibility are represented by real numbers</h1>
<p>We would to assign a real number representing the plausibility that a proposition A is true given the knowledge that B is true. This is denoted</p>
<p>$$ A | B $$</p>
<p>And is pronounced ‘A given B’. This is a real number! It is not a probability, it is not in \([0, 1]\): in fact we know nothing about it other than that it is a real number.</p>
<h1 id="qualitative-correspondence-with-common-sense">Qualitative correspondence with common sense</h1>
<p>We then <em>choose</em> to view larger real numbers as representing a higher degree of plausibility. This is an intuitive and natural choice, but is not necessary. We also would like consistency with the rules of Boolean logic: eg. A+B|C represents the plausibility that at least one of A or B is true given C.</p>
<p>As for correspondence with common sense, for example: if C’ represents new information built upon C, then</p>
<p>$$(A|C') > (A|C)$$
we expect
$$(AB|C') > (AB|C)$$
so long as
$$(B|C') = (B|C)$$</p>
<h2 id="consistency">Consistency</h2>
<p>Jaynes posits the following consistency axioms for the robot:</p>
<blockquote>
<p>If a conclusion can be reasoned out in more than one way, then every possible way must lead to the same result.</p>
<p>The robot always takes into account all of the evidence it has relevant to a question. It does not arbitrarily ignore some of the information, basing its conclusions only on what remains. In other words, the robot is completely non-ideological.</p>
<p>The robot always represents equivalent states of knowledge by equivalent plausibility assignments. That is, if in two problems the robot’s state of knowledge is the same (except perhaps for the labeling of the propositions), then it must assign the same plausibilities in both.</p>
</blockquote>
<p>And that’s all. The previous desiderata about common sense, and real numbers, along with these consistency guarantees are enough to develop a mathematical theory of plausible reasoning (ie. probability theory). Notice there is no mention of the familiar axioms of probability theory. Although this is a different formulation, Jaynes insists that he (and theory) agree with Kolmogorov’s results without any of the messiness of measure theory and infinite paradoxy.</p>
<h1 id="what-about-p-x">What about \(P(x)\)?</h1>
<p>Through some relatively straightforward derivation (and references to exhaustive proofs) Jaynes develops the rules of probability that are familiar. For example, we might like to know what (AB|C) is, and the solution involves finding a solution to the equation</p>
<p>$$ (AB|C) = F[(B|C), (A|BC)] $$</p>
<p>This is all in the first two chapters of the book. The rest goes onto develop familiar distributions and techniques using this robot I’m still working through it, but the basic premise is compelling.</p>
alternating neural attention for machine reading
https://nathanschucher.com/blog/alternating-neural-attention-for-machine-reading/
Wed, 11 Jan 2017 16:31:57 -0500https://nathanschucher.com/blog/alternating-neural-attention-for-machine-reading/
<p><em>March 15th 2036</em>: You sit with your child of four years, reading Rudyard Kipling’s The Jungle Book. After reading precisely 20 sentences of the story the child turns its face to you and says:</p>
<p>“Please caregiver, tell me what X represents in this sentence, ‘Yes,’ said X, ‘all the jungle fear Bagheera–all except Mowgli.’”</p>
<p>The child has carefully worded the query in this manner because you are not you, you are a Childcare Robot, and this is the only type of query you encountered in your training set. You respond, “Mowgli,” and the child nods knowingly and smiles. You twist your robo-face into your best imitation of a human smile.</p>
<h2 id="story-mode">Story Mode</h2>
<p>I dumped all the attention outputs from the test set to see what the network was doing. You can pick a story from the drop-down on the left, and drag the slider to see the distributions at each step of the iteration. The labeled answer is listed below as well as the predicted answer from each time step.</p>
<div class="story" id="story-0">
<div>
<select></select>
<label class="glimpse">Glimpse</label>
<input type="range" min="1" max="8" value="1" step="1" />
<label>Answer: <span class="answer"></span></label>
<label>Predicted: <span class="predicted"></span></label>
</div>
<label>Document: </label>
<div class="document"></div>
<label>Query: </label>
<div class="query"></div>
</div>
<h2 id="iterative-alternating-neural-attention-for-machine-reading">Iterative Alternating Neural Attention for Machine Reading</h2>
<p>I decided to get my hands dirty and implement one of the papers I had been reading. I picked a <a href="https://arxiv.org/abs/1606.02245">paper</a> written by researchers at <a href="https://mila.umontreal.ca/">UdeM</a> and a startup in Montreal called Maluuba (<a href="http://www.maluuba.com/blog/2017/1/13/maluuba-microsoft">recently acquired by microsoft</a>).</p>
<p>Machine readers are models that attempt to develop some kind of document understanding in order to apply them to various tasks: summarization, question answering, captioning, sentiment analysis, etc.
There are a <a href="http://colah.github.io/posts/2015-08-Understanding-LSTMs/">number</a> of <a href="http://karpathy.github.io/2015/05/21/rnn-effectiveness/">great</a> <a href="http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/">articles</a> written about <a href="http://r2rt.com/written-memories-understanding-deriving-and-extending-the-lstm.html">recurrent neural networks</a>, so I won’t explain how they work here.</p>
<h2 id="the-model">The Model</h2>
<p>The model takes as input a document, and a ‘cloze-style’ query which is a sentence with a word missing. The output of the model is a prediction to fill the blank in the query.</p>
<h3 id="bidirectional-gru-for-encoding-the-document-and-query">Bidirectional GRU for encoding the document and query</h3>
<p>The document and query are split into tokens, given a unique id, and then used to train word embeddings.
A common use of RNNs in NLP is for collapsing a sequence into a single vector.
We hope that this vector encodes the meaning of the sentence in some high-dimensional space such that we can do interesting things with it!
Many papers, including this one, use bidirectional RNNs in order to capture both the left-to-right semantics and the right-to-left. In this case the outputs of the forward and backward pass are concatenated.</p>
<h3 id="attentive-query-glimpses">Attentive query glimpses</h3>
<p>The encoded query vectors are used to generate a <em>glimpse</em> of the query as a whole.
The trick with this paper, is to build two attention distributions, one over the document, and one over the query.
The document attention and query attention distributions represent the likelhood of each token being useful to predicting the answer.
An additional novel feature is to iteratively reconstruct query and document glimpses rather than performing a single pass.
The network alternates between glimpsing at the query, and using the information gleaned to create a document glimpse.
The idea being that many inference problems require more than one logical hop. Then the document glimpse at time t is defined,</p>
<p>$$\mathbf{d}_t = \sum_{i} d_{i,t} \mathbf{\tilde{d}}_{t}$$</p>
<p>This acts like an approximation of the expected value of this document attention distribution.
Although it doesn’t output samples from the distribution, it produces a new vector positioned near the mode of the attentions.
We hope that this lets the network exploit the linear structure of the embedding space and enable these glimpse vectors to capture the most important properties of the query/document.</p>
<p>In order for the network to parameterize and control these glimpses, the network maintains a <em>context</em> vector.
This is the same as the thought vectors mentioned above, and is implemented again as a GRU.
This controls the query and document glimpses over the iterations.
Finally, the last document attention distribution is used to compute the prediction.
By summing over each word (words appear multiple times) we get the total likelihood of that word being the answer, and the highest probability is the network’s prediction.</p>
<h2 id="training">Training</h2>
<p>Below is a picture from TensorBoard of the likelihood that the network has assigned to the correct answer. The x-axis is the probability [0,1], the y-axis is the density (ie. over the number of examples in that batch), and the z-axis is the training step.</p>
<p>You can see as the training proceeds (from back to front), the likelihood distribution for the answer shifts from near-zero to near one, demonstrating the improvement over time.</p>
<p><img src="https://nathanschucher.com/img/answer_probability.png" alt="Likelihood of Answer vs. Time" /></p>
<p>My implementation acheived 65% accuracy on the Children’s Book Test (Named Entity) test set, compared to the paper’s 68.6%.
This was after ~8 epochs, after which the validation loss started to increase.
There are a few things missing from my implementation, including proper orthogonal initialization of the GRU weights, <a href="http://blog.dennybritz.com/2017/01/17/engineering-is-the-bottleneck-in-deep-learning-research/">but undoubtedly much more</a>.</p>
<p>If you play around with the “Story Mode” section above, you may notice that the distribution doesn’t change <em>that</em> much, especially the query distribution.</p>
<p>I was anticipating a more dramatic change in distribution as the network alternates glimpses, but it seems to immediately latch on to names and proper nouns. My initial guess is that the network memorizes those entities because they’ve read parts of the same stories in the training set. This is a rather large problem, and so these results can’t really demonstrate generalization.</p>
<h2 id="what-s-next">What’s next</h2>
<p>I want to apply this model to the other datasets in the original paper: Facebook’s bAbI tasks, DeepMinds CNN news dataset along with Maluuba’s additional question/answer data. I’m also interested in trying to apply <a href="https://arxiv.org/abs/1603.08983">Adaptive Computation Time</a>, which allows an RNN to decide when to halt, rather than using a fixed number of iterations. It seems obvious to me that certain examples will be more difficult than others. Giving the network the ability to halt would allow it to take more computation time on more difficult examples, and maybe give more insight into how the glimpse mechanism works.</p>