Jekyll2018-06-08T22:16:57+00:00http://srome.github.io//Scott RomeMath Ph.D. who works in Machine LearningLearning About Deep Reinforcement Learning (Slides)2018-04-18T00:00:00+00:002018-04-18T00:00:00+00:00http://srome.github.io//Learning-About-Deep-Reinforcement-Learning-(Slides)<p>Earlier this month, I gave an introductory talk at Data Philly on deep reinforcement learning. The talk followed the Nature paper on teaching neural networks to play Atari games by Google DeepMind and was intended as a crash course on deep reinforcement learning for the uninitiated. Get the slides below!</p>
<h2 id="abstract">Abstract</h2>
<p>Deep Reinforcement Learning sounds intractable for a layperson. “Reinforcement learning” alone sounds scary (I’m reinforcing WHAT again? Random actions? Bah!), and then you add the word “deep” and most people simply drop off. This talk details what I’ve learned from replicating the original NIPS/Nature papers on Deep Reinforcement Learning for playing Atari games and walks through some python implementations of the basic steps for reinforcement learning code. The talk uses Keras (TensorFlow backend) and tries (albeit, unsuccessfully) to hide advanced mathematical formulations of the problem. By the end of the talk, we will all have taken a modest step forward in learning deep reinforcement learning.</p>
<h2 id="slides">Slides</h2>
<p>You can access the slides <a href="../files/dataphillytalk/data_philly_04022017.pdf">here</a>.</p>Earlier this month, I gave an introductory talk at Data Philly on deep reinforcement learning. The talk followed the Nature paper on teaching neural networks to play Atari games by Google DeepMind and was intended as a crash course on deep reinforcement learning for the uninitiated. Get the slides below!Understanding Attention in Neural Networks Mathematically2018-03-23T00:00:00+00:002018-03-23T00:00:00+00:00http://srome.github.io//Understanding-Attention-in-Neural-Networks-Mathematically<p>Attention has gotten plenty of attention lately, after yielding state of the art results in multiple fields of research. From <a href="https://arxiv.org/pdf/1502.03044.pdf">image captioning</a> and <a href="https://arxiv.org/pdf/1409.0473.pdf">language translation</a> to <a href="https://arxiv.org/pdf/1612.07411.pdf">interactive question answering</a>, Attention has quickly become a key tool to which researchers must attend. Some have taken notice and even postulate that <a href="https://arxiv.org/pdf/1706.03762.pdf">attention is all you need</a>. But what is Attention anyway? Should you pay attention to Attention? Attention enables the model to focus in on important pieces of the feature space. In this post, we explain how the Attention mechanism works mathematically and then implement the equations using Keras. We conclude with discussing how to “see” the Attention mechanism at work by identifying important words for a classification task.</p>
<p>Attention manifests differently in different contexts. In some NLP problems, Attention allows the model to put more emphasis on different words during prediction, e.g. from <a href="https://www.cs.cmu.edu/~diyiy/docs/naacl16.pdf">Yang et. Al</a>:</p>
<p><img src="../images/attention/attention.png" alt="png" /></p>
<p>Interestingly, there is <a href="https://arxiv.org/pdf/1506.07285.pdf">increasing</a> <a href="https://arxiv.org/pdf/1503.08895.pdf">research</a> that calls this mechanism “Memory”, which some claim is a better title for such a versatile layer. Indeed, the Attention layer can allow a model to “look back” at previous examples that are relevant at prediction time, and the mechanism has been used in that way for so called Memory Networks. A discussion of Memory Networks is outside the scope of this post, so for our purposes, we will discuss Attention named as such, and explore the properties apparent to this nomenclature.</p>
<h2 id="the-attention-mechanism">The Attention Mechanism</h2>
<p>The Attention Mechanism mathematically is deceptively simple for its far reaching applications. I am going to go over the style of Attention that is common in NLP applications. A good overview of different approaches can be found <a href="http://ruder.io/deep-learning-nlp-best-practices/index.html#attention">here</a>. In this case, Attention can be broken down into a few key steps:</p>
<ol>
<li>MLP: A one layer MLP acting on the hidden state of the word</li>
<li>Word-level Context: A vector is dotted with the output of the MLP</li>
<li>Softmax: The resulting vector is passed through a softmax layer</li>
<li>Combination: The attention vector from the softmax is combined with the input state that was passed into the MLP</li>
</ol>
<p>Now we will describe this in equations. The general flavor of the notation is following the <a href="https://www.cs.cmu.edu/~diyiy/docs/naacl16.pdf">paper</a> on Hierarchical Attention by Yang et. Al, but includes insight from the notation used in <a href="https://arxiv.org/pdf/1612.07411.pdf">this paper</a> by Li et. Al as well.</p>
<p>Let <script type="math/tex">h_{t}</script> be the word representation (either an embedding or a (concatenation) hidden state(s) of an RNN) of dimension <script type="math/tex">K_w</script> for the word at position <script type="math/tex">t</script> in some sentence padded to a length of <script type="math/tex">M</script>. Then we can describe the steps via</p>
<p>\begin{equation} u_t = \tanh(W h_t + b)\end{equation}
\begin{equation} \alpha_t = \text{softmax}(v^T u_t)\end{equation}
\begin{equation} s = \sum_{t=1}^{M} \alpha_t h_t\end{equation}</p>
<p>If we define <script type="math/tex">W\in \mathbb{R}^{K_w \times K_w}</script> and <script type="math/tex">v,b\in\mathbb{R}^{K_w}</script>, then <script type="math/tex">u_t \in \mathbb{R}^{K_w}</script>, <script type="math/tex">\alpha_t \in \mathbb{R}</script> and <script type="math/tex">s\in\mathbb{R}^{K_w}</script>. (Note: this is the multiplicative application of attention.) Then, the final option is to determine Even though there is a lot of notation, it is still three equations. How can models improve so much when using attention?</p>
<h3 id="how-attention-actually-works-geometrically">How Attention Actually Works (Geometrically)</h3>
<p>To answer that question, we’re going to explain what each step actually does geometrically, and this will illuminate names like “word-level context vector”. I recommend reading my <a href="http://srome.github.io/Visualizing-the-Learning-of-a-Neural-Network-Geometrically/">post</a> on how neural networks bend and twist the input space, so this next bit will make a lot more sense.</p>
<h4 id="the-single-layer-mlp">The Single Layer MLP</h4>
<p>First, remember that <script type="math/tex">h_t</script> is an embedding or representation of a word in a vector space. That’s a fancy way of saying that we pick some N-dimensional space, and we pick a point for every word (hopefully in a clever way). For <script type="math/tex">N=2</script>, we would be drawing a bunch of dots on a piece of paper to represent our words. Every layer of the neural net moves these dots around (and transforms the paper itself!), and this is the key to understanding how attention works. The first piece</p>
<p>\begin{equation}u_t = tanh(Wh_t+b)\end{equation}</p>
<p>takes the word, rotates/scales it and then translates it. So, this layer rearranges the word representation in its current vector space. The tanh activation then twists and bends (but does not break!) the vector space into a manifold. Because <script type="math/tex">W:\mathbb{R}^{K_w}\to\mathbb{R}^{K_w}</script> (i.e., it does not change the dimension of <script type="math/tex">h_t</script>), then there is no information lost in this step. If <script type="math/tex">W</script> projected <script type="math/tex">h_t</script> to a lower dimensional space, then we would lose information from our word embedding. Embedding <script type="math/tex">h_t</script> into a higher dimensional space would not give us any new information about <script type="math/tex">h_t</script>, but it would add computational complexity, so no one does that.</p>
<h4 id="the-word-level-context-vector">The Word-Level Context Vector</h4>
<p>The word-level context vector is then applied via a dot product:
\begin{equation}v^T u_t\end{equation}
As both <script type="math/tex">v,u_t\in \mathbb{R}^{K_w}</script>, it is <script type="math/tex">v</script>’s job to learn information of this new vector space induced by the previous layer. You can think of <script type="math/tex">v</script> as combining the components of <script type="math/tex">u_t</script> according to their relevance to the problem at hand. Geometrically, you can imagine that <script type="math/tex">v</script> “emphasizes” the important dimensions of the vector space.</p>
<p>Let’s see an example. Say we are working on a classification task: is this word an animal or not. Imagine you have a word embedding defined by the following image:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">%</span><span class="n">matplotlib</span> <span class="n">inline</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span>
<span class="n">vectors</span> <span class="o">=</span> <span class="p">{</span><span class="s">'dog'</span> <span class="p">:</span> <span class="p">[</span><span class="o">.</span><span class="mi">1</span><span class="p">,</span><span class="o">.</span><span class="mi">5</span><span class="p">],</span> <span class="s">'cat'</span><span class="p">:</span> <span class="p">[</span><span class="o">.</span><span class="mi">5</span><span class="p">,</span><span class="o">.</span><span class="mi">48</span><span class="p">],</span> <span class="s">'snow'</span> <span class="p">:</span> <span class="p">[</span><span class="o">.</span><span class="mi">5</span><span class="p">,</span><span class="o">.</span><span class="mo">02</span><span class="p">],</span> <span class="s">'rain'</span> <span class="p">:</span> <span class="p">[</span><span class="o">.</span><span class="mi">4</span><span class="p">,</span> <span class="o">.</span><span class="mo">05</span><span class="p">]</span> <span class="p">}</span>
<span class="k">for</span> <span class="n">word</span><span class="p">,</span> <span class="n">item</span> <span class="ow">in</span> <span class="n">vectors</span><span class="o">.</span><span class="n">iteritems</span><span class="p">():</span>
<span class="n">plt</span><span class="o">.</span><span class="n">annotate</span><span class="p">(</span><span class="n">word</span><span class="p">,</span> <span class="n">xy</span><span class="o">=</span><span class="n">item</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">scatter</span><span class="p">(</span><span class="n">item</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span><span class="n">item</span><span class="p">[</span><span class="mi">1</span><span class="p">])</span>
</code></pre></div></div>
<p><img src="../images/attention/output_1_0.png" alt="png" /></p>
<p>In this toy example, we can assume that the x-axis component of this word embedding is NOT useful for the task of classifying animals, if this was all of our data. So, a choice of <script type="math/tex">v</script> might be the vector <script type="math/tex">[0,1]</script>, which would pluck out the <script type="math/tex">y</script> dimension, as it is clearly important, and remove the <script type="math/tex">x</script>. The resulting clustering would look like this:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Here we apply v, which zeros out the first dimension</span>
<span class="k">for</span> <span class="n">word</span><span class="p">,</span> <span class="n">item</span> <span class="ow">in</span> <span class="n">vectors</span><span class="o">.</span><span class="n">iteritems</span><span class="p">():</span>
<span class="n">plt</span><span class="o">.</span><span class="n">annotate</span><span class="p">(</span><span class="n">word</span><span class="p">,</span> <span class="n">xy</span><span class="o">=</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span><span class="n">item</span><span class="p">[</span><span class="mi">1</span><span class="p">]))</span>
<span class="n">plt</span><span class="o">.</span><span class="n">scatter</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span><span class="n">item</span><span class="p">[</span><span class="mi">1</span><span class="p">])</span>
</code></pre></div></div>
<p><img src="../images/attention/output_3_0.png" alt="png" /></p>
<p>We can clearly see that <script type="math/tex">v</script> has identified important information in the vector space representing the words. This is what the word-level context vector’s role is. It picks out the important parts of the vector space for our model to focus in on. This means that the previous layer will move the input vector space around so that <script type="math/tex">v</script> can easily pick out the important words! In fact, the softmax values of “dog” and “cat” will be larger in this case than “rain” and “snow”. That’s exactly what we want.</p>
<p>The next layer does not have a particularly geometric flavor to it. The application of the softmax is a monotonic transformation (i.e. the importances picked out by <script type="math/tex">v^Tu_t</script> stay the same). This allows the final representation of the input sentence to be a linear combination of <script type="math/tex">h_t</script>, weighting the word representations by their importance. There is however another way to view the Attention output…</p>
<h3 id="probabilistic-interpretation-of-the-output">Probabilistic Interpretation of the Output</h3>
<p>The output is easy to understand probabilistically. Because <script type="math/tex">\alpha=[\alpha_t]_{t=1}^M</script> sums to <script type="math/tex">1</script>, this yields a probabilistic interpretation of Attention. Each weight <script type="math/tex">\alpha_t</script> indicates the probability that the word is important to the problem at hand. There are some versions of attention (“hard attention”) which actually will sample a word according to this probability only return that word from the attention layer. In our case, the resulting vector is the expectation of the important words. In particular, if our random variable <script type="math/tex">X</script> is defined to take on the value <script type="math/tex">h_t</script> with probability <script type="math/tex">\alpha_t</script>, then the output of our Attention mechanism is actually <script type="math/tex">\mathbb{E}[X]</script>.</p>
<h1 id="the-data">The Data</h1>
<p>To demonstrate this, we’re going to use data from a recent <a href="https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/">Kaggle competition</a> on using NLP to detect toxic comments. The competition aims to label a text with 6 different (somewhat correlated) labels: “toxic”, “severe_toxic”, “obscene”, “threat”, “insult”, and “identity_hate”. We will expect the Attention layer to give more importance to negative words that could indicate one of those labels have occured. Let’s do a quick data load with some boilerplate code…</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>
<span class="kn">from</span> <span class="nn">keras.preprocessing.text</span> <span class="kn">import</span> <span class="n">Tokenizer</span>
<span class="kn">from</span> <span class="nn">keras.preprocessing.sequence</span> <span class="kn">import</span> <span class="n">pad_sequences</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">train</span><span class="o">=</span><span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s">'train.csv'</span><span class="p">)</span>
<span class="n">train</span><span class="o">=</span><span class="n">train</span><span class="o">.</span><span class="n">sample</span><span class="p">(</span><span class="n">frac</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="n">list_classes</span> <span class="o">=</span> <span class="p">[</span><span class="s">"toxic"</span><span class="p">,</span> <span class="s">"severe_toxic"</span><span class="p">,</span> <span class="s">"obscene"</span><span class="p">,</span> <span class="s">"threat"</span><span class="p">,</span> <span class="s">"insult"</span><span class="p">,</span> <span class="s">"identity_hate"</span><span class="p">]</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">train</span><span class="p">[</span><span class="n">list_classes</span><span class="p">]</span><span class="o">.</span><span class="n">values</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Example of a data point</span>
<span class="n">train</span><span class="o">.</span><span class="n">iloc</span><span class="p">[</span><span class="mi">159552</span><span class="p">]</span>
</code></pre></div></div>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>id 07ea42c192ead7d9
comment_text "\n\nafd\nHey, MONGO, why dont you put a proce...
toxic 0
severe_toxic 0
obscene 0
threat 0
insult 0
identity_hate 0
Name: 2928, dtype: object
</code></pre></div></div>
<p>More NLP boilerplate to tokenize and pad the sequences:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">max_features</span> <span class="o">=</span> <span class="mi">20000</span>
<span class="n">maxlen</span> <span class="o">=</span> <span class="mi">200</span>
<span class="n">list_sentences</span> <span class="o">=</span> <span class="n">train</span><span class="p">[</span><span class="s">'comment_text'</span><span class="p">]</span><span class="o">.</span><span class="n">values</span>
<span class="c"># Tokenize</span>
<span class="n">tokenizer</span> <span class="o">=</span> <span class="n">Tokenizer</span><span class="p">(</span><span class="n">num_words</span><span class="o">=</span><span class="n">max_features</span><span class="p">)</span>
<span class="n">tokenizer</span><span class="o">.</span><span class="n">fit_on_texts</span><span class="p">(</span><span class="nb">list</span><span class="p">(</span><span class="n">list_sentences</span><span class="p">))</span>
<span class="n">list_tokenized_train</span> <span class="o">=</span> <span class="n">tokenizer</span><span class="o">.</span><span class="n">texts_to_sequences</span><span class="p">(</span><span class="n">list_sentences</span><span class="p">)</span>
<span class="c"># Pad</span>
<span class="n">X_t</span> <span class="o">=</span> <span class="n">pad_sequences</span><span class="p">(</span><span class="n">list_tokenized_train</span><span class="p">,</span> <span class="n">maxlen</span><span class="o">=</span><span class="n">maxlen</span><span class="p">)</span>
</code></pre></div></div>
<h2 id="keras-architecture">Keras Architecture</h2>
<p>There are many <a href="https://gist.github.com/cbaziotis/7ef97ccf71cbc14366835198c09809d2">versions</a> of attention out there that actually implements a custom Keras layer and does the calculations with low-level calls to the Keras backend. In my implementation, I’d like to avoid this and instead use Keras layers to build up the Attention layer in an attempt to demystify what is going on.</p>
<h4 id="attention-implementation">Attention Implementation</h4>
<p>For our implementation, we’re going to follow the formulas used above. The approach was inspired by <a href="https://github.com/philipperemy/keras-attention-mechanism/blob/master/attention_lstm.py">this repository</a>.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">keras.layers</span> <span class="kn">import</span> <span class="n">Activation</span><span class="p">,</span> <span class="n">Concatenate</span><span class="p">,</span> <span class="n">Permute</span><span class="p">,</span> <span class="n">SpatialDropout1D</span><span class="p">,</span> <span class="n">RepeatVector</span><span class="p">,</span> <span class="n">LSTM</span><span class="p">,</span> <span class="n">Bidirectional</span><span class="p">,</span> <span class="n">Multiply</span><span class="p">,</span> <span class="n">Lambda</span><span class="p">,</span> <span class="n">Dense</span><span class="p">,</span> <span class="n">Dropout</span><span class="p">,</span> <span class="n">Input</span><span class="p">,</span><span class="n">Flatten</span><span class="p">,</span><span class="n">Embedding</span>
<span class="kn">from</span> <span class="nn">keras.models</span> <span class="kn">import</span> <span class="n">Model</span>
<span class="kn">import</span> <span class="nn">keras.backend</span> <span class="k">as</span> <span class="n">K</span>
<span class="k">class</span> <span class="nc">Attention</span><span class="p">:</span>
<span class="k">def</span> <span class="nf">__call__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">inp</span><span class="p">,</span> <span class="n">combine</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">return_attention</span><span class="o">=</span><span class="bp">True</span><span class="p">):</span>
<span class="c"># Expects inp to be of size (?, number of words, embedding dimension)</span>
<span class="n">repeat_size</span> <span class="o">=</span> <span class="nb">int</span><span class="p">(</span><span class="n">inp</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">])</span>
<span class="c"># Map through 1 Layer MLP</span>
<span class="n">x_a</span> <span class="o">=</span> <span class="n">Dense</span><span class="p">(</span><span class="n">repeat_size</span><span class="p">,</span> <span class="n">kernel_initializer</span> <span class="o">=</span> <span class="s">'glorot_uniform'</span><span class="p">,</span> <span class="n">activation</span><span class="o">=</span><span class="s">"tanh"</span><span class="p">,</span> <span class="n">name</span><span class="o">=</span><span class="s">"tanh_mlp"</span><span class="p">)(</span><span class="n">inp</span><span class="p">)</span>
<span class="c"># Dot with word-level vector</span>
<span class="n">x_a</span> <span class="o">=</span> <span class="n">Dense</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">kernel_initializer</span> <span class="o">=</span> <span class="s">'glorot_uniform'</span><span class="p">,</span> <span class="n">activation</span><span class="o">=</span><span class="s">'linear'</span><span class="p">,</span> <span class="n">name</span><span class="o">=</span><span class="s">"word-level_context"</span><span class="p">)(</span><span class="n">x_a</span><span class="p">)</span>
<span class="n">x_a</span> <span class="o">=</span> <span class="n">Flatten</span><span class="p">()(</span><span class="n">x_a</span><span class="p">)</span> <span class="c"># x_a is of shape (?,200,1), we flatten it to be (?,200)</span>
<span class="n">att_out</span> <span class="o">=</span> <span class="n">Activation</span><span class="p">(</span><span class="s">'softmax'</span><span class="p">)(</span><span class="n">x_a</span><span class="p">)</span>
<span class="c"># Clever trick to do elementwise multiplication of alpha_t with the correct h_t:</span>
<span class="c"># RepeatVector will blow it out to be (?,120, 200)</span>
<span class="c"># Then, Permute will swap it to (?,200,120) where each row (?,k,120) is a copy of a_t[k]</span>
<span class="c"># Then, Multiply performs elementwise multiplication to apply the same a_t to each</span>
<span class="c"># dimension of the respective word vector</span>
<span class="n">x_a2</span> <span class="o">=</span> <span class="n">RepeatVector</span><span class="p">(</span><span class="n">repeat_size</span><span class="p">)(</span><span class="n">att_out</span><span class="p">)</span>
<span class="n">x_a2</span> <span class="o">=</span> <span class="n">Permute</span><span class="p">([</span><span class="mi">2</span><span class="p">,</span><span class="mi">1</span><span class="p">])(</span><span class="n">x_a2</span><span class="p">)</span>
<span class="n">out</span> <span class="o">=</span> <span class="n">Multiply</span><span class="p">()([</span><span class="n">inp</span><span class="p">,</span><span class="n">x_a2</span><span class="p">])</span>
<span class="k">if</span> <span class="n">combine</span><span class="p">:</span>
<span class="c"># Now we sum over the resulting word representations</span>
<span class="n">out</span> <span class="o">=</span> <span class="n">Lambda</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span> <span class="p">:</span> <span class="n">K</span><span class="o">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">),</span> <span class="n">name</span><span class="o">=</span><span class="s">'expectation_over_words'</span><span class="p">)(</span><span class="n">out</span><span class="p">)</span>
<span class="k">if</span> <span class="n">return_attention</span><span class="p">:</span>
<span class="n">out</span> <span class="o">=</span> <span class="p">(</span><span class="n">out</span><span class="p">,</span> <span class="n">att_out</span><span class="p">)</span>
<span class="k">return</span> <span class="n">out</span>
</code></pre></div></div>
<h3 id="model-architecture">Model Architecture</h3>
<p>We are going to train a Bi-Directional LSTM to demonstrate the Attention class. The Bidirectional class in Keras returns a tensor with the same number of time steps as the input tensor, but with the forward and backward pass of the LSTM concatenated. Just to remind the reader, the standard dimensions for this use case in Keras is: (batch size, time steps, word embedding dimension).</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">lstm_shape</span> <span class="o">=</span> <span class="mi">60</span>
<span class="n">embed_size</span> <span class="o">=</span> <span class="mi">128</span>
<span class="c"># Define the model</span>
<span class="n">inp</span> <span class="o">=</span> <span class="n">Input</span><span class="p">(</span><span class="n">shape</span><span class="o">=</span><span class="p">(</span><span class="n">maxlen</span><span class="p">,))</span>
<span class="n">emb</span> <span class="o">=</span> <span class="n">Embedding</span><span class="p">(</span><span class="n">input_dim</span><span class="o">=</span><span class="n">max_features</span><span class="p">,</span> <span class="n">input_length</span> <span class="o">=</span> <span class="n">maxlen</span><span class="p">,</span> <span class="n">output_dim</span><span class="o">=</span><span class="n">embed_size</span><span class="p">)(</span><span class="n">inp</span><span class="p">)</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">SpatialDropout1D</span><span class="p">(</span><span class="mf">0.35</span><span class="p">)(</span><span class="n">emb</span><span class="p">)</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">Bidirectional</span><span class="p">(</span><span class="n">LSTM</span><span class="p">(</span><span class="n">lstm_shape</span><span class="p">,</span> <span class="n">return_sequences</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">dropout</span><span class="o">=</span><span class="mf">0.15</span><span class="p">,</span> <span class="n">recurrent_dropout</span><span class="o">=</span><span class="mf">0.15</span><span class="p">))(</span><span class="n">x</span><span class="p">)</span>
<span class="n">x</span><span class="p">,</span> <span class="n">attention</span> <span class="o">=</span> <span class="n">Attention</span><span class="p">()(</span><span class="n">x</span><span class="p">)</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">Dense</span><span class="p">(</span><span class="mi">6</span><span class="p">,</span> <span class="n">activation</span><span class="o">=</span><span class="s">"sigmoid"</span><span class="p">)(</span><span class="n">x</span><span class="p">)</span>
<span class="n">model</span> <span class="o">=</span> <span class="n">Model</span><span class="p">(</span><span class="n">inputs</span><span class="o">=</span><span class="n">inp</span><span class="p">,</span> <span class="n">outputs</span><span class="o">=</span><span class="n">x</span><span class="p">)</span>
<span class="n">model</span><span class="o">.</span><span class="nb">compile</span><span class="p">(</span><span class="n">loss</span><span class="o">=</span><span class="s">'binary_crossentropy'</span><span class="p">,</span>
<span class="n">optimizer</span><span class="o">=</span><span class="s">'adam'</span><span class="p">,</span>
<span class="n">metrics</span><span class="o">=</span><span class="p">[</span><span class="s">'accuracy'</span><span class="p">])</span>
<span class="n">attention_model</span> <span class="o">=</span> <span class="n">Model</span><span class="p">(</span><span class="n">inputs</span><span class="o">=</span><span class="n">inp</span><span class="p">,</span> <span class="n">outputs</span><span class="o">=</span><span class="n">attention</span><span class="p">)</span> <span class="c"># Model to print out the attention data</span>
</code></pre></div></div>
<p>If we look at the summary, we can trace the dimensions through the model.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">model</span><span class="o">.</span><span class="n">summary</span><span class="p">()</span>
</code></pre></div></div>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>__________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
==================================================================================================
input_31 (InputLayer) (None, 200) 0
__________________________________________________________________________________________________
embedding_31 (Embedding) (None, 200, 128) 2560000 input_31[0][0]
__________________________________________________________________________________________________
spatial_dropout1d_31 (SpatialDr (None, 200, 128) 0 embedding_31[0][0]
__________________________________________________________________________________________________
bidirectional_37 (Bidirectional (None, 200, 120) 90720 spatial_dropout1d_31[0][0]
__________________________________________________________________________________________________
tanh_mlp (Dense) (None, 200, 120) 14520 bidirectional_37[0][0]
__________________________________________________________________________________________________
word-level_context (Dense) (None, 200, 1) 121 tanh_mlp[0][0]
__________________________________________________________________________________________________
flatten_30 (Flatten) (None, 200) 0 word-level_context[0][0]
__________________________________________________________________________________________________
activation_11 (Activation) (None, 200) 0 flatten_30[0][0]
__________________________________________________________________________________________________
repeat_vector_29 (RepeatVector) (None, 120, 200) 0 activation_11[0][0]
__________________________________________________________________________________________________
permute_31 (Permute) (None, 200, 120) 0 repeat_vector_29[0][0]
__________________________________________________________________________________________________
multiply_28 (Multiply) (None, 200, 120) 0 bidirectional_37[0][0]
permute_31[0][0]
__________________________________________________________________________________________________
expectation_over_words (Lambda) (None, 120) 0 multiply_28[0][0]
__________________________________________________________________________________________________
dense_55 (Dense) (None, 6) 726 expectation_over_words[0][0]
==================================================================================================
Total params: 2,666,087
Trainable params: 2,666,087
Non-trainable params: 0
__________________________________________________________________________________________________
</code></pre></div></div>
<h3 id="training">Training</h3>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">model</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X_t</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">validation_split</span><span class="o">=.</span><span class="mi">2</span><span class="p">,</span> <span class="n">epochs</span><span class="o">=</span><span class="mi">3</span><span class="p">,</span> <span class="n">verbose</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">batch_size</span><span class="o">=</span><span class="mi">512</span><span class="p">)</span>
</code></pre></div></div>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Train on 127656 samples, validate on 31915 samples
Epoch 1/3
127656/127656 [==============================] - 192s 2ms/step - loss: 0.1686 - acc: 0.9613 - val_loss: 0.1388 - val_acc: 0.9639
Epoch 2/3
127656/127656 [==============================] - 183s 1ms/step - loss: 0.0939 - acc: 0.9708 - val_loss: 0.0526 - val_acc: 0.9814
Epoch 3/3
127656/127656 [==============================] - 183s 1ms/step - loss: 0.0492 - acc: 0.9820 - val_loss: 0.0480 - val_acc: 0.9828
<keras.callbacks.History at 0x7f5c60dbc690>
</code></pre></div></div>
<h2 id="identifying-the-words-the-model-attends">Identifying The Words the Model Attends</h2>
<p>Because we stored the output of the softmax layer, we can identify what words the model identified as important for classifying each sentence. We’ll define a helper function that will print the words in the sentence and pair them with their corresponding Attention output (i.e., the word’s importance to the prediction).</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">get_word_importances</span><span class="p">(</span><span class="n">text</span><span class="p">):</span>
<span class="n">lt</span> <span class="o">=</span> <span class="n">tokenizer</span><span class="o">.</span><span class="n">texts_to_sequences</span><span class="p">([</span><span class="n">text</span><span class="p">])</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">pad_sequences</span><span class="p">(</span><span class="n">lt</span><span class="p">,</span> <span class="n">maxlen</span><span class="o">=</span><span class="n">maxlen</span><span class="p">)</span>
<span class="n">p</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="n">att</span> <span class="o">=</span> <span class="n">attention_model</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="k">return</span> <span class="n">p</span><span class="p">,</span> <span class="p">[(</span><span class="n">reverse_token_map</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">word</span><span class="p">),</span> <span class="n">importance</span><span class="p">)</span> <span class="k">for</span> <span class="n">word</span><span class="p">,</span> <span class="n">importance</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="n">x</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">att</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span> <span class="k">if</span> <span class="n">word</span> <span class="ow">in</span> <span class="n">reverse_token_map</span><span class="p">]</span>
</code></pre></div></div>
<p>The data set from Kaggle is about classifying toxic comments. Instead of using examples from the data (which is filled with vulgarities…), I’ll use a relatively benign example to show Attention in action:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># This example is not offensive, according to the model.</span>
<span class="n">get_word_importances</span><span class="p">(</span><span class="s">'ice cream'</span><span class="p">)</span>
</code></pre></div></div>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(array([[0.03289272, 0.00060764, 0.00365122, 0.00135027, 0.00818442,
0.00286318]], dtype=float32),
[('ice', 0.24796312), ('cream', 0.18920079)])
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># This example is labeled ALMOST labeled toxic, but isn't, due to the word "worthless".</span>
<span class="n">get_word_importances</span><span class="p">(</span><span class="s">'ice cream is worthless'</span><span class="p">)</span>
</code></pre></div></div>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(array([[0.4471274 , 0.00662996, 0.07287972, 0.00910207, 0.09902474,
0.02583557]], dtype=float32),
[('ice', 0.028298836),
('cream', 0.024065288),
('is', 0.041414138),
('worthless', 0.8417008)])
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># This example is labeled toxic and it's because of the word "sucks".</span>
<span class="n">get_word_importances</span><span class="p">(</span><span class="s">'ice cream sucks'</span><span class="p">)</span>
</code></pre></div></div>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(array([[0.9310674 , 0.04415609, 0.64935035, 0.02346741, 0.4773681 ,
0.07371927]], dtype=float32),
[('ice', 0.0032987772), ('cream', 0.003724447), ('sucks', 0.9874626)])
</code></pre></div></div>
<h1 id="discussion">Discussion</h1>
<p>Attention mechanisms are an exciting technique and an active area of research with new approaches being published all the time. Attention has been shown to help achieve state of the art performance on many tasks, but it is also has the obvious advantage of explaining important components of neural networks to a practitioner. In an academic setting, Attention proves very useful for model debugging by allowing one to inspect the predictions and their reasoning. Moreover, having the option available to indicate why a model made a prediction could be very important to fields like healthcare where predictions may be vetted by an expert.</p>Attention has gotten plenty of attention lately, after yielding state of the art results in multiple fields of research. From image captioning and language translation to interactive question answering, Attention has quickly become a key tool to which researchers must attend. Some have taken notice and even postulate that attention is all you need. But what is Attention anyway? Should you pay attention to Attention? Attention enables the model to focus in on important pieces of the feature space. In this post, we explain how the Attention mechanism works mathematically and then implement the equations using Keras. We conclude with discussing how to “see” the Attention mechanism at work by identifying important words for a classification task.Adversarial Dreaming with TensorFlow and Keras2018-01-07T00:00:00+00:002018-01-07T00:00:00+00:00http://srome.github.io//Adversarial-Dreaming-with-TensorFlow-and-Keras<p>Everyone has heard the feats of Google’s “dreaming” neural network. Today, we’re going to define a special loss function so that we can dream adversarially– that is, we will dream in a way that will fool the InceptionV3 image classifier to classify an image of a dreamy cat as a coffeepot.</p>
<p><a href="https://en.wikipedia.org/wiki/DeepDream">DeepDream</a> made headlines a few years ago for its bizarre images generated by neural networks. As with many scientific advancements, a flashy name won out over the precise name. “Dreaming” sounds more exotic than the more boring truth: “purposely maximizing the L2 norm of different layers”. Anyway, the approach was later extended to <a href="https://www.youtube.com/watch?v=oyxSerkkP4o">video</a> by researchers to more unsettling effects. How to “dream” has been covered at length, so for this post, I’m going to repurpose <a href="https://github.com/keras-team/keras/blob/master/examples/deep_dream.py">code from the good people at Keras</a> and only focus on the adversarial side of dreaming. This modified code is included as an appendix below.</p>
<h2 id="adversarial-images">Adversarial Images</h2>
<p>Adversarial images have been getting a lot of attention lately. Adversarial images are images that fool state of the art image classifiers but a human would easily classify as the proper class. At present, these images are only possible with access to the classifier you aim to fool, but that could always change in the future. Therefore, researchers are already exploring all the different ways one can exploit an image classifier.</p>
<p>There have been several papers exploring these so-called attacks. The aptly named <a href="https://arxiv.org/abs/1710.08864">One pixel attacks</a> change a single pixel to trick a classifier:
<img src="../images/advdream/one_pixel.png" alt="One pixel attacks" /></p>
<p><a href="https://www.cs.cmu.edu/~sbhagava/papers/face-rec-ccs16.pdf">Special glasses</a> have been used to cause a classifier to misclassify celebrities:
<img src="../images/advdream/glasses.png" alt="Misclassifying Reese" /></p>
<p>One can quickly imagine how this can go awry when self-driving cars need to identify obstacles (or pedestrians). With these publications in mind, I would like to walk through another approach that combines “dreaming” to create adversarial images.</p>
<h2 id="setup">Setup</h2>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">keras.applications</span> <span class="kn">import</span> <span class="n">inception_v3</span>
<span class="kn">from</span> <span class="nn">keras</span> <span class="kn">import</span> <span class="n">backend</span> <span class="k">as</span> <span class="n">K</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span>
<span class="o">%</span><span class="n">matplotlib</span> <span class="n">inline</span>
<span class="c"># Repurposed deepdream.py code from the people at Keras</span>
<span class="kn">from</span> <span class="nn">deepdream_mod</span> <span class="kn">import</span> <span class="n">preprocess_image</span><span class="p">,</span> <span class="n">deprocess_image</span><span class="p">,</span> <span class="n">eval_loss_and_grads</span><span class="p">,</span> <span class="n">resize_img</span><span class="p">,</span> <span class="n">gradient_descent</span>
<span class="n">K</span><span class="o">.</span><span class="n">set_learning_phase</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span>
<span class="c"># Load the model and include the "top" classification layer</span>
<span class="n">model</span> <span class="o">=</span> <span class="n">inception_v3</span><span class="o">.</span><span class="n">InceptionV3</span><span class="p">(</span><span class="n">weights</span><span class="o">=</span><span class="s">'imagenet'</span><span class="p">,</span>
<span class="n">include_top</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
</code></pre></div></div>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Using TensorFlow backend.
</code></pre></div></div>
<h2 id="the-cat">The Cat</h2>
<p>Today, we’re going to take this image of a lazy cat for our experiment. Let’s see what InceptionV3 thinks right now:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">orig_img</span> <span class="o">=</span> <span class="n">preprocess_image</span><span class="p">(</span><span class="s">'lazycat.jpg'</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">top_predicted_label</span> <span class="o">=</span> <span class="n">inception_v3</span><span class="o">.</span><span class="n">decode_predictions</span><span class="p">(</span><span class="n">model</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">orig_img</span><span class="p">),</span> <span class="n">top</span><span class="o">=</span><span class="mi">1</span><span class="p">)[</span><span class="mi">0</span><span class="p">][</span><span class="mi">0</span><span class="p">]</span>
<span class="n">plt</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="s">'Predicted Label: </span><span class="si">%</span><span class="s">s (</span><span class="si">%</span><span class="s">f)'</span> <span class="o">%</span> <span class="p">(</span><span class="n">top_predicted_label</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="n">top_predicted_label</span><span class="p">[</span><span class="mi">2</span><span class="p">]))</span>
<span class="n">plt</span><span class="o">.</span><span class="n">imshow</span><span class="p">(</span><span class="n">deprocess_image</span><span class="p">(</span><span class="n">orig_img</span><span class="p">))</span>
</code></pre></div></div>
<p><img src="../images/advdream/output_5_1.png" alt="png" /></p>
<p>The most likely class according to InceptionV3 is an Egyptian_cat. That’s oddly specific (and wrong), but let’s go with it.</p>
<h2 id="select-fake-label">Select Fake Label</h2>
<p>In this case, let’s select a “coffeepot”.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">coffeepot</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">zeros_like</span><span class="p">(</span><span class="n">model</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">orig_img</span><span class="p">))</span>
<span class="n">coffeepot</span><span class="p">[</span><span class="mi">0</span><span class="p">,</span><span class="mi">505</span><span class="p">]</span> <span class="o">=</span> <span class="mf">1.</span>
<span class="n">inception_v3</span><span class="o">.</span><span class="n">decode_predictions</span><span class="p">(</span><span class="n">coffeepot</span><span class="p">)</span>
</code></pre></div></div>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[[(u'n03063689', u'coffeepot', 1.0),
(u'n15075141', u'toilet_tissue', 0.0),
(u'n02317335', u'starfish', 0.0),
(u'n02391049', u'zebra', 0.0),
(u'n02389026', u'sorrel', 0.0)]]
</code></pre></div></div>
<h2 id="define-the-loss">Define the loss</h2>
<p>Our loss will have a “dreaming” and “adversarial” component.</p>
<h3 id="dreaming-component">Dreaming Component</h3>
<p>To achieve “dreaming”, we fix the weights and perform gradient ascent on the input image itself to maximize the L2 norm of a chosen layer’s output of the network. You can also select multiple layers and create a loss to maximize with coefficients, but in this case we will choose a single layer for simplicity.</p>
<p>The interesting part mathematically is as follows: define the input image as <script type="math/tex">x_0</script> and the output of the <script type="math/tex">n</script>-th layer as <script type="math/tex">f_n(x)</script>. To dream, we perform the gradient ascent step:</p>
<p>\begin{equation}
x_{i+1} = x_{i} + \gamma \frac{\partial}{\partial x} |f_{n}(x)|^2
\end{equation}</p>
<p>where <script type="math/tex">\gamma</script> is the learning rate and the absolute value represents the L2 norm. For our choice of <script type="math/tex">f_n</script>, we are not limited to activation functions. We could use the output before the activation as well as special layers like Batch Normalization as well.</p>
<h3 id="adversarial-component">Adversarial Component</h3>
<p>For the “adversarial” piece, we will also add a component to the loss that will correspond to gradient descent on the predicted label, meaning each iteration we will modify the image to make the neural network think the image is the label we select.</p>
<p>One of the easiest ways to define this is to choose a label and minimize the loss between the true output for an image and the false label. Let us define the model’s output softmax layer as <script type="math/tex">f(x)</script> for an image <script type="math/tex">x</script> and let’s let <script type="math/tex">o_{fake}</script> be the fake label. Let’s assume we are using a categorical crossentropy loss <script type="math/tex">\ell</script>, but any loss would do. Then, we would want to perform the gradient descent step</p>
<p>\begin{equation}
x_{i+1} = x_i - \gamma \frac{\partial}{\partial x} \ell( f(x), o_{fake}).
\end{equation}</p>
<p>This iteration scheme will adjust the original image to minimize the loss function between the output of the model and the fake label, hence “tricking” the model.</p>
<h3 id="final-loss">Final loss</h3>
<p>If we add the previous two loss functions, see the final proper loss where we are taking the derivative:</p>
<p>\begin{equation}
2 x_{i+1} = 2 x_{i} + \gamma (\frac{\partial}{\partial x} |f_{n}(x)|^2 - \frac{\partial}{\partial x} \ell( f(x), o_{fake}) )
\end{equation}</p>
<p>After massaging the equation, we have the gradient <em>descent</em> rule</p>
<p>\begin{equation}
x_{i+1} = x_{i} - \frac{\gamma }{2} \frac{\partial}{\partial x} \left (\ell( f(x), o_{fake}) - |f_{n}(x)|^2 \right ).
\end{equation}
So, our final loss function for adversarial dreaming to define in Keras is:</p>
<p>\begin{equation}
\ell_{\text{adv dream}} = \ell( f(x), o_{fake}) - |f_{n}(x)|^2.
\end{equation}</p>
<p>For the final loss, we will use a linear combination of a few randomly selected layer output functions for better dreaming effects, but those are a small technical detail. Let’s see the code:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">keras</span> <span class="kn">import</span> <span class="n">losses</span>
<span class="n">input_class</span> <span class="o">=</span> <span class="n">K</span><span class="o">.</span><span class="n">zeros</span><span class="p">(</span><span class="n">shape</span><span class="o">=</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="mi">1000</span><span class="p">))</span> <span class="c"># Variable fake class</span>
<span class="c"># dreaming loss, choose a few layers at random and sum their L2 norms</span>
<span class="n">dream_loss</span> <span class="o">=</span> <span class="n">K</span><span class="o">.</span><span class="n">variable</span><span class="p">(</span><span class="mf">0.</span><span class="p">)</span>
<span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">seed</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
<span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">choice</span><span class="p">(</span><span class="n">model</span><span class="o">.</span><span class="n">layers</span><span class="p">[:</span><span class="o">-</span><span class="mi">1</span><span class="p">],</span> <span class="n">size</span> <span class="o">=</span> <span class="mi">5</span><span class="p">)</span>
<span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">choice</span><span class="p">(</span><span class="n">model</span><span class="o">.</span><span class="n">layers</span><span class="p">[:</span><span class="o">-</span><span class="mi">1</span><span class="p">],</span> <span class="n">size</span> <span class="o">=</span> <span class="mi">4</span><span class="p">):</span>
<span class="n">x_var</span> <span class="o">=</span> <span class="n">x</span><span class="o">.</span><span class="n">output</span>
<span class="n">dream_loss</span> <span class="o">+=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">uniform</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span><span class="mi">2</span><span class="p">)</span> <span class="o">*</span> <span class="n">K</span><span class="o">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">K</span><span class="o">.</span><span class="n">square</span><span class="p">(</span><span class="n">x_var</span><span class="p">))</span> <span class="o">/</span> <span class="n">K</span><span class="o">.</span><span class="n">prod</span><span class="p">(</span><span class="n">K</span><span class="o">.</span><span class="n">cast</span><span class="p">(</span><span class="n">K</span><span class="o">.</span><span class="n">shape</span><span class="p">(</span><span class="n">x_var</span><span class="p">),</span> <span class="s">'float32'</span><span class="p">))</span>
<span class="c"># adversarial loss</span>
<span class="n">adversarial_loss</span> <span class="o">=</span> <span class="n">losses</span><span class="o">.</span><span class="n">categorical_crossentropy</span><span class="p">(</span><span class="n">input_class</span><span class="p">,</span> <span class="n">model</span><span class="o">.</span><span class="n">output</span><span class="p">)</span>
<span class="c"># final loss</span>
<span class="n">adversarial_dream_loss</span> <span class="o">=</span> <span class="n">adversarial_loss</span><span class="o">-</span><span class="n">dream_loss</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Compute the gradients of the dream wrt the loss.</span>
<span class="n">dream</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="nb">input</span> <span class="c"># This is the input image</span>
<span class="n">grads</span> <span class="o">=</span> <span class="n">K</span><span class="o">.</span><span class="n">gradients</span><span class="p">(</span><span class="n">adversarial_dream_loss</span><span class="p">,</span> <span class="n">dream</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span> <span class="c"># the signs will acheive the desired effect</span>
<span class="n">grads</span> <span class="o">/=</span> <span class="n">K</span><span class="o">.</span><span class="n">maximum</span><span class="p">(</span><span class="n">K</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">K</span><span class="o">.</span><span class="nb">abs</span><span class="p">(</span><span class="n">grads</span><span class="p">)),</span> <span class="n">K</span><span class="o">.</span><span class="n">epsilon</span><span class="p">())</span> <span class="c"># Normalize for numerical stability</span>
<span class="n">outputs</span> <span class="o">=</span> <span class="p">[</span><span class="n">adversarial_dream_loss</span><span class="p">,</span> <span class="n">grads</span><span class="p">]</span>
<span class="c"># Function to use during dreaming</span>
<span class="n">fetch_loss_and_grads</span> <span class="o">=</span> <span class="n">K</span><span class="o">.</span><span class="n">function</span><span class="p">([</span><span class="n">dream</span><span class="p">,</span> <span class="n">input_class</span><span class="p">],</span> <span class="n">outputs</span><span class="p">)</span>
</code></pre></div></div>
<h2 id="dreaming">Dreaming</h2>
<p>The code to “dream” is straight forward. The image is shrunk to a smaller size and ran through a selected number of gradient descent steps. Then, the resulting image is upscaled with any detail the original image could have lost when resizing added back in. This is then repeated based on the number of “octaves” selected where during each octave, the image has a different size. The octave scale determines the size ratio between the different octaves.</p>
<p>This is really just a more complicated way to say that the image is passed through the neural net for gradient descent at multiple increasing sizes. The octave concept is a nice geometric approach for deciding what those sizes will be that has been found to work well. Sometimes, additional jitter is added to the image on either each gradient descent step too.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Shamelessly taken from the Keras' example</span>
<span class="n">step</span> <span class="o">=</span> <span class="mf">0.01</span> <span class="c"># Gradient ascent step size</span>
<span class="n">num_octave</span> <span class="o">=</span> <span class="mi">5</span> <span class="c"># Number of scales at which to run gradient ascent</span>
<span class="n">octave_scale</span> <span class="o">=</span> <span class="mf">1.2</span> <span class="c"># Size ratio between scales</span>
<span class="n">iterations</span> <span class="o">=</span> <span class="mi">15</span> <span class="c"># Number of ascent steps per scale</span>
<span class="n">max_loss</span> <span class="o">=</span> <span class="mf">100.</span>
<span class="n">img</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">copy</span><span class="p">(</span><span class="n">orig_img</span><span class="p">)</span>
<span class="k">if</span> <span class="n">K</span><span class="o">.</span><span class="n">image_data_format</span><span class="p">()</span> <span class="o">==</span> <span class="s">'channels_first'</span><span class="p">:</span>
<span class="n">original_shape</span> <span class="o">=</span> <span class="n">img</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">2</span><span class="p">:]</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">original_shape</span> <span class="o">=</span> <span class="n">img</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">:</span><span class="mi">3</span><span class="p">]</span>
<span class="n">successive_shapes</span> <span class="o">=</span> <span class="p">[</span><span class="n">original_shape</span><span class="p">]</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">num_octave</span><span class="p">):</span>
<span class="n">shape</span> <span class="o">=</span> <span class="nb">tuple</span><span class="p">([</span><span class="nb">int</span><span class="p">(</span><span class="n">dim</span> <span class="o">/</span> <span class="p">(</span><span class="n">octave_scale</span> <span class="o">**</span> <span class="n">i</span><span class="p">))</span> <span class="k">for</span> <span class="n">dim</span> <span class="ow">in</span> <span class="n">original_shape</span><span class="p">])</span>
<span class="n">successive_shapes</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">shape</span><span class="p">)</span>
<span class="n">successive_shapes</span> <span class="o">=</span> <span class="n">successive_shapes</span><span class="p">[::</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span>
<span class="n">original_img</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">copy</span><span class="p">(</span><span class="n">img</span><span class="p">)</span>
<span class="n">shrunk_original_img</span> <span class="o">=</span> <span class="n">resize_img</span><span class="p">(</span><span class="n">img</span><span class="p">,</span> <span class="n">successive_shapes</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span>
<span class="k">for</span> <span class="n">shape</span> <span class="ow">in</span> <span class="n">successive_shapes</span><span class="p">:</span>
<span class="k">print</span><span class="p">(</span><span class="s">'Processing image shape'</span><span class="p">,</span> <span class="n">shape</span><span class="p">)</span>
<span class="n">img</span> <span class="o">=</span> <span class="n">resize_img</span><span class="p">(</span><span class="n">img</span><span class="p">,</span> <span class="n">shape</span><span class="p">)</span>
<span class="n">img</span> <span class="o">=</span> <span class="n">gradient_descent</span><span class="p">(</span><span class="n">img</span><span class="p">,</span>
<span class="n">iterations</span><span class="o">=</span><span class="n">iterations</span><span class="p">,</span>
<span class="n">step</span><span class="o">=</span><span class="n">step</span><span class="p">,</span>
<span class="n">clz</span><span class="o">=</span><span class="n">coffeepot</span><span class="p">,</span>
<span class="n">max_loss</span><span class="o">=</span><span class="n">max_loss</span><span class="p">,</span>
<span class="n">loss_func</span> <span class="o">=</span> <span class="n">fetch_loss_and_grads</span><span class="p">)</span>
<span class="c"># Upscale</span>
<span class="n">upscaled_shrunk_original_img</span> <span class="o">=</span> <span class="n">resize_img</span><span class="p">(</span><span class="n">shrunk_original_img</span><span class="p">,</span> <span class="n">shape</span><span class="p">)</span>
<span class="n">same_size_original</span> <span class="o">=</span> <span class="n">resize_img</span><span class="p">(</span><span class="n">original_img</span><span class="p">,</span> <span class="n">shape</span><span class="p">)</span>
<span class="c"># Add back in detail from original image</span>
<span class="n">lost_detail</span> <span class="o">=</span> <span class="n">same_size_original</span> <span class="o">-</span> <span class="n">upscaled_shrunk_original_img</span>
<span class="n">img</span> <span class="o">+=</span> <span class="n">lost_detail</span>
<span class="n">shrunk_original_img</span> <span class="o">=</span> <span class="n">resize_img</span><span class="p">(</span><span class="n">original_img</span><span class="p">,</span> <span class="n">shape</span><span class="p">)</span>
</code></pre></div></div>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>('Processing image shape', (144, 144))
('..Loss value at', 0, ':', array([ 13.78646088], dtype=float32))
...
('..Loss value at', 14, ':', array([ 5.75342369], dtype=float32))
('Processing image shape', (173, 173))
('..Loss value at', 0, ':', array([ 12.50799942], dtype=float32))
...
('..Loss value at', 14, ':', array([ 4.36679077], dtype=float32))
('Processing image shape', (207, 207))
('..Loss value at', 0, ':', array([ 12.14435482], dtype=float32))
...
('..Loss value at', 14, ':', array([-8.84395885], dtype=float32))
('Processing image shape', (249, 249))
('..Loss value at', 0, ':', array([ 6.44068241], dtype=float32))
...
('..Loss value at', 14, ':', array([-9.34126949], dtype=float32))
('Processing image shape', (299, 299))
('..Loss value at', 0, ':', array([ 1.45110321], dtype=float32))
...
('..Loss value at', 14, ':', array([-9.23139477], dtype=float32))
</code></pre></div></div>
<h2 id="outcome">Outcome</h2>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">f</span><span class="p">,</span> <span class="n">axs</span><span class="o">=</span> <span class="n">plt</span><span class="o">.</span><span class="n">subplots</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="mi">2</span><span class="p">,</span> <span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">8</span><span class="p">,</span><span class="mi">10</span><span class="p">))</span>
<span class="n">top_predicted_label</span> <span class="o">=</span> <span class="n">inception_v3</span><span class="o">.</span><span class="n">decode_predictions</span><span class="p">(</span><span class="n">model</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">orig_img</span><span class="p">),</span> <span class="n">top</span><span class="o">=</span><span class="mi">1</span><span class="p">)[</span><span class="mi">0</span><span class="p">][</span><span class="mi">0</span><span class="p">]</span>
<span class="n">axs</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">set_title</span><span class="p">(</span><span class="s">'Predicted Label: </span><span class="si">%</span><span class="s">s (</span><span class="si">%</span><span class="s">f)'</span> <span class="o">%</span> <span class="p">(</span><span class="n">top_predicted_label</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="n">top_predicted_label</span><span class="p">[</span><span class="mi">2</span><span class="p">]))</span>
<span class="n">axs</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">imshow</span><span class="p">(</span><span class="n">deprocess_image</span><span class="p">(</span><span class="n">orig_img</span><span class="p">))</span>
<span class="n">top_predicted_label</span> <span class="o">=</span> <span class="n">inception_v3</span><span class="o">.</span><span class="n">decode_predictions</span><span class="p">(</span><span class="n">model</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">img</span><span class="p">),</span> <span class="n">top</span><span class="o">=</span><span class="mi">1</span><span class="p">)[</span><span class="mi">0</span><span class="p">][</span><span class="mi">0</span><span class="p">]</span>
<span class="n">axs</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">set_title</span><span class="p">(</span><span class="s">'Predicted Label: </span><span class="si">%</span><span class="s">s (</span><span class="si">%</span><span class="s">f)'</span> <span class="o">%</span> <span class="p">(</span><span class="n">top_predicted_label</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="n">top_predicted_label</span><span class="p">[</span><span class="mi">2</span><span class="p">]))</span>
<span class="n">axs</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">imshow</span><span class="p">(</span><span class="n">deprocess_image</span><span class="p">(</span><span class="n">img</span><span class="p">))</span>
</code></pre></div></div>
<p><img src="../images/advdream/output_15_1.png" alt="png" /></p>
<p>As you can see, we accomplished our initial goal to dream in a directed adversarially way. There are many other approaches that can be attempted, and as you can see, this is a very popular avenue of research at the moment.</p>
<h2 id="appendix-deepdream_modpy">Appendix: deepdream_mod.py</h2>
<p>Here’s the functions used in this post:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Modification of https://github.com/keras-team/keras/blob/master/examples/deep_dream.py</span>
<span class="kn">from</span> <span class="nn">keras</span> <span class="kn">import</span> <span class="n">backend</span> <span class="k">as</span> <span class="n">K</span>
<span class="kn">from</span> <span class="nn">keras.applications</span> <span class="kn">import</span> <span class="n">inception_v3</span>
<span class="kn">from</span> <span class="nn">keras.preprocessing.image</span> <span class="kn">import</span> <span class="n">load_img</span><span class="p">,</span> <span class="n">img_to_array</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">import</span> <span class="nn">scipy</span>
<span class="k">def</span> <span class="nf">preprocess_image</span><span class="p">(</span><span class="n">image_path</span><span class="p">):</span>
<span class="c"># Util function to open, resize and format pictures</span>
<span class="c"># into appropriate tensors.</span>
<span class="n">img</span> <span class="o">=</span> <span class="n">load_img</span><span class="p">(</span><span class="n">image_path</span><span class="p">)</span>
<span class="n">img</span> <span class="o">=</span> <span class="n">img_to_array</span><span class="p">(</span><span class="n">img</span><span class="p">)</span>
<span class="n">img</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">expand_dims</span><span class="p">(</span><span class="n">img</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
<span class="n">img</span> <span class="o">=</span> <span class="n">inception_v3</span><span class="o">.</span><span class="n">preprocess_input</span><span class="p">(</span><span class="n">img</span><span class="p">)</span>
<span class="k">return</span> <span class="n">img</span>
<span class="k">def</span> <span class="nf">deprocess_image</span><span class="p">(</span><span class="n">x</span><span class="p">):</span>
<span class="c"># Util function to convert a tensor into a valid image.</span>
<span class="n">x</span><span class="o">=</span><span class="n">np</span><span class="o">.</span><span class="n">copy</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="k">if</span> <span class="n">K</span><span class="o">.</span><span class="n">image_data_format</span><span class="p">()</span> <span class="o">==</span> <span class="s">'channels_first'</span><span class="p">:</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">x</span><span class="o">.</span><span class="n">reshape</span><span class="p">((</span><span class="mi">3</span><span class="p">,</span> <span class="n">x</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">2</span><span class="p">],</span> <span class="n">x</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">3</span><span class="p">]))</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">x</span><span class="o">.</span><span class="n">transpose</span><span class="p">((</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">0</span><span class="p">))</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">x</span><span class="o">.</span><span class="n">reshape</span><span class="p">((</span><span class="n">x</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="n">x</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">2</span><span class="p">],</span> <span class="mi">3</span><span class="p">))</span>
<span class="n">x</span> <span class="o">/=</span> <span class="mf">2.</span>
<span class="n">x</span> <span class="o">+=</span> <span class="mf">0.5</span>
<span class="n">x</span> <span class="o">*=</span> <span class="mf">255.</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">clip</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">255</span><span class="p">)</span><span class="o">.</span><span class="n">astype</span><span class="p">(</span><span class="s">'uint8'</span><span class="p">)</span>
<span class="k">return</span> <span class="n">x</span>
<span class="k">def</span> <span class="nf">eval_loss_and_grads</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="n">clz</span><span class="p">,</span> <span class="n">loss_func</span><span class="p">):</span>
<span class="n">outs</span> <span class="o">=</span> <span class="n">loss_func</span><span class="p">([</span><span class="n">x</span><span class="p">,</span><span class="n">clz</span><span class="p">])</span>
<span class="n">loss_value</span> <span class="o">=</span> <span class="n">outs</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
<span class="n">grad_values</span> <span class="o">=</span> <span class="n">outs</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span>
<span class="k">return</span> <span class="n">loss_value</span><span class="p">,</span> <span class="n">grad_values</span>
<span class="k">def</span> <span class="nf">resize_img</span><span class="p">(</span><span class="n">img</span><span class="p">,</span> <span class="n">size</span><span class="p">):</span>
<span class="n">img</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">copy</span><span class="p">(</span><span class="n">img</span><span class="p">)</span>
<span class="k">if</span> <span class="n">K</span><span class="o">.</span><span class="n">image_data_format</span><span class="p">()</span> <span class="o">==</span> <span class="s">'channels_first'</span><span class="p">:</span>
<span class="n">factors</span> <span class="o">=</span> <span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span>
<span class="nb">float</span><span class="p">(</span><span class="n">size</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span> <span class="o">/</span> <span class="n">img</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">2</span><span class="p">],</span>
<span class="nb">float</span><span class="p">(</span><span class="n">size</span><span class="p">[</span><span class="mi">1</span><span class="p">])</span> <span class="o">/</span> <span class="n">img</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">3</span><span class="p">])</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">factors</span> <span class="o">=</span> <span class="p">(</span><span class="mi">1</span><span class="p">,</span>
<span class="nb">float</span><span class="p">(</span><span class="n">size</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span> <span class="o">/</span> <span class="n">img</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span>
<span class="nb">float</span><span class="p">(</span><span class="n">size</span><span class="p">[</span><span class="mi">1</span><span class="p">])</span> <span class="o">/</span> <span class="n">img</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">2</span><span class="p">],</span>
<span class="mi">1</span><span class="p">)</span>
<span class="k">return</span> <span class="n">scipy</span><span class="o">.</span><span class="n">ndimage</span><span class="o">.</span><span class="n">zoom</span><span class="p">(</span><span class="n">img</span><span class="p">,</span> <span class="n">factors</span><span class="p">,</span> <span class="n">order</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">gradient_descent</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">iterations</span><span class="p">,</span> <span class="n">step</span><span class="p">,</span> <span class="n">clz</span><span class="o">=</span><span class="bp">None</span><span class="p">,</span> <span class="n">max_loss</span><span class="o">=</span><span class="bp">None</span><span class="p">,</span> <span class="n">loss_func</span><span class="o">=</span><span class="bp">None</span><span class="p">):</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">iterations</span><span class="p">):</span>
<span class="n">loss_value</span><span class="p">,</span> <span class="n">grad_values</span> <span class="o">=</span> <span class="n">eval_loss_and_grads</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">clz</span><span class="p">,</span> <span class="n">loss_func</span><span class="p">)</span>
<span class="k">if</span> <span class="n">max_loss</span> <span class="ow">is</span> <span class="ow">not</span> <span class="bp">None</span> <span class="ow">and</span> <span class="n">loss_value</span> <span class="o">></span> <span class="n">max_loss</span><span class="p">:</span>
<span class="k">break</span>
<span class="k">print</span><span class="p">(</span><span class="s">'..Loss value at'</span><span class="p">,</span> <span class="n">i</span><span class="p">,</span> <span class="s">':'</span><span class="p">,</span> <span class="n">loss_value</span><span class="p">)</span>
<span class="n">x</span> <span class="o">-=</span> <span class="n">step</span> <span class="o">*</span> <span class="n">grad_values</span>
<span class="k">return</span> <span class="n">x</span>
</code></pre></div></div>Everyone has heard the feats of Google’s “dreaming” neural network. Today, we’re going to define a special loss function so that we can dream adversarially– that is, we will dream in a way that will fool the InceptionV3 image classifier to classify an image of a dreamy cat as a coffeepot.Hogwild!? Implementing Async SGD in Python2017-11-18T00:00:00+00:002017-11-18T00:00:00+00:00http://srome.github.io//Async-SGD-in-Python-Implementing-Hogwild!<p>Hogwild! is asynchronous stochastic gradient descent algorithm. The Hogwild! approach utilizes “lock-free” gradient updates. For a machine learning model, this means that the weights of a model are updated by multiple processes at the same time with the possibility of overwriting each other. In this post, we will use the multiprocessing library to implement Hogwild! in Python for training a linear regression model.</p>
<p>In order to not bog this post down with excessive details, I’m going to assume that the reader is familiar with the concepts of (stochastic) gradient descent and what asynchronous programming looks like.</p>
<h2 id="hogwild-explained">Hogwild! Explained</h2>
<p>Stochastic gradient descent is an iterative algorithm. If one wanted to speed it up, they would have to move the iterative calculation to a faster processor (or say, a GPU). Hogwild! is an approach to parallelize SGD in order to scale SGD to quickly train on even larger data sets. The basic algorithm of Hogwild! is deceptively simple. The only pre-req is that all processors on the machine can read and update the weights of your algorithm (i.e. of your neural network or linear regression) at the same time. In short, the weights are stored in shared memory with atomic component-wise addition. The individual processors’ SGD update code is described below in adapted from the <a href="https://people.eecs.berkeley.edu/~brecht/papers/hogwildTR.pdf">original paper</a>:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>loop
sample an example (X_i, y_i) from your training set
evaluate the gradient descent step using (X_i, y_i)
perform the update step component-wise
end loop
</code></pre></div></div>
<p>So, we run this code on multiple processors updating the same model weights at the same time. You can see where the name Hogwild! comes from– the algorithm is like watching a bunch of farm animals bucking around at random, stepping on each other’s toes.</p>
<p>There are a few conditions for this algorithm to work, but for our purposes, one way to phrase a (usually) sufficient condition is that the gradient updates are sparse, meaning that there are only a few non-zero elements of the gradient estimate used in SGD. This is likely to occur when your data itself being sparse. A sparse update means that it’s unlikely for the different processors to step on each other’s toes, i.e. trying to access the same weight-component to update at the same time. However, collisions do occur, but this has been observed to act as a form of regularization in practice.</p>
<p>As an exercise, we will implement linear regression using Hogwild! for training. I have prepared a <a href="https://github.com/srome/sklearn-hogwild">git repo</a> with the full implemetation adhering the sklearn’s API, and I will use pieces of that code here.</p>
<h2 id="linear-regression-refresher">Linear Regression Refresher</h2>
<p>To make sure we’re all on the same page, here’s the terminology I will use in the subsequent sections. Our linear regression model <script type="math/tex">f</script> will take <script type="math/tex">x\in\mathbb{R}^n</script> as input and output a real number <script type="math/tex">y</script> via</p>
<p>\begin{equation}
f(x) = w \cdot x,
\end{equation}
where <script type="math/tex">w\in\mathbb{R}^n</script> is our vector of weights for our model. We will use squared error as our loss function, which for a single example is written as
\begin{equation}
\ell(x,y) = ( f(x)-y )^2 = ( w\cdot x - y)^2.
\end{equation}</p>
<p>To train our model via SGD, we need to calculate the gradient update step so that we may update the weights via:
\begin{equation}
w_{t+1} = w_t - \lambda G_w (x,y),
\end{equation}
where <script type="math/tex">\lambda</script> is the learning rate and <script type="math/tex">G_w</script> is an estimate of <script type="math/tex">\nabla_w \ell</script> satisfying:
\begin{equation}
\mathbb{E}[G_w (x,y)] = \nabla_w \ell(x,y).
\end{equation}</p>
<p>In particular,</p>
<p>\begin{equation}
G_w (x,y):= -2\, (w\cdot x - y)\, x \in \mathbb{R}^n
\end{equation}</p>
<h2 id="generating-our-training-data">Generating our Training Data</h2>
<p>First, we quickly generate our training data. We will assume our data was generated by a function of the same form as our model, namely a dot product between a weight vector and our input vector <script type="math/tex">x</script>. This true weight vector I will call the “real <script type="math/tex">w</script>”.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">scipy.sparse</span>
<span class="n">n</span><span class="o">=</span><span class="mi">10</span> <span class="c"># number of features</span>
<span class="n">m</span><span class="o">=</span><span class="mi">20000</span> <span class="c"># number of training examples</span>
<span class="n">X</span> <span class="o">=</span> <span class="n">scipy</span><span class="o">.</span><span class="n">sparse</span><span class="o">.</span><span class="n">random</span><span class="p">(</span><span class="n">m</span><span class="p">,</span><span class="n">n</span><span class="p">,</span> <span class="n">density</span><span class="o">=.</span><span class="mi">2</span><span class="p">)</span><span class="o">.</span><span class="n">toarray</span><span class="p">()</span> <span class="c"># Guarantees sparse grad updates</span>
<span class="n">real_w</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">uniform</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="n">size</span><span class="o">=</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="mi">1</span><span class="p">))</span> <span class="c"># Define our true weight vector</span>
<span class="n">X</span> <span class="o">=</span> <span class="n">X</span><span class="o">/</span><span class="n">X</span><span class="o">.</span><span class="nb">max</span><span class="p">()</span> <span class="c"># Normalizing for training</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">X</span><span class="p">,</span><span class="n">real_w</span><span class="p">)</span>
</code></pre></div></div>
<h3 id="high-level-approach">High Level Approach</h3>
<p>The multiprocessing library allows you to spin up additional processes to run code in parallel. We will use the multiprocessing.Pool’s map function to calculate and apply the gradient update step asynchronously. The key components of the algorithm is that the weight vector is accessible to all processes at the same time and can be accessed without any locking.</p>
<h3 id="lock-free-shared-memory-in-python">Lock-free Shared Memory in Python</h3>
<p>First, we need to define the weight vector <script type="math/tex">w</script> in shared memory that can be accessed without locks. We will have to use the “sharedctypes.Array” class from multiprocessing for this functionality. We will further use numpy’s frombuffer function to make it accessible from a numpy array.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">multiprocessing.sharedctypes</span> <span class="kn">import</span> <span class="n">Array</span>
<span class="kn">from</span> <span class="nn">ctypes</span> <span class="kn">import</span> <span class="n">c_double</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="n">coef_shared</span> <span class="o">=</span> <span class="n">Array</span><span class="p">(</span><span class="n">c_double</span><span class="p">,</span>
<span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">normal</span><span class="p">(</span><span class="n">size</span><span class="o">=</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="mi">1</span><span class="p">))</span> <span class="o">*</span> <span class="mf">1.</span><span class="o">/</span><span class="n">np</span><span class="o">.</span><span class="n">sqrt</span><span class="p">(</span><span class="n">n</span><span class="p">))</span><span class="o">.</span><span class="n">flat</span><span class="p">,</span>
<span class="n">lock</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span> <span class="c"># Hogwild!</span>
<span class="n">w</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">frombuffer</span><span class="p">(</span><span class="n">coef_shared</span><span class="p">)</span>
<span class="n">w</span> <span class="o">=</span> <span class="n">w</span><span class="o">.</span><span class="n">reshape</span><span class="p">((</span><span class="n">n</span><span class="p">,</span><span class="mi">1</span><span class="p">))</span>
</code></pre></div></div>
<p>For the final parallelization code, we will use multiprocessing.Pool to map our gradient update out to several workers. We will need to expose this vector to the workers so this will work.</p>
<h3 id="gradient-update">Gradient Update</h3>
<p>Our ultimate goal is to perform the gradient update in parallel. To do this, we need to define what this update is. In the <a href="https://github.com/srome/sklearn-hogwild">git repo</a>, I took a much more sophisticated approach to expose our weight vector <script type="math/tex">w</script> to multiple processes. To avoid over-complicating this explanation with Python technicalities, I’m going to just use a global variable to take advantage of how multiprocessing.Pool.map works. The map function copies over the entire namespace of the function to a new process, and as I need to expose a single <script type="math/tex">w</script> to all workers, this will be sufficient.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># The calculation has been adjusted to allow for mini-batches</span>
<span class="n">learning_rate</span> <span class="o">=</span> <span class="o">.</span><span class="mo">001</span>
<span class="k">def</span> <span class="nf">mse_gradient_step</span><span class="p">(</span><span class="n">X_y_tuple</span><span class="p">):</span>
<span class="k">global</span> <span class="n">w</span> <span class="c"># Only for instructive purposes!</span>
<span class="n">X</span><span class="p">,</span> <span class="n">y</span> <span class="o">=</span> <span class="n">X_y_tuple</span> <span class="c"># Required for how multiprocessing.Pool.map works</span>
<span class="c"># Calculate the gradient</span>
<span class="n">err</span> <span class="o">=</span> <span class="n">y</span><span class="o">.</span><span class="n">reshape</span><span class="p">((</span><span class="nb">len</span><span class="p">(</span><span class="n">y</span><span class="p">),</span><span class="mi">1</span><span class="p">))</span><span class="o">-</span><span class="n">np</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">X</span><span class="p">,</span><span class="n">w</span><span class="p">)</span>
<span class="n">grad</span> <span class="o">=</span> <span class="o">-</span><span class="mf">2.</span><span class="o">*</span><span class="n">np</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">transpose</span><span class="p">(</span><span class="n">X</span><span class="p">),</span><span class="n">err</span><span class="p">)</span><span class="o">/</span> <span class="n">X</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
<span class="c"># Update the nonzero weights one at a time</span>
<span class="k">for</span> <span class="n">index</span> <span class="ow">in</span> <span class="n">np</span><span class="o">.</span><span class="n">where</span><span class="p">(</span><span class="nb">abs</span><span class="p">(</span><span class="n">grad</span><span class="p">)</span> <span class="o">></span> <span class="o">.</span><span class="mo">01</span><span class="p">)[</span><span class="mi">0</span><span class="p">]:</span>
<span class="n">coef_shared</span><span class="p">[</span><span class="n">index</span><span class="p">]</span> <span class="o">-=</span> <span class="n">learning_rate</span><span class="o">*</span><span class="n">grad</span><span class="p">[</span><span class="n">index</span><span class="p">,</span><span class="mi">0</span><span class="p">]</span>
</code></pre></div></div>
<h3 id="preparing-the-examples-for-multiprocessingpool">Preparing the examples for multiprocessing.Pool</h3>
<p>We are going to have to cut the training examples into tuples, one per example, to pass to our workers. We will also be reshaping them to adhere to the way the gradient step is written above. Code is included to allow you to use mini-batches via the batch_size variable.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">batch_size</span><span class="o">=</span><span class="mi">1</span>
<span class="n">examples</span><span class="o">=</span><span class="p">[</span><span class="bp">None</span><span class="p">]</span><span class="o">*</span><span class="nb">int</span><span class="p">(</span><span class="n">X</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">/</span><span class="nb">float</span><span class="p">(</span><span class="n">batch_size</span><span class="p">))</span>
<span class="k">for</span> <span class="n">k</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">int</span><span class="p">(</span><span class="n">X</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">/</span><span class="nb">float</span><span class="p">(</span><span class="n">batch_size</span><span class="p">))):</span>
<span class="n">Xx</span> <span class="o">=</span> <span class="n">X</span><span class="p">[</span><span class="n">k</span><span class="o">*</span><span class="n">batch_size</span> <span class="p">:</span> <span class="p">(</span><span class="n">k</span><span class="o">+</span><span class="mi">1</span><span class="p">)</span><span class="o">*</span><span class="n">batch_size</span><span class="p">,:]</span><span class="o">.</span><span class="n">reshape</span><span class="p">((</span><span class="n">batch_size</span><span class="p">,</span><span class="n">X</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">]))</span>
<span class="n">yy</span> <span class="o">=</span> <span class="n">y</span><span class="p">[</span><span class="n">k</span><span class="o">*</span><span class="n">batch_size</span> <span class="p">:</span> <span class="p">(</span><span class="n">k</span><span class="o">+</span><span class="mi">1</span><span class="p">)</span><span class="o">*</span><span class="n">batch_size</span><span class="p">]</span><span class="o">.</span><span class="n">reshape</span><span class="p">((</span><span class="n">batch_size</span><span class="p">,</span><span class="mi">1</span><span class="p">))</span>
<span class="n">examples</span><span class="p">[</span><span class="n">k</span><span class="p">]</span> <span class="o">=</span> <span class="p">(</span><span class="n">Xx</span><span class="p">,</span> <span class="n">yy</span><span class="p">)</span>
</code></pre></div></div>
<h3 id="the-asynchronous-bit">The Asynchronous Bit</h3>
<p>Parallel programming can be very difficult. low-level code, but the multiprocessing library has abstracted the details away from us. This makes this final piece of code underwhelming as a big reveal. Anyway, after the definitions above, the final code for Hogwild! is:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">multiprocessing</span> <span class="kn">import</span> <span class="n">Pool</span>
<span class="c"># Training with Hogwild!</span>
<span class="n">p</span> <span class="o">=</span> <span class="n">Pool</span><span class="p">(</span><span class="mi">5</span><span class="p">)</span>
<span class="n">p</span><span class="o">.</span><span class="nb">map</span><span class="p">(</span><span class="n">mse_gradient_step</span><span class="p">,</span> <span class="n">examples</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">'Loss function on the training set:'</span><span class="p">,</span> <span class="n">np</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="nb">abs</span><span class="p">(</span><span class="n">y</span><span class="o">-</span><span class="n">np</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">X</span><span class="p">,</span><span class="n">w</span><span class="p">))))</span>
<span class="k">print</span><span class="p">(</span><span class="s">'Difference from the real weight vector:'</span><span class="p">,</span> <span class="nb">abs</span><span class="p">(</span><span class="n">real_w</span><span class="o">-</span><span class="n">w</span><span class="p">)</span><span class="o">.</span><span class="nb">sum</span><span class="p">())</span>
</code></pre></div></div>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Loss function on the training set: 0.0014203406038
Difference from the real weight vector: 0.0234173242317
</code></pre></div></div>
<h1 id="importance-of-asynchronous-methods">Importance of Asynchronous Methods</h1>
<p>As you probably know, deep learning is pretty popular right now. The neural networks employed to solve deep learning problems are trained via stochastic gradient descent. Deep learning implies large data sets (or, additionally some type of interactive environment in the case of reinforcement learning). As the scale increases, training neural networks becomes slower and more cumbersome. So, parallelizing the training of neural networks is a big deal.</p>
<p>There are other approaches to parallelizing SGD. Google has popularized <a href="https://static.googleusercontent.com/media/research.google.com/en//archive/large_deep_networks_nips2012.pdf">Downpour SGD</a> among others. Downpour SGD basically keeps the weights on a central hub where several model instances on different pieces of the data send updates and retrieve refreshed rates. This idea has been extended in the case of reinforcement learning and seen to provide surprising improvements in performance and a decrease in training time. In particular, the <a href="https://arxiv.org/pdf/1602.01783.pdf">A3C</a> model uses asynchronous agents playing multiple game instances who publish gradient updates to a central hub, along with a few other tweaks. This has been seen to alleviate some of the time correlation issues found in reinforcement learning that causes divergence, which was previously tackled via experience replay.</p>
<p>The point is asynchronous methods are here to stay, and who knows what new advances may come from them in the next few years. At present, there are not many mathematical proof pertaining to the accuracy increases seen for asynchronous methods. Indeed, many of the approaches are not theoretically justified, but their adoption has been driven by impressive practical results. For mathematicians, it will be interesting to see what new mathematics arise from trying to explain these methods formally.</p>Hogwild! is asynchronous stochastic gradient descent algorithm. The Hogwild! approach utilizes “lock-free” gradient updates. For a machine learning model, this means that the weights of a model are updated by multiple processes at the same time with the possibility of overwriting each other. In this post, we will use the multiprocessing library to implement Hogwild! in Python for training a linear regression model.Covariate Shift, i.e. Why Prediction Quality Can Degrade In Production and How To Fix It2017-10-15T00:00:00+00:002017-10-15T00:00:00+00:00http://srome.github.io//Covariate-Shift,-i.e.-Why-Prediction-Quality-Can-Degrade-In-Production-and-How-To-Fix-It<p>In in many classification problems, time is not a feature (literally). Case in point: healthcare. Say you want to predict if a person has a deadly disease. You probably have some historical data that you plan use to train a model and then deploy it into production. Great– but what happens if the distribution of people you predict on changes by the time your model starts (miss) flagging people? This <em>covariate shift</em> can be a real issue. In healthcare, the Affordable Care Act completely reshaped the landscape of the insured population, so there definitely were models out there that were faced with a healthier population than they were trained on. The question is: Is there anything you can do during training to correct for covariate shift? The answer is yes.</p>
<p>In this post, we are going to discuss the <a href="http://proceedings.mlr.press/v38/sasaki15.pdf">Kullback-Leibler Importance Estimation Procedure</a> (KLIEP) for importance weighting training samples. This approach was implemented in the DensityRatioEstimator class from <a href="https://github.com/srome/pykliep">pykliep</a>. First, we will motivate KLIEP and identify how it weights training examples. Then we will outline the API for pykliep and walk through an example.</p>
<h2 id="the-setup">The Setup</h2>
<p>Assume you have a training set that is distributed according to
\begin{equation} x \sim p(x)\end{equation}
and a test set distributed via
\begin{equation} x \sim q(x). \end{equation}
When p!=q, this is called <em>covariate shift</em>, and it violates a key assumption of machine learning: the training set distribution is the same as the test set distribution. (In this case, <script type="math/tex">x=(X,y)</script> for notational convenience.) One approach to correct this phenomenom is called importance weighting. Essentially, you attach a sample weight to each training example during training in order to satisfy:
\begin{equation}
E_{x \sim w(x)p(x)} \text{loss}(x) = E_{x\sim q(x)} \text{loss}(x)
\end{equation}</p>
<p>Theoretically, the best choice of <script type="math/tex">w(x)=q(x)/p(x)</script> because
\begin{equation}
E_{x \sim w(x)p(x)} \text{loss}(x) = \int_{supp(w(x)p(x))} \text{loss}(x)w(x)p(x)\, dx \end{equation}
\begin{equation}
\int_{supp(q(x))} \text{loss}(x)q(x)\, dx = E_{x\sim q(x)} \text{loss}(x),
\end{equation}
assuming their supports are equal. So, we can see that the expected loss on the test distribution <script type="math/tex">q</script> is the same as using the sample weights <script type="math/tex">w(x)</script> on the training set! (Notice, this is trivially the case when <script type="math/tex">p(x)=q(x)</script>.)</p>
<p>Don’t celebrate yet. It is not very easy to estimate <script type="math/tex">w(x)</script>, and many approaches are unstable. A simple approach is to train a classifier to predict if an example comes from the training set or the test set. Then you would use the likelihood score from a model
\begin{equation}\hat w(x)= \frac{p(y =1 |x)}{ p(y=0 |x) }\end{equation}
where <script type="math/tex">y=1</script> indicates membership in the test set. This tends to work poorly in practice as the ratio is very sensitive.</p>
<p>A more stable approach is the aforementioned KLIEP algorithm. This approach directly estimates the density ratio <script type="math/tex">q(x)/p(x)</script>. The original implementation is not very scalable, but since the first publication, there have been new approaches to increase its efficiency.</p>
<h2 id="kliep">KLIEP</h2>
<p>The KLIEP algorithm defines an estimate for <script type="math/tex">w</script> of the following form:</p>
<p>\begin{equation}
\hat w(x) = \sum_{x_i \in D_{te}} \alpha_i K(x,x_i,\sigma),
\end{equation}</p>
<p>where <script type="math/tex">D_{te}\subset X_{te}</script> and <script type="math/tex">\alpha_i,\sigma \in \mathbb{R}</script>. The original paper choose <script type="math/tex">K</script> to be a Gaussian kernel, i.e.:
\begin{equation}
K(x,x_i,\sigma) = e^{ - \frac{| x-x_i|^2}{2\sigma^2}}.
\end{equation}</p>
<p>The set <script type="math/tex">D_{te}</script> is a random sample of test set vectors. The parameter <script type="math/tex">\sigma</script> is called the “width” of the kernel. The authors then introduce the quantity</p>
<p>\begin{equation}
J = \frac{1}{|X_{te}|} \sum_{x\in X_{te}}\log \hat w (x),
\end{equation}
where <script type="math/tex">X_{te}</script> is the test set. They prove that the larger this value is, the closer <script type="math/tex">q(x)</script> and <script type="math/tex">\hat w(x) p(x)</script> are in Kullback-Leibler divergence. In other words, you want that value maximized to have the best possible sample weights for training. The KLIEP uses gradient descent to find <script type="math/tex">\alpha_i</script> which maximizes <script type="math/tex">J</script> given <script type="math/tex">D_{te}</script> and <script type="math/tex">\sigma</script> with the constraints that
\begin{equation}
\alpha_i \geq 0 \quad\text{and}\quad\int \hat w(x) p(x) \, dx = 1,
\end{equation}
which means the quantity <script type="math/tex">\hat w(x) p(x)</script> is akin to a true probability density function (i.e. it integrates to 1).</p>
<h3 id="model-selection-via-likelihood-cross-validation">Model Selection via Likelihood Cross Validation</h3>
<p>The DensityRatioEstimator class has two main parameters to tune: <em>num_params</em> and <em>sigma</em>, which correspond to the size of <script type="math/tex">D_{te}</script> and <script type="math/tex">\sigma</script> respectively. An interesting component of the approach is the use of likelihood cross validation (LCV). As opposed to grid search cross validation, this approach only splits the test set into folds. It turns out that this is the proper form of cross validation to estimate the theoretical quantity that indicates goodness of fit. Otherwise, it is very similar to grid search cross validation:</p>
<ol>
<li>Split the test set into k-folds.</li>
<li>For each test fold, train each candidate model (i.e. model + param settings) on the full training set and i!=k test folds. Score the resulting model on the kth test fold.</li>
<li>Choose the best parameter based on the averaged scored and train the model on the full training set and full test set.</li>
</ol>
<h2 id="densityratioestimator-usage">DensityRatioEstimator Usage</h2>
<p>Let’s walk through how to use the DensityRatioEstimator class which implements KLIEP. It’s meant to follow the spirit of sklearn’s conventions. Overall, the API is similar to sklearn, but there are key differences. This estimator takes as its input both the training and test set, which already breaks sklearn’s “fit” API. One facet of “fit” is that X and y are the same length, and obviously that is not the case for your training and test set in general.</p>
<p>In theory, there’s nothing stopping us from using the procedure on <script type="math/tex">x=(X,y)</script>. For training sets with a time element, one can test if KLIEP improves your validation scores. In problems without time explicitly as a feature, the underlying distribution can still drift over time. You can still cut your data up by time (collection time? date added to the data?) and apply KLIEP. You might even have data to predict that is unlabeled. In that case, you can simply use <script type="math/tex">X_{test}</script> to develop sample weights (and then your approach is edging towards a semi-supervised learning approach as you use labeled and unlabeled examples).</p>
<p>Next, an example: We will define a simple learning problem where <script type="math/tex">x\in\mathbb{R}^3</script>. The training set is generated via
\begin{equation}
X \sim N(0,.5)^2
\end{equation}
and a covariate shift in <script type="math/tex">X</script> will occur and the test set will be distributed say:
\begin{equation}
X \sim (N(1,.5), N(1,.5)).
\end{equation}</p>
<p>The y data will be generated by
\begin{equation}
y= \sin(n\pi)+m + N(0,\epsilon)
\end{equation}
where <script type="math/tex">X=(m,n)\in \mathbb{R}^2</script>.</p>
<h3 id="training-and-test-definition">Training and Test Definition</h3>
<p>First, let’s visualize the covariate shift.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">eps</span> <span class="o">=</span> <span class="mf">1e-3</span>
<span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">seed</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
<span class="n">X_train</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">normal</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span><span class="o">.</span><span class="mi">5</span><span class="p">,</span><span class="n">size</span><span class="o">=</span><span class="p">(</span><span class="mi">500</span><span class="p">,</span><span class="mi">2</span><span class="p">))</span>
<span class="n">X_test</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">column_stack</span><span class="p">((</span><span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">normal</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="o">.</span><span class="mi">5</span><span class="p">,</span><span class="n">size</span><span class="o">=</span><span class="p">(</span><span class="mi">300</span><span class="p">,)),</span><span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">normal</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="o">.</span><span class="mi">5</span><span class="p">,</span><span class="n">size</span><span class="o">=</span><span class="p">(</span><span class="mi">300</span><span class="p">,))))</span>
<span class="n">y_train</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">sin</span><span class="p">(</span><span class="n">X_train</span><span class="p">[:,</span><span class="mi">1</span><span class="p">]</span><span class="o">*</span><span class="n">np</span><span class="o">.</span><span class="n">pi</span><span class="p">)</span> <span class="o">+</span> <span class="n">X_train</span><span class="p">[:,</span><span class="mi">0</span><span class="p">]</span> <span class="o">+</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">normal</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span><span class="n">eps</span><span class="p">)</span>
<span class="n">y_test</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">sin</span><span class="p">(</span><span class="n">X_test</span><span class="p">[:,</span><span class="mi">1</span><span class="p">]</span><span class="o">*</span><span class="n">np</span><span class="o">.</span><span class="n">pi</span><span class="p">)</span> <span class="o">+</span> <span class="n">X_test</span><span class="p">[:,</span><span class="mi">0</span><span class="p">]</span> <span class="o">+</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">normal</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span><span class="n">eps</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">%</span><span class="n">matplotlib</span> <span class="n">inline</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span>
<span class="n">f</span><span class="p">,</span><span class="n">axs</span><span class="o">=</span><span class="n">plt</span><span class="o">.</span><span class="n">subplots</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="mi">2</span><span class="p">,</span> <span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span><span class="mi">5</span><span class="p">))</span>
<span class="n">axs</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">scatter</span><span class="p">(</span><span class="n">X_train</span><span class="p">[:,</span><span class="mi">0</span><span class="p">],</span><span class="n">X_train</span><span class="p">[:,</span><span class="mi">1</span><span class="p">],</span> <span class="n">color</span><span class="o">=</span><span class="s">'b'</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=.</span><span class="mi">2</span><span class="p">)</span>
<span class="n">axs</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">scatter</span><span class="p">(</span><span class="n">X_test</span><span class="p">[:,</span><span class="mi">0</span><span class="p">],</span><span class="n">X_test</span><span class="p">[:,</span><span class="mi">1</span><span class="p">],</span> <span class="n">color</span><span class="o">=</span><span class="s">'r'</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=.</span><span class="mi">2</span><span class="p">)</span>
<span class="n">axs</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">set_title</span><span class="p">(</span><span class="s">'Distributions of X: train (blue) vs. test (red)'</span><span class="p">)</span>
<span class="n">axs</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">set_title</span><span class="p">(</span><span class="s">'Distributions of y: train (blue) vs. test (red)'</span><span class="p">)</span>
<span class="n">axs</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">hist</span><span class="p">(</span><span class="n">y_train</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s">'b'</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=.</span><span class="mi">2</span><span class="p">)</span>
<span class="n">axs</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">hist</span><span class="p">(</span><span class="n">y_test</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s">'r'</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=.</span><span class="mi">2</span><span class="p">)</span>
</code></pre></div></div>
<p><img src="../images/kliep/output_3_1.png" alt="png" /></p>
<h3 id="estimating-the-density-ratio-w">Estimating the Density Ratio w</h3>
<p>Now, we will train two random forests: with and without KLIEP sample weights.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">pykliep</span> <span class="kn">import</span> <span class="n">DensityRatioEstimator</span>
<span class="c"># Estimate the density ratio w</span>
<span class="n">kliep</span> <span class="o">=</span> <span class="n">DensityRatioEstimator</span><span class="p">(</span><span class="n">random_state</span><span class="o">=</span><span class="mi">7</span><span class="p">)</span>
<span class="n">kliep</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span> <span class="n">X_test</span><span class="p">)</span> <span class="c"># keyword arguments are X_train and X_test</span>
<span class="n">weights</span> <span class="o">=</span> <span class="n">kliep</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">X_train</span><span class="p">)</span>
<span class="n">rf</span> <span class="o">=</span> <span class="n">RandomForestRegressor</span><span class="p">(</span><span class="n">n_estimators</span><span class="o">=</span><span class="mi">50</span><span class="p">,</span> <span class="n">random_state</span><span class="o">=</span><span class="mi">2</span><span class="p">)</span>
<span class="n">rf2</span> <span class="o">=</span> <span class="n">RandomForestRegressor</span><span class="p">(</span><span class="n">n_estimators</span><span class="o">=</span><span class="mi">50</span><span class="p">,</span> <span class="n">random_state</span><span class="o">=</span><span class="mi">2</span><span class="p">)</span>
<span class="n">f</span><span class="p">,</span><span class="n">axs</span> <span class="o">=</span> <span class="n">plt</span><span class="o">.</span><span class="n">subplots</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="mi">2</span><span class="p">,</span> <span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span><span class="mi">5</span><span class="p">))</span>
<span class="n">rf</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span> <span class="n">y_train</span><span class="p">)</span>
<span class="n">axs</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">scatter</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">log10</span><span class="p">(</span><span class="n">y_test</span><span class="o">+</span><span class="mi">1</span><span class="p">),</span> <span class="n">np</span><span class="o">.</span><span class="n">log10</span><span class="p">(</span><span class="n">rf</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">X_test</span><span class="p">)</span><span class="o">+</span><span class="mi">1</span><span class="p">))</span>
<span class="n">axs</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">set_title</span><span class="p">(</span><span class="s">'RFW MAE = </span><span class="si">%</span><span class="s">s'</span> <span class="o">%</span> <span class="n">np</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="nb">abs</span><span class="p">(</span><span class="n">y_test</span><span class="o">-</span><span class="n">rf</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">X_test</span><span class="p">))))</span>
<span class="n">rf2</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span> <span class="n">y_train</span><span class="p">,</span> <span class="n">sample_weight</span><span class="o">=</span><span class="n">weights</span><span class="p">)</span> <span class="c"># Train using the sample weights!</span>
<span class="n">axs</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">scatter</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">log10</span><span class="p">(</span><span class="n">y_test</span><span class="o">+</span><span class="mi">1</span><span class="p">),</span> <span class="n">np</span><span class="o">.</span><span class="n">log10</span><span class="p">(</span><span class="n">rf2</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">X_test</span><span class="p">)</span><span class="o">+</span><span class="mi">1</span><span class="p">))</span>
<span class="n">axs</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">set_title</span><span class="p">(</span><span class="s">'RF + KLIEP MAE = </span><span class="si">%</span><span class="s">s'</span> <span class="o">%</span> <span class="n">np</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="nb">abs</span><span class="p">((</span><span class="n">y_test</span><span class="o">-</span><span class="n">rf2</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">X_test</span><span class="p">)))))</span>
<span class="n">axs</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">set_ylabel</span><span class="p">(</span><span class="s">'Log Predicted'</span><span class="p">)</span>
<span class="n">axs</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">set_xlabel</span><span class="p">(</span><span class="s">'Log Actual'</span><span class="p">)</span>
<span class="n">axs</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">plot</span><span class="p">([</span><span class="o">-.</span><span class="mi">8</span><span class="p">,</span><span class="o">.</span><span class="mi">6</span><span class="p">],</span> <span class="p">[</span><span class="o">-.</span><span class="mi">8</span><span class="p">,</span><span class="o">.</span><span class="mi">6</span><span class="p">],</span> <span class="n">color</span><span class="o">=</span><span class="s">'r'</span><span class="p">)</span>
<span class="n">axs</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">plot</span><span class="p">([</span><span class="o">-.</span><span class="mi">8</span><span class="p">,</span><span class="o">.</span><span class="mi">6</span><span class="p">],</span> <span class="p">[</span><span class="o">-.</span><span class="mi">8</span><span class="p">,</span><span class="o">.</span><span class="mi">6</span><span class="p">],</span> <span class="n">color</span><span class="o">=</span><span class="s">'r'</span><span class="p">)</span>
</code></pre></div></div>
<p><img src="../images/kliep/output_5_2.png" alt="png" /></p>
<h2 id="validation-with-known-w">Validation With Known w</h2>
<p>If you had access to the true <script type="math/tex">w(x)</script>, you could just look at <script type="math/tex">w(x)</script> and <script type="math/tex">\hat w(x)</script> and compare. We are going to do just that.</p>
<p>If we use the last example above, the pdf for <script type="math/tex">X=(m,n)</script> is independent in each component and so
\begin{equation}
p(X) = p(m)p(n)
\end{equation}
for the training and test set. So then, <script type="math/tex">w(X)</script> is given by
\begin{equation}
w(X) = \frac{p_{te}(m) p_{te}(n)}{p_{tr}(m) p_{tr}(n)}.
\end{equation}
We can then plot how well <script type="math/tex">\hat w (X)</script> approximates <script type="math/tex">w(X)</script>.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">seed</span><span class="p">(</span><span class="mi">128</span><span class="p">)</span>
<span class="n">X_train</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">normal</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span><span class="o">.</span><span class="mi">5</span><span class="p">,</span><span class="n">size</span><span class="o">=</span><span class="p">(</span><span class="mi">1000</span><span class="p">,</span><span class="mi">2</span><span class="p">))</span>
<span class="n">X_test</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">normal</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="o">.</span><span class="mi">5</span><span class="p">,</span><span class="n">size</span><span class="o">=</span><span class="p">(</span><span class="mi">1500</span><span class="p">,</span><span class="mi">2</span><span class="p">))</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">kliep</span> <span class="o">=</span> <span class="n">DensityRatioEstimator</span><span class="p">(</span><span class="n">random_state</span><span class="o">=</span><span class="mi">4</span><span class="p">)</span>
<span class="n">kliep</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span> <span class="n">X_test</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Take a look at the LCV results for (num_params, sigma)</span>
<span class="n">kliep</span><span class="o">.</span><span class="n">_j_scores</span>
</code></pre></div></div>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[((0.2, 0.25), 2.3883652107662297),
((0.1, 0.25), 2.2569424774193245),
((0.2, 0.5), 1.8987499277478477),
((0.1, 0.5), 1.8136986917106765),
((0.2, 0.75), 1.4189331965313798),
((0.1, 0.75), 1.3891472114097529),
((0.2, 1), 1.0377502976502824),
((0.1, 1), 0.98391914587619367),
((0.2, 0.1), 0.94975924498825803),
((0.1, 0.1), 0.19270111220322969)]
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">scipy.stats</span>
<span class="c"># Define the true w(X)</span>
<span class="k">def</span> <span class="nf">w_true</span><span class="p">(</span><span class="n">X</span><span class="p">):</span>
<span class="n">num</span> <span class="o">=</span> <span class="n">scipy</span><span class="o">.</span><span class="n">stats</span><span class="o">.</span><span class="n">norm</span><span class="p">(</span><span class="n">loc</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">scale</span><span class="o">=.</span><span class="mi">5</span><span class="p">)</span><span class="o">.</span><span class="n">pdf</span><span class="p">(</span><span class="n">X</span><span class="p">[:,</span><span class="mi">0</span><span class="p">])</span><span class="o">*</span><span class="n">scipy</span><span class="o">.</span><span class="n">stats</span><span class="o">.</span><span class="n">norm</span><span class="p">(</span><span class="n">loc</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">scale</span><span class="o">=.</span><span class="mi">5</span><span class="p">)</span><span class="o">.</span><span class="n">pdf</span><span class="p">(</span><span class="n">X</span><span class="p">[:,</span><span class="mi">1</span><span class="p">])</span>
<span class="n">denom</span> <span class="o">=</span><span class="n">scipy</span><span class="o">.</span><span class="n">stats</span><span class="o">.</span><span class="n">norm</span><span class="p">(</span><span class="n">scale</span><span class="o">=.</span><span class="mi">5</span><span class="p">)</span><span class="o">.</span><span class="n">pdf</span><span class="p">(</span><span class="n">X</span><span class="p">[:,</span><span class="mi">0</span><span class="p">])</span><span class="o">*</span><span class="n">scipy</span><span class="o">.</span><span class="n">stats</span><span class="o">.</span><span class="n">norm</span><span class="p">(</span><span class="n">scale</span><span class="o">=.</span><span class="mi">5</span><span class="p">)</span><span class="o">.</span><span class="n">pdf</span><span class="p">(</span><span class="n">X</span><span class="p">[:,</span><span class="mi">1</span><span class="p">])</span>
<span class="k">return</span> <span class="n">num</span><span class="o">/</span><span class="n">denom</span>
</code></pre></div></div>
<p>As <script type="math/tex">X</script> is two dimentional, we’ll actually plot a parametrized line <script type="math/tex">X(t):= (m(t),n(t))</script> where <script type="math/tex">m(t)=n(t)=t</script>. This is like we are taking a slice of <script type="math/tex">w(X)</script> and will allow us to plot a 2-D image.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">Xt</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">column_stack</span><span class="p">((</span><span class="n">np</span><span class="o">.</span><span class="n">linspace</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">100</span><span class="p">),</span><span class="n">np</span><span class="o">.</span><span class="n">linspace</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">100</span><span class="p">)))</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">%</span><span class="n">matplotlib</span> <span class="n">inline</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span>
<span class="n">plt</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">linspace</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">100</span><span class="p">),</span><span class="n">w_true</span><span class="p">(</span><span class="n">Xt</span><span class="p">),</span><span class="s">'b'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">linspace</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">100</span><span class="p">),</span><span class="n">kliep</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">Xt</span><span class="p">),</span><span class="n">color</span><span class="o">=</span><span class="s">'r'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="s">'$$w(X)$$ (Blue) vs $$</span><span class="err">\</span><span class="s">hat w(X)$$ (Red)'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">xlabel</span><span class="p">(</span><span class="s">'t'</span><span class="p">)</span>
</code></pre></div></div>
<p><img src="../images/kliep/output_13_1.png" alt="png" /></p>In in many classification problems, time is not a feature (literally). Case in point: healthcare. Say you want to predict if a person has a deadly disease. You probably have some historical data that you plan use to train a model and then deploy it into production. Great– but what happens if the distribution of people you predict on changes by the time your model starts (miss) flagging people? This covariate shift can be a real issue. In healthcare, the Affordable Care Act completely reshaped the landscape of the insured population, so there definitely were models out there that were faced with a healthier population than they were trained on. The question is: Is there anything you can do during training to correct for covariate shift? The answer is yes.An Annotated Proof of Generative Adversarial Networks with Implementation Notes2017-08-27T00:00:00+00:002017-08-27T00:00:00+00:00http://srome.github.io//An-Annotated-Proof-of-Generative-Adversarial-Networks-with-Implementation-Notes<p>The preferred illustration of Generative Adversarial Networks (which seems indoctrinated at this point like Bob, Alice and Eve for cryptography) is the scenario of counterfitting money. The Generator is a counterfits money and the Discriminator is supposed to discriminate between real and fake dollars. As the Generator gets better, the Discriminator has to improve also. This implies a dual training scheme where each model tries to one up the other (i.e. through additional learning). In this post, we will explore the proof from the original paper and demonstrate a typical implementation.</p>
<h2 id="the-perfect-generator-mathematically">The Perfect Generator Mathematically</h2>
<p>The illustration is only half the battle. In order to prove anything, we need to be able to formally (mathematically) write down the objective. This informs not only the proof strategy but gives one the end goal to “keep an eye out for” while studying the proof.</p>
<p>The goal is that the Generator generates a examples that is indistinguishable from the real data. Mathematically, this is the concept of random variables being equal in distribution (or in law). Another way to say this is that their probability density functions (i.e. the probability measure induced by the random variable on its range) are equal: <script type="math/tex">p_G(x)=p_{data}(x)</script>. This is exactly the strategy of the proof: define an optimization problem where the optimal <script type="math/tex">G</script> satisfies <script type="math/tex">p_G(x)=p_{data}(x)</script>. If you knew that your solution <script type="math/tex">G</script> satisfies this relationship, then you have a reasonable expectation that you can represent the optimal <script type="math/tex">G</script> as a neural network via typical SGD training!</p>
<h2 id="the-optimization-problem">The Optimization Problem</h2>
<p>With the counterfitting illustration in mind, the motivation for defining the optimization problem is guided by the following two terms. First, we require that the discriminator <script type="math/tex">D</script> recognizes examples from the <script type="math/tex">p_{data}(x)</script> distribution, hence:
\begin{equation}
E_{x \sim p_{data}(x)}\log(D(x))
\end{equation}
where <script type="math/tex">E</script> denotes the expectation. This term comes from the “positive class” of the log-loss function. Maximizing this term corresponds to <script type="math/tex">D</script> being able to precisely predict <script type="math/tex">D(x)=1</script> when <script type="math/tex">x\sim p_{data}(x)</script>.</p>
<p>The next term has to do with the Generator <script type="math/tex">G</script> tricking the discriminator. In particular, the term comes from the “negative class” of the log-loss function:
\begin{equation}
E_{z \sim p_{z}(z)}\log(1-D(G(z))).
\end{equation}
If this value is maximized (remember, the log of <script type="math/tex">% <![CDATA[
x<1 %]]></script> is negative), that means <script type="math/tex">D(G(z))\approx 0</script> and so <script type="math/tex">G</script> is NOT tricking <script type="math/tex">D</script>.</p>
<p>To combine these two concepts, the Discriminator’s goal is to maximize
\begin{equation}
E_{x \sim p_{data}(x)}\log(D(x))+E_{z \sim p_{z}(z)}\log(1-D(G(z)))
\end{equation}
given <script type="math/tex">G</script>, which means that <script type="math/tex">D</script> properly identifies real and fake data points.
The optimal Discriminator in the sense of the above equation for a given <script type="math/tex">G</script> will be denoted <script type="math/tex">D^*_G</script>. Define the value function</p>
<p>\begin{equation}
V(G,D):= E_{x \sim p_{data}(x)}\log(D(x))+E_{z \sim p_z(z)}\log(1-D(G(z))).
\end{equation}</p>
<p>Then we can write <script type="math/tex">D_G^* = \text{argmax}_D V(G,D)</script>. Now, <script type="math/tex">G</script>’s aim is the reverse– the optimial <script type="math/tex">G</script> minimizes the previous equation when <script type="math/tex">D=D^*_G</script>. In the paper, they prefer to write the optimial <script type="math/tex">G</script> and <script type="math/tex">D</script> to solve the minimax game for the value function</p>
<p>\begin{equation}
V(G,D)= E_{x \sim p_{data}(x)}\log(D(x))+E_{z \sim p_z(z)}\log(1-D(G(z))).
\end{equation}</p>
<p>Then, we can write the optimial solution satisfies
\begin{equation}
G^* = \text{argmin}_G V(G,D_G^*).
\end{equation}</p>
<h2 id="the-annotated-proof">The Annotated Proof</h2>
<p>At this point, we must show that this optimization problem has a unique solution <script type="math/tex">G^*</script> and that this solution satisfies <script type="math/tex">p_G=p_{data}</script>.</p>
<p>One big idea from the GAN paper– which is different from other approaches– is that <script type="math/tex">G</script> need not be invertible. This is crucial because in practice <script type="math/tex">G</script> is NOT invertible. Many pieces of notes online miss this fact when they try to replicate the proof and incorrectly use the change of variables formula from calculus (which would depend on <script type="math/tex">G</script> being invertible). Rather, the whole proof relies on this equality:</p>
<p>\begin{equation}
E_{z \sim p_{z}(z)}\log(1-D(G(z))) = E_{x \sim p_{G}(x)}\log(1-D(x)) .
\end{equation}</p>
<p>This equality comes from the <a href="https://en.wikipedia.org/wiki/Radon%E2%80%93Nikodym_theorem">Radon-Nikodym Theorem</a> of measure theory. Colloquially, it’s sometimes called the <a href="https://en.wikipedia.org/wiki/Law_of_the_unconscious_statistician">law of the unconscious statistician</a>, which was probably given its name by mathematicians– if I had to guess. This equality shows up in the proof of Proposition 1 in the <a href="https://arxiv.org/pdf/1406.2661.pdf">original GAN paper</a> to little fanfare when they write:</p>
<p>\begin{equation}
\int_{x} p_{data}(x)\log D(x) \, \mathrm{d}x + \int_{z} p(z)\log ( 1- D(G(z))) \, \mathrm{d}z
\end{equation}
\begin{equation}
=\int_{x} p_{data}(x)\log D(x) + p_G(x) \log ( 1- D(x)) \, \mathrm{d}x
\end{equation}</p>
<p>I have seen lecture notes use the <a href="https://en.wikipedia.org/wiki/Probability_density_function#Dependent_variables_and_change_of_variables">change of variables formula</a> for this line! That is incorrect, as to do a change of variables, one must calculate <script type="math/tex">G^{-1}</script> which is not assumed to exist (and in practice for neural networks– does not exist!). I imagine this approach is so common in the machine learning / stats literature that it goes unnamed.</p>
<h3 id="finding-the-best-discriminator">Finding the Best Discriminator</h3>
<p>Due to the last equation, we actually can write down the optimal <script type="math/tex">D</script> given <script type="math/tex">G</script>. Perhaps surprisingly, it comes from simple calculus. We will find a maximum of the integrand in the equation above and then choose that for <script type="math/tex">D</script>. Notice, the integrand can be written:
\begin{equation}
f(y)= a \log y + b \log(1-y).
\end{equation}
To find the critical points,
\begin{equation}
f^\prime(y) = 0 \Rightarrow \frac{a}{y} - \frac{b}{1-y} = 0 \Rightarrow y = \frac{a}{a+b}
\end{equation}
if <script type="math/tex">a+b \neq 0</script>. If you continue and do the second derivative test:
\begin{equation}
f^{\prime\prime}\big ( \frac{a}{a+b} \big) = - \frac{a}{(\frac{a}{a+b})^2} - \frac{b}{(1-\frac{a}{a+b})^2} < 0
\end{equation}
when <script type="math/tex">a,b \in (0,1)</script>, and so <script type="math/tex">\frac{a}{a+b}</script> is a maximum.</p>
<p>So, we can then write that
\begin{equation}
V(G,D) = \int_{x} p_{data}(x)\log D(x) + p_G(x) \log ( 1- D(x)) \, \mathrm{d}x
\end{equation}
\begin{equation}
\leq \int_x \max_y {p_{data}(x)\log y + p_G(x) \log ( 1- y)}\, \mathrm{d}x.
\end{equation}
If you let <script type="math/tex">D(x) = \frac{p_{data}}{p_{data}+p_{G}}</script>, you then achieve this maximum. Since <script type="math/tex">f</script> has a unique maximizer on the interval of interest, optimal <script type="math/tex">D</script> is unique as well, as no other choice of <script type="math/tex">D</script> will achieve the maximum.</p>
<p>Notice, that this optimal <script type="math/tex">D</script> is not practically calculable, but mathematically it is still very important. We do not know <script type="math/tex">p_{data}(x)</script> a priori and so we would never be able to use it directly during training. On the other hand, its existence enables us to prove that an optimal <script type="math/tex">G</script> exists, and during training we only need to approximate this <script type="math/tex">D</script>.</p>
<h3 id="finding-the-best-generator">Finding the Best Generator</h3>
<p>Of course, the goal of the GAN process is for <script type="math/tex">p_G = p_{data}</script>. What does this mean for the optimal <script type="math/tex">D</script>? Substituting into the equation for <script type="math/tex">D^*_G</script>,
\begin{equation}
D_G^* = \frac{p_{data}}{p_{data}+p_G} = \frac{1}{2}.
\end{equation}
This means that the Discriminator is completely confused, outputting <script type="math/tex">1/2</script> for examples from both <script type="math/tex">p_{G}</script> and <script type="math/tex">p_{data}</script>. Using this idea, the authors exploit this to prove that this <script type="math/tex">G</script> is the solution to the minimax game. The theorem is as follows:</p>
<p>“The global minimum of the virtual training criterion <script type="math/tex">C(G)=\max_D V(G,D)</script> is acheived if and only if <script type="math/tex">p_G = p_{data}</script>.”</p>
<p>The theorem in the paper is and if and only if statement, and so we prove both directions. First, we approach the backwards direction and show what value <script type="math/tex">C(G)</script> takes. Then, we approach the forward direction with the newfound knowledge from the backwards direction.
Assuming <script type="math/tex">p_G=p_{data}</script>, we can write</p>
<p>\begin{equation}
V(G, D_G^*) = \int_{x} p_{data}(x)\log \frac{1}{2} + p_G(x) \log \big ( 1- \frac{1}{2}\big) \, \mathrm{d}x
\end{equation}</p>
<p>and
\begin{equation}
V(G,D_G^*) = - \log 2 \int_{x}p_{G}(x) \,\mathrm{d}x - \log 2 \int_x p_{data}(x)\, \mathrm{d}x = - 2 \log 2 = -\log4.
\end{equation}
This value is a candidate for the global minimum (since it is what occurs when <script type="math/tex">p_G=p_{data}</script>). We will stop the backwards direction for now because we want to prove that this value always the minimum (which will satisfy the “if” and “only if” pieces at the same time). So, drop the assumption that <script type="math/tex">p_G=p_{data}</script> and observe that for any <script type="math/tex">G</script>, we can plug in <script type="math/tex">D^*_{G}</script> into <script type="math/tex">C(G)=\max_D V(G,D)</script>:
\begin{equation}
C(G) = \int_{x} p_{data}(x)\log \big (\frac{p_{data}(x)}{p_{G}(x)+p_{data}(x)} \big ) + p_G(x) \log\big ( \frac{p_{G}(x)}{p_{G}(x)+p_{data}(x)}\big ) \, \mathrm{d}x.
\end{equation}</p>
<p>The second integrand comes from the following trick:</p>
<p>\begin{equation}
1-D_G^*(x) = 1 - \frac{p_{data}(x)}{p_{G}(x)+p_{data}(x)} =\frac{p_G(x) + p_{data}(x)}{p_{G}(x)+p_{data}(x)} - \frac{p_{data}(x)}{p_{G}(x)+p_{data}(x)}
\end{equation}
\begin{equation}
= \frac{p_{G}(x)}{p_{G}(x)+p_{data}(x)}.
\end{equation}</p>
<p>Knowing the <script type="math/tex">-\log 4</script> is the candidate for the global minimum, I want to stuff that value into the equation, so I add and subtract <script type="math/tex">\log2</script> from each integral, multiplied by the probabiliy densities. This is a common mathematical proof technique which doesn’t change the equality because in essense I am adding 0 to the equation.</p>
<p>\begin{equation}
C(G) = \int_{x} (\log2 -\log2)p_{data}(x) + p_{data}(x)\log \big (\frac{p_{data}(x)}{p_{G}(x)+p_{data}(x)} \big )
\end{equation}
\begin{equation}
\quad+(\log2 - \log2)p_G(x) + p_G(x) \log\big ( \frac{p_G(x)}{p_G(x)+p_{data}(x)}\big ) \, \mathrm{d}x.
\end{equation}</p>
<p>\begin{equation}
C(G) = - \log2\int_{x} p_{G}(x) + p_{data}(x)\, \mathrm{d}x
\end{equation}
\begin{equation}+\int_{x}p_{data}(x)\Big(\log2 + \log \big (\frac{p_{data}(x)}{p_{G}(x)+p_{data}(x)} \big ) \big)
\end{equation}
\begin{equation}+ p_G(x)\Big (\log2 + \log\big ( \frac{p_{G}(x)}{p_{G}(x)+p_{data}(x)}\big ) \Big ) \, \mathrm{d}x.
\end{equation}</p>
<p>Due to the definition of probability densities, integrating <script type="math/tex">p_G</script> and <script type="math/tex">p_{data}</script> over their domain equals 1, i.e.:
\begin{equation}
-\log2\int_{x} p_{G}(x) + p_{data}(x)\, \mathrm{d}x = -\log2 ( 1 + 1) = -2\log2 = -\log4.
\end{equation}
Moreover, using the definition of the logarithm,
\begin{equation}
\log2 + \log \big (\frac{p_{data}(x)}{p_{G}(x)+p_{data}(x)} \big ) = \log \big ( 2\frac{p_{data}(x)}{p_{G}(x)+p_{data}(x)} \big )\end{equation}
\begin{equation} = \log \big (\frac{p_{data}(x)}{(p_{G}(x)+p_{data}(x))/2} \big ).
\end{equation}</p>
<p>So, substituting in these equalities we can write:
\begin{equation}
C(G) = - \log4 + \int_{x}p_{data}(x)\log \big (\frac{p_{data}(x)}{(p_{G}(x)+p_{data}(x))/2} \big )\,\mathrm{d}x
\end{equation}
\begin{equation}+ \int_x p_G(x)\log\big ( \frac{p_{G}(x)}{(p_{G}(x)+p_{data}(x))/2}\big ) \, \mathrm{d}x.
\end{equation}</p>
<p>Now, for those of you who have heard of the <a href="https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence">Kullback-Leibler divergence</a>, you will recognize that each integral is exactly that! In particular,
\begin{equation}
C(G) = - \log4 + KL\big (p_{data} \big | \frac{p_{data}+p_{G}}{2} \big ) + KL\big(p_{G} \big | \frac{p_{data}+p_{G}}{2}\big) .
\end{equation}</p>
<p>The Kullback-Leibler divergence is always non-negative, so we immediately see that <script type="math/tex">-\log 4</script> is the global minimum of <script type="math/tex">C(G)</script>. If we prove only one <script type="math/tex">G</script> acheives this, then we are finished because <script type="math/tex">p_G=p_{data}</script> would be the unique point where <script type="math/tex">C(G)=-\log 4</script>. To do this, we have to notice a second identity:
\begin{equation}
C(G) = - \log4 + 2\cdot JSD \big (p_{data} | p_{G}\big ) .
\end{equation}</p>
<p>The JSD term is known as the <a href="https://en.wikipedia.org/wiki/Jensen%E2%80%93Shannon_divergence">Jenson-Shannon divergence</a>. Of course, you would have to be initiated in the field to know this identify. This divergence is actually the square of the Jenson-Shannon distance metric, which gives us the property that <script type="math/tex">JSD \big (p_{data} \| p_{G}\big )</script> is only 0 when <script type="math/tex">p_{data}=p_{G}</script>.</p>
<h3 id="convergence">Convergence</h3>
<p>Now, the main thrust of the paper has been proven: namely, that <script type="math/tex">p_G=p_{data}</script> at the optimum of <script type="math/tex">\max_D V(G,D)</script>. There is an additional proof in the paper that states given enough training data and the right circumstances, the training process will converge to the optimal <script type="math/tex">G</script>. We will omit the details.</p>
<h1 id="implementation">Implementation</h1>
<p>Most implementations do not explicitly perform the minimax optimization suggested in the original paper. Instead, they use a more typical training regiment inspired by the paper. For clarity, we will use the concrete case of generating images. First, we define two models. The first is the Discriminator <script type="math/tex">D(x)</script> and the second is the Adversarial net <script type="math/tex">A(z)=D(G(z))</script>. Notice, the adversarial net is also a classifier. In our setting, we will define the positive label 1 to be a real image and 0 to be a fake image. So, training will consist of two steps for each minibatch:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>1. Train the Discriminator D to discriminate between real images and generated images via a standard 0-1 classification loss function.
2. Freeze the weights of D and train the adversarial network A with generated images with their labels forced to be 1.
</code></pre></div></div>
<p>The second step is interesting because it’s very intuitive. We are not updating <script type="math/tex">D</script>’s weights, but we are updating <script type="math/tex">G</script> to make the generated images “look more like real images” (i.e. having the Adversarial net output 1 for fake images). If the Adversarial network outputs a 1, that means we have tricked the Discriminator.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">sklearn</span>
<span class="kn">from</span> <span class="nn">sklearn.datasets</span> <span class="kn">import</span> <span class="n">fetch_mldata</span>
<span class="n">mnist</span> <span class="o">=</span> <span class="n">fetch_mldata</span><span class="p">(</span><span class="s">'MNIST original'</span><span class="p">)</span>
<span class="n">mnist</span><span class="p">[</span><span class="s">'data'</span><span class="p">]</span> <span class="o">=</span> <span class="n">mnist</span><span class="p">[</span><span class="s">'data'</span><span class="p">]</span><span class="o">.</span><span class="n">astype</span><span class="p">(</span><span class="nb">float</span><span class="p">)</span> <span class="o">/</span> <span class="mf">255.</span> <span class="c"># Normalized!</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">keras</span>
<span class="kn">from</span> <span class="nn">keras.models</span> <span class="kn">import</span> <span class="n">Sequential</span>
<span class="kn">from</span> <span class="nn">keras.layers</span> <span class="kn">import</span> <span class="n">Dense</span><span class="p">,</span> <span class="n">Activation</span><span class="p">,</span> <span class="n">Conv2D</span><span class="p">,</span> <span class="n">BatchNormalization</span><span class="p">,</span> <span class="n">Input</span><span class="p">,</span> <span class="n">Dropout</span><span class="p">,</span> <span class="n">UpSampling2D</span><span class="p">,</span> <span class="n">Reshape</span><span class="p">,</span> <span class="n">Flatten</span><span class="p">,</span> <span class="n">Conv2DTranspose</span>
<span class="kn">from</span> <span class="nn">keras.layers.advanced_activations</span> <span class="kn">import</span> <span class="n">LeakyReLU</span>
</code></pre></div></div>
<h2 id="network-architecture-highlights">Network Architecture Highlights</h2>
<p>For the network itself, I combined my <a href="https://medium.com/towards-data-science/gan-by-example-using-keras-on-tensorflow-backend-1a6d515a60d0">favorite</a> <a href="https://github.com/osh/KerasGAN/blob/master/MNIST_CNN_GAN.ipynb">implementations</a> and approaches. It’s sometimes a pain to pick hyperparameters for a network, so it’s always good to have a reference for your first implementation. The Discriminator is a standard Convolutional Neural Network. The Generator is in a sense the reverse of a CNN. We will use transposed convolutional layers to take a dense vector from the latent distribution and upscale/transform those vectors to our desired image shape.</p>
<h3 id="conv2dtranspose">Conv2dTranspose</h3>
<p>Rather than using Convolution2D and UpSampling2D as some implementations do, you can achieve the same results from using Conv2DTranspose in Keras. There’s deep mathematical reasoning why they are equivalent, but I will leave that for possibly a future post. There’s a <a href="https://github.com/vdumoulin/conv_arithmetic">page</a> that graphically shows the difference between convolutions and transposed convolutions.</p>
<h3 id="batchnormalization">BatchNormalization</h3>
<p>BatchNormalization has become known as another source of regularization, but there is no rigorous proof of this. Like RMSprop, it just appears to work better. BatchNormalization normalizes the output of each layer before passing it through the next layer. This speeds up training and appears to lead to better results. There is no consensus (as of right now) on whether it should be used before or after the activation function, and so I use it before.</p>
<h2 id="generator">Generator</h2>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">depth</span><span class="o">=</span><span class="mi">200</span>
<span class="n">dropout</span><span class="o">=.</span><span class="mi">3</span>
<span class="n">image_size</span> <span class="o">=</span> <span class="mi">28</span>
<span class="n">half_image_size</span> <span class="o">=</span> <span class="mi">7</span>
<span class="n">gan_layers</span> <span class="o">=</span> <span class="p">[</span>
<span class="n">Dense</span><span class="p">(</span><span class="n">depth</span><span class="o">*</span><span class="n">half_image_size</span><span class="o">*</span><span class="n">half_image_size</span><span class="p">,</span> <span class="n">input_shape</span><span class="o">=</span><span class="p">(</span><span class="mi">100</span><span class="p">,),</span> <span class="n">init</span><span class="o">=</span><span class="s">'glorot_normal'</span><span class="p">),</span>
<span class="n">BatchNormalization</span><span class="p">(</span><span class="n">momentum</span><span class="o">=</span><span class="mf">0.9</span><span class="p">),</span>
<span class="n">Activation</span><span class="p">(</span><span class="s">'relu'</span><span class="p">),</span>
<span class="n">Dropout</span><span class="p">(</span><span class="n">dropout</span><span class="p">),</span>
<span class="n">Reshape</span><span class="p">([</span><span class="n">half_image_size</span><span class="p">,</span><span class="n">half_image_size</span><span class="p">,</span><span class="n">depth</span><span class="p">]),</span>
<span class="n">Conv2DTranspose</span><span class="p">(</span><span class="n">depth</span><span class="o">/</span><span class="mi">2</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="n">strides</span><span class="o">=</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span><span class="mi">2</span><span class="p">),</span> <span class="n">padding</span><span class="o">=</span><span class="s">'same'</span><span class="p">,</span> <span class="n">init</span><span class="o">=</span><span class="s">'glorot_uniform'</span><span class="p">),</span> <span class="c"># (14,14,128)</span>
<span class="n">BatchNormalization</span><span class="p">(</span><span class="n">momentum</span><span class="o">=</span><span class="mf">0.9</span><span class="p">),</span>
<span class="n">Activation</span><span class="p">(</span><span class="s">'relu'</span><span class="p">),</span>
<span class="n">Dropout</span><span class="p">(</span><span class="n">dropout</span><span class="p">),</span>
<span class="n">Conv2DTranspose</span><span class="p">(</span><span class="n">depth</span><span class="o">/</span><span class="mi">4</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="n">strides</span><span class="o">=</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span><span class="mi">2</span><span class="p">),</span> <span class="n">padding</span><span class="o">=</span><span class="s">'same'</span><span class="p">,</span> <span class="n">init</span><span class="o">=</span><span class="s">'glorot_uniform'</span><span class="p">),</span> <span class="c"># (28,28,64)</span>
<span class="n">BatchNormalization</span><span class="p">(</span><span class="n">momentum</span><span class="o">=</span><span class="mf">0.9</span><span class="p">),</span>
<span class="n">Activation</span><span class="p">(</span><span class="s">'relu'</span><span class="p">),</span>
<span class="n">Dropout</span><span class="p">(</span><span class="n">dropout</span><span class="p">),</span>
<span class="n">Conv2DTranspose</span><span class="p">(</span><span class="n">depth</span><span class="o">/</span><span class="mi">8</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="n">padding</span><span class="o">=</span><span class="s">'same'</span><span class="p">,</span> <span class="n">init</span><span class="o">=</span><span class="s">'glorot_uniform'</span><span class="p">),</span> <span class="c"># (28,28,32) </span>
<span class="n">BatchNormalization</span><span class="p">(</span><span class="n">momentum</span><span class="o">=</span><span class="mf">0.9</span><span class="p">),</span>
<span class="n">Activation</span><span class="p">(</span><span class="s">'relu'</span><span class="p">),</span>
<span class="n">Dropout</span><span class="p">(</span><span class="n">dropout</span><span class="p">),</span>
<span class="n">Conv2DTranspose</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="n">padding</span><span class="o">=</span><span class="s">'same'</span><span class="p">),</span> <span class="c"># (28,28,1) </span>
<span class="n">Activation</span><span class="p">(</span><span class="s">'sigmoid'</span><span class="p">)</span>
<span class="p">]</span>
<span class="n">gen</span> <span class="o">=</span> <span class="n">Sequential</span><span class="p">(</span><span class="n">layers</span><span class="o">=</span><span class="n">gan_layers</span><span class="p">)</span>
</code></pre></div></div>
<h2 id="discriminator">Discriminator</h2>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">leaky_alpha</span> <span class="o">=</span> <span class="o">.</span><span class="mi">2</span>
<span class="n">d_depth</span> <span class="o">=</span> <span class="mi">256</span>
<span class="n">discriminator_layers</span> <span class="o">=</span> <span class="p">[</span>
<span class="n">Conv2D</span><span class="p">(</span><span class="n">d_depth</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="n">strides</span><span class="o">=</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span><span class="mi">2</span><span class="p">),</span> <span class="n">input_shape</span><span class="o">=</span><span class="p">(</span><span class="n">image_size</span><span class="p">,</span> <span class="n">image_size</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span> <span class="n">data_format</span><span class="o">=</span><span class="s">'channels_last'</span><span class="p">,</span> <span class="n">padding</span><span class="o">=</span><span class="s">'same'</span><span class="p">),</span>
<span class="n">LeakyReLU</span><span class="p">(</span><span class="n">alpha</span><span class="o">=</span><span class="n">leaky_alpha</span><span class="p">),</span>
<span class="n">Dropout</span><span class="p">(</span><span class="n">dropout</span><span class="p">),</span>
<span class="n">Conv2D</span><span class="p">(</span><span class="n">d_depth</span><span class="o">*</span><span class="mi">2</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="n">strides</span><span class="o">=</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span><span class="mi">2</span><span class="p">),</span> <span class="n">padding</span><span class="o">=</span><span class="s">'same'</span><span class="p">),</span>
<span class="n">LeakyReLU</span><span class="p">(</span><span class="n">alpha</span><span class="o">=</span><span class="n">leaky_alpha</span><span class="p">),</span>
<span class="n">Dropout</span><span class="p">(</span><span class="n">dropout</span><span class="p">),</span>
<span class="n">Flatten</span><span class="p">(),</span>
<span class="n">Dense</span><span class="p">(</span><span class="mi">256</span><span class="p">),</span>
<span class="n">LeakyReLU</span><span class="p">(</span><span class="n">alpha</span><span class="o">=</span><span class="n">leaky_alpha</span><span class="p">),</span>
<span class="n">Dropout</span><span class="p">(</span><span class="n">dropout</span><span class="p">),</span>
<span class="n">Dense</span><span class="p">(</span><span class="mi">2</span><span class="p">),</span>
<span class="n">Activation</span><span class="p">(</span><span class="s">'softmax'</span><span class="p">)</span>
<span class="p">]</span>
<span class="n">dsc</span> <span class="o">=</span> <span class="n">Sequential</span><span class="p">(</span><span class="n">layers</span><span class="o">=</span><span class="n">discriminator_layers</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">keras</span> <span class="kn">import</span> <span class="n">optimizers</span>
<span class="n">opt</span> <span class="o">=</span> <span class="n">optimizers</span><span class="o">.</span><span class="n">Adam</span><span class="p">(</span><span class="n">lr</span><span class="o">=</span><span class="mf">2e-4</span><span class="p">)</span>
<span class="n">dopt</span> <span class="o">=</span> <span class="n">optimizers</span><span class="o">.</span><span class="n">Adam</span><span class="p">(</span><span class="n">lr</span><span class="o">=</span><span class="mf">1e-4</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">dsc</span><span class="o">.</span><span class="nb">compile</span><span class="p">(</span><span class="n">optimizer</span><span class="o">=</span><span class="n">dopt</span><span class="p">,</span> <span class="n">loss</span><span class="o">=</span><span class="s">'categorical_crossentropy'</span><span class="p">)</span>
</code></pre></div></div>
<h2 id="adversarial-network">Adversarial Network</h2>
<p>Remember to turn off training on the Discriminator’s weights!</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for</span> <span class="n">layer</span> <span class="ow">in</span> <span class="n">dsc</span><span class="o">.</span><span class="n">layers</span><span class="p">:</span>
<span class="n">layer</span><span class="o">.</span><span class="n">trainable</span> <span class="o">=</span> <span class="bp">False</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">keras.models</span> <span class="kn">import</span> <span class="n">Model</span>
<span class="n">adv_model</span> <span class="o">=</span> <span class="n">Sequential</span><span class="p">([</span><span class="n">gen</span><span class="p">,</span><span class="n">dsc</span><span class="p">])</span>
<span class="n">adv_model</span><span class="o">.</span><span class="nb">compile</span><span class="p">(</span><span class="n">optimizer</span><span class="o">=</span><span class="n">opt</span><span class="p">,</span> <span class="n">loss</span><span class="o">=</span><span class="s">'categorical_crossentropy'</span><span class="p">)</span>
<span class="n">adv_model</span><span class="o">.</span><span class="n">summary</span><span class="p">()</span>
</code></pre></div></div>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
sequential_1 (Sequential) (None, 28, 28, 1) 1586351
_________________________________________________________________
sequential_2 (Sequential) (None, 2) 9707266
=================================================================
Total params: 11,293,617
Trainable params: 1,566,401
Non-trainable params: 9,727,216
_________________________________________________________________
</code></pre></div></div>
<p>Notice you can clearly see the discriminator’s weights are not trainable.</p>
<h2 id="pretraining">Pretraining</h2>
<p>The jury is out on whether or not pretraining the Discriminator is worthwhile. Many recent papers suggest it is not needed. However, this is based on subjectively what Generators predict “more realistic” images. So, I’ll just follow what the original paper did.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># prior distribution</span>
<span class="n">generate_values</span> <span class="o">=</span> <span class="k">lambda</span> <span class="n">x</span> <span class="p">:</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">uniform</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="mi">100</span><span class="p">))</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Initial Discriminator Training</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="n">initial_training_size</span> <span class="o">=</span> <span class="mi">10000</span>
<span class="n">real_images</span> <span class="o">=</span> <span class="n">mnist</span><span class="p">[</span><span class="s">'data'</span><span class="p">][</span><span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">choice</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">mnist</span><span class="p">[</span><span class="s">'data'</span><span class="p">]),</span> <span class="n">size</span><span class="o">=</span><span class="n">initial_training_size</span><span class="p">)]</span><span class="o">.</span><span class="n">reshape</span><span class="p">(</span><span class="n">initial_training_size</span><span class="p">,</span><span class="n">image_size</span><span class="p">,</span> <span class="n">image_size</span><span class="p">,</span><span class="mi">1</span><span class="p">)</span>
<span class="n">fake_images</span> <span class="o">=</span> <span class="n">gen</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">generate_values</span><span class="p">(</span><span class="n">initial_training_size</span><span class="p">))</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">%</span><span class="n">matplotlib</span> <span class="n">inline</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span>
<span class="n">f</span><span class="p">,</span> <span class="n">axs</span> <span class="o">=</span><span class="n">plt</span><span class="o">.</span><span class="n">subplots</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="mi">2</span><span class="p">,</span> <span class="n">sharey</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">axs</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">imshow</span><span class="p">(</span><span class="n">real_images</span><span class="p">[</span><span class="mi">1</span><span class="p">,:,:,</span><span class="mi">0</span><span class="p">])</span>
<span class="n">axs</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">imshow</span><span class="p">(</span><span class="n">fake_images</span><span class="p">[</span><span class="mi">1</span><span class="p">,:,:,</span><span class="mi">0</span><span class="p">])</span>
<span class="n">axs</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">set_title</span><span class="p">(</span><span class="s">'Real Image'</span><span class="p">)</span>
<span class="n">axs</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">set_title</span><span class="p">(</span><span class="s">'Fake Image'</span><span class="p">)</span>
</code></pre></div></div>
<p><img src="../images/ganproof/output_19_1.png" alt="png" /></p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Compile initial training set</span>
<span class="n">X</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">concatenate</span><span class="p">([</span><span class="n">real_images</span><span class="p">,</span> <span class="n">fake_images</span><span class="p">])</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">((</span><span class="mi">2</span><span class="o">*</span><span class="n">initial_training_size</span><span class="p">,</span><span class="mi">2</span><span class="p">))</span>
<span class="n">y</span><span class="p">[:</span><span class="n">initial_training_size</span><span class="p">,</span><span class="mi">1</span><span class="p">]</span> <span class="o">=</span> <span class="mi">1</span>
<span class="n">y</span><span class="p">[</span><span class="n">initial_training_size</span><span class="p">:,</span><span class="mi">0</span><span class="p">]</span> <span class="o">=</span> <span class="mi">1</span>
<span class="n">dsc</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X</span><span class="p">,</span><span class="n">y</span><span class="p">,</span> <span class="n">batch_size</span><span class="o">=</span><span class="mi">32</span><span class="p">,</span> <span class="n">epochs</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
</code></pre></div></div>
<h2 id="the-combined-training">The Combined Training</h2>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">steps</span> <span class="o">=</span> <span class="mi">10000</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">(</span><span class="n">shape</span><span class="o">=</span><span class="p">(</span><span class="n">mini_batch_size</span><span class="p">,</span><span class="mi">2</span><span class="p">))</span>
<span class="n">y</span><span class="p">[:</span><span class="n">mini_batch_size</span><span class="o">/</span><span class="mi">2</span><span class="p">,</span> <span class="mi">1</span><span class="p">]</span><span class="o">=</span><span class="mi">1</span>
<span class="n">y</span><span class="p">[</span><span class="n">mini_batch_size</span><span class="o">/</span><span class="mi">2</span><span class="p">:,</span> <span class="mi">0</span><span class="p">]</span><span class="o">=</span><span class="mi">1</span>
<span class="n">d_loss</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">(</span><span class="n">steps</span><span class="p">)</span>
<span class="n">a_loss</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">(</span><span class="n">steps</span><span class="p">)</span>
<span class="k">for</span> <span class="n">k</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">steps</span><span class="p">):</span>
<span class="k">if</span> <span class="n">k</span> <span class="o">%</span> <span class="mi">50</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
<span class="k">print</span><span class="p">(</span><span class="n">k</span><span class="p">)</span>
<span class="n">X</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">concatenate</span><span class="p">([</span>
<span class="n">mnist</span><span class="p">[</span><span class="s">'data'</span><span class="p">][</span><span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">choice</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">mnist</span><span class="p">[</span><span class="s">'data'</span><span class="p">]),</span> <span class="n">size</span><span class="o">=</span><span class="n">mini_batch_size</span><span class="o">/</span><span class="mi">2</span><span class="p">)]</span><span class="o">.</span><span class="n">reshape</span><span class="p">(</span><span class="n">mini_batch_size</span><span class="o">/</span><span class="mi">2</span><span class="p">,</span><span class="n">image_size</span><span class="p">,</span> <span class="n">image_size</span><span class="p">,</span><span class="mi">1</span><span class="p">),</span>
<span class="n">gen</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">generate_values</span><span class="p">(</span><span class="n">mini_batch_size</span><span class="o">/</span><span class="mi">2</span><span class="p">))</span>
<span class="p">])</span>
<span class="n">d_loss</span><span class="p">[</span><span class="n">k</span><span class="p">]</span> <span class="o">=</span> <span class="n">dsc</span><span class="o">.</span><span class="n">train_on_batch</span><span class="p">(</span><span class="n">X</span><span class="p">,</span><span class="n">y</span><span class="p">)</span>
<span class="c"># Train GAN</span>
<span class="n">X_adv</span> <span class="o">=</span> <span class="n">generate_values</span><span class="p">(</span><span class="n">mini_batch_size</span><span class="p">)</span>
<span class="n">a_loss</span><span class="p">[</span><span class="n">k</span><span class="p">]</span> <span class="o">=</span> <span class="n">adv_model</span><span class="o">.</span><span class="n">train_on_batch</span><span class="p">(</span><span class="n">X_adv</span><span class="p">,</span> <span class="n">y_adv</span><span class="p">)</span>
</code></pre></div></div>
<h1 id="results">Results</h1>
<p>Full Disclosure: I may have ran the code a few times to get a good looking image…</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">%</span><span class="n">matplotlib</span> <span class="n">inline</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span>
<span class="n">f</span><span class="p">,</span> <span class="n">axs</span> <span class="o">=</span><span class="n">plt</span><span class="o">.</span><span class="n">subplots</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="mi">2</span><span class="p">,</span> <span class="n">sharey</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">axs</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">imshow</span><span class="p">(</span><span class="n">real_images</span><span class="p">[</span><span class="mi">1</span><span class="p">,:,:,</span><span class="mi">0</span><span class="p">])</span>
<span class="n">axs</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">set_title</span><span class="p">(</span><span class="s">'Real Image'</span><span class="p">)</span>
<span class="n">axs</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">set_title</span><span class="p">(</span><span class="s">'Fake Image'</span><span class="p">)</span>
<span class="n">axs</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">imshow</span><span class="p">(</span><span class="n">gen</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">generate_values</span><span class="p">(</span><span class="mi">1</span><span class="p">))[</span><span class="mi">0</span><span class="p">,:,:,</span><span class="mi">0</span><span class="p">])</span>
</code></pre></div></div>
<p><img src="../images/ganproof/output_24_1.png" alt="png" /></p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">%</span><span class="n">matplotlib</span> <span class="n">inline</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span>
<span class="n">f</span><span class="p">,</span> <span class="n">axs</span> <span class="o">=</span><span class="n">plt</span><span class="o">.</span><span class="n">subplots</span><span class="p">(</span><span class="mi">3</span><span class="p">,</span><span class="mi">8</span><span class="p">,</span> <span class="n">sharey</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span><span class="mi">4</span><span class="p">))</span>
<span class="n">plt</span><span class="o">.</span><span class="n">suptitle</span><span class="p">(</span><span class="s">'Generated Images'</span><span class="p">)</span>
<span class="k">for</span> <span class="n">j</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">3</span><span class="p">):</span>
<span class="k">for</span> <span class="n">k</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">8</span><span class="p">):</span>
<span class="n">axs</span><span class="p">[</span><span class="n">j</span><span class="p">][</span><span class="n">k</span><span class="p">]</span><span class="o">.</span><span class="n">imshow</span><span class="p">(</span><span class="n">gen</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">generate_values</span><span class="p">(</span><span class="mi">1</span><span class="p">))[</span><span class="mi">0</span><span class="p">,:,:,</span><span class="mi">0</span><span class="p">])</span>
</code></pre></div></div>
<p><img src="../images/ganproof/output_25_0.png" alt="png" /></p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">plt</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">d_loss</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s">'r'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">a_loss</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s">'b'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">xlabel</span><span class="p">(</span><span class="s">'Steps'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">ylabel</span><span class="p">(</span><span class="s">'Value'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="s">'Loss'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">legend</span><span class="p">([</span><span class="s">'Discriminator'</span><span class="p">,</span><span class="s">'Adversarial'</span><span class="p">])</span>
</code></pre></div></div>
<p><img src="../images/ganproof/output_26_1.png" alt="png" /></p>The preferred illustration of Generative Adversarial Networks (which seems indoctrinated at this point like Bob, Alice and Eve for cryptography) is the scenario of counterfitting money. The Generator is a counterfits money and the Discriminator is supposed to discriminate between real and fake dollars. As the Generator gets better, the Discriminator has to improve also. This implies a dual training scheme where each model tries to one up the other (i.e. through additional learning). In this post, we will explore the proof from the original paper and demonstrate a typical implementation.A Tour of Gotchas When Implementing Deep Q Networks with Keras and OpenAi Gym2017-07-26T00:00:00+00:002017-07-26T00:00:00+00:00http://srome.github.io//A-Tour-Of-Gotchas-When-Implementing-Deep-Q-Networks-With-Keras-And-OpenAi-Gym<p>Starting with the Google DeepMind paper, there has been a lot of new attention around training models to play video games. You, the data scientist/engineer/enthusiast, may not work in reinforcement learning but probably are interested in teaching neural networks to play video games. Who isn’t? With that in mind, here’s a list of nuances that should jumpstart your own implementation.</p>
<p>The lessons below were gleaned from working on my <a href="http://www.github.com/srome/ExPyDQN">own implementation</a> of the <a href="http://www.nature.com/nature/journal/v518/n7540/full/nature14236.html">Nature</a> paper. The lessons are aimed at people who work with data but may run into some issues with some of the non-standard approaches used in the reinforcement learning community when compared with typical supervised learning use cases. I will address both technical details of the parameters of the neural networks and the libraries involved. This post assumes limited knowledge of the Nature paper, in particular in regards to basic notation used around Q learning. My implementation was written to be instructive by relying mostly on Keras and Gym. I avoided theano/tensorflow specific tricks (e.g., theano’s <a href="https://github.com/Theano/theano/blob/52903f8267cff316fc669e207eac4e2ecae952a6/theano/gradient.py#L2002-L2021">disconnected gradient</a>) to keep the focus on the main components.</p>
<h1 id="learning-rate-and-loss-functions">Learning Rate and Loss Functions</h1>
<p>If you have looked at some of the <a href="https://github.com/spragunr/deep_q_rl/blob/master/deep_q_rl/q_network.py">some</a> of the <a href="https://github.com/matthiasplappert/keras-rl/blob/master/rl/agents/dqn.py">implementations</a>, you’ll see there’s usually an option between summing the loss function of a minibatch or taking a mean. Most loss functions you hear about in machine learning start with the word “mean” or at least take a mean over a mini batch. Let’s talk about what a summed loss really means in the context of your learning rate.</p>
<p>A typical gradient descent update looks like this:</p>
<p>\begin{equation}
\theta_{t+1} \leftarrow \theta_{t} - \lambda \nabla(\ell_{\theta_t})
\end{equation}</p>
<p>where <script type="math/tex">\theta_{t}</script> is your learner’s weights at time <script type="math/tex">t</script>, <script type="math/tex">\lambda</script> is the learning rate and <script type="math/tex">\ell_{\theta_t}</script> is your loss function which depends on <script type="math/tex">\theta_t</script>. I will suppress the <script type="math/tex">\theta_t</script> dependence on the loss function moving forward. Let us define <script type="math/tex">\ell</script> as a loss function (where we assume it is the sum over) and <script type="math/tex">\hat{\ell}</script> as your loss function taking the mean over the mini batch. For a fixed mini batch of size <script type="math/tex">m</script>, notice:
\begin{equation}
\hat{\ell} = \frac{1}{m} \ell
\end{equation}
and so mathematically we have
\begin{equation}
\nabla(\hat{\ell}) = \frac{1}{m} \nabla(\ell).
\end{equation}</p>
<p>What does this tell us? If you train two models each with learning rate <script type="math/tex">\lambda</script>, these two variants will have very different behavior as the magnitude of the updates will be off by a factor of <script type="math/tex">m</script>! In theory, with a small enough learning rate, you can take into account the size of the mini batch and recover the behavior of the mean version of the loss from the summed version. However, this would lead other components coefficients in the loss like for regularization needing adjustment as well. Sticking with the mean version leads to standard coefficients which work well in many situations. Therefore, it’s atypical in other data science applications to use a summed loss function rather than a mean, but the option is frequently present in reinforcement learning. So, you should be aware that you may need to adjust the learning rate! With that said, my implementation uses the mean version and a learning rate of .00025.</p>
<h1 id="stochasticity-frame-skipping-and-gym-defaults">Stochasticity, Frame Skipping, and Gym Defaults</h1>
<p>Many (good) blog posts on reinforcement learning show a model being trained using “(Game Name)-v0” and so you might decide to use the ROM as you’ve seen it done before. So far so good. Then you read the various papers and see a technique called “frame skipping”, where you stack <script type="math/tex">n</script> output screens from the emulator into a single <script type="math/tex">n</script>-by-length-by-width image and pass that to the model, and so you implement that thinking everything is going according to plan. It’s not. Depending on your version of Gym, you can run into trouble.</p>
<p>In older versions of Gym, the Atari environment <em>randomly</em> <a href="https://github.com/openai/gym/blob/bde0de609d3645e76728b3b8fc2f3bf210187b27/gym/envs/atari/atari_env.py#L69-L71">repeated your action for 2-4 steps</a>, and returning the resulting frame. In terms of the code, this is what happens in terms of your frame skip implementation.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for</span> <span class="n">k</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">frame_skip</span><span class="p">):</span>
<span class="n">obs</span><span class="p">,</span> <span class="n">reward</span><span class="p">,</span> <span class="n">is_terminal</span><span class="p">,</span> <span class="n">info</span> <span class="o">=</span> <span class="n">env</span><span class="o">.</span><span class="n">step</span><span class="p">(</span><span class="n">action</span><span class="p">)</span> <span class="c"># 2-4 step occur in the emulator with the given action</span>
</code></pre></div></div>
<p>If you implement frame skipping with n=4, it’s possible that your learning is seeing every 8-16 (or more!) frames rather than every 4. You can imagine the impact on performance. Thankfully, this has since been been made <a href="https://github.com/openai/gym/blob/master/gym/envs/atari/atari_env.py#L75-L80">optional</a> via new ROMs, which we will mention in a second. However, there is another setting to be aware of and that’s repeat_action_probability. For “(Game Name)-v0” ROMs, this is on by default. This is the probability that the game will ignore a new action and repeat a previous action at each time step. To remove both frameskip and the repeat action probability, use the “(Game Name)NoFrameskip-v4” ROMs. A full understanding of these settings can be found <a href="https://github.com/openai/gym/blob/5cb12296274020db9bb6378ce54276b31e7002da/gym/envs/__init__.py#L298-L376">here</a>.</p>
<p>I would be remiss not to point out that Gym is not doing this just to ruin your DQN. There is a legitimate reason for doing this, but this setting can lead to <em>unending</em> frustrating when your neural network is ramming itself into the boundary of the stage rather than hitting the ping pong pixel-square. The reason is to introduce stochasticity into the environment. Otherwise, the game will be deterministic and your network is simply memorizing a series of steps like a dance. When using a NoFrameskip ROM, you have to introduce your own stochasticity to avoid said dance. The Nature paper (and many libraries) do this by the “null op max” setting. At the beginning of each episode (i.e., a round for a game of Pong), the agent will perform a series of <script type="math/tex">k</script> consecutive null operations (action=0 for the Atari emulator in Gym) where <script type="math/tex">k</script> is an integer sampled uniformly from <script type="math/tex">[0,\text{null op max}]</script>. It can be implemented by the following pseudo-code at the start of an episode:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">obs</span> <span class="o">=</span> <span class="n">env</span><span class="o">.</span><span class="n">reset</span><span class="p">()</span>
<span class="c"># Perform a null operation to make game stochastic</span>
<span class="k">for</span> <span class="n">k</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">randint</span><span class="p">(</span><span class="mi">0</span> <span class="p">,</span> <span class="n">null_op_max</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="mi">1</span><span class="p">)):</span>
<span class="n">obs</span><span class="p">,</span> <span class="n">reward</span><span class="p">,</span> <span class="n">is_terminal</span><span class="p">,</span> <span class="n">info</span> <span class="o">=</span> <span class="n">env</span><span class="o">.</span><span class="n">step</span><span class="p">(</span><span class="n">action</span><span class="p">)</span>
</code></pre></div></div>
<h1 id="gradient-clipping-error-clipping-reward-clipping">Gradient Clipping, Error Clipping, Reward Clipping</h1>
<p>There are several different types of clipping going on in the Nature paper, and each can easily be confused and implemented incorrectly. In fact, if you thought error clipping and gradient clipping were different, you’re already confused!</p>
<h2 id="what-is-gradient-clipping">What is Gradient Clipping?</h2>
<p>The Nature paper states it is helpful to “clip the error term”. The community seems to have rejected “error clipping” for the term “gradient clipping”. Without knowing the background, the term can be ambiguous. In both instances, there is actual no clipping involved of the loss function or of the error or of the gradient. Really, they are choosing a loss function whose gradient does not grow with the size of the error past a certain region, thus limiting the size of the gradient update for large errors. In particular, if the value of the loss function is greater than 1, they switch the loss function to absolute value. Why? Let’s look at the derivative!</p>
<p>This term represents the loss of a mean squared error-like function:</p>
<p>\begin{equation}
\frac{d}{dx} (x-y)^2 = \frac{1}{2}(x-y)
\end{equation}</p>
<p>Now compare that to the term resulting from an absolute value-like function when <script type="math/tex">x-y>0</script>:</p>
<p>\begin{equation}
\frac{d}{dx} (x-y)= 1
\end{equation}</p>
<p>If we think of the value of the loss function, or of the error, as <script type="math/tex">x-y</script>, we can see one gradient update would contain <script type="math/tex">x-y</script> while the other does not. There isn’t really a good, catchy phrase to describe the above mathematical trick that is more representative than “gradient clipping”.</p>
<p>The standard approach is to accomplish this is to use the <a href="https://en.wikipedia.org/wiki/Huber_loss">Huber loss</a> function. The definition of this function is as follows:</p>
<p>\begin{equation}
f(x) = \frac{1}{2}x^2 \text{ if } |x| \leq \delta \text{, } \delta(|x| - \frac{1}{2}\delta) \text{ otherwise.}
\end{equation}</p>
<p>There is a common trick to implement this so that symbolic mathematics libraries like theano and tensorflow can take the derivative easier without the use of a switch statement. That trick is outlined/coded below and is commonly employed in most implementations.</p>
<p>The function you actually code is as follows: Let <script type="math/tex">q=\min(|x|,\delta)</script>. Then,
\begin{equation}
g(x) = \frac{q^2}{2} + \delta(|x| - q ).
\end{equation}
When <script type="math/tex">|x|\leq \delta</script>, plugging into the formula shows that we recover <script type="math/tex">g(x) = \frac{1}{2}x^2</script>. Otherwise when <script type="math/tex">|x|>\delta</script>,
\begin{equation}
g(x) = \frac{\delta^2}{2} + \delta(|x| - \delta) = \delta |x| - \frac{1}{2}\delta^2 = \delta(|x|-\frac{1}{2}\delta).
\end{equation}</p>
<p>So, <script type="math/tex">g=f</script>. This is coded below and plotted to show the function is continuous with a continuous derivative. The Nature paper uses <script type="math/tex">\delta=1</script>.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">%</span><span class="n">matplotlib</span> <span class="n">inline</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="k">def</span> <span class="nf">huber_loss</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">clip_delta</span><span class="o">=</span><span class="mi">1</span><span class="p">):</span>
<span class="n">error</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="nb">abs</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="n">quadratic_part</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">minimum</span><span class="p">(</span><span class="n">error</span><span class="p">,</span> <span class="n">clip_delta</span><span class="p">)</span>
<span class="k">return</span> <span class="mf">0.5</span> <span class="o">*</span> <span class="n">np</span><span class="o">.</span><span class="n">square</span><span class="p">(</span><span class="n">quadratic_part</span><span class="p">)</span> <span class="o">+</span> <span class="n">clip_delta</span> <span class="o">*</span> <span class="p">(</span><span class="n">error</span> <span class="o">-</span> <span class="n">quadratic_part</span><span class="p">)</span>
<span class="n">f</span><span class="o">=</span><span class="n">np</span><span class="o">.</span><span class="n">vectorize</span><span class="p">(</span><span class="n">huber_loss</span><span class="p">)</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">linspace</span><span class="p">(</span><span class="o">-</span><span class="mi">5</span><span class="p">,</span><span class="mi">5</span><span class="p">,</span><span class="mi">100</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">f</span><span class="p">(</span><span class="n">x</span><span class="p">))</span>
</code></pre></div></div>
<p><img src="../images/dqn_impl/output_8_1.png" alt="png" /></p>
<p>As you can see, the slope of the graph (the derivative) is “clipped” to never be bigger in magnitude than 1. An interesting point is the second derivative is not continuous:</p>
<p>\begin{equation}
f^{\prime\prime}(x) = 1 \text{ if } |x| \leq \delta \text{, } 0 \text{ otherwise.}
\end{equation}</p>
<p>This could cause problems using second order methods for gradiet descent, which is why some suggest a pseudo-Huber loss function which is a smooth approximation to the Huber loss. However, Huber loss is sufficient for our goals.</p>
<h2 id="what-reward-do-i-clip">What Reward Do I Clip?</h2>
<p>This is actually a much subtler question when you introduce frame skipping to the equation. Do you take the last reward returned from the emulator? But what if there was a reward during the skipped frames? When you’re so far down in the weeds of neural network implementation, it’s surprising when the answer comes from the basic <a href="https://en.wikipedia.org/wiki/Q-learning">Q learning</a> framework. In Q learning, one uses the total reward since the last action, and this is what they did in their paper as well. This means even on frames you skipped you need to save the reward observed and aggregate them as the “reward” for a given state <script type="math/tex">(s_t,a_t,r_t,s_{t+1})</script>. This cumulative reward is what you clip. In my implementation, you can see this during the step function on the TrainingEnvironment class. The pseudo code is below:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for</span> <span class="n">k</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">frame_skip</span><span class="p">):</span>
<span class="n">obs</span><span class="p">,</span> <span class="n">reward</span><span class="p">,</span> <span class="n">is_terminal</span><span class="p">,</span> <span class="n">info</span> <span class="o">=</span> <span class="n">env</span><span class="o">.</span><span class="n">step</span><span class="p">(</span><span class="n">action</span><span class="p">)</span> <span class="c"># 1 step in the emulator</span>
<span class="n">total_reward</span> <span class="o">+=</span> <span class="n">reward</span>
</code></pre></div></div>
<p>This total reward is what is stored in the experience replay memory.</p>
<h1 id="memory-consideration">Memory Consideration</h1>
<p>The “experience replay memory” is a store where training examples <script type="math/tex">(s_t,a_t,r_t,s_{t+1})</script> are sampled from to break the time correlation of the data points, stopping the network’s learning from diverging. The original paper mentions a replay memory of a million examples <script type="math/tex">(s_t,a_t,r_t,s_{t+1})</script>. Both the <script type="math/tex">s_t</script>’s and the <script type="math/tex">s_{t+1}</script>’s are <script type="math/tex">n</script>-by-length-by-width images and so you can imagine the memory requirements can become very large. This is one of the cases where knowing more about programming and types while working in Python can be useful. By default, Gym returns images as numpy arrays with the datatype int8. If you do any processing of the images, it’s likely that your images will now be of type float32. So, it’s important when you store your images to make sure they’re in int8 for space considerations, and any transformations that are necessary for the neural network (like scaling) are done before training rather than before storing the states in the replay memory.</p>
<p>It is also helpful to pre-allocate this replay memory. In my implementation, I do this as follows:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">size</span><span class="p">,</span> <span class="n">image_size</span><span class="p">,</span> <span class="n">phi_length</span><span class="p">,</span> <span class="n">minibatch_size</span><span class="p">):</span>
<span class="bp">self</span><span class="o">.</span><span class="n">_memory_state</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">(</span><span class="n">shape</span><span class="o">=</span><span class="p">(</span><span class="n">size</span><span class="p">,</span> <span class="n">phi_length</span><span class="p">,</span> <span class="n">image_size</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">image_size</span><span class="p">[</span><span class="mi">1</span><span class="p">]),</span> <span class="n">dtype</span><span class="o">=</span><span class="n">np</span><span class="o">.</span><span class="n">int8</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">_memory_future_state</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">(</span><span class="n">shape</span><span class="o">=</span><span class="p">(</span><span class="n">size</span><span class="p">,</span> <span class="n">phi_length</span><span class="p">,</span> <span class="n">image_size</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">image_size</span><span class="p">[</span><span class="mi">1</span><span class="p">]),</span> <span class="n">dtype</span><span class="o">=</span><span class="n">np</span><span class="o">.</span><span class="n">int8</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">_rewards</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">(</span><span class="n">shape</span><span class="o">=</span><span class="p">(</span><span class="n">size</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span> <span class="n">dtype</span><span class="o">=</span><span class="n">np</span><span class="o">.</span><span class="n">float32</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">_is_terminal</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">(</span><span class="n">shape</span><span class="o">=</span><span class="p">(</span><span class="n">size</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span> <span class="n">dtype</span><span class="o">=</span><span class="n">np</span><span class="o">.</span><span class="nb">bool</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">_actions</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">(</span><span class="n">shape</span><span class="o">=</span><span class="p">(</span><span class="n">size</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span> <span class="n">dtype</span><span class="o">=</span><span class="n">np</span><span class="o">.</span><span class="n">int8</span><span class="p">)</span>
</code></pre></div></div>
<p>Of course, phi_length=<script type="math/tex">n</script> from our previous discussion, the number of screens from the emulator stacked together to form a state.</p>
<h1 id="debugging-in-circles">Debugging In Circles</h1>
<p>With so many moving parts and parameters, there’s a lot that can go wrong in an implementation. My recommendation is to set your starting parameters as the same as the original paper to nail down one source of error. Most of the changes from the original NIPS paper to the Nature paper were to standardize learning parameters and performance across different Atari games. There are many quirks that arise from game to game, and most of the techniques in the Nature paper guard against this problem. For example, the consecutive max is used to deal with screens flickering causing objects to disappear under certain frame skip settings. So once you have set the parameters, here are a few things to check when your network is not learning well (or at all):</p>
<ul>
<li>Check if your Q-value output jumps in size between batch updates, this means your gradient update is large. Take a look at your learning rate and investigate your gradient to find the problem.</li>
<li>Look at the states you are sending to the neural network. Does the frame skipping/consecutive max seem to be correct? Are your future states different from your current states? Do you see a logical progression of images? If not, you may have some memory reference issues, which can happen when you try to conserve memory in Python.</li>
<li>Verify that your fixed target network’s weights are actually fixed. In Python, it’s easy to make the fixed network point to the same location in memory, and then your fixed target network isn’t actually fixed!</li>
<li>If you find your implementation is not working on a certain game, test your code on a simpler game like Pong. Your implementation may be fine, but you don’t have enough of the cutting-edge advances to learn harder games! Explore Double DQN, Dueling Q Networks, and prioritized experience replay.</li>
</ul>
<h1 id="conclusion">Conclusion</h1>
<p>Implementing the DeepMind paper is a rewarding first step into modern advancements in reinforcement learning. If you work mostly in a more traditional supervised learning setting, many of the common ideas and tricks may seem foreign at first. Deep Q Networks are just neural networks afterall, and many of the techniques to stabilize learning can be applied to traditional uses as well. The most interesting point to me is the topic of regularization. If you noticed, we did not use Dropout or <script type="math/tex">L_2</script> or <script type="math/tex">L_1</script>– all the traditional approaches to stabalize training for neural networks. Instead, the same goal of regularization is accomplished with how the mini batches are selected (experience replay) and how future rewards are defined (using a fixed target network). That is one of the most exciting and novel theoretical approaches from the original Nature paper which has continued to be iterated upon and refined in subsequent papers. When you want to further improve your implementation, you should investigate the next iteration of the techniques: <a href="https://arxiv.org/abs/1511.05952">prioritized experience replay</a>, <a href="https://arxiv.org/abs/1509.06461">Double DQN</a>, and <a href="https://arxiv.org/abs/1511.06581">Dueling Q Networks</a>. The (mostly) current standard comes from a modification to allow asyncronous learning called <a href="https://arxiv.org/abs/1602.01783">A3C</a>.</p>Starting with the Google DeepMind paper, there has been a lot of new attention around training models to play video games. You, the data scientist/engineer/enthusiast, may not work in reinforcement learning but probably are interested in teaching neural networks to play video games. Who isn’t? With that in mind, here’s a list of nuances that should jumpstart your own implementation.Visualizing the Learning of a Neural Network Geometrically2017-06-18T00:00:00+00:002017-06-18T00:00:00+00:00http://srome.github.io//Visualizing-the-Learning-of-a-Neural-Network-Geometrically<p>How neural networks learn classification tasks is <a href="http://catb.org/jargon/html/D/deep-magic.html">“deep magic”</a> to many people, but in this post, we will demystify the training component of neural networks with geometric visualizations. After motivating the approach mathematically, we will dive into animating how a neural network learns using python and Keras.</p>
<p>This post was inspired by an <a href="http://colah.github.io/posts/2014-03-NN-Manifolds-Topology/">article</a> by one of Google’s machine learning researchers. The article is very informative, but a full understanding requires a significant background in mathematics. My goal in this blog post is to explore the same topic in an approach more directed towards beginner students of machine learning.</p>
<h1 id="preliminaries">Preliminaries</h1>
<p>In this section, we’ll go over some preliminaries to piece together a geometric picture of the decision making process of a neural network. At a high level, the neural network “learns” a representation of the data where it can separate the classes with a decision boundary. You can visualize this boundary superimposed on the input space, where in most neural network applications you will see a highly nonlinear boundary.</p>
<p><img src="../images/nnlearn/db.png" alt="png" /></p>
<p>Notice, the boundary (mostly) seperates the “red” class and the “blue” class. You can also visualize the boundary on the output of the hidden layers. In the case of binary classification, the decision boundary visualized in the second to last layer will be a hyper plane in a typical neural network.</p>
<p>Before going further, I want to cement what I mean by “visualized in the n-th layer”, as this is not a precise statement. What I mean is I want to plot the image of the training data under the n-th layer. For binary classification, typically you would map the training data X through the neural network, extracting the output of the activation function of the n-th layer, and plot that output, coloring the points based on their label. Now, ideally the n-th layer has 2 or 3 hidden units, otherwise the plot will not completely characterize the “output space”.</p>
<h3 id="topology">Topology</h3>
<p>The motivation to do such a thing comes from a branch of mathematics called Topology. What is Topology, you may be asking… Let me go through an example. First imagine a ball of dough. What can you make from the dough without “tearing” it or “glueing” the edges together? Well, you could make a pizza or a baguette, but you couldn’t make a donut. This is because to make a donut you either poke a hole in the middle of the dough, or you roll it out and “glue” the ends together. This notion is captured mathematically by a special kind of function called a “homeomorphism”, a central concept in Topology. Any “shape” that can be “continuously deformed” to another without tearing or glueing are “topologically equivalent”. There is some kind of geometric structure (i.e. topological structure) which is maintained during the process. An oversimplified (and not always correct) heuristic to get your bearings is that two “shapes” are topologically equivalent if they have the same number of holes. The obligatory joke is that a topologist does not know the difference between a donut and a coffee cup, because the two objects can be continuously deformed to one another (from the <a href="https://en.wikipedia.org/wiki/General_topology">wiki</a>):</p>
<p><img src="../images/nnlearn/Mug_and_Torus_morph.gif" alt="Mmmm Donut" /></p>
<p>Continuous functions induce topologies (i.e. geometric structure of some kind) on their output. With a typical activation function, a neural network is a continuous map and so it “induces” structure on its output. In lay terms, that means that the network won’t “tear” or “glue together” the input space. Each mini-batch update will alter the neural network slightly and the visualization of these changes over time will create a “continuous deformation” from the starting induced topology on the output to the final output. That means we will see the neural network bending and conforming the input space to try to separate the classes. Said another way, your input space is akin to the dough mentioned above, only with blue and red dots on it representing the classes. If the neural network can deform the dough continuously so that the classes are ultimately separable by a line (on the dough), then training was successful. So that’s how notions from Topology relate to the training a neural network: the training process can be seen as generating a useful “continuous deformation” of the input space (under the image of the n-th layer).</p>
<h3 id="geometric-tranformations-of-hidden-layers">Geometric Tranformations of Hidden Layers</h3>
<p>Hidden layers are made up of a matrix, a bias term, and an activation function. Each of these components can be understood by how they geometrically effect their input.</p>
<p>Matrices are linear maps. Matrices “act on” an object in only two ways geometrically: stretching and rotating. For a matrix of full rank (an invertible square matrix), the matrix performs these geometric actions in the dimensions it “lives” in. If the matrix is not of full rank, it will collapse one of the dimensions as well. For example, a two dimensional matrix can only rotate and stretch planar objects, like a piece of paper on a table. However, a three dimensional matrix could rotate a piece of paper in any direction in 3d-space. A 3-by-2 matrix will collapse a three dimensional object like a sphere into a two dimensional circle (while rotating and stretching).</p>
<p>The bias term represents a translation in space, meaning it moves the object.</p>
<p>The activation function is what makes a neural network nonlinear. This function is typically nonlinear and allows a hidden layer to “bend” its input.</p>
<p>A hidden layer whose output is a dimension lower than its input allows the network to give the appearence of manipulations like “folding” the space over onto itself (more on that below!).</p>
<h3 id="making-decisions-geometrically">Making Decisions Geometrically</h3>
<p>A neural network makes a binary classification decision in a similar way to how a support vector machine does. It takes the input space and bends and twists it around until the classes are separated, and then it draws a line. Everything on one side of the line is classified as a 1 and 0 otherwise. This line is called the decision boundary.</p>
<p>The decision boundary is the visual representation of the threshold where the model will predict 0 or 1. To enable this visualization to be possible, we set up our neural network to have a layer with 2 hidden units before a layer of 1 hidden unit and a sigmoid activation function. Then, we plot the points in 2-D (corresponding to the 2-hidden unit layer) which map to .5 in the final output.</p>
<p>Let’s write the equation. Define <script type="math/tex">x,y\in\mathbb{R}</script> to be the first and second component of the output of the 2-hidden unit layer. Then, let <script type="math/tex">a,b,c\in\mathbb{R}</script> be the weights and bias of the last hidden layer. Then, the set of points in the <script type="math/tex">(x,y)</script>-plane for the decision line is:
<script type="math/tex">\{ (x,y)\ | \ \text{sigmoid}(ax+by+x)=.5 \}.</script></p>
<p>The strategy for doing this is to choose points of <script type="math/tex">x</script> and then calculate <script type="math/tex">y</script> from the equation with <script type="math/tex">x</script> fixed. By plotting this set, we can visualize the decision boundary, and therefore visualize the training process. Indeed, if the line is able to completely separate the input points into classes, then we have a perfect classifier.</p>
<h1 id="classification-problem">Classification Problem</h1>
<p>In order to visualize the output, we define the following simple classification problem. We have two circles in a two dimensional plane and we color their points based on their radius. We will train a classifier to identify if the point should be “blue” or “red” based on the x-y position of the point.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">keras</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span>
<span class="o">%</span><span class="n">matplotlib</span> <span class="n">inline</span>
<span class="kn">from</span> <span class="nn">keras</span> <span class="kn">import</span> <span class="n">backend</span> <span class="k">as</span> <span class="n">K</span>
<span class="n">n</span><span class="o">=</span><span class="mi">20000</span>
<span class="n">t</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">linspace</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span><span class="mi">2</span><span class="p">,</span><span class="n">n</span><span class="p">)</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">sin</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">pi</span><span class="o">*</span><span class="n">t</span><span class="p">)</span> <span class="o">+</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">normal</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span><span class="o">.</span><span class="mo">005</span><span class="p">,</span><span class="n">n</span><span class="p">)</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">cos</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">pi</span><span class="o">*</span><span class="n">t</span><span class="p">)</span> <span class="o">+</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">normal</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span><span class="o">.</span><span class="mo">005</span><span class="p">,</span><span class="n">n</span><span class="p">)</span>
<span class="n">label</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">ones</span><span class="p">(</span><span class="n">n</span><span class="p">)</span>
<span class="n">tdf</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">({</span><span class="s">'label'</span> <span class="p">:</span> <span class="n">label</span><span class="p">,</span> <span class="s">'x'</span> <span class="p">:</span> <span class="n">x</span><span class="p">,</span> <span class="s">'y'</span> <span class="p">:</span> <span class="n">y</span><span class="p">})</span>
<span class="n">t</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">linspace</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span><span class="mi">2</span><span class="p">,</span><span class="n">n</span><span class="p">)</span>
<span class="n">x</span> <span class="o">=</span> <span class="o">.</span><span class="mi">5</span><span class="o">*</span><span class="n">np</span><span class="o">.</span><span class="n">sin</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">pi</span><span class="o">*</span><span class="n">t</span><span class="p">)</span> <span class="o">+</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">normal</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span><span class="o">.</span><span class="mo">005</span><span class="p">,</span><span class="n">n</span><span class="p">)</span>
<span class="n">y</span> <span class="o">=</span> <span class="o">.</span><span class="mi">5</span><span class="o">*</span><span class="n">np</span><span class="o">.</span><span class="n">cos</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">pi</span><span class="o">*</span><span class="n">t</span><span class="p">)</span> <span class="o">+</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">normal</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span><span class="o">.</span><span class="mo">005</span><span class="p">,</span><span class="n">n</span><span class="p">)</span>
<span class="n">label</span> <span class="o">=</span> <span class="mf">0.</span><span class="o">*</span><span class="n">np</span><span class="o">.</span><span class="n">ones</span><span class="p">(</span><span class="n">n</span><span class="p">)</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">concat</span><span class="p">([</span><span class="n">tdf</span><span class="p">,</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">({</span><span class="s">'label'</span> <span class="p">:</span> <span class="n">label</span><span class="p">,</span> <span class="s">'x'</span> <span class="p">:</span> <span class="n">x</span><span class="p">,</span> <span class="s">'y'</span> <span class="p">:</span> <span class="n">y</span><span class="p">})])</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">plt</span><span class="o">.</span><span class="n">scatter</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s">'x'</span><span class="p">],</span><span class="n">df</span><span class="p">[</span><span class="s">'y'</span><span class="p">],</span> <span class="n">color</span><span class="o">=</span><span class="n">df</span><span class="p">[</span><span class="s">'label'</span><span class="p">]</span><span class="o">.</span><span class="nb">apply</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span> <span class="p">:</span> <span class="s">'b'</span> <span class="k">if</span> <span class="n">x</span> <span class="o">></span> <span class="o">.</span><span class="mi">5</span> <span class="k">else</span> <span class="s">'r'</span><span class="p">))</span>
</code></pre></div></div>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code><matplotlib.collections.PathCollection at 0x7f90203a0f60>
</code></pre></div></div>
<p><img src="../images/nnlearn/output_3_1.png" alt="png" /></p>
<h1 id="animation-code">Animation Code</h1>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">get_decision_boundary</span><span class="p">(</span><span class="n">model</span><span class="p">):</span>
<span class="s">""" Function to return the x-y coodinates of the decision boundary given a model.
This assumes the second to last hidden layer is a 2 hidden unit layer with a bias term
and sigmoid activation on the last layer."""</span>
<span class="n">a</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">layers</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">get_weights</span><span class="p">()[</span><span class="mi">0</span><span class="p">][</span><span class="mi">0</span><span class="p">][</span><span class="mi">0</span><span class="p">]</span>
<span class="n">b</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">layers</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">get_weights</span><span class="p">()[</span><span class="mi">0</span><span class="p">][</span><span class="mi">1</span><span class="p">][</span><span class="mi">0</span><span class="p">]</span>
<span class="n">c</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">layers</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">get_weights</span><span class="p">()[</span><span class="mi">1</span><span class="p">][</span><span class="mi">0</span><span class="p">]</span>
<span class="n">decision_x</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">linspace</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">100</span><span class="p">)</span>
<span class="n">decision_y</span> <span class="o">=</span> <span class="p">(</span><span class="n">scipy</span><span class="o">.</span><span class="n">special</span><span class="o">.</span><span class="n">logit</span><span class="p">(</span><span class="o">.</span><span class="mi">5</span><span class="p">)</span><span class="o">-</span><span class="n">c</span><span class="o">-</span><span class="n">a</span><span class="o">*</span><span class="n">decision_x</span><span class="p">)</span><span class="o">/</span><span class="n">b</span>
<span class="k">return</span> <span class="n">decision_x</span><span class="p">,</span> <span class="n">decision_y</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span>
<span class="kn">import</span> <span class="nn">matplotlib.animation</span> <span class="k">as</span> <span class="n">animation</span>
<span class="kn">from</span> <span class="nn">IPython.display</span> <span class="kn">import</span> <span class="n">HTML</span>
<span class="kn">from</span> <span class="nn">keras</span> <span class="kn">import</span> <span class="n">backend</span> <span class="k">as</span> <span class="n">K</span>
<span class="k">def</span> <span class="nf">animate_model</span><span class="p">(</span><span class="n">model</span><span class="p">,</span> <span class="n">n_frames</span><span class="o">=</span><span class="mi">100</span><span class="p">):</span>
<span class="s">""" Function to animate a model's first n_frames epochs of training. """</span>
<span class="c"># Define necessary lines to plot a grid-- this will represent the vanilla "input space".</span>
<span class="n">grids</span> <span class="o">=</span> <span class="p">[</span><span class="n">np</span><span class="o">.</span><span class="n">column_stack</span><span class="p">((</span><span class="n">np</span><span class="o">.</span><span class="n">linspace</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span> <span class="mi">100</span><span class="p">),</span> <span class="n">k</span><span class="o">*</span><span class="n">np</span><span class="o">.</span><span class="n">ones</span><span class="p">(</span><span class="mi">100</span><span class="p">)</span><span class="o">/</span><span class="mf">10.</span><span class="p">))</span> <span class="k">for</span> <span class="n">k</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="o">-</span><span class="mi">10</span><span class="p">,</span><span class="mi">11</span><span class="p">)]</span> <span class="o">+</span>\
<span class="p">[</span><span class="n">np</span><span class="o">.</span><span class="n">column_stack</span><span class="p">((</span><span class="n">k</span><span class="o">*</span><span class="n">np</span><span class="o">.</span><span class="n">ones</span><span class="p">(</span><span class="mi">100</span><span class="p">)</span><span class="o">/</span><span class="mf">10.</span><span class="p">,</span><span class="n">np</span><span class="o">.</span><span class="n">linspace</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span> <span class="mi">100</span><span class="p">)))</span> <span class="k">for</span> <span class="n">k</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="o">-</span><span class="mi">10</span><span class="p">,</span><span class="mi">11</span><span class="p">)</span> <span class="p">]</span>
<span class="c"># Define functions for the output of the 2-hidden unit layer. </span>
<span class="c"># We assume this is the second to last layer</span>
<span class="n">f</span> <span class="o">=</span> <span class="n">K</span><span class="o">.</span><span class="n">function</span><span class="p">(</span><span class="n">inputs</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">inputs</span><span class="p">,</span> <span class="n">outputs</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">layers</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">output</span><span class="p">)</span>
<span class="n">decision_x</span><span class="p">,</span> <span class="n">decision_y</span> <span class="o">=</span> <span class="n">get_decision_line</span><span class="p">(</span><span class="n">model</span><span class="p">)</span>
<span class="c"># Plot the original space's deformation by the neural network and use it as the init()</span>
<span class="n">fig</span><span class="p">,</span> <span class="n">ax</span> <span class="o">=</span> <span class="n">plt</span><span class="o">.</span><span class="n">subplots</span><span class="p">()</span>
<span class="n">orig_vals</span> <span class="o">=</span> <span class="n">f</span><span class="p">(</span><span class="n">inputs</span><span class="o">=</span><span class="p">[</span><span class="n">df</span><span class="p">[[</span><span class="s">'x'</span><span class="p">,</span><span class="s">'y'</span><span class="p">]]</span><span class="o">.</span><span class="n">values</span><span class="p">])</span>
<span class="n">line</span><span class="p">,</span> <span class="o">=</span> <span class="n">ax</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">decision_x</span><span class="p">,</span><span class="n">decision_y</span><span class="p">,</span><span class="n">color</span><span class="o">=</span><span class="s">'black'</span><span class="p">)</span>
<span class="n">lineb</span><span class="p">,</span> <span class="o">=</span> <span class="n">ax</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">orig_vals</span><span class="p">[</span><span class="n">indb</span><span class="p">,</span><span class="mi">0</span><span class="p">],</span> <span class="n">orig_vals</span><span class="p">[</span><span class="n">indb</span><span class="p">,</span><span class="mi">1</span><span class="p">],</span> <span class="n">marker</span><span class="o">=</span><span class="s">'.'</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s">'b'</span><span class="p">)</span>
<span class="n">liner</span><span class="p">,</span> <span class="o">=</span> <span class="n">ax</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">orig_vals</span><span class="p">[</span><span class="n">indr</span><span class="p">,</span><span class="mi">0</span><span class="p">],</span> <span class="n">orig_vals</span><span class="p">[</span><span class="n">indr</span><span class="p">,</span><span class="mi">1</span><span class="p">],</span> <span class="n">marker</span><span class="o">=</span><span class="s">'.'</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s">'r'</span><span class="p">)</span>
<span class="n">grid_lines</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">grid</span> <span class="ow">in</span> <span class="n">grids</span><span class="p">:</span>
<span class="n">vals</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">grid</span><span class="p">)</span>
<span class="n">l</span><span class="p">,</span> <span class="o">=</span> <span class="n">ax</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">vals</span><span class="p">[:,</span><span class="mi">0</span><span class="p">],</span><span class="n">vals</span><span class="p">[:,</span><span class="mi">1</span><span class="p">],</span> <span class="n">color</span><span class="o">=</span><span class="s">'grey'</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=.</span><span class="mi">5</span><span class="p">)</span>
<span class="n">grid_lines</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">l</span><span class="p">)</span>
<span class="n">all_lines</span> <span class="o">=</span> <span class="nb">tuple</span><span class="p">([</span><span class="n">line</span><span class="p">,</span> <span class="n">lineb</span><span class="p">,</span> <span class="n">liner</span><span class="p">,</span> <span class="o">*</span><span class="n">grid_lines</span><span class="p">])</span>
<span class="k">def</span> <span class="nf">animate</span><span class="p">(</span><span class="n">i</span><span class="p">):</span>
<span class="n">model</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">df</span><span class="p">[[</span><span class="s">'x'</span><span class="p">,</span><span class="s">'y'</span><span class="p">]]</span><span class="o">.</span><span class="n">values</span><span class="p">,</span> <span class="n">df</span><span class="p">[[</span><span class="s">'label'</span><span class="p">]]</span><span class="o">.</span><span class="n">values</span><span class="p">,</span> <span class="n">epochs</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">batch_size</span><span class="o">=</span><span class="mi">32</span><span class="p">,</span> <span class="n">verbose</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
<span class="n">line</span><span class="o">.</span><span class="n">set_data</span><span class="p">(</span><span class="o">*</span><span class="n">get_decision_line</span><span class="p">(</span><span class="n">model</span><span class="p">))</span>
<span class="n">vals</span> <span class="o">=</span> <span class="n">f</span><span class="p">(</span><span class="n">inputs</span> <span class="o">=</span> <span class="p">[</span><span class="n">df</span><span class="p">[[</span><span class="s">'x'</span><span class="p">,</span><span class="s">'y'</span><span class="p">]]</span><span class="o">.</span><span class="n">values</span><span class="p">])</span>
<span class="n">lineb</span><span class="o">.</span><span class="n">set_data</span><span class="p">(</span><span class="n">vals</span><span class="p">[</span><span class="n">indb</span><span class="p">,</span><span class="mi">0</span><span class="p">],</span> <span class="n">vals</span><span class="p">[</span><span class="n">indb</span><span class="p">,</span><span class="mi">1</span><span class="p">])</span>
<span class="n">liner</span><span class="o">.</span><span class="n">set_data</span><span class="p">(</span><span class="n">vals</span><span class="p">[</span><span class="n">indr</span><span class="p">,</span><span class="mi">0</span><span class="p">],</span> <span class="n">vals</span><span class="p">[</span><span class="n">indr</span><span class="p">,</span><span class="mi">1</span><span class="p">])</span>
<span class="k">for</span> <span class="n">k</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">grid_lines</span><span class="p">)):</span>
<span class="n">ln</span> <span class="o">=</span> <span class="n">grid_lines</span><span class="p">[</span><span class="n">k</span><span class="p">]</span>
<span class="n">grid</span> <span class="o">=</span> <span class="n">grids</span><span class="p">[</span><span class="n">k</span><span class="p">]</span>
<span class="n">vals</span> <span class="o">=</span> <span class="n">f</span><span class="p">(</span><span class="n">inputs</span> <span class="o">=</span> <span class="p">[</span><span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">grid</span><span class="p">)])</span>
<span class="n">ln</span><span class="o">.</span><span class="n">set_data</span><span class="p">(</span><span class="n">vals</span><span class="p">[:,</span><span class="mi">0</span><span class="p">],</span><span class="n">vals</span><span class="p">[:,</span><span class="mi">1</span><span class="p">])</span>
<span class="k">return</span> <span class="n">all_lines</span>
<span class="k">def</span> <span class="nf">init</span><span class="p">():</span>
<span class="n">line</span><span class="o">.</span><span class="n">set_ydata</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">ma</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">decision_x</span><span class="p">,</span> <span class="n">mask</span><span class="o">=</span><span class="bp">True</span><span class="p">))</span>
<span class="n">lineb</span><span class="o">.</span><span class="n">set_data</span><span class="p">(</span><span class="n">orig_vals</span><span class="p">[</span><span class="n">indb</span><span class="p">,</span><span class="mi">0</span><span class="p">],</span><span class="n">orig_vals</span><span class="p">[</span><span class="n">indb</span><span class="p">,</span><span class="mi">1</span><span class="p">])</span>
<span class="n">liner</span><span class="o">.</span><span class="n">set_data</span><span class="p">(</span><span class="n">orig_vals</span><span class="p">[</span><span class="n">indr</span><span class="p">,</span><span class="mi">0</span><span class="p">],</span><span class="n">orig_vals</span><span class="p">[</span><span class="n">indr</span><span class="p">,</span><span class="mi">1</span><span class="p">])</span>
<span class="k">for</span> <span class="n">k</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">grid_lines</span><span class="p">)):</span>
<span class="n">ln</span> <span class="o">=</span> <span class="n">grid_lines</span><span class="p">[</span><span class="n">k</span><span class="p">]</span>
<span class="n">grid</span> <span class="o">=</span> <span class="n">grids</span><span class="p">[</span><span class="n">k</span><span class="p">]</span>
<span class="n">vals</span> <span class="o">=</span> <span class="n">f</span><span class="p">(</span><span class="n">inputs</span> <span class="o">=</span> <span class="p">[</span><span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">grid</span><span class="p">)])</span>
<span class="n">ln</span><span class="o">.</span><span class="n">set_data</span><span class="p">(</span><span class="n">vals</span><span class="p">[:,</span><span class="mi">0</span><span class="p">],</span><span class="n">vals</span><span class="p">[:,</span><span class="mi">1</span><span class="p">])</span>
<span class="k">return</span> <span class="n">all_lines</span>
<span class="k">return</span> <span class="n">animation</span><span class="o">.</span><span class="n">FuncAnimation</span><span class="p">(</span><span class="n">fig</span><span class="p">,</span> <span class="n">animate</span><span class="p">,</span> <span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">n_frames</span><span class="p">),</span> <span class="n">init_func</span><span class="o">=</span><span class="n">init</span><span class="p">,</span>
<span class="n">interval</span><span class="o">=</span><span class="mi">100</span><span class="p">,</span> <span class="n">blit</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
</code></pre></div></div>
<h1 id="animating-models">Animating Models</h1>
<p>Let’s define a few models and see how they deform the 2-hidden unit layer’s output for classification. Of course, these visualization depend on the initializations of each layer– so don’t be surprised if you don’t see the same ones when you try it at home.</p>
<h2 id="2d-geometric-tranformations">2d Geometric Tranformations</h2>
<p>If we only allow the model to have 2d geometric transformations, we will find that the model is unable to learn any useful representation of the data for classification, but it sure does bend the space into a boomerang.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">keras.models</span> <span class="kn">import</span> <span class="n">Sequential</span>
<span class="kn">from</span> <span class="nn">keras.layers</span> <span class="kn">import</span> <span class="n">Dense</span><span class="p">,</span> <span class="n">Activation</span>
<span class="n">model</span> <span class="o">=</span> <span class="n">Sequential</span><span class="p">()</span>
<span class="n">model</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">Dense</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="n">activation</span><span class="o">=</span><span class="s">'tanh'</span><span class="p">,</span> <span class="n">input_dim</span><span class="o">=</span><span class="mi">2</span><span class="p">))</span>
<span class="n">model</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">Dense</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="n">activation</span><span class="o">=</span><span class="s">'tanh'</span><span class="p">))</span>
<span class="n">model</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">Dense</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">activation</span><span class="o">=</span><span class="s">'sigmoid'</span><span class="p">))</span>
<span class="n">sgd</span> <span class="o">=</span> <span class="n">keras</span><span class="o">.</span><span class="n">optimizers</span><span class="o">.</span><span class="n">SGD</span><span class="p">(</span><span class="n">lr</span><span class="o">=</span><span class="mf">0.001</span><span class="p">)</span>
<span class="n">model</span><span class="o">.</span><span class="nb">compile</span><span class="p">(</span><span class="n">optimizer</span><span class="o">=</span><span class="n">sgd</span><span class="p">,</span>
<span class="n">loss</span><span class="o">=</span><span class="s">'mse'</span><span class="p">,</span>
<span class="n">metrics</span><span class="o">=</span><span class="p">[</span><span class="s">'accuracy'</span><span class="p">])</span>
<span class="n">anim</span> <span class="o">=</span> <span class="n">animate_model</span><span class="p">(</span><span class="n">model</span><span class="p">)</span>
<span class="n">HTML</span><span class="p">(</span><span class="n">anim</span><span class="o">.</span><span class="n">to_html5_video</span><span class="p">())</span>
</code></pre></div></div>
<video width="432" height="288" controls="" loop="">
<source type="video/mp4" src="../images/nnlearn/2d.mp4" />
</video>
<p>As you can see, with only two dimensional transformations (bends, rotations and scaling), the neural network cannot fully separate the two classes.</p>
<h2 id="3d-linear-transformations">3d Linear Transformations</h2>
<p>Now, let’s enable the model to move the space in three dimensions:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">model</span> <span class="o">=</span> <span class="n">Sequential</span><span class="p">()</span>
<span class="n">model</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">Dense</span><span class="p">(</span><span class="mi">3</span><span class="p">,</span> <span class="n">activation</span><span class="o">=</span><span class="s">'tanh'</span><span class="p">,</span> <span class="n">input_dim</span><span class="o">=</span><span class="mi">2</span><span class="p">))</span>
<span class="n">model</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">Dense</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="n">activation</span><span class="o">=</span><span class="s">'tanh'</span><span class="p">))</span>
<span class="n">model</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">Dense</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">activation</span><span class="o">=</span><span class="s">'sigmoid'</span><span class="p">))</span>
<span class="n">sgd</span> <span class="o">=</span> <span class="n">keras</span><span class="o">.</span><span class="n">optimizers</span><span class="o">.</span><span class="n">SGD</span><span class="p">(</span><span class="n">lr</span><span class="o">=</span><span class="mf">0.005</span><span class="p">)</span>
<span class="n">model</span><span class="o">.</span><span class="nb">compile</span><span class="p">(</span><span class="n">optimizer</span><span class="o">=</span><span class="n">sgd</span><span class="p">,</span>
<span class="n">loss</span><span class="o">=</span><span class="s">'mse'</span><span class="p">,</span>
<span class="n">metrics</span><span class="o">=</span><span class="p">[</span><span class="s">'accuracy'</span><span class="p">])</span>
<span class="n">anim2</span> <span class="o">=</span> <span class="n">animate_model</span><span class="p">(</span><span class="n">model</span><span class="p">)</span>
<span class="n">HTML</span><span class="p">(</span><span class="n">anim2</span><span class="o">.</span><span class="n">to_html5_video</span><span class="p">())</span>
</code></pre></div></div>
<video width="432" height="288" controls="" loop="">
<source type="video/mp4" src="../images/nnlearn/3d.mp4" />
</video>
<p>It’s clear that the model easily can surmise a useful projection to 2d space to separate the classes.</p>
<h2 id="4d-geometric-transformations">4d Geometric Transformations</h2>
<p>When we beging to add dimensions, the projections to the 2d plan can become more exotic as seen below.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">model</span> <span class="o">=</span> <span class="n">Sequential</span><span class="p">()</span>
<span class="n">model</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">Dense</span><span class="p">(</span><span class="mi">4</span><span class="p">,</span> <span class="n">activation</span><span class="o">=</span><span class="s">'tanh'</span><span class="p">,</span> <span class="n">input_dim</span><span class="o">=</span><span class="mi">2</span><span class="p">))</span>
<span class="n">model</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">Dense</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="n">activation</span><span class="o">=</span><span class="s">'tanh'</span><span class="p">))</span>
<span class="n">model</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">Dense</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">activation</span><span class="o">=</span><span class="s">'sigmoid'</span><span class="p">))</span>
<span class="n">sgd</span> <span class="o">=</span> <span class="n">keras</span><span class="o">.</span><span class="n">optimizers</span><span class="o">.</span><span class="n">SGD</span><span class="p">(</span><span class="n">lr</span><span class="o">=</span><span class="mf">0.005</span><span class="p">)</span>
<span class="n">model</span><span class="o">.</span><span class="nb">compile</span><span class="p">(</span><span class="n">optimizer</span><span class="o">=</span><span class="n">sgd</span><span class="p">,</span>
<span class="n">loss</span><span class="o">=</span><span class="s">'mse'</span><span class="p">,</span>
<span class="n">metrics</span><span class="o">=</span><span class="p">[</span><span class="s">'accuracy'</span><span class="p">])</span>
<span class="n">anim3</span> <span class="o">=</span> <span class="n">animate_model</span><span class="p">(</span><span class="n">model</span><span class="p">)</span>
<span class="n">HTML</span><span class="p">(</span><span class="n">anim3</span><span class="o">.</span><span class="n">to_html5_video</span><span class="p">())</span>
</code></pre></div></div>
<video width="432" height="288" controls="" loop="">
<source type="video/mp4" src="../images/nnlearn/4d.mp4" />
</video>
<h1 id="conclusion">Conclusion</h1>
<p>Geometric demonstrations in mathematics typically build intuition about more abstract topics. The training process of a neural network can be understood as a stochastic gradient descent to minimize the loss function on the training set, but at the same time, it can also be understood geometrically as demonstrated above. Both viewpoints are informative and useful for becoming a more well rounded machine learning practitioner. In this case, the geometric viewpoint makes the need for additional hidden units obvious, a fact which would be more difficult to deduce otherwise.</p>How neural networks learn classification tasks is “deep magic” to many people, but in this post, we will demystify the training component of neural networks with geometric visualizations. After motivating the approach mathematically, we will dive into animating how a neural network learns using python and Keras.Dealing with Trends. Combine a Random Walk with a Tree-Based Model to Predict Time Series Data2017-03-19T00:00:00+00:002017-03-19T00:00:00+00:00http://srome.github.io//Dealing-With-Trends-Combine-a-Random-Walk-with-a-Tree-Based-Model-to-Predict-Time-Series-Data<p>A standard assumption underlying a standard machine learning model is that the model will be used on the same population during training and testing (and production). This poses an interesting issue with time series data, as the underlying process could change over time which would cause the production population to look differently from the original training data. In this post, we will explore the concept of a data model, a theoretical tool from statistics, and in particular the idea of a random walk to handle this situation, improving your modeling toolkit.</p>
<p>If we are all honest with ourselves, our working knowledge of probability and statistics at some point (maybe now?) was not much better than Lloyd’s here:</p>
<p><img src="../images/rw_ts/prob1.jpg" alt="jpg" /></p>
<p>But that can change! We’re going to combine the data model concept to machine learning via a modified example from the Kaggle <a href="https://www.kaggle.com/c/rossmann-store-sales/">Rossmann</a> competition, which will guide our thinking and feature engineering.</p>
<p>The problem we will study is how to build a model to predict the turnover (sales) when the model has data available to it from the day before. This is a typical use case for a model living in production and is a different flavor from the competition itself. This also presents new challenges that aren’t always attached to a static data set: changing trends over time. For our purposes, we only need the data from a single store.</p>
<h2 id="tree-based-models-and-trends-in-time">Tree-Based Models and Trends in Time</h2>
<p>Using tree-based models in the presence of trends over time can lead to terrible results, and the reason is how decision trees make predictions. Decision Trees make regression predictions by seeing which “leaf” the data point belongs to and assigning the average of the target variable from the training set to that point. That means they are unable to “extrapolate” to data they have not already seen, and data from a significantly different population (perhaps after some trend in time) will cause the model to make inaccurate predictions. Random Forests and Gradient Boosting Tree algorithms inherit this problem because they are averages of the output of Decision Trees. The techniques in this post will allow you to alleviate this issue.</p>
<h1 id="a-glance-at-the-data">A Glance at the Data</h1>
<p>First, we will load the data and split out basic time-based features.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>
<span class="n">d</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s">'train.csv'</span><span class="p">)</span>
<span class="n">d</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div></div>
<div>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>Store</th>
<th>DayOfWeek</th>
<th>Date</th>
<th>Sales</th>
<th>Customers</th>
<th>Open</th>
<th>Promo</th>
<th>StateHoliday</th>
<th>SchoolHoliday</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>1</td>
<td>5</td>
<td>2015-07-31</td>
<td>5263</td>
<td>555</td>
<td>1</td>
<td>1</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<th>1</th>
<td>2</td>
<td>5</td>
<td>2015-07-31</td>
<td>6064</td>
<td>625</td>
<td>1</td>
<td>1</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<th>2</th>
<td>3</td>
<td>5</td>
<td>2015-07-31</td>
<td>8314</td>
<td>821</td>
<td>1</td>
<td>1</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<th>3</th>
<td>4</td>
<td>5</td>
<td>2015-07-31</td>
<td>13995</td>
<td>1498</td>
<td>1</td>
<td>1</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<th>4</th>
<td>5</td>
<td>5</td>
<td>2015-07-31</td>
<td>4822</td>
<td>559</td>
<td>1</td>
<td>1</td>
<td>0</td>
<td>1</td>
</tr>
</tbody>
</table>
</div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">t</span><span class="o">=</span> <span class="n">d</span><span class="p">[</span><span class="n">d</span><span class="p">[</span><span class="s">'Store'</span><span class="p">]</span> <span class="o">==</span> <span class="mi">1</span><span class="p">]</span>
<span class="n">t</span><span class="o">.</span><span class="n">drop</span><span class="p">(</span><span class="s">'Store'</span><span class="p">,</span> <span class="n">inplace</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span><span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="kn">import</span> <span class="nn">datetime</span>
<span class="k">def</span> <span class="nf">make_date</span><span class="p">(</span><span class="n">x</span><span class="p">):</span>
<span class="n">s</span> <span class="o">=</span> <span class="n">x</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s">'-'</span><span class="p">)</span>
<span class="k">return</span> <span class="n">datetime</span><span class="o">.</span><span class="n">datetime</span><span class="p">(</span><span class="nb">int</span><span class="p">(</span><span class="n">s</span><span class="p">[</span><span class="mi">0</span><span class="p">]),</span><span class="nb">int</span><span class="p">(</span><span class="n">s</span><span class="p">[</span><span class="mi">1</span><span class="p">]),</span><span class="nb">int</span><span class="p">(</span><span class="n">s</span><span class="p">[</span><span class="mi">2</span><span class="p">]))</span>
<span class="n">t</span><span class="p">[</span><span class="s">'Date'</span><span class="p">]</span> <span class="o">=</span> <span class="n">t</span><span class="p">[</span><span class="s">'Date'</span><span class="p">]</span><span class="o">.</span><span class="nb">apply</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span> <span class="p">:</span> <span class="n">make_date</span><span class="p">(</span><span class="n">x</span><span class="p">))</span>
<span class="n">t</span><span class="p">[</span><span class="s">'Day'</span><span class="p">]</span> <span class="o">=</span> <span class="n">t</span><span class="p">[</span><span class="s">'Date'</span><span class="p">]</span><span class="o">.</span><span class="nb">apply</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span> <span class="p">:</span> <span class="n">x</span><span class="o">.</span><span class="n">day</span><span class="p">)</span>
<span class="n">t</span><span class="p">[</span><span class="s">'Month'</span><span class="p">]</span> <span class="o">=</span> <span class="n">t</span><span class="p">[</span><span class="s">'Date'</span><span class="p">]</span><span class="o">.</span><span class="nb">apply</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span> <span class="p">:</span> <span class="n">x</span><span class="o">.</span><span class="n">month</span><span class="p">)</span>
<span class="n">t</span><span class="p">[</span><span class="s">'Year'</span><span class="p">]</span> <span class="o">=</span> <span class="n">t</span><span class="p">[</span><span class="s">'Date'</span><span class="p">]</span><span class="o">.</span><span class="nb">apply</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span> <span class="p">:</span> <span class="n">x</span><span class="o">.</span><span class="n">year</span><span class="p">)</span>
<span class="c"># To get the previous day's sales later</span>
<span class="n">t</span><span class="p">[</span><span class="s">'DayOfYear'</span><span class="p">]</span> <span class="o">=</span> <span class="n">t</span><span class="p">[</span><span class="s">'Date'</span><span class="p">]</span><span class="o">.</span><span class="nb">apply</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span> <span class="p">:</span> <span class="n">x</span><span class="o">.</span><span class="n">dayofyear</span><span class="p">)</span>
<span class="n">t</span><span class="p">[</span><span class="s">'DayOfYear+1'</span><span class="p">]</span> <span class="o">=</span> <span class="n">t</span><span class="p">[</span><span class="s">'Date'</span><span class="p">]</span><span class="o">.</span><span class="nb">apply</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span> <span class="p">:</span> <span class="n">x</span><span class="o">.</span><span class="n">dayofyear</span><span class="o">+</span><span class="mi">1</span><span class="p">)</span>
<span class="n">t</span> <span class="o">=</span> <span class="n">t</span><span class="o">.</span><span class="n">set_index</span><span class="p">(</span><span class="s">'Date'</span><span class="p">)</span><span class="o">.</span><span class="n">sort_index</span><span class="p">()</span>
<span class="n">t</span> <span class="o">=</span> <span class="n">t</span><span class="p">[</span><span class="n">t</span><span class="o">.</span><span class="nb">apply</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span> <span class="p">:</span> <span class="n">x</span><span class="p">[</span><span class="s">'Open'</span><span class="p">]</span> <span class="o">==</span> <span class="mi">1</span> <span class="ow">and</span> <span class="n">x</span><span class="p">[</span><span class="s">'Sales'</span><span class="p">]</span> <span class="o">></span> <span class="mi">0</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)]</span> <span class="c"># Open stores with positive sales only</span>
<span class="n">t</span><span class="o">.</span><span class="n">dropna</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span> <span class="n">inplace</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
</code></pre></div></div>
<h2 id="training-vs-testing-population">Training Vs. Testing Population</h2>
<p>Our goal is to learn about models dealing with a changing underlying population, i.e. a trend. Let’s take a look if our data is in line with the situation we want to study.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">t</span><span class="p">[</span><span class="s">'Sales'</span><span class="p">]</span><span class="o">.</span><span class="n">loc</span><span class="p">[:</span><span class="s">'2014-08-01'</span><span class="p">]</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">kind</span><span class="o">=</span><span class="s">'line'</span><span class="p">)</span>
<span class="n">t</span><span class="p">[</span><span class="s">'Sales'</span><span class="p">]</span><span class="o">.</span><span class="n">loc</span><span class="p">[</span><span class="s">'2014-08-01'</span><span class="p">:]</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">kind</span><span class="o">=</span><span class="s">'line'</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s">'r'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="s">'Training (Blue) vs. Testing (Red)'</span><span class="p">)</span>
</code></pre></div></div>
<p><img src="../images/rw_ts/output_5_1.png" alt="png" /></p>
<p>If you look at the data, you can see there does not seem to be any non-periodic trends to the data. The sales in 2015 are arguably the same as in 2014 and are the same in 2013. This is not what we want to study! Tree based models do really well when the data they’re predicting on looks the same, and as mentioned, this is a common assumption when doing machine learning. A major problem for any data science model is a changing population. A trend over time can do will change the population and decimate performance. How do you deal with that situation? Well, we’re going to inject a trend and then explore that question!</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">t</span><span class="p">[</span><span class="s">'Sales'</span><span class="p">]</span> <span class="o">+=</span> <span class="n">np</span><span class="o">.</span><span class="n">cumsum</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="nb">abs</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">cos</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">linspace</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="mi">20</span><span class="p">,</span><span class="nb">len</span><span class="p">(</span><span class="n">t</span><span class="p">))))</span><span class="o">*</span><span class="mi">10</span><span class="p">)</span>
<span class="n">t</span><span class="p">[</span><span class="s">'Sales'</span><span class="p">]</span><span class="o">.</span><span class="n">loc</span><span class="p">[:</span><span class="s">'2014-08-01'</span><span class="p">]</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">kind</span><span class="o">=</span><span class="s">'line'</span><span class="p">)</span>
<span class="n">t</span><span class="p">[</span><span class="s">'Sales'</span><span class="p">]</span><span class="o">.</span><span class="n">loc</span><span class="p">[</span><span class="s">'2014-08-01'</span><span class="p">:]</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">kind</span><span class="o">=</span><span class="s">'line'</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s">'r'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="s">'Training (Blue) vs. Testing (Red)'</span><span class="p">)</span>
</code></pre></div></div>
<p><img src="../images/rw_ts/output_7_1.png" alt="png" /></p>
<p>Much better! Of course, in practice you wouldn’t want to look at your test set, and would be unable to look at future production data. We are doing this to illustrate a point rather than a best practice. The assumption of a trend should be deduced from the training set, which we will go over below. This trend is actually what will cause a standard tree-based model to not perform, leading us to explore a more advanced modeling technique.</p>
<h1 id="gaussian-random-walks">Gaussian Random Walks</h1>
<p>A <a href="https://en.wikipedia.org/wiki/Statistical_model">data model</a> is an statistical process that is assumed to generate observed data and when applied correctly can lead to better results from your machine learning models. A random walk is a data model used to describe processes where their current location only depends on their previous location and a random “step”. That is to say,</p>
<script type="math/tex; mode=display">\text{position}_{t+1} = \text{position}_{t} + \text{random step}.</script>
<p>It is commonplace to hear the term “random walk” associated with stock returns. In fact, due to a deep theoretical connection between random walks and Brownian motion, very often the random step is modeled by a draw from a normal distribution. The technical term for this is a <a href="https://en.wikipedia.org/wiki/Random_walk#Gaussian_random_walk">Gaussian random walk</a>. In fact, if our quantity of interest is denoted <script type="math/tex">y</script>, then a process <script type="math/tex">y_t</script> satisfying a Gaussian random walk has the distribution</p>
<script type="math/tex; mode=display">y_{t+1} \sim y_{t} + \mathcal{N}(0,\sigma^2).</script>
<p>Now, if the jumps of the random walk have a non random component based on some factors defined by <script type="math/tex">x_t</script>, then we can choose a function <script type="math/tex">f</script> such that</p>
<script type="math/tex; mode=display">y_{t+1} \sim y_{t} + f(x_t) + \mathcal{N}(0,\sigma^2).</script>
<p>Finally, define <script type="math/tex">Y_t := y_{t+1}-y_t</script> to end up with the equation</p>
<script type="math/tex; mode=display">Y_t \sim f(x_t) + \mathcal{N}(0,\sigma^2).</script>
<p>If this is starting to look like the formula for linear regression, that’s because it is the formula, where <script type="math/tex">f</script> would be a linear function.</p>
<p>So, if we expect a process follows a random walk-type progression with some deterministic component, this would be a perfectly valid data model for generating the time series data. If we have accurately selected the data model, that would give our model a much stronger change of success and in fact we can validate it, just as in the time-independent care, by looking at the residuals.</p>
<h4 id="fun-fact">Fun Fact</h4>
<p>A (symmetric) random walk in two dimensions will return to its starting point infinitely many times, while a random walk in three dimensions have a 0% probability that it will return to its starting point. See this <a href="http://math.stackexchange.com/questions/536/proving-that-1-and-2-d-simple-symmetric-random-walks-return-to-the-origin-with">discussion</a> for the 1d and 2d case.</p>
<h2 id="time-series-residuals">Time Series Residuals</h2>
<p>In statistics, a time series that is “stationary” is nice to work with. Essentially, a stationary time series is a series with a constant mean and variance in time. A simple stationary series would be a series
<script type="math/tex">s_t \sim N(0,\sigma^2)</script>
where at all time $t$ you simply sample from a normal distribution.</p>
<p>However, the series
<script type="math/tex">s_t \sim N(t,\sigma^2)</script>
is not stationary as its mean changes in time. However, we can transform the series to make it stationary by defining the new series</p>
<script type="math/tex; mode=display">\hat s_t := s_t - t \sim N(0,\sigma^2).</script>
<p>For a problem without a time domain, it’s common wisdom that the residuals should be normal. For a time dependent problem, the equivalent widsom is that the residuals should be stationary. If our data model was selected correctly, then notice our model’s residuals would be stationary</p>
<script type="math/tex; mode=display">Y_t - f(x_t) \sim \mathcal{N}(0,\sigma^2).</script>
<p>When thinking about the viability of assuming such a model, you should look at basic transformations of the training data and see if stationary residuals might be possible.</p>
<h2 id="differencing">Differencing</h2>
<p>The act of subtracting the previous data point to obtain a stationary series (<script type="math/tex">y_{t+1}-y_{t}</script>) is a method of “differencing” in the statistical literature. The goal in any type of differencing is to obtain a stationary series, which would mean your data model accurately represents the data. In many cases, this was simply not enough and a trend of some kind remained.</p>
<p>To show the data model may be viable, the residuals of <script type="math/tex">y_{t+1}-y_{t}</script> should appear approximately normal. At that point, we’d rely on our machine learning model to capture the rest of the variation by learning <script type="math/tex">f(x_t)</script>. Then, on our test set, we’d hope to see stationary residuals.</p>
<p>Seeing stationary residuals on the test set validates the entire approach and gives us additional confidence that our model will perform well in production. Not only would the model be performing well on our evaluation metric, but it appears to capture the statistical properties of the generating process of the data in question.</p>
<h1 id="data-exploration-to-test-data-model-viability">Data Exploration to Test Data Model Viability</h1>
<p>To have increased confidence in our final model through a statistical framework, we want to see that stationary residuals might ultimately be possible. That means that the differenced variable should exhibit no trends over time that we believe our features would be unable to explain. Of course, this is more of an art than a science. For this exploration, we’re going to treat it like any other machine learning model– we exclude the test set data from our analysis to avoid cheating.</p>
<h2 id="practical-gotcha-defining-your-training-set">Practical Gotcha: Defining Your Training Set</h2>
<p>You have to be more careful defining your training set. In our case, the data spans 1/2013 to 7/2015. So, I choose a year of data starting 8/2014-7/2015 as my test set. This way, I have a chance to test my model on every day of the year. Any data exploration should only be done on your training set, and if possible, a part of the training set you don’t use for actual model training. This is not a hard and fast rule, but it’s an attempt to avoid overfitting the theory to the training set.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Combining current day's data with previous day's data</span>
<span class="n">dfs</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">t</span><span class="o">.</span><span class="n">drop</span><span class="p">([</span><span class="s">'StateHoliday'</span><span class="p">,</span><span class="s">'Open'</span><span class="p">],</span> <span class="n">inplace</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="n">t</span><span class="o">.</span><span class="n">reset_index</span><span class="p">(</span><span class="n">inplace</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">bad_words</span> <span class="o">=</span> <span class="nb">set</span><span class="p">([</span><span class="s">'Day'</span><span class="p">,</span><span class="s">'Month'</span><span class="p">,</span><span class="s">'Year'</span><span class="p">,</span><span class="s">'DayOfWeek'</span><span class="p">,</span><span class="s">'Date'</span><span class="p">])</span>
<span class="k">def</span> <span class="nf">select_df_cols</span><span class="p">(</span><span class="n">df</span><span class="p">,</span><span class="n">suffix</span><span class="p">):</span>
<span class="k">return</span> <span class="n">df</span><span class="p">[[</span><span class="n">x</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">df</span><span class="o">.</span><span class="n">columns</span> <span class="k">if</span> <span class="p">(</span><span class="n">suffix</span> <span class="ow">in</span> <span class="n">x</span> <span class="ow">and</span> <span class="ow">not</span> <span class="n">np</span><span class="o">.</span><span class="nb">any</span><span class="p">([</span><span class="n">word</span> <span class="ow">in</span> <span class="n">x</span> <span class="k">for</span> <span class="n">word</span> <span class="ow">in</span> <span class="n">bad_words</span><span class="p">])</span> <span class="p">)</span> <span class="ow">or</span> <span class="n">x</span><span class="o">==</span><span class="s">'Date'</span><span class="p">]]</span>
<span class="n">dfs</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">t</span><span class="o">.</span><span class="n">copy</span><span class="p">())</span>
<span class="n">df</span><span class="o">=</span><span class="n">t</span><span class="o">.</span><span class="n">merge</span><span class="p">(</span><span class="n">t</span><span class="p">,</span> <span class="n">left_on</span> <span class="o">=</span> <span class="p">[</span><span class="s">'Year'</span><span class="p">,</span><span class="s">'DayOfYear'</span><span class="p">],</span> <span class="n">right_on</span> <span class="o">=</span> <span class="p">[</span><span class="s">'Year'</span><span class="p">,</span><span class="s">'DayOfYear+1'</span><span class="p">],</span> <span class="n">suffixes</span><span class="o">=</span><span class="p">[</span><span class="s">''</span><span class="p">,</span><span class="s">'_day-1'</span><span class="p">])</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">select_df_cols</span><span class="p">(</span><span class="n">df</span><span class="p">,</span> <span class="s">'_day-1'</span><span class="p">)</span>
<span class="n">dfs</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">df</span><span class="p">)</span>
<span class="n">dfs</span> <span class="o">=</span> <span class="p">[</span><span class="n">df</span><span class="o">.</span><span class="n">set_index</span><span class="p">(</span><span class="s">'Date'</span><span class="p">)</span> <span class="k">for</span> <span class="n">df</span> <span class="ow">in</span> <span class="n">dfs</span><span class="p">]</span>
<span class="n">nt</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">concat</span><span class="p">(</span><span class="n">dfs</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span><span class="o">.</span><span class="n">dropna</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
<span class="c"># Making Training Set</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="n">X_train</span> <span class="o">=</span> <span class="n">nt</span><span class="o">.</span><span class="n">iloc</span><span class="p">[:</span><span class="n">np</span><span class="o">.</span><span class="n">where</span><span class="p">(</span><span class="n">nt</span><span class="o">.</span><span class="n">index</span> <span class="o">==</span> <span class="s">'2014-08-01'</span><span class="p">)[</span><span class="mi">0</span><span class="p">][</span><span class="mi">0</span><span class="p">],</span> <span class="p">:]</span>
<span class="n">X_test</span> <span class="o">=</span> <span class="n">nt</span><span class="o">.</span><span class="n">iloc</span><span class="p">[</span><span class="n">np</span><span class="o">.</span><span class="n">where</span><span class="p">(</span><span class="n">nt</span><span class="o">.</span><span class="n">index</span> <span class="o">==</span> <span class="s">'2014-08-01'</span><span class="p">)[</span><span class="mi">0</span><span class="p">][</span><span class="mi">0</span><span class="p">]:,</span> <span class="p">:]</span>
<span class="c"># Check the basic differencing residuals on the training set</span>
<span class="n">f</span><span class="p">,</span> <span class="n">axs</span> <span class="o">=</span> <span class="n">plt</span><span class="o">.</span><span class="n">subplots</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span><span class="mi">10</span><span class="p">))</span>
<span class="p">(</span><span class="n">X_train</span><span class="p">[</span><span class="s">'Sales'</span><span class="p">]</span><span class="o">-</span><span class="n">X_train</span><span class="p">[</span><span class="s">'Sales_day-1'</span><span class="p">])</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">kind</span><span class="o">=</span><span class="s">'line'</span><span class="p">,</span> <span class="n">ax</span><span class="o">=</span><span class="n">axs</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span>
<span class="n">axs</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">set_title</span><span class="p">(</span><span class="s">'Differenced Sales Residuals'</span><span class="p">)</span>
<span class="c"># Compare to a sampling the normal distribution in time</span>
<span class="n">axs</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">normal</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span><span class="mi">500</span><span class="p">,</span><span class="n">size</span><span class="o">=</span><span class="nb">len</span><span class="p">(</span><span class="n">X_train</span><span class="p">)))</span>
<span class="n">axs</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">set_title</span><span class="p">(</span><span class="s">'Stationary Distribution Example'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">tight_layout</span><span class="p">()</span>
</code></pre></div></div>
<p><img src="../images/rw_ts/output_14_0.png" alt="png" /></p>
<p>From a quick inspection, you can see the residuals does look similar to drawing a normal distribution in time except for a few point, which is something we expect a model could capture.</p>
<h2 id="practical-gotcha-cleaning-time-series">Practical Gotcha: Cleaning Time Series</h2>
<p>You have to be careful with cleaning data for a time dependent models. Although it can make your Kaggle score go up (as you’ll see in the competition, most people do it), but it can also hurt the real-life viability of the model. In fact, it could cause your model to fail on the most import production scenarios.</p>
<p>In this <a href="http://nbviewer.jupyter.org/github/JohanManders/ROSSMANN-KAGGLE/blob/master/ROSSMANN%20STORE%20SALES%20COMPETITION%20KAGGLE.ipynb">notebook</a> for the competition, both outlier and abberant trends are cleaned from the store data. The outliers are cleaned via a modified z-score using the median absolute deviation and the erroneous trends are scrubbed by inspection. The theoretical framework we have introduced will take both of these cases into account, removing the need for such cleaning.</p>
<p>As you might expect, many of the outliers turn out to be holidays. Do we want to remove those data points? Sometimes such cleaning can improve the performance of your model on a test set, but it could lead to overfitting your model to an idealized version of the data. This can be dangerous and should be done with caution, if not at all.</p>
<h1 id="training">Training</h1>
<p>There’s a few other features we are going to exclude. The original data set had a “Customers” feature which was the number of customers in the store on a given day. If our goal is to predict tomorrow’s sales from today, this seems unreasonable to have in the data set. We’ll also train a model without the information on the previous day to compare the results.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">y</span> <span class="o">=</span> <span class="n">X_train</span><span class="p">[</span><span class="s">'Sales'</span><span class="p">]</span> <span class="o">-</span><span class="n">X_train</span><span class="p">[</span><span class="s">'Sales_day-1'</span><span class="p">]</span> <span class="c"># For the statistical data model approach</span>
<span class="n">to_drop</span> <span class="o">=</span> <span class="p">[</span><span class="s">'Sales'</span><span class="p">,</span> <span class="s">'Year'</span><span class="p">,</span> <span class="s">u'DayOfYear+1'</span><span class="p">,</span> <span class="s">'Customers'</span><span class="p">]</span>
<span class="n">y2</span> <span class="o">=</span> <span class="n">X_train</span><span class="p">[</span><span class="s">'Sales'</span><span class="p">]</span> <span class="c"># No statistical data model</span>
<span class="n">X</span> <span class="o">=</span> <span class="n">X_train</span><span class="o">.</span><span class="n">drop</span><span class="p">(</span><span class="n">to_drop</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="n">X_test_p</span> <span class="o">=</span> <span class="n">X_test</span><span class="o">.</span><span class="n">drop</span><span class="p">(</span><span class="n">to_drop</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span> <span class="p">)</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">sklearn.ensemble</span> <span class="kn">import</span> <span class="n">GradientBoostingRegressor</span>
<span class="c"># Using our statistical data model</span>
<span class="n">model</span> <span class="o">=</span> <span class="n">GradientBoostingRegressor</span><span class="p">(</span><span class="n">n_estimators</span><span class="o">=</span><span class="mi">200</span><span class="p">)</span>
<span class="n">model</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X</span><span class="p">,</span><span class="n">y</span><span class="p">)</span>
<span class="c"># No statistical data model</span>
<span class="n">model2</span> <span class="o">=</span> <span class="n">GradientBoostingRegressor</span><span class="p">(</span><span class="n">n_estimators</span><span class="o">=</span><span class="mi">200</span><span class="p">)</span>
<span class="n">model2</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X</span><span class="p">,</span><span class="n">y2</span><span class="p">)</span>
</code></pre></div></div>
<h1 id="evaluation">Evaluation</h1>
<p>The Rossmann Kaggle competition used “root mean squared percentage error” as their metric. Or in another words, it is the square root of the MSE of the relative error. We will calculate this metric as well as show plots of the model “with stats” and without. For reference, the winning model in that competition scored a .1 RMSPE.</p>
<p>Notice, we have to add back the previous day’s sales to the prediction because we subtracted that off of our target variable.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">X_test</span><span class="p">[</span><span class="s">'ML+Stats'</span><span class="p">]</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">X_test_p</span><span class="p">)</span> <span class="o">+</span> <span class="n">X_test</span><span class="p">[</span><span class="s">'Sales_day-1'</span><span class="p">]</span> <span class="c"># Using our data model / theory</span>
<span class="n">X_test</span><span class="p">[</span><span class="s">'ML_Only'</span><span class="p">]</span> <span class="o">=</span> <span class="n">model2</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">X_test_p</span><span class="p">)</span> <span class="c"># Vanilla ML model</span>
<span class="n">X_test</span><span class="p">[[</span><span class="s">'Sales'</span><span class="p">,</span><span class="s">'ML+Stats'</span><span class="p">,</span> <span class="s">'ML_Only'</span><span class="p">]]</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">kind</span><span class="o">=</span><span class="s">"line"</span><span class="p">,</span> <span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span><span class="mi">10</span><span class="p">))</span>
</code></pre></div></div>
<p><img src="../images/rw_ts/output_22_1.png" alt="png" /></p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">f</span><span class="p">,</span> <span class="n">axs</span><span class="o">=</span><span class="n">plt</span><span class="o">.</span><span class="n">subplots</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span><span class="mi">2</span><span class="p">,</span> <span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span><span class="mi">10</span><span class="p">))</span>
<span class="p">(</span><span class="n">X_test</span><span class="p">[</span><span class="s">'Sales'</span><span class="p">]</span><span class="o">-</span><span class="n">X_test</span><span class="p">[</span><span class="s">'ML+Stats'</span><span class="p">])</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">kind</span><span class="o">=</span><span class="s">"line"</span><span class="p">,</span> <span class="n">ax</span><span class="o">=</span><span class="n">axs</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="mi">0</span><span class="p">])</span>
<span class="n">axs</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">set_title</span><span class="p">(</span><span class="s">'ML + Stats Residuals in Time'</span><span class="p">)</span>
<span class="n">axs</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">set_title</span><span class="p">(</span><span class="s">'ML + Stats Residuals Histogram'</span><span class="p">)</span>
<span class="p">(</span><span class="n">X_test</span><span class="p">[</span><span class="s">'Sales'</span><span class="p">]</span><span class="o">-</span><span class="n">X_test</span><span class="p">[</span><span class="s">'ML+Stats'</span><span class="p">])</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">kind</span><span class="o">=</span><span class="s">"hist"</span><span class="p">,</span> <span class="n">ax</span><span class="o">=</span><span class="n">axs</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="mi">1</span><span class="p">])</span>
<span class="p">(</span><span class="n">X_test</span><span class="p">[</span><span class="s">'Sales'</span><span class="p">]</span><span class="o">-</span><span class="n">X_test</span><span class="p">[</span><span class="s">'ML_Only'</span><span class="p">])</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">kind</span><span class="o">=</span><span class="s">"line"</span><span class="p">,</span> <span class="n">ax</span><span class="o">=</span><span class="n">axs</span><span class="p">[</span><span class="mi">1</span><span class="p">][</span><span class="mi">0</span><span class="p">])</span>
<span class="p">(</span><span class="n">X_test</span><span class="p">[</span><span class="s">'Sales'</span><span class="p">]</span><span class="o">-</span><span class="n">X_test</span><span class="p">[</span><span class="s">'ML_Only'</span><span class="p">])</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">kind</span><span class="o">=</span><span class="s">"hist"</span><span class="p">,</span> <span class="n">ax</span><span class="o">=</span><span class="n">axs</span><span class="p">[</span><span class="mi">1</span><span class="p">][</span><span class="mi">1</span><span class="p">])</span>
<span class="n">axs</span><span class="p">[</span><span class="mi">1</span><span class="p">][</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">set_title</span><span class="p">(</span><span class="s">'ML Only Residuals In Time'</span><span class="p">)</span>
<span class="n">axs</span><span class="p">[</span><span class="mi">1</span><span class="p">][</span><span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">set_title</span><span class="p">(</span><span class="s">'ML Only Residuals Histogram'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">tight_layout</span><span class="p">()</span>
</code></pre></div></div>
<p><img src="../images/rw_ts/output_23_0.png" alt="png" /></p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">print</span><span class="p">(</span><span class="s">'ML + Stats RMSPE'</span><span class="p">,</span> <span class="n">np</span><span class="o">.</span><span class="n">sqrt</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">mean</span><span class="p">(((</span><span class="n">X_test</span><span class="p">[</span><span class="s">'Sales'</span><span class="p">]</span><span class="o">-</span><span class="n">X_test</span><span class="p">[</span><span class="s">'ML+Stats'</span><span class="p">])</span><span class="o">/</span><span class="n">X_test</span><span class="p">[</span><span class="s">'Sales'</span><span class="p">])</span><span class="o">**</span><span class="mi">2</span><span class="p">)))</span>
<span class="k">print</span><span class="p">(</span><span class="s">'ML Only RMSPE'</span><span class="p">,</span> <span class="n">np</span><span class="o">.</span><span class="n">sqrt</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">mean</span><span class="p">(((</span><span class="n">X_test</span><span class="p">[</span><span class="s">'Sales'</span><span class="p">]</span><span class="o">-</span><span class="n">X_test</span><span class="p">[</span><span class="s">'ML_Only'</span><span class="p">])</span><span class="o">/</span><span class="n">X_test</span><span class="p">[</span><span class="s">'Sales'</span><span class="p">])</span><span class="o">**</span><span class="mi">2</span><span class="p">)))</span>
</code></pre></div></div>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>('ML + Stats RMSPE', 0.07455399875280469)
('ML Only RMSPE', 0.11567913104977838)
</code></pre></div></div>
<p>Although each model had trouble with the holidays, the model which uses our statistical framework did almost twice as well on the Kaggle metric of root mean squared percentage error, and we can see from the histogram of the residuals that the model is less biased as well (in the statistical sense) and appear almost stationary. By looking at the residuals in time, we can see that the ML + Stats model has more stationary residuals when compared to the ML Only model. You can see this manifest itself in a slight upward drift in the residuals of the ML Only model.</p>
<p>With more data and additional features, we would hope we’d be able to anticipate that problem. However, the holidays might be so important to the year end sales you may even train a model just to predict the excess.</p>
<h1 id="extension-seasonal-differencing">Extension: Seasonal Differencing</h1>
<p>There are many ways to improve the approach and handle the holidays. Differencing is not just for adjacent time points but can be applied on larger scales as well to adjust for seasonal trends. This would be a proper way to account for the spike around the holidays, but as we only have two holiday periods in the data set, I am unable to have holiday data points in both the training set and the testing set. However, the same theory and overarching approach applied and can be used with enough data.</p>
<h1 id="conclusion">Conclusion</h1>
<p>Another way to view this approach is <em>reframing</em> the problem we are studying to fit a key assumption in machine learning, namely that the data the model is trained and tested on come from the same population. The technique of differencing is one way to make sure this happens. There are other ways as well, but the point is that reframing a problem in such a way can lead to huge performance gains, but really the only thing you have done is think about the problem differently and follow the standard approach for such a problem.</p>A standard assumption underlying a standard machine learning model is that the model will be used on the same population during training and testing (and production). This poses an interesting issue with time series data, as the underlying process could change over time which would cause the production population to look differently from the original training data. In this post, we will explore the concept of a data model, a theoretical tool from statistics, and in particular the idea of a random walk to handle this situation, improving your modeling toolkit.(My Opinion of) Best Practices for a Data Scientist in Industry2017-03-05T00:00:00+00:002017-03-05T00:00:00+00:00http://srome.github.io//(My-Opinion-Of)-Best-Practices-for-a-Data-Scientist-in-Industry<p>Most things you need to know when you start as a data science can be found online these days. Besides containing my own opinions, this post also aims to detail some important points that are either not always spelled out in detail or are glossed over. I also include practical advice to avoid typical mistakes a new data scientist can make. Of course, it’s going to be tailored more towards python users (as I am one too), but that doesn’t mean there’s nothing there for other users.</p>
<p>After reading a friend’s <a href="https://philadelphiaphysicist.wordpress.com/2016/11/11/transitioning-from-physics-in-academia-to-data-science-lessons-learned/">post</a> on transitioning to industry from academia, I decided I wanted to give my own take on starting up as a data scientist.</p>
<h2 id="general-advice">General Advice</h2>
<p>Some of these may seem like common sense… but we still don’t always do them.</p>
<h3 id="read-the-docs">Read the Docs</h3>
<p>Documentation is your best friend. You should fully read the documentation on any piece of your ML pipeline <strong>in full</strong> before you employ it. Make sure you understand any gotchas and special cases that are highlighted. I’ve been particularly bitten by H2O in the past…</p>
<h3 id="work-with-notebooks">Work With Notebooks</h3>
<p>No matter what language you use there’s a concept of a notebook. Notebooks have become indispensible in my workflow for organizing my thoughts and analyses. I write this blog in a notebook. They allow you have code, output, and text side by side in a natural way to explain what is going on in your work. For research and for basic prototypes, notebooks are essential. For Python, the standard is the <a href="http://jupyter.org/">jupyter notebook</a>. There’s a similar concept for scala and I’m sure for the other languages. It even allows to export your work directly to a pdf document, it can be converted easily into a powerpoint-ready presentation, you can include math equations with LaTeX, and even drop the code to a (say) Python .py file. It’s a great place to start any analysis.</p>
<h3 id="organize-your-projects-with-version-control">Organize Your Projects With Version Control</h3>
<p>This seems like an obvious statement– but many of us are lazy and lose track of our projects. Maybe one avenue of analysis you took happened on a separate branch that you thought you merged in but you didn’t. Maybe some exploratory analysis becomes important all of a sudden, and all the code is broken because you’ve updated a library. It seems crazy– but it happens. You have to enforce a strict organizational policy on yourself.</p>
<p>There’s many patterns I’ve seen for organizing your projects. I think they most important thing is to DO IT. Too long I let scripts and data float around my hard drive after I finished a prototype, and I had forgotten about it by the time business had decided to give the project legs. Then, I had to find everything and immediately remember the nuances of my undocumented work. Big mistake.</p>
<p>One tips I use was to keep all code/data for each project in a self contained folder, which itself is in a single folder (Say “research”). This was a great tip I saw on twitter actually. However, I want to extend the suggestion. If possible, you should store each individual project (sans data, probably) in a version controlled (git!) repo. My suggestion is a basic “models” repo with a folder for each project. Then, the readme for each project can contain any additional information needed to get up to speed with the analysis and the results. You should have your ETL script (or however you got the data), a notebook of the analysis, and information on the dataset, and where its permanent home is. Another important thing is to document the libraries used and their version. This can be an issue with early versions of libraries. For python users, a simple approach is to include a pip requirements.txt file with the versions pinned hard. It’s very frustrating when you saved a model in the repo but can’t load it because you’ve updated versions in the interim. More on that below (see virtualenv).</p>
<p>Keeping the ETL script is more important than you’d think. It’s ideal if you have a data dictionary for the project, but for quick prototypes that you may do in your free time, you may not be able to do a full scale data dictionary. The ETL script (fully commented, of course) will really help.</p>
<h4 id="python-tip-use-virtualenv">Python Tip: Use virtualenv</h4>
<p>Different projects have different requirements. Virtualenv allows you to upgrade any library you want and not break other projects as each project will have its own virutalenv setup with it’s own requirements installed. Many older models I build on an old version of Keras broke when I upgraded, but I forgot what version I was running when I built it! If I had kept track of that, I could have spun up a virtualenv (or loaded the old one, if I had saved it) and been instantly back in the same position I was in– seeing the same results as I had.</p>
<h2 id="timebox-research">Timebox Research</h2>
<p>Research is a critical component as a data scientist, but the usual academic approach to research (which I have at least once been guilty of) should be timeboxed. That is, don’t start doing research for a project without also deciding on a reasonable amount of time you’ll spend on it– and spend no more!</p>
<p><img src="../images/ds_tips/picture.png" alt="png" /></p>
<p>And even though it seems like the most important component, only make humorous graphics for talks after the main meat of it is completed… less stressful that way.</p>
<h2 id="on-evaluating-your-model-correctly">On Evaluating Your Model Correctly</h2>
<p>A “golden rule” that will always point you in the right direction in data science is “test exactly the process you want to generalize in the same manner as it will be evaluated”. It sounds easy, but in practice there are some subtle nuances that can lead to poor results when missed. I break this statement down below.</p>
<p>I’m going to use “cross validation” in this section, but for large models you may not be able to do full cross validation. I understand this and in those cases you are probably far enough along that you don’t need my advice :).</p>
<h3 id="the-process-you-want-to-generalize-think-of-your-model-as-a-pipeline">The Process You Want To Generalize: Think Of Your Model as a Pipeline</h3>
<p>Your model is not just the ML algorithm you use to predict. It’s the entire pipeline including your data transformation, your scaling, your feature engineering (in many cases when automated), how you select hyperparameters, and your final model. When you identify these components together, it makes the following sections easier to understand conceptually why you do it.</p>
<h3 id="test-your-process-use-nested-cross-validation">Test Your Process: Use Nested Cross Validation</h3>
<p>I just said think of your pipeline as your model. Why? It makes the question of what to cross validate easy! You cross validate your WHOLE pipeline– as that is what you want to generalize! No one disagrees that cross validation is an important tool in a data scientist’s pocket, but many people do not do it properly and it can lead to disasterous results.</p>
<p>Now, that may sound well and good, but there’s a sneaky component to this. Many of us use cross validation to select parameters, and above I said to think of that piece as part of your pipeline. So yes, what I’m saying is, you should have 2 cross validations happening when you test your model (your pipeline!) with cross validation. This is called <strong>nested cross validation</strong>. The “outer” cross validation tests the generalization of your pipeline, where the “inner” cross validation is typically used to select hyperparameters or feature engineering steps.</p>
<p>What this is also subtly implying is if you can’t automate a process in order to cross validate it, it’s likely (unless you are a seasoned expert) you shouldn’t do it. If you want to cut off features that fail to a hit a threshold to cut down your data– you better do it inside your pipeline so it can be cross validated as a part of your process.</p>
<p>The key is whatever your pipeline is, that’s what you want to generalize. If you do anything outside of a cross validation framework, you could discover that your model doesn’t actually generalize to new data when it’s in production rather than catching it earlier.</p>
<p>If you plan to try different models, this advice would include adding different algorithms (Random Forest, NN, Linear Regression, etc) to this nested cross validation scheme under ideal circumstances. Remember, however you are picking your final approach should be cross validated.</p>
<h3 id="in-the-same-manner-as-it-will-be-evaluated-how-is-success-judged-by-business">In The Same Manner As It Will Be Evaluated: How is “Success” Judged By Business</h3>
<p>Cross validation is a great tool, but you need to be aware of your goals. Many tips say to shuffle your data during training (either before it starts, or during epochs, etc.) which is great, but you may need to do more than that.</p>
<p>The important question to ask is what will you be predicing on in the future and what the business goal is– what determines the success of your model. If you will be predicting on examples from the <strong>SAME</strong> population as your training set, and your CV metric generalizing is your goal, then a standard shuffle + CV is great for you. A proven model that answers the right question can achieve the business goal on the desired population. If you will be predicting on a <strong>NEW</strong> population, you should cross validate the same way it will be evaluated.</p>
<p>Let’s go through an example. Say you have data for multiple states for something you’d like to predict. Your goal is to predict on new states not in your current data set. You will be evaluated on how well the model predicts the quantity on each individual state and an average taken of the success. This is a situation where you do <strong>not</strong> want to simply cross validate on your training set. If you simply shuffled and cross validated, you would be answering the <strong>WRONG</strong> question– as in production, your model won’t be evaluated by business on its performance on this type of data. You should use Labeled (or Grouped) cross validation. This labels the data based on the population they’re in, and then during cross validation, it ensures that the group will never be “split” into both the training and the test set. This way, for every fold, you will be training on A B and C and predicting on a new group. This is exactly what your goal is, and you’re getting an estimate on how well your process generalizes! For python users, try the <a href="http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GroupKFold.html">GroupKFold</a> class in sci-kit learn.</p>
<p>This is important as you as the data scientist should be confident that your model will satisfy the business goals you commit to do, as you are the one who is being judged along side your model and it’s your career that is effected by its performance.</p>
<h4 id="read-the-docs-more">Read the docs more!!</h4>
<p>Take sci-kit learn’s <a href="http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html">GridSearchCV</a>. The “iid” indicator is defaulted to true– let’s see what this does:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>If True, the data is assumed to be identically distributed across the folds, and the loss minimized is the total loss per sample, and not the mean loss across the folds.
</code></pre></div></div>
<p>This is very subtle. What does it mean for your data to be IID? In our grouped cross validation example, I’d prefer to set this setting to “False” as I’d want a low mean loss per fold AND a low standard deviation to be sure my model generalizes on the question I’m asking of it.</p>
<h2 id="an-example-python-model-training-script">An Example Python Model Training Script</h2>
<p>Below I’ll give an example model in sci-kit learn. Now, I do not always advocate for all of these techniques, but I wanted to include as much as possible to identify the proper approach based on the previous points.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">sklearn.model_selection</span> <span class="kn">import</span> <span class="n">GridSearchCV</span><span class="p">,</span> <span class="n">GroupKFold</span><span class="p">,</span> <span class="n">cross_val_score</span>
<span class="kn">from</span> <span class="nn">sklearn.decomposition</span> <span class="kn">import</span> <span class="n">PCA</span>
<span class="kn">from</span> <span class="nn">sklearn.feature_selection</span> <span class="kn">import</span> <span class="n">SelectKBest</span>
<span class="kn">from</span> <span class="nn">sklearn.linear_model</span> <span class="kn">import</span> <span class="n">SGDClassifier</span>
<span class="kn">from</span> <span class="nn">sklearn.pipeline</span> <span class="kn">import</span> <span class="n">Pipeline</span>
<span class="n">data_labels</span> <span class="o">=</span> <span class="p">[</span><span class="o">...</span><span class="p">]</span> <span class="c"># Labels for your data set's groups</span>
<span class="c"># Pipeline definition</span>
<span class="n">steps</span> <span class="o">=</span> <span class="p">[</span>
<span class="p">(</span><span class="s">'pca'</span><span class="p">,</span> <span class="n">PCA</span><span class="p">()),</span> <span class="c"># Might as well...</span>
<span class="p">(</span><span class="s">'feature_selection'</span><span class="p">,</span> <span class="n">SelectKBest</span><span class="p">()),</span> <span class="c"># Let's be selective</span>
<span class="p">(</span><span class="s">'classifier'</span><span class="p">,</span> <span class="n">SGDClassifier</span><span class="p">())</span> <span class="c"># Start simple</span>
<span class="p">]</span>
<span class="n">pipeline</span> <span class="o">=</span> <span class="n">Pipeline</span><span class="p">(</span><span class="n">steps</span><span class="p">)</span>
<span class="n">params</span> <span class="o">=</span> <span class="p">{</span>
<span class="s">'pca__n_components'</span><span class="p">:</span> <span class="p">[</span><span class="o">.</span><span class="mi">5</span><span class="p">,</span><span class="o">.</span><span class="mi">7</span><span class="p">,</span><span class="o">.</span><span class="mi">9</span><span class="p">],</span>
<span class="s">'classifier__loss'</span> <span class="p">:</span> <span class="p">[</span><span class="s">'log'</span><span class="p">,</span><span class="s">'hinge'</span><span class="p">],</span>
<span class="s">'classifier__alpha'</span> <span class="p">:</span> <span class="p">[</span><span class="o">.</span><span class="mo">01</span><span class="p">,</span><span class="o">.</span><span class="mo">001</span><span class="p">,</span><span class="o">.</span><span class="mo">0001</span><span class="p">],</span>
<span class="s">'feature_selection__k'</span> <span class="p">:</span> <span class="p">[</span><span class="s">'all'</span><span class="p">,</span> <span class="mi">1000</span><span class="p">,</span> <span class="mi">500</span><span class="p">]</span>
<span class="p">}</span>
<span class="n">model</span> <span class="o">=</span> <span class="n">GridSearchCV</span><span class="p">(</span><span class="n">pipeline</span><span class="p">,</span> <span class="n">param_grid</span><span class="o">=</span><span class="n">params</span><span class="p">)</span>
<span class="c"># Evaluating generalization to new overarching groups</span>
<span class="n">scores</span> <span class="o">=</span> <span class="n">cross_val_score</span><span class="p">(</span><span class="n">model</span><span class="p">,</span> <span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">groups</span><span class="o">=</span><span class="n">data_labels</span><span class="p">,</span> <span class="n">cv</span><span class="o">=</span><span class="n">GroupKFold</span><span class="p">())</span>
<span class="c"># Now fit your model on the data</span>
<span class="n">model</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X</span><span class="p">,</span><span class="n">y</span><span class="p">)</span>
<span class="c"># Test on final held out test set...</span>
</code></pre></div></div>
<p>As of scikit-learn 0.18, you can’t have the labels passed to the inner GridSearchCV, which is unfortunate, but at least the outer loop will respect the groupings.</p>
<h2 id="assorted-practical-advice-and-personal-preferences">Assorted Practical Advice and Personal Preferences</h2>
<h3 id="quantile-your-predictions">Quantile Your Predictions</h3>
<p>If you’re working on a problem in business, AUC is not always very helpful to communicate with business. If you can boil your problem down to a classification problems, it may make life easier. Business colleagues appreciate a yes or no answer from the model, and providing the raw likelihood output can be counter productive. They want a decision– that’s what they hired you for! Work with them to determine their goals and explain the model in those terms. You will have to guide business in many cases in their understanding and explaining the model’s impact on business is always a good start. However, they still would like to understand different degrees of confidence in the predictions. A good approach is to bucket your predictions and explain the results with metrics evaluated for each bucket. Precision, recall and lift are more applicable and easier to speak about in many situations.</p>
<h3 id="sparsity-is-your-friend">Sparsity is Your Friend</h3>
<p>A sparse data set is the name for when your data has few nonzero data points. The libsvm file format is very helpful and a widely supported way to represent data sparsely. I’ve been able to reduce TBs to MBs by representing my data sets properly in a sparse format. This can really help you when trying to run parallel random forests in say sci-kit learn which notoriously copies your data set (last time I checked) for each tree.</p>
<h3 id="treat-algorithm-selection-as-a-hyperparameter">Treat Algorithm Selection as a Hyperparameter</h3>
<p>The algorithm you suggest is no different from the number of trees you choose for your forest. If you are testing multiple candidate models, they should all be included in your inner CV in your pipeline– otherwise you’re not testing part of your process.</p>
<h3 id="a-winning-combo-more-data--algorithms-that-do-feature-engineering">A Winning Combo: More Data + Algorithms That Do Feature Engineering</h3>
<p>I do not think I will be able to come up with the same features a random forest or a neural network can under ideal circumstances. In my mind, I should provide the model with as much data as possible and let the mathematics behind the algorithm power and direct learning. Of course, this approach does not necessarily work with simpler models and in those cases feature engineering might be necessary. It would also be inappropriate if you want an explainable model.</p>
<h3 id="learn-your-regularization">Learn Your Regularization</h3>
<p>Regularization is prevents overfitting to the data. Almost all basic models in mature packages automatically apply some form of regularization for you, but if you’re a savvy user, you may want more say in your regularization. Do you want fewer features for your linear model? Use L1 regularization (Lasso). Do you want to ameliorate colinearity for a linear model? Use L2 regularization (Ridge). What about a little of both? Use both (ElasticNet). Are you working with convolutional neural networks? Dropout and gradient clipping. Did you apply dropout to a large sparse data set of dense layers? Probably a bad idea… How do you regularize random forests? More trees, less features per split, smaller bootstrapped samples, shallower trees. Every model will have its quirks, but it’s important to understand what regularization is and how it can benefit your specific use case.</p>
<h3 id="data-integrity-is-key-to-performance">Data Integrity is Key To Performance</h3>
<p>When your model is not performing, or is performing strangely, try to diagnose the problem through the data. Look for leakage or mistakes in the ETL. It’s also possible that certain features aren’t scaled properly if the algorithm is not scale invariant.</p>
<h3 id="keep-it-simple-stupid">Keep It Simple (Stupid)</h3>
<p>My old English teacher used to say this a lot, and it’s true in data science as well. A nice scale invariant model like a random forest typically does well in many instances. You don’t usually need to play with the default hyperparameter settings. With the right data (and the right amount!), most implementations will do very well with their default settings. Ouside of Kaggle, you rarely need to use parameter search to have a decent model. If a model isn’t working, it’s sometimes better to try an alternative algorithm rather than diving into parameter searches. There’s no free lunch, afterall. However, you should always try a baseline linear model before you break out too much machinery.</p>
<h3 id="understanding-runtimes">Understanding Runtimes</h3>
<p>It’s helpful to get a basic understanding on how models work so you can understand how long it may run. Can a model be parallelized? How does it scale with samples? With more features? Random forests can become annoyingly slow with lots of features with many values, as most CART implementations will check every split on every feature sampled for a tree. This can be sped up, for example, by using “Extremely Random Trees”. Stochastic gradient descent algorithms get better after more epochs, and can take a long time to converge, but sometimes converge to a decent model well before your random forest model could train to allow you to see a simple feasibility check. This can be very helpful when you’re training models on your laptop. A good practice is to have a basic understanding of how the major algorithms scale with features and with samples.</p>Most things you need to know when you start as a data science can be found online these days. Besides containing my own opinions, this post also aims to detail some important points that are either not always spelled out in detail or are glossed over. I also include practical advice to avoid typical mistakes a new data scientist can make. Of course, it’s going to be tailored more towards python users (as I am one too), but that doesn’t mean there’s nothing there for other users.