Covariate Shift, i.e. Why Prediction Quality Can Degrade In Production and How To Fix It

In in many classification problems, time is not a feature (literally). Case in point: healthcare. Say you want to predict if a person has a deadly disease. You probably have some historical data that you plan use to train a model and then deploy it into production. Great– but what happens if the distribution of people you predict on changes by the time your model starts (miss) flagging people? This covariate shift can be a real issue. In healthcare, the Affordable Care Act completely reshaped the landscape of the insured population, so there definitely were models out there that were faced with a healthier population than they were trained on. The question is: Is there anything you can do during training to correct for covariate shift? The answer is yes.

Read More

An Annotated Proof of Generative Adversarial Networks with Implementation Notes

The preferred illustration of Generative Adversarial Networks (which seems indoctrinated at this point like Bob, Alice and Eve for cryptography) is the scenario of counterfitting money. The Generator is a counterfits money and the Discriminator is supposed to discriminate between real and fake dollars. As the Generator gets better, the Discriminator has to improve also. This implies a dual training scheme where each model tries to one up the other (i.e. through additional learning). In this post, we will explore the proof from the original paper and demonstrate a typical implementation.

Read More

A Tour of Gotchas When Implementing Deep Q Networks with Keras and OpenAi Gym

Starting with the Google DeepMind paper, there has been a lot of new attention around training models to play video games. You, the data scientist/engineer/enthusiast, may not work in reinforcement learning but probably are interested in teaching neural networks to play video games. Who isn’t? With that in mind, here’s a list of nuances that should jumpstart your own implementation.

Read More

Dealing with Trends. Combine a Random Walk with a Tree-Based Model to Predict Time Series Data

A standard assumption underlying a standard machine learning model is that the model will be used on the same population during training and testing (and production). This poses an interesting issue with time series data, as the underlying process could change over time which would cause the production population to look differently from the original training data. In this post, we will explore the concept of a data model, a theoretical tool from statistics, and in particular the idea of a random walk to handle this situation, improving your modeling toolkit.

Read More