Combining Thompson Sampling with Neural Networks for Classification Tasks

In the last post, we looked at an iterative online update to a logistic regression model. This derivation requires knowledge of the derivative with respect to the model’s weights, which is intractible for more complicated models. In practice, one wants to use neural networks for prediction tasks for ads or recommendation systems. The question becomes what is the right techniques to apply in order to perform Thompson Sampling with a neural network to enable exploration.

Read More

Online Calculation of the Mean and Variance of Logistic Regression Weights for Thompson Sampling

For non-stationary distributions, it may be beneficial to learn (at least some) parameters for your models online. A common approach is to learn a logistic regression model for binary outcomes, and this technique appears as early as 2008 at Yahoo! News. There are several approaches to deriving the model weights distribution online, which additionally enables one to perform exploration via Thompson Sampling. Crucially, whatever method must estimate both the mean of the model parameters and their covariance.

Read More

Predicting Watch Time like YouTube via Weighted Logistic Regression

In reading the YouTube Paper, I came across one technical section that seemed very subtle and was not readily apparent why it was true: the watch time estimation using weighted logistic regression. It’s easy to gloss over this detail, but as it turns out, many people before me were curious about this section as well. The previous links have already explored one view into the why behind this formula, but I would like to formalize the process and explain it end to end.

Read More

Learning RecSys through Papers Vol III- Mixed Negative Sampling + Odds and Ends

In our previous posts, we walked through the sampled softmax and in-batch negatives as loss functions for training a retrieval model. In this post, we will explore the best of both worlds in the so-called mixed negative sampling. This method will prove the to be the most data efficient and yield the best performance of the previous two methods. The post contains a new implementation of all three methods which will allow all three to be trained from (approximately) the same data loader and compared.

Read More