A recommendation system is like a funnel that starts from the largest pool of content avaialble to a user and filters until we get the right set of items to show to a user. From the paper, they have the following graphic that illustrates the process well:

Our goal is to find items in the (possible) billions of possible items to present to the user to improve upon given business metrics. We do this by modeling the problem as “predicting what the user will watch” with watched items being positive examples and unwatched items being negatives. In the next sections, we will detail modeling decisions for various components and data preparation.

For the uninitiated, recommendation systems are basically a big vector search if you squint at it. We represent users and items as vectors and then using approximate lookup techniques we can scale the search to very large item sets. The first step in the pipeline is the candidate generation where a user vector is generated and used to look up candidate items via approximate nearest neighbor techniques (e.g., Hierarchical Navigatable Small Worlds). When using matrix factorization methods, each user has a vector that is learned during training. In neural methods for recommenders, typically features like watch history among others are passed into neural nets to generate user embeddings.

In the YouTube Deep Nets paper, various user level features are fed into a feedforward network to generate a user vector which is then passed into a dense layer leading to a *big softmax*, where every item corresponds to an index in the softmax’s output.

For training, the dense layer for the softmax would become prohibitivelty large, and so variations on a sampled softmax are typically used. Indeed, even the input embedding layers have a similar issues, and more modern techniques would use things like Hash Embeddings and other tricks to reduce the size of the embedding table (e.g., see TikTok describe its collisionless embedding table).

In my implementation, I will use a *simplified* shared hash embeddings to represent both the watched item IDs and the softmax output, combined with sampled softmax. Shared embeddings is a common technique used in such systems and can improve generalization, as stated in the YouTube Deep Nets paper in Section 4.1:

Sharing embeddings is important for improving generalization, speeding up training and reducing memory requirements. The overwhelming majority of model parameters are in these high-cardinality embedding spaces - for example, one million IDs embedded in a 32 dimensional space have 7 times more parameters than fully connected layers 2048 units wide.

A user will be represented by a list of past watched item IDs. You will notice in my implementation I allow variable length watch histories up to length `max_pad`

=30. I do this by looping over the inputs, applying the embedding layer, and then averaging the inputs. Treating items as a “bag of words” is a common approach in early recommender systems, especially before the onset of Transformers and efficient RNNs. The approach is still extremely effective when combined with side features, such as in the Deep Nets Youtube paper and the Wide and Deep paper.

My network architecture is closer to a two-tower architecture, which became more common in later papers such as this paper from YouTube in 2019. In the paper, they actually use in-batch items as their negatives but need to perform a correction to account for popular items showing up more often in their loss. I take this idea of the two-tower achitecture with shared embeddings and I remove the candidate tower and simply pass through the item embeddings to dot with each input item. The main difference is I do not use in-batch negatives but instead will sample negatives at random per example. You can see the two-tower architecture from the paper below and imagine that we pass the video embeddings directly through the right-hand side item tower to lead to the softmax term:

```
import torch
import torch.nn as nn
import torch.nn.functional as F
import hashlib
import pandas as pd
import numpy as np
EMBED_TABLE_SIZE = 1000
hash_f = lambda x: int(hashlib.md5(x.encode('utf-8')).hexdigest(),16) % EMBED_TABLE_SIZE
mps_device = torch.device("mps")
max_pad=30
```

```
class DeepNetRecommender(nn.Module):
def __init__(self, n_embed=3, table_size=EMBED_TABLE_SIZE, embed_dim=64):
super(DeepNetRecommender, self).__init__()
# MUST DEFINE ALL LAYERS IN THE BODY OR PYTORCH DOESNT NOTICE THEM
self.embed1 = nn.Embedding(table_size, embed_dim)
self.embed2 = nn.Embedding(table_size, embed_dim)
self.embed3 = nn.Embedding(table_size, embed_dim)
# need to set seed for it to be the same every load
np.random.seed(182)
self.embeds = [(getattr(self,'embed'+str(k)),np.random.choice(100000)) for k in range(1,1+n_embed)] # (layer, salt)
self.model = nn.Sequential(nn.Linear(64,64), nn.ReLU(), nn.Linear(64,64), nn.ReLU(), nn.Linear(64,64))
def embed(self, x, max_pad):
o = None
for embedder, salt in self.embeds:
items = [hash_f(str(salt) + '_' + str(k)) for k in x][:max_pad]
hashed_items = torch.IntTensor(items).to(mps_device)
embedded = embedder(hashed_items)
if o is None:
o = embedded
else:
o += embedded
return o
def embed_stack(self, x):
return torch.stack([self.embed(k, max_pad=len(k)) for k in x],0)
def _embed_transform(self, y, max_pad):
# embeds variable length inputs and translate them to the same size, [1,64] via averaging
x = self.embed(y, max_pad)
return torch.sum(x,0) / x.shape[0]
def embed_and_avg(self, data, max_pad):
stacked = torch.stack([self._embed_transform(x, max_pad) for x in data],0)
return stacked.unsqueeze(1)
def forward(self, x):
# Pad, stack and embed vectors
inps = x['history']
to_rank = x['to_rank']
lhs = self.embed_and_avg(inps, max_pad)
lhs = self.model(lhs)
rhs = self.embed_stack(to_rank)
output = torch.bmm(lhs, rhs.T.permute(-1,0,1))
return output.squeeze(1)
```

As I mentioned before, we are using a sampled softmax approach. If your softmax output dimension is one-to-one with your item catalog, you can use functions build into PyTorch to sample indices. To create the sampled softmax, I reuse the simplified hash embedding by dotting the item vector with the user embedding, and therefore, I have to explicitly pass in valid item IDs to ensure I am learning embeddings on the items I care about. In practice, it’s more likely to see the output item vectors independent of the inputs for these types of models as it allows for easier sampling without knowing the identifiers of the items you are sampling.

```
model = DeepNetRecommender().to(mps_device)
with torch.no_grad():
print(
model({'history': [[1,182,1930]], 'to_rank': [[1,8,24,1234,53435,1000000]]}).cpu().detach()
)
```

```
tensor([[ 3.3182, -0.1267, -0.1815, -0.1327, -0.0200, -0.4532]])
```

There are two approaches to modeling the data presented in the paper. I will focus on the one I find works best in practice– the “predicting the next/future watch”:

Here, you take all the information leading up to a watch as your inputs and predict the item that will be watched. Notice, this approach only requires *positive* samples as the softmax will imply the other examples are the *negatives*. We will use the MovieLens dataset, sorted by time, and do a groupby on each user to get a big ordered list for each user. Then in our training loop, we will sample an index as our watch item and generate the appropriate history window (considering our `max_pad`

parameter). We will do all this in a custom `DataLoader`

class. Here I will also sample item IDs in the data loader and output them as the `label`

and pass that into the model during training to score each label ID. For the sampled softmax, we will take `1000`

items in total to calculate the loss for each training instance.

```
d = pd.read_csv('data/ratings.csv')
```

```
n_movies = d.movieId.nunique()
all_movies = d.movieId.unique().tolist()
n_movies
```

```
83239
```

```
d['click'] = d['rating'] >= 3 # Ratings bigger than 3 we consider a click/watch
dd = d[d['click'] > 0].groupby('userId', as_index=False).movieId.apply(list) # assumes final list is in order
dd.set_index('userId',inplace=True)
dd=dd[dd.movieId.apply(len) > max_pad+2] # minimum number of movies per user
```

```
from torch.utils.data import Dataset
class MovieLensDataset(Dataset):
def __init__(self, user_level_data, random_negatives=999):
self.d = user_level_data
self.rn = random_negatives
def __len__(self):
return len(self.d)
def __getitem__(self, idx):
user_movies = dd.iloc[idx]['movieId']
label_idx = np.random.choice(range(1,len(user_movies)))
history = user_movies[label_idx-max_pad:label_idx]
label_movie_id = user_movies[label_idx]
# sample real movie ids so we get reasonable embedding lookups.. as it's not a true softmax
to_rank = [label_movie_id] + np.random.choice(all_movies, self.rn).tolist()
start_idx = label_idx-max_pad if label_idx-max_pad > 0 else 0
movies = user_movies[start_idx:label_idx]
if len(movies) == 0:
print('FAIL', label_idx, start_idx, t_rank)
return movies, to_rank
```

This section is mostly boilerplate PyTorch code.

```
from torch.utils.tensorboard import SummaryWriter
writer = SummaryWriter('runs/experiment_1')
loss_fn = torch.nn.CrossEntropyLoss() # not equiv to Keras CCE https://discuss.pytorch.org/t/categorical-cross-entropy-loss-function-equivalent-in-pytorch/85165/11
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
dd['train'] = np.random.uniform(0,1, len(dd)) < .9
dataset=MovieLensDataset(dd[dd['train']])
val_dataset=MovieLensDataset(dd[~dd['train']])
def my_collate(batch):
data = [item[0] for item in batch]
target = [item[1] for item in batch]
target = target
return [data, target]
trainloader = torch.utils.data.DataLoader(dataset, batch_size=256, drop_last=True,
shuffle=True, collate_fn = my_collate)
testloader = torch.utils.data.DataLoader(val_dataset, batch_size=256, drop_last=True,
shuffle=True, collate_fn = my_collate)
```

```
def train_one_epoch(epoch_index, tb_writer):
running_loss = 0.
last_loss = 0.
for i, data in enumerate(trainloader):
inputs, labels = data
optimizer.zero_grad()
# Make predictions for this batch
outputs = model({'history': inputs, 'to_rank':labels})
# Compute the loss and its gradients
default_label = torch.zeros((outputs.shape[0]), dtype=torch.long).to(mps_device)
loss = loss_fn(outputs, default_label)
loss.backward()
# Adjust learning weights
optimizer.step()
# Gather data and report
l = loss.item()
running_loss += l
if i % 100 == 99:
last_loss = running_loss / 100 # loss per batch
print(' batch {} loss: {}'.format(i + 1, last_loss))
tb_x = epoch_index * len(trainloader) + i + 1
tb_writer.add_scalar('Loss/train', last_loss, tb_x)
running_loss = 0.
return last_loss
```

```
fixed_validation_set = [vdata for vdata in testloader]
for epoch_number in range(1):
model.train()
avg_loss = train_one_epoch(1, writer)
running_vloss = 0.0
model.eval()
with torch.no_grad():
for i, vdata in enumerate(fixed_validation_set):
vinputs, vlabels = vdata
voutputs = model({'history': vinputs, 'to_rank':vlabels})
default_label = torch.zeros((voutputs.shape[0]), dtype=torch.long).to(mps_device)
vloss = loss_fn(voutputs, default_label)
running_vloss += vloss
avg_vloss = running_vloss / (i + 1)
print('LOSS train {} valid {}'.format(avg_loss, avg_vloss))
writer.add_scalars('Training vs. Validation Loss',
{ 'Training' : avg_loss, 'Validation' : avg_vloss },
epoch_number + 1)
writer.flush()
torch.save(model, 'model.pt' +'_' +'_'.join([str(k[1]) for k in model.embeds]) +'.pt' ) # save model with hashes
```

```
batch 100 loss: 6.049025082588196
batch 200 loss: 4.822506718635559
batch 300 loss: 4.121425771713257
batch 400 loss: 3.6819806337356566
batch 500 loss: 3.45309770822525
LOSS train 3.45309770822525 valid 3.265143871307373
```

```
movies = pd.read_csv('data/movies.csv')
movies.set_index('movieId',inplace=True)
```

```
def map_movies_list(x):
return [movies.loc[k,'title'] for k in x]
```

```
all_movies_scored = model({'history': [dd.iloc[0].movieId[:max_pad]], 'to_rank': [list(movies.index)]})
```

```
print('User Movies','\n', map_movies_list(dd.iloc[0].movieId[:max_pad]))
```

```
User Movies
['Toy Story (1995)', 'Braveheart (1995)', 'Casper (1995)', 'Star Wars: Episode IV - A New Hope (1977)', 'Forrest Gump (1994)', 'When a Man Loves a Woman (1994)', 'Pinocchio (1940)', 'Die Hard (1988)', 'Ghost and the Darkness, The (1996)', 'Shall We Dance (1937)', 'Star Wars: Episode V - The Empire Strikes Back (1980)', 'Aliens (1986)', 'Star Wars: Episode VI - Return of the Jedi (1983)', 'Alien (1979)', 'Indiana Jones and the Last Crusade (1989)', 'Star Trek IV: The Voyage Home (1986)', 'Sneakers (1992)', 'Shall We Dance? (Shall We Dansu?) (1996)', 'X-Files: Fight the Future, The (1998)', 'Out of Africa (1985)', 'Last Emperor, The (1987)', 'Saving Private Ryan (1998)', '101 Dalmatians (One Hundred and One Dalmatians) (1961)', 'Lord of the Rings, The (1978)', 'Elizabeth (1998)', 'Notting Hill (1999)', 'Sixth Sense, The (1999)', 'Christmas Story, A (1983)', "Boys Don't Cry (1999)", 'American Graffiti (1973)']
```

```
print('Predicted Top 20 Movies', '\n', movies.title.values[np.argsort(all_movies_scored.cpu().detach().numpy())[0][::-1]][:20])
```

```
Predicted Top 20 Movies
['Gladiator (2000)' 'Green Mile, The (1999)' 'American Pie (1999)'
'Fight Club (1999)' 'Memento (2000)' 'Being John Malkovich (1999)'
'Toy Story 2 (1999)'
'Lord of the Rings: The Fellowship of the Ring, The (2001)'
'Matrix, The (1999)' 'Princess Mononoke (Mononoke-hime) (1997)'
'X-Men (2000)' 'American Beauty (1999)' 'Shrek (2001)'
'Requiem for a Dream (2000)' 'Cast Away (2000)'
'Ghostbusters (a.k.a. Ghost Busters) (1984)'
'Who Framed Roger Rabbit? (1988)' 'Airplane! (1980)'
'Crouching Tiger, Hidden Dragon (Wo hu cang long) (2000)'
'Sixth Sense, The (1999)']
```

In our model, we are only predicting if the user will watch an item or not. It may be that you want to predict various outcomes, like a click, a save, a buy, etc. One approach is modeling outcomes as prediction tasks demonstrated by Pinterest in their ItemSage paper. Here, they extend the basic model by adding an input term for the outcome they are predicting and embed that for each example. Then, they leverage both in-batch negatives and random negatives to train their model end-to-end. The in-batch negatives help the model learn the difference between popular items where the random negatives help to push down (i.e., rank lower) long-tail irrelevant items during inference.

The paper discusses implementing multiple tasks in Section 3.5:

We implement multi-task learning by combining positive labels from all the different tasks into the same training batch (Figure 4). This technique is effective even when the query entities come from different modalities, the only difference being that the query embedding needs to be inferred with a different pretrained model.

Minmin Chen, the author of Top-K Off-Policy Correction for a REINFORCE Recommender System, said the model from the paper was the single biggest launch in terms of metric improvements at YouTube for that year in her talk on the topic. The recommender system is modeled as a Reinforcement Learning problem and they leverage techniques from the off-policy learning literature to correct for the “behavior policy” (i.e., their production system) which generated and biased their data. Another way to put this is that they use Inverse Propensity Scoring (IPS) in a clever way to correct for biases in their training data, allowing them to properly optimize their desired loss on their data distribution. In their approach, they learn the behavior policy as a separate task during training:

One interesting note above is to concatenate a `label context`

to the final user state from the RNN is another way to implement a flavor of multiple tasks from the previous section.

Another avenue for improvement are different approaches to candidate generation. Knowledge graphs embed relational information among items (called entities) and can give more contextual recommendations. For example, if you watched a movie with an actor, the movie node could be connected to an actor node which is then connected to their other movies. An important paper in this space is Knowledge Graph Convolutional Networks for Recommender Systems, which embeds the knowledge graph into a vector space, allowing one to do the same retrieval techniques that we discussed above via the dot product.

]]>In my last post, we discussed the linearized form of the quotient \(\frac{\bar{V}}{\bar{M}} =\frac{\bar{V}}{\mu_{M}} - \frac{\mu_{V}}{\mu_{M}^2}\bar{M}.\) We want to connect how to use this formula to derive a power calculation when using the delta method. This approach is useful for ratio metrics, and many session / page / non-user level metrics like the click-through rate.

To simulate the problem, we create users that sample a user level click-through rate (CTR) from a normal distribution with mean equal to a given population CTR and a set standard error. For users in the treatment group, we add a small effect to their user-level CTR. We also assign the number of sessions per user that the user will appear in the experiment if they are selected. From there, we sample a set number of users and calculate the clicks and page views per users and return that as a dataframe.

```
import numpy as np
import pandas as pd
from scipy.special import ndtr
POP_CTR = .65
N_USERS = 50000 # total pool of users
SESSIONS_PER_USER = 8
EFFECT =.01
STE = .003
treatment = np.random.choice([0,1], p=[.5,.5], size=N_USERS)
user_rates = np.random.normal(POP_CTR+treatment*EFFECT, STE, N_USERS)
user_session_rates = np.random.normal(SESSIONS_PER_USER, 2, N_USERS)
user_rates[user_rates<0] = 0
user_rates[user_rates>1] = 1
def generate_exp_data(n_users):
clicks = []
users = []
treatments = []
j=0
users_in_exp = np.random.choice(N_USERS, n_users) # number of users in the experiment
while j < n_users:
k = users_in_exp[j]
treatment_data = treatment[k]
user_session_rate = int(np.maximum(1,user_session_rates[k]))
user_sessions = np.random.randint(np.maximum(user_session_rate-2,1),user_session_rate+2)
click_data = np.random.binomial(1,user_rates[k],size=user_sessions).tolist()
clicks.extend(click_data)
treatments.extend([treatment_data]*user_sessions)
users.extend([k]*user_sessions)
j+=1
d = pd.DataFrame({'click': clicks, 'user' : users, 'treatment':treatments})
# Linearize for delta method
d['clicks_user']=d.groupby(['treatment']).click.transform('sum') / d.groupby(['treatment']).user.transform('nunique')
d['sessions_user']=d.groupby(['treatment']).click.transform('count')/ d.groupby(['treatment']).user.transform('nunique')
d['session'] = 1
df=d.groupby(['user','treatment'], as_index=False).agg({
'click' : 'sum',
'session' : 'sum',
'clicks_user' : 'max',
'sessions_user' : 'max'
})
# Construct the linearized term for the delta method, which is at the user level!
df['linear_ctr'] = (1/df.sessions_user)*df.click-(df.clicks_user/df.sessions_user**2)*df.session
return df
def calculate_p_value(df):
# delta method! linearize the term clicks / sessions
diff = df[df.treatment==1].click.sum() / df[df.treatment==1].session.sum() \
- df[df.treatment==0].click.sum() / df[df.treatment==0].session.sum()
var_a = df[df.treatment == 0].linear_ctr.var() / df[df.treatment == 0].user.nunique()
var_b = df[df.treatment == 1].linear_ctr.var() / df[df.treatment == 1].user.nunique()
ste = np.sqrt(var_b + var_a)
p = ndtr(diff / ste)
return diff, ste, 1-p if p > .5 else p
```

To avoid too much complication, we will use Lehr’s rule from the wiki where the sample size necessary is

\[\begin{equation} n \approx 16 \frac{\sigma^2}{\delta^2}. \end{equation}\]It may not be obvious, but you can actually linearize your variable of interest, calculate the standard deviation, and plug it into the power formula! Let’s do a simulation to convince ourselves. We will calculate an estimated power using the standard formula for \(\alpha=.05\) and power at 80%, and then we will simulate to estimate the power.

```
np.random.seed(777)
d = generate_exp_data(1000) # using a smaller sample than we'd eventually need
sigma = d.linear_ctr.std() # calculate the STD of the linearized term
n = 16 * (sigma**2 / (EFFECT)**2 )
std_err = sigma / np.sqrt(1000)
n # formula per treatment
```

```
5289.1898124023655
```

Recall, the power of the test is equivalent to the following probability:

\[\begin{equation}P(\text{null hypothesis rejected} | \text{alternative hypothesis is true}).\end{equation}\]Above, we have constructed a simulation where the alternative hypothesis is true (\(\delta>0\)), and so if we calculate how often we would reject the null hypothesis when \(p < \text{significance level}/2=.05/2\) with the given sample size derived above will estimate the statistical power.

```
ps = []
significance_level = .05
for k in range(1000):
ps.append(calculate_p_value(generate_exp_data(int(2*n)))) # Generate P distribution
```

```
# power for 2 tailed test, p(H_0 rejected | H_1 is true)
np.mean([p[2] < significance_level/2 for p in ps])
```

```
0.809
```

Voila!

]]>To calculate the CUPED estimate, namely \(\theta\), the authors parametrized a function \(Y_{cv}(\theta)\) where \(\mathbb{E}(Y_{cv}(\theta))=Y\). Then, they chose the minimal \(\theta\) such that \(\theta = \min_{\hat\theta} var(Y_{cv}(\hat\theta))\). When you know the trick, it’s not that bad to generalize to more variables.

The first thing you have to do is allow multiple covariates. With a single covariate we had \(\theta\) and \(X\), the covariate. The analog in the multiple covariate case is a collection of \(\theta_i\) for covariates \(X_i\). In this setting, \(Y_{cv}\) is defined via \(\begin{equation} Y_{cv} = \bar{Y} - \sum_{i} \theta_i (\bar{X_i} - \mathbb{E}(X_i)) \end{equation}\) So, using rules about the variance of linear combinations, we have \(\begin{equation} var ( Y_{cv}) = \frac{1}{n} \left ( var(Y) + \sum_i \theta_i^2 var(X_i) + 2 \sum_{1 \leq i< j\leq n} \theta_i \theta_j cov(X_i, X_j) - \sum_i 2\theta_i cov(Y,X_i) \right ). \end{equation}\)

In the previous equation, we used the identity \(var(\bar{Y}) = var(Y)/n\) where \(n\) is the size of the sample. (We are glossing over many steps and definitions to get to the meat and not distract from the main point. For more details, see the CUPED paper’s definitions.)

Now, consider \(g(\boldsymbol \theta):=var(Y_{cv})\) where \(\boldsymbol\theta:= (\theta_1,\dots, \theta_m)\).

Now, to find the minimum value of \(\theta\) of this quadratic equation, we need to take the (multivariate) derivative with respect to \(\boldsymbol \theta\) and find the critical points– which in this case the critical point will be a minimum.

\[\begin{equation}\frac{\partial g}{\partial \theta_i} = \frac{1}{n}\left( 2\theta_i var(X_i) + 2 \sum_{j\neq i} \theta_j cov(X_i, X_j) - 2 cov(Y, X_i) \right) \end{equation}\]We can write this another way if we consider the vector \(\nabla \boldsymbol \theta = (\frac{\partial g}{\partial \theta_1},\dots \frac{\partial g}{\partial \theta_m})^T.\)

If we set \(\nabla \boldsymbol \theta = 0\), we can remove the factor of \(2\) and \(1/n\) in all the terms of the partial derivative (by dividing both sides by \(2\) and \(1/n\)) and we have

\[\begin{equation} \nabla \boldsymbol \theta = \Sigma \boldsymbol \theta - Z = 0 \end{equation}\]where \(\Sigma\) is the covariance matrix of the \(X_i\) and \(Z\) is a vector such that \(Z_i = cov(Y,X_i)\). Therefore, the minimum is achieved when we set

\[\begin{equation} \boldsymbol \theta := \Sigma^{-1}Z. \end{equation}\]Here, we will create a \(Y\) from a linear combination of covariates \(X_i\)’s, i.e, \(Y=\sum_i a_i X_i\). We will then solve for \(\theta\) using the formula above and we will find that the coefficients \(a_i\) and \(\boldsymbol \theta\) are equal.

```
import numpy as np
import pandas as pd
np.random.seed(8080)
size = 50000
num_X = 10
X_means = np.random.uniform(0, 1, num_X)
Xs = [np.random.normal(X_means[k], .01, size) for k in range(num_X)]
coefficients = np.random.uniform(0,1,num_X)
Y = np.sum([a*b for a,b in zip(coefficients, Xs)],axis=0) \
+ np.random.normal(0,.001,size)
```

```
# recover coefficients/theta via the formula, note n must be large
# for theta to be a good approximation of the coefficients
big_cov = np.cov([Y] + Xs) # Calculating sigma and z together
sigma = big_cov[1:,1:]
z = big_cov[1:,0].reshape((num_X,1))
theta = np.dot(np.linalg.inv(sigma),z) # here's the formula!
```

```
# This will be a small number, not perfectly 0
# due to the error term we add, this value isn't exactly 0
np.mean(theta.reshape((num_X,))-np.array(coefficients))
```

```
-9.495897072566845e-05
```

Wait? Why are they equal? So, there was a little gap in the logic above. I did not mention why those two values should be equal in the first place, but I showed it happens in this case. It is actually due to the Gauss-Markov Theorem, which states that under a few assumptions on the data distribution, the ordinary least squares (OLS) estimator is unbiased and has the smallest variance.

Let’s play that back to really grok it. I was coming up with a linear estimate for \(Y\), in terms of a linear combination \(Y_{cv}\) (so a linear estimator), and I was solving explicitly for coefficients (\(\theta\)) which minimized the variance. Since \(Y\) *is* already a linear combination, the OLS estimator will recover the coefficients of \(Y\). Gauss-Markov says that the OLS solution is the *unique* “best linear unbiased estimator”, and so solving for a variance minimizing linear combination (\(Y_{cv}\)) another way will still yield the same coefficients.

In the CUPED paper, the authors state that one covariate they found predictive was the first day the user entered the experiment. This is not a pre-experiment variable but it is independent of the treatment. We will craft a simulation that leverages this idea to create covariates to test for our formula above.

In our simulation, we will simulate a user’s query count during a test period under a control and a treatment. The treatment will make a user more likely to query. Depending on the day of the week on their first entrance into the experiment, a user will have a different additive component to their query propensity. This effect only impacts their visits on the first day.

```
# Using the simulation from a previous post
# with a few additions
import numpy as np
import pandas as pd
np.random.seed(1)
user_sample_mean = 8
user_standard_error = 3
users = 1000
# assign groups
treatment = np.random.choice([0,1], users)
treatment_effect = 2
# Identify query per session
user_query_means = np.random.normal(user_sample_mean, user_standard_error, users)
def run_session_experiment(user_means, users, user_assignment, treatment_effect):
# reate click rate per session level
queries_per_user = \
treatment_effect*treatment[users]\
+ user_means[users] \
+ np.random.normal(0,1, len(users))
queries_per_user=queries_per_user.astype(int)
queries_per_user[queries_per_user<0] = 0
return pd.DataFrame({'queries': queries_per_user, 'user': users, 'treatment': treatment[users]})
```

```
# Generate pre-experiment data for each user once, i.e. over some period
pre_data=run_session_experiment(user_query_means, range(users),
treatment, 0)
pre_data.columns = ['pre_' + k if k != 'user' else k for k in pre_data.columns]
pre_data = pre_data[['pre_queries','user']]
```

```
# Generate experiment data
day_impact = np.random.uniform(-3, 6, 7)
dfs = []
users_seen = set()
users_first_day = []
for k in range(14):
# select the users for that day, each user has a 2/14 change of appearing
day_users = np.random.choice([0,1], p=[12/14,2/14], size=users)
available_users = np.where(day_users==1)[0]
day_user_query_means = user_query_means + day_impact[k % 7]
df = run_session_experiment(day_user_query_means, available_users, treatment,
treatment_effect)
df['first_day'] = k % 7
df['real_day'] = k
dfs.append(df)
```

```
# We are doing a user level analysis with randomization unit
# equal to user as well. This means groupby's should be on user!
df=pd.concat(dfs)
def get_first_day(x):
# Get the "first_day" value corresponding to their actual first day
t=np.argmin(x['real_day'].values)
tmp = x.iloc[t]['first_day']
return tmp
dd=df.groupby(['user','treatment']).agg({'queries': 'sum'})
dd['first_day']=df.groupby(['user','treatment']).apply(get_first_day)
dd.reset_index(inplace=True) # pandas is really ugly sometimes
# combine data, notice each row is a unique user
data=dd.merge(pre_data, on='user')
```

```
# Calculate theta
covariates=pd.get_dummies(data['first_day'],drop_first=False)
covariates.columns = ['day_'+str(k) for k in covariates.columns]
covariates['pre_queries'] = data['pre_queries']
all_data = np.hstack((data[['queries']].values,covariates.values))
big_cov = np.cov(all_data.T) # 9x9 matrix
sigma = big_cov[1:,1:] # 8x8
z = big_cov[1:,0] # 8x1
theta = np.dot(np.linalg.inv(sigma),z)
```

```
# Construct CUPED estimate
Y = data['queries'].values.astype(float)
covariates = covariates.astype(float)
Y_cv = Y.copy()
for k in range(covariates.shape[1]):
Y_cv -= theta[k]*(covariates.values[:,k] - covariates.values[:,k].mean())
```

```
real_var, reduced_var = np.sqrt(Y.var()/len(Y)),\
np.sqrt(Y_cv.var()/len(Y))
```

```
reduced_var/real_var # variance reduction
```

```
0.8368532147135743
```

```
# Let's try OLS!
from statsmodels.formula.api import ols
results= ols('queries ~ pre_queries + C(first_day) + treatment', data).fit()
```

```
# Some calculations for the final table
effect = Y[data.treatment==1].mean() - Y[data.treatment==0].mean()
ste = Y[data.treatment==1].std()/np.sqrt(len(Y)) + Y[data.treatment==0].std()/np.sqrt(len(Y))
cuped_effect = Y_cv[data.treatment==1].mean() - Y_cv[data.treatment==0].mean()
cuped_ste = Y_cv[data.treatment==1].std()/np.sqrt(len(Y)) + Y_cv[data.treatment==0].std()/np.sqrt(len(Y))
```

```
pd.DataFrame(index=['t-test', 'CUPED', 'OLS'],data=
{
"effect size estimate": [effect, cuped_effect, results.params['treatment']],
"standard error": [ste, cuped_ste, results.bse['treatment']]
}
)
```

effect size estimate | standard error | |
---|---|---|

t-test | 5.014989 | 1.048969 |

CUPED | 4.458988 | 0.876270 |

OLS | 4.486412 | 0.884752 |

If you have a non-user level metric, like the click through rate, the analysis from the first section is mostly unchanged. When you define \(Y_{cv}\), we must account for those terms differently. Following Appendix B from the CUPED paper, let \(V_{i,+}\) equal the sum of obervations of statistic \(V\) for user \(i\), \(n\) the number of users and \(\frac{1}{n} \sum_i V_{i,+} = \bar{V}\). Let \(M_{i,+}\) be the number of visits for user \(i\). Let \(X\) be another user-level metric we will use for the correction. Then, we have the equation \(\begin{equation}Y_{cv} = \bar{Y} - \theta_1 \left( \frac{\sum_{i} V_{i,+}}{\sum_{i} M_{i,+}} - \mathbb{E}{\frac{\sum_{i} V_{i,+}}{\sum_{i} M_{i,+}}} \right ) - \theta_2 ( \bar{X} - \mathbb{E}X).\end{equation}\) In this case, we have a page-level metric \(V/M\) being used as a covariate for a user-level metric \(Y\). Is this realistic? Maybe, but at least it will serve to illustrate what to do here.

When you write out the formula for the variance, you will have a term of the form

\[\begin{equation} cov \left (\frac{\sum_{i} V_{i,+}}{\sum_{i} M_{i,+}} , \bar{X}\right ), \end{equation}\]where you need to apply the delta method to compute this covariance. We can write

\(\begin{equation}cov \left ( \frac{\sum_i V_{i,+}}{\sum_i M_{i,+}}, \bar{X} \right ) \approx cov\left (\frac{1}{\mu_{M}}\bar{V} - \frac{\mu_{V}}{\mu_{M}^2}\bar{M} , \bar{X}\right).\end{equation}\) This term can be simplified via properties of covariance. In particular, \(\begin{equation} cov\left (\frac{1}{\mu_{M}}\bar{V} - \frac{\mu_{V}}{\mu_{M}^2}\bar{M} , \bar{X}\right)= \frac{1}{\mu_M}cov(\bar{V},\bar{X}) - \frac{\mu_V}{\mu_{M}^2 }cov(\bar{M},\bar{X}). \end{equation}\) At this point, you are able to calculate all the necessary terms and can sub this value in for the \(\partial d / \partial \theta_i\) equation.

Using the previous presentation of the delta method, we can actually make our lives easier by replacing the vector \(\bar{V}/\bar{M}\) with the delta estimate \(\frac{1}{\mu_{M}}\bar{V} - \frac{\mu_{V}}{\mu_{M}^2}\bar{M}\) and then calculating the covariance, rather than applying any of the complicated delta formulae.

All that’s left is to convince you of this. First, I’ll take the formula for the delta method from my previous post and show it is equivalent to taking the variance of the vector \(\frac{1}{\mu_{M}}\bar{V} - \frac{\mu_{V}}{\mu_{M}^2}\bar{M}\).

```
# Define our data
V = np.random.normal(user_sample_mean, user_standard_error, users)
M = np.random.normal(user_sample_mean*2, user_standard_error, users)
X = np.random.normal(0,1,users)
mu_V, mu_M = V.mean(), M.mean()
```

```
def _delta_method(clicks, views):
# Following Kohavi et. Al. clicks and view are at the per-user level
# Clicks and views are aggregated to the user level and lined up by user,
# i.e., row 1 = user 1 for both X and Y
K = len(clicks)
X = clicks
Y = views
# sample mean
X_bar = X.mean()
Y_bar = Y.mean()
# sample variance
X_var = X.var(ddof=1)
Y_var = Y.var(ddof=1)
cov = np.cov(X,Y, ddof=1)[0,1] # cov(X-bar, Y-bar) = 1/n * cov(X,Y)
# based on deng et. al
return (1/K)*(1/Y_bar**2)*(X_var + Y_var*(X_bar/Y_bar)**2 - 2*(X_bar/Y_bar)*cov)
_delta_method(V, M)
```

```
4.452882421409211e-05
```

```
# Take the variance of the taylor expansion of V/M
(1/users)*np.var((1/mu_M)*V-(mu_V/mu_M**2)*M, ddof=1)
```

```
4.452882421409211e-05
```

How do we use this insight to make our \(\sigma\) calculation easier? Simply replace any vector of the metric \(V/M\) with the linearized formula \(\frac{\bar{V}}{\mu_{M}} - \frac{\mu_{V}}{\mu_{M}^2}\bar{M}\) and take the covariance as usual.

```
# Formula for the covariance from the previous section
delta_var = (1/users)*((1/mu_M)*np.cov(V,X)[0,1] - (mu_V/mu_M**2)*np.cov(M,X)[0,1])
delta_var
```

```
-5.78229317802757e-06
```

```
# Sub in the value for each vector before taking the covariance
pre_cov_est = (1/users)*np.cov((1/mu_M)*V-(mu_V/mu_M**2)*M, X)[0,1]
pre_cov_est
```

```
-5.782293178027562e-06
```

In our case, \(V\) and \(M\) are correlated and so the equivalence I empirically demonstrated was not a sleight of hand that doesn’t always work.

```
np.cov(V,M)
```

```
array([[9.10327978, 0.10338676],
[0.10338676, 9.29610525]])
```

I have personally implemented the delta method formula multiple times before realizing I could linearize before calculating the covariance, and so I do not think this calculation is *obvious*. Upon rereading the original delta method paper, the same formula is *very briefly* mentioned by Deng et al. in section 2.2 after equation (3) when they define $W_i$. In that section, they do not explicitly mention using the $W_i$ to calculate the variance, which is what I do above. However, it turns out that this is a common approach in the field to calculate the delta method, which is why it’s not belabored in the papers as it’s available in several books, including Art Owen’s online book in equation (2.29). Thanks to Deng (of Deng et al.) for sharing the reference.

In this post, we will empirically demonstrate the folllowing things:

- The variance from the delta method equals the OLS variance with cluster robust standard errors.
- The CUPED variance reduction can be achieved (and is equivalent to) using OLS (both one and two steps!)
- How to use the CUPED approach with multiple covariates via OLS.
- How to use CUPED when the randomization unit does not equal the unit of analysis (i.e., Appendix B of Deng et. al.) Implicitly, this post assumes you know that standard OLS is equivalent to a \(t\)-test when properly setup. If you are not sure of this, convince yourself of this by considering regressing \(Y\sim d\) where \(d\) is a binary indicator of being in the treatment and \(Y\) is our variable of interest.

A basic \(t\) test assumes analysis unit (usually users) equals your randomization unit (also users, as treatments are sticky per users in most cases). The problem is many of our units have mismatched units. A common unit is the click through rate (CTR), commonly defined as

\[\begin{equation}\frac{clicks}{sessions},\end{equation}\]where sessions can be thought of as page views for this discussion. Right here, we have a mismatch in units as these are both at the session level and not the user level. We can fix this by writing each as a “per user” metric, by dividing top and bottom by the number of users:

\[\begin{equation}\frac{clicks}{sessions} = \frac{\frac{clicks}{N}}{\frac{sessions}{N}}=\frac{clicks\ per\ user}{sessions\ per\ user}.\end{equation}\]Simple enough– now our numerator and denominator have the same units, but now when we try to calculate the variance of this term, using `np.var`

will underestimate the variance. This underestimate leads to a higher false positive rate. In fact, if you perform simulated A/A tests, the p-values will not be uniformly distributed but instead have a slight bump on the left hand side of your plot, near 0.

To avoid this and to regain a flat p-value distribution, the guidance in the A/B testing community is to use the delta method. The delta method uses a Taylor expansion (more or less) to find an approximation to the variance which is “close enough” when the number of users \(N\) becomes large. I’ll spare you the formula, but it’s in the paper linked.

However, if you come from a stats background, you may be thinking that you should be able to just use Ordinary Least Squares (OLS) with clustered standard errors. For awhile, there was no official document linking these two techniques, the delta method from A/B testing and OLS with cluster robust standard errors (implemented via the sandwich estimator in the `statsmodels`

package). Until Deng et al., published a proof they are equivalent. But, interestingly the idea of clustered standard errors were well understood in economics and other fields, but had not been connected to the A/B testing literature until 2021.

With that said, it’s one thing to understand the techniques individually and their connections at a high level and quite another thing to understand them in practice. I speak from experience because althought I *knew* the theory behind the contents of this post, it took me much longer than I’d like to admit to write it down coherently with simulated examples. And, that is the point of this– to present empirical evidence of the aforementioned claims.

In each example, I will simulate the true variance (and know the effect size by definition) and then aim to recover the proper variance through various techniques. I will also comment on when things go awry to hopefully help others in the same position. Each example will illustrate how to use a \(t\)-test for the calculation, as that method is the most scalable.

As a refresher and to set notation, assume we have two groups \(A\) and \(B\) are associated with random variables \(X\) and \(Y\). We define their sample means as \(\bar{X}\) and \(\bar{Y}\) respectively, their variances \(\text{var}(X):=\sigma_X^2\) and \(\text{var}(Y):=\sigma_Y^2\), and their user counts as \(n_A\) and \(n_B\). The simplest formula for the \(t\)-score, which for large \(n\) is equivalent approximately to the \(z\)-score, is \(\begin{equation}\frac{\bar{X}-\bar{Y}}{ \sqrt{ \frac{\sigma_X^2}{n_A} + \frac{\sigma_Y^2}{n_B} }}.\end{equation}\)

Note, because the two groups are independent, then \(X\) and \(Y\) are independent, and the denominator comes from the observation that \(\begin{equation}\text{var}(\bar{X}-\bar{Y}) = \text{var}(\bar{X}) + \text{var}(\bar{Y}) = \frac{\text{var}(X)}{n_X}+\frac{\text{var}(Y)}{n_Y}.\end{equation}\)

```
import numpy as np
import pandas as pd
from statsmodels.formula.api import ols
```

```
np.random.seed(1)
user_sample_mean = .3
user_standard_error = .15
users = 50000
pre_n_views = 100000
n_views = 100000
# assign groups
treatment = np.random.choice([0,1], users)
treatment_effect = .25
# Identify individual user click percentage mean
user_click_means = np.random.normal(user_sample_mean, user_standard_error, users)
def run_session_experiment(user_click_means, n_views, user_assignment, treatment_effect):
# determine user per each sessions/view
user_sessions = np.random.choice(users, n_views)
# reate click rate per session level
click_percent_per_session = treatment_effect*treatment[user_sessions] + user_click_means[user_sessions] + np.random.normal(0,.01, n_views)
# Remove <0 or >1
click_percent_per_session[click_percent_per_session>1] = 1
click_percent_per_session[click_percent_per_session<0] = 0
clicks_observed = np.random.binomial(1, click_percent_per_session)
return pd.DataFrame({'clicks': clicks_observed, 'user': user_sessions, 'treatment': treatment[user_sessions], 'views' : np.ones_like(user_sessions)})
```

```
# Simulate for "true" standard error
diffs = []
for k in range(100):
o = run_session_experiment(user_click_means, n_views, treatment, treatment_effect)
gbd=o.groupby(['user','treatment'], as_index=False).agg({'clicks':'sum', 'views': 'sum'})
A, B = gbd.treatment == 0, gbd.treatment ==1
diffs.append((gbd.clicks[B].sum()/gbd.views[B].sum()) - (gbd.clicks[A].sum()/gbd.views[A].sum()))
print('effect','ste', np.array(diffs).mean(), np.array(diffs).std()) # std error
```

```
effect ste 0.24817493916173397 0.003371032173547226
```

```
data = run_session_experiment(user_click_means, n_views, treatment, treatment_effect)
```

```
df = pd.DataFrame({'Y' : data.clicks, 'd' : data.treatment, 'g' : data.user})
o=ols('Y~d',df).fit(cov_type='cluster', cov_kwds={'groups': df['g'], 'use_correction':False})
o.params['d'],o.bse['d'] # see standard error for d
```

```
(0.24903284320201524, 0.003304827358515906)
```

```
def _delta_method(clicks, views):
# Following Kohavi et. Al. clicks and view are at the per-user level
# Clicks and views are aggregated to the user level and lined up by user,
# i.e., row 1 = user 1 for both X and Y
K = len(clicks)
X = clicks
Y = views
# sample mean
X_bar = X.mean()
Y_bar = Y.mean()
# sample variance
X_var = X.var(ddof=1)
Y_var = Y.var(ddof=1)
cov = np.cov(X,Y, ddof=1)[0,1] # cov(X-bar, Y-bar) = 1/n * cov(X,Y)
# based on deng et. al
return (1/K)*(1/Y_bar**2)*(X_var + Y_var*(X_bar/Y_bar)**2 - 2*(X_bar/Y_bar)*cov)
```

```
# NOTICE - delta method only is a good approximation when the cluster size K is approximately constant and number of users is larger
gbd=data.groupby(['user','treatment'], as_index=False).agg({'clicks':'sum', 'views': 'sum'})
A, B = gbd.treatment == 0, gbd.treatment ==1
np.sqrt(_delta_method(gbd.clicks[A], gbd.views.values[A])+_delta_method(gbd.clicks[B], gbd.views[B]))
```

```
0.0033049040362566383
```

The delta method only applies when \(n\) is large, and so for small \(n\), the approximation may not hold. In fact, if you rerun this when \(n<1000\), the delta method isn’t a good approximation.

In particular, consider \(\hat Y_{cv}=Y-\theta (X-\mathbb{E}X)\) from the CUPED paper, where \(\theta = \text{cov}(X,Y)/\text{var}(X)\). Then, \(\theta\) and the proper variance reduction can be gained by constructing the OLS formula \(Y\sim X+d+c\) where \(X\) is your pre-experiment covariate, \(d\) is your treatment indicator, and \(c\) is a constant.

```
# Changing the counts for a better approximation
# Larger number of users negate the need for CUPED as each user only appears once or twice
users = 10000
#pre_n_views = 200000
n_views = 200000
```

```
# Simulate for "true" standard error, we know the exact value of the effect by design
diffs = []
for k in range(100):
o = run_session_experiment(user_click_means, n_views, treatment, treatment_effect)
gbd=o.groupby(['user','treatment'], as_index=False).agg({'clicks':'sum', 'views': 'sum'})
A, B = gbd.treatment == 0, gbd.treatment ==1
diffs.append(gbd.clicks[B].mean() - gbd.clicks[A].mean())
sim_ste = np.array(diffs).std() # std error
```

```
# For a user metric, must do sum-- not mean-- as that considers session counts
# Generate pre-experiment data and experiment data, using same user click means and treatments, etc.
before_data = run_session_experiment(user_click_means, pre_n_views, treatment, 0)
after_data = run_session_experiment(user_click_means, n_views, treatment, treatment_effect)
# To apply the delta method or cuped, the calculations are at the user level, not sessions
bd_agg = before_data.groupby('user', as_index=False).agg({'clicks': 'sum', 'views' : 'sum'})
ad_agg = after_data.groupby('user', as_index=False).agg({'treatment': 'max', 'clicks': 'sum', 'views' : 'sum'})
#assert bd_agg.user.nunique() == ad_agg.user.nunique() # make sure all users sampled
bd_agg.columns = ['pre_' + col for col in bd_agg.columns]
d=ad_agg.join(bd_agg, on='user', how='left')
d['pre_clicks'] = d['pre_clicks'].fillna(d.pre_clicks.mean()) # users not seen in pre-period
cv = np.cov(d['clicks'], d['pre_clicks']) # clicks per user!
theta = cv[0,1]/cv[1,1]
y_hat = d.clicks - theta * (d.pre_clicks - d.pre_clicks.mean())
d['y_hat'] = y_hat
d['y_hat_div_n'] = y_hat
means = d.groupby('treatment').agg({'y_hat':'mean'})
variances = d.groupby('treatment').agg({'y_hat':'var', 'clicks':'count'})
effect=means.loc[1] - means.loc[0]
standard_error = np.sqrt(variances.loc[0, 'y_hat'] / variances.loc[0, 'clicks'] + variances.loc[1, 'y_hat'] / variances.loc[1, 'clicks'])
t_score = (means.loc[1] - means.loc[0]) / standard_error
```

```
# variance reduction
var_redux=np.corrcoef(d['clicks'], d['pre_clicks'])[0,1]**2
```

```
# Two step OLS, which indirectly estimates theta with OLS
s1 = ols('Y ~ X', pd.DataFrame({'Y': d.clicks, 'X': d.pre_clicks})).fit()
residuals = s1.resid # Y - theta*(X) - C, constant C handles the mean term
d['y_hat'] = residuals
# You can either do the t-test here or use OLS again
means = d.groupby('treatment').agg({'y_hat':'mean'})
variances = d.groupby('treatment').agg({'y_hat':'var', 'clicks':'count'})
effect2 = means.loc[1] - means.loc[0]
standard_error2 = np.sqrt(variances.loc[0, 'y_hat'] / variances.loc[0, 'clicks'] + variances.loc[1, 'y_hat'] / variances.loc[1, 'clicks'])
t_score2 = (means.loc[1] - means.loc[0]) / standard_error
s2 = ols('resid ~ d', pd.DataFrame({'resid': d['y_hat'], 'd': d.treatment})).fit()
```

```
# Direct OLS w/ pre-term stuff, note a constant is added automatically with the formula API
s3 = ols('Y ~ d + X', pd.DataFrame({'Y': d.clicks, 'd': d.treatment, 'X': d.pre_clicks})).fit()
```

```
# Direct OLS, no pre-term
s4 = ols('Y ~ d ', pd.DataFrame({'Y': d.clicks, 'd': d.treatment, 'X': d.pre_clicks})).fit()
```

```
# t-test, no CUPED
means = d.groupby('treatment').agg({'clicks':'mean'})
variances = d.groupby('treatment').agg({'clicks':'var', 'y_hat':'count'})
effect_none=(means.loc[1] - means.loc[0]).iloc[0]
standard_error_none = np.sqrt(variances.loc[0, 'clicks'] / variances.loc[0, 'y_hat'] + variances.loc[1, 'clicks'] / variances.loc[1, 'y_hat'])
t_score_none = effect_none/standard_error_none
```

```
ground_truth_effect = (n_views/users)*treatment_effect
pd.DataFrame(
{ 'method' : ['Ground Truth','Vanilla t-test','Vanilla OLS, no CUPED', 'CUPED with theta formula -> t-test', 'OLS estimates residuals -> t-test', '2 Step OLS', 'OLS'],
'estimated effect': [ground_truth_effect, effect_none, s4.params['d'],effect.iloc[0], effect2.iloc[0], s2.params['d'], s3.params['d']],
'theta': [None, None, None, theta, s1.params['X'], s1.params['X'],s3.params['X']],
't_score' : [ground_truth_effect/sim_ste, t_score_none, s4.tvalues['d'],t_score.iloc[0], t_score2.iloc[0], s2.tvalues['d'],s3.tvalues['d']],
'standard_error': [sim_ste, standard_error_none, s4.bse['d'], standard_error, standard_error2, s2.bse['d'], s3.bse['d']]}
)
```

method | estimated effect | theta | t_score | standard_error | |
---|---|---|---|---|---|

0 | Ground Truth | 5.000000 | NaN | 84.642781 | 0.059072 |

1 | Vanilla t-test | 4.915146 | NaN | 59.331897 | 0.082842 |

2 | Vanilla OLS, no CUPED | 4.915146 | NaN | 59.340633 | 0.082829 |

3 | CUPED with theta formula -> t-test | 4.927060 | 0.300307 | 60.330392 | 0.081668 |

4 | OLS estimates residuals -> t-test | 4.927060 | 0.300307 | 60.330392 | 0.081668 |

5 | 2 Step OLS | 4.927060 | 0.300307 | 60.339849 | 0.081655 |

6 | OLS | 4.927445 | 0.310026 | 60.340052 | 0.081661 |

```
# expected std? due to cuped
np.sqrt((1-var_redux)*standard_error_none**2) # close !
```

```
0.08202757811298232
```

This situation occurs for session or page level metrics like the click through rate \(\left ( \frac{clicks}{sessions} \right )\) when typically the A/B split occurs on users (i.e., you hash the user ID, the treatment is sticky by user, etc.) yet the metric does not contain users at all. As seen in the first example, this is a common situation to use the delta method. However, it’s unclear at first glance how to do this with CUPED. In the paper by Deng et al., Appendix B sketches how to combine the delta method with CUPED.

However, it is complicated, and it could be confusing to implement. In this case, you don’t apply the delta method like in the typical \(t\)-test, but rather Deng et al. applies the delta method (i.e., Taylor’s theorem) to linearize the terms in the variance and covariance and then passes \(n\to\infty\) to get an approximation. This is the same process as deriving the normal delta method, but this time they use it to get a formula in terms of several user level metrics. So, at the end of the day, you don’t calculate the traditional delta method at all for the variances, but rather you use their new formula given in Appendix B which I have recreated below.

```
def _extract_variables(d):
class VariableHolder:
'''
Convenient container class to hold common calculations that I have to do several times
off of data frames and return all those variables to the previous scope
d must be aggregated to the iid level, i.e., user
'''
n =len(d)
Y, N = d.clicks.values, d.views.values
X, M = d.pre_clicks.values, d.pre_views.values
sigma = np.cov([Y,N,X,M]) # 4
mu_Y, mu_N = Y.mean(), N.mean()
mu_X, mu_M = X.mean(), M.mean()
beta1 = np.array([1/mu_N,-mu_Y/mu_N**2,0,0]).T
beta2 = np.array([0,0,1/mu_M,-mu_X / mu_M**2]).T
return VariableHolder()
def calculate_theta(V):
# assumes the V, the VariableHolder class is passed in
# formula from Deng et al. n's would cancel out
theta = np.dot(V.beta1, np.matmul(V.sigma,V.beta2)) / np.dot(V.beta2, np.matmul(V.sigma,V.beta2))
return theta
```

By the definition of variance, \(var(\hat{Y}_{cv})= var(Y/N) + \theta^2 var(X/M)-2\theta cov(Y/N, X/M)\). Each term has to be calculated via the delta method and summed together… unless we can use OLS!

```
def var_y_hat(V, theta):
# assumes the V, the VariableHolder class is passed in
# formula from Deng et al., using the delta method
# to arrive at a new formula from Appendix B
var_Y_div_N = np.dot(V.beta1, np.matmul(V.sigma,V.beta1.T))
var_X_div_M = np.dot(V.beta2, np.matmul(V.sigma,V.beta2))
cov = np.dot(V.beta1, np.matmul(V.sigma,V.beta2))
# can also use traditional delta method for the var_Y_div_N type terms
return (var_Y_div_N+(theta**2)*var_X_div_M-2*theta*cov)/V.n
```

```
def cuped_exp(user_click_means, user_assignment, treatment_effect, views=150000):
# Generate pre-experiment data and experiment data, using same user click means and treatments, etc.
before_data = run_session_experiment(user_click_means, views, treatment, 0)
after_data = run_session_experiment(user_click_means, views, treatment, treatment_effect)
# To apply the delta method or cuped, the calculations are at the user level, not sessions
bd_agg = before_data.groupby('user', as_index=False).agg({'clicks': 'sum', 'views' : 'sum'})
ad_agg = after_data.groupby('user', as_index=False).agg({'treatment': 'max', 'clicks': 'sum', 'views' : 'sum'})
bd_agg.columns = ['pre_' + col for col in bd_agg.columns]
d=ad_agg.join(bd_agg, on='user', how='left')
d['pre_clicks'] = d['pre_clicks'].fillna(d.pre_clicks.mean()) # users not seen in pre-period
d['pre_views'] = d['pre_views'].fillna(d.pre_views.mean()) # users not seen in pre-period
V = _extract_variables(d)
theta = calculate_theta(V)
Y_hat = V.Y/V.N - theta*(V.X/V.M - (V.X/V.M).mean()) # Y_hat mean!
# create assignment
d['treatment']=[user_assignment[k] for k in d.user]
df = pd.DataFrame({"resids" : Y_hat, "d": d.treatment.values})
results = ols('resids ~ d', df).fit()
results.params['d'], results.pvalues['d']/2 # divide by 2 for single tail
A = d.treatment == 0
# should I do this or not? should it just be 1 variance
Vt = _extract_variables(d[~A])
Vc = _extract_variables(d[A])
# Return difference of means, variance
return Y_hat[~A].mean() - Y_hat[A].mean(), \
var_y_hat(Vt,theta)+var_y_hat(Vc,theta), d, theta, results
```

```
# Keep the users fixed per experiment, to simulate sampling from a real user base
users = 10000
n_views = 200000 # each user need enough examples for the covariances to be nonzero
np.random.seed(10)
user_click_means = np.random.normal(user_sample_mean, user_standard_error, users)
treatment = np.random.choice([0,1], users)
```

```
cupeds = []
thetas = []
for k in range(100):
o=cuped_exp(user_click_means, treatment, treatment_effect, views=n_views)
cupeds.append(o[0])
thetas.append(o[-2])
```

```
# Ground truth values from simulation
cupeds=np.array(cupeds)
cm=cupeds.mean()
cste=cupeds.std()
t_score = cm/cste
```

```
# Additional run for traditional t-test using Y_hat and variance correction
o=cuped_exp(user_click_means, treatment, treatment_effect, views=n_views)
results=o[-1]
theta=o[-2]
delta_std = np.sqrt(o[1])
```

```
# OLS session level with pre-treatment variables
before_data = run_session_experiment(user_click_means, n_views, treatment, 0) # No treatment in before period
after_data = run_session_experiment(user_click_means, n_views, treatment, treatment_effect)
bd_agg = before_data.groupby('user', as_index=False).agg({'clicks': 'sum', 'views' : 'sum'})
bd_agg.columns = ['pre_'+k if k != 'user' else k for k in bd_agg.columns]
# Join raw session data and fill in missing values
ols_data = after_data.merge(bd_agg, on='user', how='left')
ols_data['pre_clicks'] = ols_data['pre_clicks'].fillna(before_data.clicks.mean()) # users not seen in pre-period
ols_data['pre_views'] = ols_data['pre_views'].fillna(before_data.views.mean()) # users not seen in pre-period
s7=ols('Y ~ X + d', {'Y': ols_data.clicks, 'X' : (ols_data.pre_clicks/ols_data.pre_views - (ols_data.pre_clicks/ols_data.pre_views).mean()) , 'd': ols_data.treatment}).fit(cov_type='cluster', cov_kwds={'groups': ols_data['user'], 'use_correction':False})
```

```
# theta distribution and its standard deviation (not standard error)
np.mean(thetas),np.sqrt(np.var(thetas))
```

```
(0.679012644432057, 0.008879178301584759)
```

```
pd.DataFrame(
{ 'method' : ['Simulation (Ground truth)', 'CUPED w/ Special Theta -> t-test', 'OLS with cluster robust SEs'],
'estimated effect': [cm,results.params['d'], s7.params['d']],
'theta': [np.mean(thetas), theta, s7.params['X']],
't_score' : [t_score, results.tvalues['d'],s7.tvalues['d']],
'standard_error': [cste, delta_std,s7.bse['d']]}
)
```

method | estimated effect | theta | t_score | standard_error | |
---|---|---|---|---|---|

0 | Simulation (Ground truth) | 0.249992 | 0.679013 | 101.278105 | 0.002468 |

1 | CUPED w/ Special Theta -> t-test | 0.256820 | 0.679159 | 96.635135 | 0.002712 |

2 | OLS with cluster robust SEs | 0.249392 | 0.676926 | 92.796036 | 0.002688 |

So, now we have seen the myriad of ways that CUPED and OLS tie together with the delta method sprinkled in for good measure. The question remaining is why use the delta method at all, or the CUPED calculation, if OLS with situationally using cluster robust standard errors can do it for you. It’s a good question, and the answer lies in performance. The full OLS calculation is more computationally expensive and cannot as easily be parallelized, and so scaling OLS to very large datasets could prove challenging. It’s a similar reason as to why practitioners prefer to use the delta method instead of bootstrapping for a variance estimate. Both would yield a correct variance approximation, but one takes a much longer time to compute.

I have also specifically avoided the questions of the validity of possibly non-IID data. This topic is covered at length in this paper from Deng et al. and may be a topic for a later post.

In the CUPED paper, Deng et al. states “[t]he single control variate case can be easily generalized to include multiple variables.” Although this may be true, I don’t know if the solution is obvious if you are not currently well-versed in the mathematical background required– specifically if you would like to have a formula for the variance. However, it is very straightforward to see that you can add as many covariates as you’d like to the OLS formula and still have your variance estimated for you. Otherwise, you’d have to find the least squared solution to your multiple covariate formula and likely use bootstrapping for the variance. Maybe this is worth exploring in a later post.

I have to extend a thank you to many people who discussed and simulated this problem along side me to get to a coherent post: Kevin Foley who introduced me to this connection and championed OLS and the formula API from `statsmodels`

, Tianwen Chen for discussing the various approaches and reviewing my simulations, and Chiara Capuano for helping me with reference hunting, giving feedback, and proofreading.

Points are not the only defining factor in a game of football. Statistics did not predict the Eagles would win Superbowl LII, and in fact, most projections undercut the Eagles the entire year. There was something missing in the data– perhaps unmeasurable– that made the final difference. Now, even though fantasy football is solely based on points, point projection models are (understandably) not based on these unmeasurable quantities. What if Nick Foles is on a hot streak? What if Tom Brady has an injured hand and tries to catch a pass? What if the Patriots are still cheating but lose anyway? Sports fans however have taken these things into account for years through their gut feelings. Before you storm off citing “unscientific” methods, hear me out. Many aggregation sites look at the average of rankings given by experts. These rankings (as we will see below) are not ordered by total point projections. Sometimes, they are defined by them, but not always.

If these gut feelings satisfy a few assumptions, the central limit theorem still implies that their average will converge to a true ranking (hand waive, hand waive). If we want to understand “hot streaks” or “momentum” quantitatively, we would have to model them explicitly and introduce a new source of bias. Instead we average the rankings in hope that those individual sources of bias wash out and we are left with something reasonable to base decisions off of. The following analysis compares clustering raw player points (as suggested by the NYTimes article) with taking into account points and expert rankings. We will use data from FantasyPros.com.

In order to not “double dip” in the data, I’m going to use two separate expert sources for my projections vs. my rankings in order to get multiple opinions on the ranking.

The first thing we’re going to do is explore the data and see what the relationship looks like between points and rankings. The data contains projections for the overall points in a season for each player from one group of experts and an average ranking *from a second group of experts*.

```
import pandas as pd
d=pd.read_csv('overall.csv')
df= pd.concat([pd.read_csv(x).iloc[1:] for x in ['qb.csv','wr.csv','rb.csv']])
df.head()
```

ATT | ATT.1 | CMP | FL | FPTS | INTS | Player | REC | TDS | TDS.1 | Team | YDS | YDS.1 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|

1 | 587.0 | 58.9 | 376.1 | 2.5 | 321.7 | 9.2 | Aaron Rodgers | NaN | 33.8 | 2.1 | GB | 4,166.6 | 309.3 |

2 | 583.7 | 27.7 | 383.8 | 2.4 | 300.8 | 8.7 | Tom Brady | NaN | 32.5 | 0.9 | NE | 4,597.1 | 41.0 |

3 | 532.3 | 90.4 | 325.1 | 2.5 | 296.6 | 12.5 | Russell Wilson | NaN | 26.0 | 3.2 | SEA | 3,827.4 | 503.2 |

4 | 510.8 | 76.7 | 309.0 | 2.7 | 288.5 | 16.0 | Deshaun Watson | NaN | 27.4 | 3.5 | HOU | 3,781.0 | 442.1 |

5 | 512.9 | 120.0 | 299.5 | 2.2 | 287.8 | 14.5 | Cam Newton | NaN | 22.4 | 5.0 | CAR | 3,487.2 | 620.6 |

```
d.head()
```

Rank | WSID | Overall | Team | Pos | Bye | Best | Worst | Avg | Std Dev | ADP | vs. ADP | |
---|---|---|---|---|---|---|---|---|---|---|---|---|

0 | 1.0 | NaN | Todd Gurley | LAR | RB1 | 12.0 | 1.0 | 7.0 | 1.9 | 1.3 | 1.0 | 0.0 |

1 | 2.0 | NaN | Le'Veon Bell | PIT | RB2 | 7.0 | 1.0 | 9.0 | 2.5 | 1.8 | 2.0 | 0.0 |

2 | 3.0 | NaN | David Johnson | ARI | RB3 | 9.0 | 1.0 | 11.0 | 3.3 | 2.0 | 3.0 | 0.0 |

3 | 4.0 | NaN | Antonio Brown | PIT | WR1 | 7.0 | 1.0 | 11.0 | 4.4 | 1.7 | 5.0 | 1.0 |

4 | 5.0 | NaN | Ezekiel Elliott | DAL | RB4 | 8.0 | 1.0 | 18.0 | 5.8 | 2.9 | 4.0 | -1.0 |

```
dd=d.merge(df, left_on='Overall', right_on='Player')
dd = dd[~dd.Avg.isnull()] # remove null projections
def get_color(x):
try:
if 'WR' in x:
return 'b'
if 'RB' in x:
return 'r'
if 'TE' in x:
return 'g'
if 'QB' in x:
return 'black'
if 'K' in x:
return 'o'
return 'y'
except:
return 'c'
plt.scatter(dd.Avg, dd.FPTS, c=dd.Pos.apply(get_color))
plt.ylabel('points')
plt.xlabel('rank')
```

```
Text(0.5,0,u'rank')
```

As you can see, the quarterbacks (in black) clearly are separate from the other positions in terms of how they’re pointed vs. ranked. The most interesting positions to look at are the running backs (red) and wide receivers (blue). You can see there’s almost a linear relationship between the ranking and the number of points scored, which you would expect. A high scoring player should be ranked highly and most ranking algorithms use projected points as a large factor. However, there is still a curve to the trend of rank vs. points and that means that the ranking algorithms had more factors than purely points. The question is: Can we identify additional insight from this picture to improve our team?

We’re going exclude QBs for the analysis.

```
n_cluster = 8
gm = GaussianMixture(n_components = n_cluster,random_state=7)
gm.fit(dt[['FPTS']])
cluster=gm.predict(dt[['FPTS']])
from matplotlib import cm
import numpy as np
plt.scatter(dt.Avg, dt.FPTS, c=[cm.spectral(k/float(n_cluster)) for k in cluster])
plt.ylabel('points')
plt.xlabel('rank')
```

Look at the edges of the clusters. It’s a flat line because the approach ignored that many of those players have vastly different ranks. In the lower tiers, that won’t matter too much, but especially in the higher and mid tiers it’s an important dimension to think about.

Now, let’s do the analysis again but this time consider the ranking. We will calculate a new rank based off the point totals and the average expert ranking, and then we will cluster based on both the point projection and the original ranking.

To obtain a new ranking, I’m going to orthogonally project the data onto the line connecting the worst player to the best player:

```
rankbest=dt.loc[0][['Avg','FPTS']].values # best person clearly
ranklast=dt.loc[288][['Avg','FPTS']].values # worst person
plt.plot([rankbest[0],ranklast[0]],[rankbest[1],ranklast[1]])
plt.scatter(dt.Avg, dt.FPTS)
plt.ylabel('points')
plt.xlabel('rank')
```

This is a somewhat ad-hoc decision and it was a decision based on looking at the data. You could also use a less geometric approach.

The standard way to project a vector \(u\in\mathbb{R}^n\) onto \(v\in\mathbb{R}^n\) is as follows:

\(\)\frac{u\cdot v}{v\cdot v} v:= cv\(\)

You simply scale \(v\) by \(c\in\mathbb{R}\) to represent \(u\) projected onto \(v\). Now, \(c\) is the player’s new “rank”.

```
# Project the data and calculate c
to_proj = rankbest-ranklast
vals = dt.loc[:288][['Avg','FPTS']].values - ranklast
c = np.dot(vals, to_proj)/ np.dot(to_proj, to_proj)
```

```
dt['original_clusters'] = cluster
```

The calculation on the two dimensional data is more straightforward.

```
gm.fit(dt[['Avg','FPTS']])
dt['gmm2d_cluster'] = gm.predict(dt[['Avg','FPTS']])
```

We have adjusted the colors to match between the two plots.

We all know who the best players are and they’re not a surprise. By clustering by points, you’ll always see the best players on top– but everyone knows their top draft picks. We’re trying to find a sound tie breaker for the later draft rounds.. It’s clear from the plots that the standard GMM approach is missing some of the considerations from the experts (found in their ranking). The question as to which rank aware approach is better is an open question, but to my eyes, I like that the 1D GMM approach drops appears keep lower ranked players in their own cluster.

Next, we will plot the ranking and tier against projected points as the original New York Times article does to show the different insights you would get from our two dimensional approach vs. the standard approach.

Click here to enlarge the image.

Click here to enlarge the image.

Now, when a person has a different rank vs. other players with the same points, you can see that it effects their tiering. In the bottom right plot for wide receivers, you can see a completely different tiering. To obtain a final ranking, you would sort via the tiers followed by the normalized rank.

One interesting point is that this approach may allow you to identify over valued players because the rank, tier, and point projections aren’t deterministically linked. Look at the bottom right plot for wide recivers. You can see Corey Davis is included in the yellow group even though his rank is much higher than the other players. After a quick google, the Bleacher Report states that “…Corey Davis … no longer [qualifies as a sleeper] as [his] stocks have ascended. If anything, [he’s] in danger of getting overdrafted over the weekend.”. This insight is not clear from the original ranking/tiers.

Similarly, it also appears to draw different cutoffs as to who are “elite” players. This approach to ranking puts the top three wide receivers in a tier of their own where the original approach implies the first ~10 wide receivers are equivalent. This method would tell you to prioritize drafting the top 3 wide receivers over the others if you are able to.

As always, analytics for fantasy football are more of an art than science due to the limited data, but I hope this approach shows you an alternative way to use Gaussian Mixture Models to combine information about players into tiers.

]]>The residual layer is actually quite simple: add the output of the activation function to the original input to the layer. As a formula, the \(k+1\)th layer has the formula: \begin{equation} x_{k+1} = x_{k} + F(x_{k})\end{equation} where \(F\) is the function of the \(k\)th layer and its activation. For example, \(F\) might represent a convolutional layer with a relu activation. This simple formula is a special case of the formula: \begin{equation} x_{k+1} = x_{k} + h F(x_k),\end{equation} which is the formula for the Euler method for solving ordinary differential equations (ODEs) when \(h=1\). (Fun fact: the Euler method was the approach suggested by Katherine Johnson in the movie Hidden Figures).

Let’s take a quick detour on ODEs to appreciate what this connection means. Consider a simplified ODE from physics: we want to model the position \(x\) of a marble. Assume we can calculate its velocity \(x^\prime\) (the derivative of position) at any position \(x\). We know that the marble starts at rest \(x(0)=0\) and that its velocity at time \(t\) depends on its position through the formula: \begin{equation} x^\prime(t) = f(x) \end{equation} \begin{equation}x(0) = 0.\end{equation} The Euler method solves this problem by following the physical intuition: my position at a time very close to the present depends on my current velocity and position. For example, if you are travelling at a velocity of 5 meters per second, and you travel 1 second, your position changes by 5 meters. If we travel \(h\) seconds, we will have travelled \(5h\) meters. As a formula, we said: \begin{equation}x(t+h) = x(t) + h x^\prime(t),\end{equation} but since we know \(x^\prime(t) = f(x)\), we can rewrite this as \begin{equation} x(t+h) = x(t) + h f(x).\end{equation} If you squint at this formula for the Euler method, you can see it looks just like the formula for residual layers!

This observation has meant three things for designing neural networks:

- New neural network layers can be created through different numerical approaches to solving ODEs, e.g. Lu et al. 2017 paper.
- The possibility of arbitrarily deep neural networks, e.g. Chang et al. 2017 paper.
- Training of a deep network can be improved by considering the so-called stability of the underlying ODE and its numerical discretization, e.g. Haber and Ruthotto 2017 study.

These are all very exciting ideas, but the second two are particularly interesting– and intimately connected. The reason one can create arbitrarily deep networks with a finite memory footprint is by designing neural networks based on stable ODEs and numerical discretizations.

Rather than solely using the weights to define transformations (i.e. features), the dynamics of the underlying ODE are explicitly considered for determining features. Some implementations use multiple “ODE blocks” (my term, not standard) using different \(F\)’s, with each block responsible for 5-10+ layers of the final network. This means that the input will propagate according to the ODE for 5-10 time steps, and the trajectory helps to determine the learned features. The concept of stability comes into play here. Put simply, an ODE is stable if small perturbations of the initial position result in small perturbations of the final (asymptotic) position. However, there are stable ODEs that push all solutions to a single point, which is also not helpful for creating features.

For neural networks, the most important style of ODE to consider is the following: \begin{equation} x^\prime(t) = \sigma( K x(t) + b):=F(x)\end{equation} where \(\sigma\) is the activation function of a layer, \(K\) the matrix representing the weights of the layer and \(b\) the bias term. (Note, convolutional transformations can be represented by a matrix as well).

The stability of a (linear) system of ODEs is determined by eigenvalues of the Jacobian of \(F\). The important thing to know is that after some calculations one can show that eigenvalues of the Jacobian of \(F\) are determined by the weight matrix \(K\) in this case.

One way to understand the dynamics of an ODE are phase diagrams. Phase diagrams are plots of the direction of motion of a system at a given point indicated by an arrow. Then, some trajectories of the ODE are plotted to visualize what would happen given an initial condition. Here’s a plot from Haber and Ruthotto 2017 analyzing trajectories of the above \(ODE\) after 10 steps given different eigenvalues for a \(K\in\mathbb{R}^{2\times2}\):

From stability theory, the authors knew positive (real) eigenvalues for \(K\) would lead to a non-stable ODE which would not be ideal for learning features, and likewise, negative (real) eigenvalues would collapse all initial conditions to a single point. Zero-valued eigenvalues would lead to a dynamical system that doesn’t move at all! They suggest to use matrices whose real part is zero but have an imaginary component leading to the plot on the right. This can be done by designing the matrix to have these imaginary eigenvalues.

This theory has led to researchers proposing network designs that are inherently stable, either by construction via the weights (ex. using antisymmetric matrices) or choice of ODE. This is a very new approach for neural networks and it is exciting to see mathematical theory taking center stage. For a more in depth analysis, see Haber and Ruthotto 2017.

As we can see from the formula: \begin{equation} x_{k+1} = x_k +hF(x_k) \end{equation} the \(k+1\)th layer is defined by applying \(F\) (the weights and activation) and then adding it to \(x_k\). If you fix \(F\) for multiple layers, you can effectively have unlimited layers with only the memory footprint required to store the weights of \(F\).

Major libraries today today (i.e. Tensorflow) rely on symbolic gradient calculations (and in some cases “automatic differentation”) for their SGD updates. As these are calculated off of the computational graph, even such arbitrarily deep networks could create large graphs resulting in memory issues. One solution would be to train a model on larger hardware and then create the neural network evaluation function in a separate implementation which only loads one set of weights into memory per each layer. Then, an arbitrarily deep network could fit on much smaller hardware. It is possible that a new implementation could be built to train arbitrarily deep nets on hardware with small memory footprints by combining the chain rule / back propogration in a more componentwise fashion, but it would increase training time.

At a high level, our approach will combine arbitrary depth with stable propogation. We will reuse each layer multiple times to maximize our depth while keeping our memory footprint small. Stability will be guaranteed by the underlying ODE we will use.

For stability we will follow the approach is first suggested in Haber and Ruthotto 2017. The idea is to use a Hamiltonian system that is designed to be stable. This equation will be used as the underlying ODE:

\begin{equation}
\frac{\partial}{\partial t}
\left( \begin{array}{c}{y\\ z}\end{array} \right)(t) =
\sigma\left ( \left (
\begin{matrix} 0 & K(t) \

-K(t)^T & 0 \end{matrix}
\right )
\left( \begin{array}{c}{y\\ z}\end{array} \right)(t)
+
b(t) \right );
\left( \begin{array}{c}{y\\ z}\end{array} \right)(0)=
\left( \begin{array}{c}{y_0\\ 0}\end{array} \right).
\end{equation}

There is a lot here, so let’s do it step by step: first, \(K\) and \(b\) represent the weights and bias terms respectively for our convolutional layer and its bias term. (Note: a convolution can be represented as a matrix, which we use here.) They could depend number of times \(t\) you pass your input \(y(0)=y_0\) through the layer. The \(z\) represents an auxilary variable introduced to improve the stability of the system, but practically \(z\) will represent the additional filters we add through zero padding. In particular, we will take an image that has been transformed into a 4D tensor and double (say) the number of channels through zero padding. Then, our layer will define \(y\) as the non-zero channels and \(z\) as the zero channels. This is a practice known as “channel-wise partition” for \(y\) and \(z\) and the concept first appeared in Dinh et al. 2016.

The reason for adding this \(z\) is twofold. First, zero padding takes the place of using additional features in a convolutional layer. Instead of making a new Conv2D layer with double the filters, we’ll simply double the channels of the tensor through zero padding. Second, by introducting \(z\), we can make our dynamical system stable, because the Jacobian of this equation has negative real eigenvalues. This is proven in the Haber and Ruthotto 2017. The Jacobian depends on the eigenvalues of the matrix
\begin{equation}
\left (
\begin{matrix}{} 0 & K(t) \

-K(t)^T & 0 \end{matrix}
\right ),
\end{equation}
and by construction, this matrix is antisymmetric (\(A^T = -A\)) which implies any real eigenvalues of the matrix are \(0\). This yields that the forward propogation of this dynamical system will be stable, which will imply stability in training.

By leveraging the same concept behind arbitrary depth we can reduce the memory footprint of the network. We do this by reusing a layer multiple times, relying on the propogation of the ODE to learn more complicated features. To clarify, we do this by running an image through the layer, and then passing the output back into the layer. So then, a single layer can be “unrolled” in this fashion to actually represent multiple layers. In our case, \(t\) will indicate the number of times we have passed the image through the layer. The key to this trick is to define \(K(t)\equiv K\) and \(b(t)\equiv b\) for all \(t\).

We implement this by extending the convolutional layer in Keras. The important piece to keep in mind is that \(K\) is implemented as a convolution and \(K^T\) is a transposed convolution using the same weights. Then, we implement the Euler forward method for both \(y\) and \(z\), and then pass them back into the convolutional layers until we meet the number of unrolls, stored in the “unroll_length” variable. The Euler forward step in our case looks like this: \begin{equation} y(t+h) = y(t) + h \sigma( K z(t) + b_1(t) ) \end{equation} \begin{equation} z(t+h) = z(t) + h \sigma( -K^T y(t) + b_2(t) ). \end{equation} It’s important to choose a small enough \(h\) for stability to hold– which you can accomplish through trial and error on a validation set. To train the network below, we set \(h=.15\).

Here we replicate functions of final classes for illustrational purposes. The full class definitions can be found here.

```
class HamiltonianConv2D(Conv2D):
def call(self, inputs):
h=self.h
tmp, otmp = tf.split(inputs, num_or_size_splits=2, axis=-1) # Channel-wise partition for y and z
for k in range(self.unroll_length):
y_f_tmp = K.conv2d(
otmp,
self.kernel,
strides=self.strides,
padding=self.padding,
data_format=self.data_format,
dilation_rate=self.dilation_rate)
x_f_tmp = -K.conv2d_transpose(
x=tmp,
kernel=self.kernel,
strides=self.strides,
output_shape = K.shape(tmp),
padding=self.padding,
data_format=self.data_format)
if self.use_bias:
y_f_tmp = K.bias_add(
y_f_tmp,
self.bias,
data_format=self.data_format)
x_f_tmp = K.bias_add(
x_f_tmp,
self.bias2,
data_format=self.data_format)
# Euler Forward step
tmp = tmp + h*self.activation(y_f_tmp)
otmp = otmp + h*self.activation(x_f_tmp)
out = K.concatenate([tmp,otmp], axis=-1)
return out
```

We also have to implement the zero padding for adding channels to our tensors. This is done via the following layer:

```
class ChannelZeroPadding(Layer):
def call(self, x):
return K.concatenate([x] + [K.zeros_like(x) for k in range(self.padding-1)], axis=-1)
```

```
Using TensorFlow backend.
```

In the original ResNet paper implementation section, they select a batch size of 256 for 64000 mini batch updates. In my implementation, my computer could not handle that large of a batch, so I use a batch size of 256/8=32 and multiply the number of mini batch updates by 8 to ensure my training sees each data point the same number of times. I adopt a similar learning rate decay schedule to the the paper, outlined in the experiments section: I will divide the learning rate by 10 at iterations 32000 and 48000.

There are also some differences when training SOTA residual networks compared to (say) the current standard for reinforcement learning, which I found surprising. They are (in Keras lingo):

- SGD with momentum has become the standard optimizer rather than adaptive learning rate methods like RMSProp and Adam. Here’s a whole post writing about how surprising this is.
- The He Normal (he_normal) initialization is used rather than Xavier uniform (glorot_uniform).
- The learning rates (.1-.001) are an order of magnitude larger than in reinforment learning (.000025 and smaller).
- Usually only a single fully connected dense layer after the convolutional layers.
- No dropout is utilized. Networks are trained only with l2 regularization.

I follow the standards in the image classification community for my training regiment, which includes a data generation pipeline that perturbs the input images. The notebook for the full training code and all hyperparameters in Keras can be found here.

The final network is comprised of 5 ODE-inspired “units” made up of iterations of the layer define above. The approach is similar to the architecture found in this image below from Chang et al. 2017:

You can see that the units are followed by (Average) Pooling and Zero Padding as we defined in the previous section. For some of our units, we skip the pooling layer to be able to get more channels without shrinking the image too much.

```
model.summary()
```

```
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_1 (InputLayer) (None, 32, 32, 3) 0
_________________________________________________________________
conv1_pad (ZeroPadding2D) (None, 38, 38, 3) 0
_________________________________________________________________
conv2d_1 (Conv2D) (None, 32, 32, 16) 2368
_________________________________________________________________
batch_normalization_1 (Batch (None, 32, 32, 16) 64
_________________________________________________________________
activation_1 (Activation) (None, 32, 32, 16) 0
_________________________________________________________________
hamiltonian_conv2d_1 (Hamilt (None, 32, 32, 16) 592
_________________________________________________________________
average_pooling2d_1 (Average (None, 16, 16, 16) 0
_________________________________________________________________
channel_zero_padding_1 (Chan (None, 16, 16, 64) 0
_________________________________________________________________
hamiltonian_conv2d_2 (Hamilt (None, 16, 16, 64) 9280
_________________________________________________________________
channel_zero_padding_2 (Chan (None, 16, 16, 128) 0
_________________________________________________________________
hamiltonian_conv2d_3 (Hamilt (None, 16, 16, 128) 36992
_________________________________________________________________
channel_zero_padding_3 (Chan (None, 16, 16, 256) 0
_________________________________________________________________
hamiltonian_conv2d_4 (Hamilt (None, 16, 16, 256) 147712
_________________________________________________________________
average_pooling2d_2 (Average (None, 8, 8, 256) 0
_________________________________________________________________
channel_zero_padding_4 (Chan (None, 8, 8, 512) 0
_________________________________________________________________
hamiltonian_conv2d_5 (Hamilt (None, 8, 8, 512) 590336
_________________________________________________________________
average_pooling2d_3 (Average (None, 4, 4, 512) 0
_________________________________________________________________
channel_zero_padding_5 (Chan (None, 4, 4, 1024) 0
_________________________________________________________________
hamiltonian_conv2d_6 (Hamilt (None, 4, 4, 1024) 2360320
_________________________________________________________________
average_pooling2d_4 (Average (None, 1, 1, 1024) 0
_________________________________________________________________
flatten_1 (Flatten) (None, 1024) 0
_________________________________________________________________
dense_1 (Dense) (None, 10) 10250
=================================================================
Total params: 3,157,914
Trainable params: 3,157,882
Non-trainable params: 32
_________________________________________________________________
```

For comparison, the ResNet 50 implementation in Keras has 25,636,712 parameters and 168 layers.

For training, I will use the typical data augmentation techniques and they can be seen in the notebook.

The final accuracy on the test set is is 90.37%.

For simplicity, I used the Euler forward method for discretizing the ODE. This is based on numerical quadrature:

\begin{equation} x^\prime(t) = f(x) \Rightarrow \int_t^{t+h} x^\prime(z)\,dz = \int_t^{t+h} f(z)\,dz\Rightarrow x(t+h)-x(t) = \int_t^{t+h} f(z)\,dz.\end{equation} In other words, \begin{equation} x(t+h) = x(t) + \int_{t}^{t+h}f(z)\,dz \approx x(t) + hf(t).\end{equation} The \(\approx\) sign comes from a simple numerical quadrature (integration) rule. Other rules have been used to discretize ODEs like the midpoint rule or the leapfrog rule. Other rules can be better approximations to the integral which lead to better approximations to the ODE. For Hamiltonian systems, the Verlet rule is used which leads to the approach in Chang et al. 2017.

]]>Deep Reinforcement Learning sounds intractable for a layperson. “Reinforcement learning” alone sounds scary (I’m reinforcing WHAT again? Random actions? Bah!), and then you add the word “deep” and most people simply drop off. This talk details what I’ve learned from replicating the original NIPS/Nature papers on Deep Reinforcement Learning for playing Atari games and walks through some python implementations of the basic steps for reinforcement learning code. The talk uses Keras (TensorFlow backend) and tries (albeit, unsuccessfully) to hide advanced mathematical formulations of the problem. By the end of the talk, we will all have taken a modest step forward in learning deep reinforcement learning.

You can access the slides here.

]]>Attention manifests differently in different contexts. In some NLP problems, Attention allows the model to put more emphasis on different words during prediction, e.g. from Yang et. Al:

Interestingly, there is increasing research that calls this mechanism “Memory”, which some claim is a better title for such a versatile layer. Indeed, the Attention layer can allow a model to “look back” at previous examples that are relevant at prediction time, and the mechanism has been used in that way for so called Memory Networks. A discussion of Memory Networks is outside the scope of this post, so for our purposes, we will discuss Attention named as such, and explore the properties apparent to this nomenclature.

The Attention Mechanism mathematically is deceptively simple for its far reaching applications. I am going to go over the style of Attention that is common in NLP applications. A good overview of different approaches can be found here. In this case, Attention can be broken down into a few key steps:

- MLP: A one layer MLP acting on the hidden state of the word
- Word-level Context: A vector is dotted with the output of the MLP
- Softmax: The resulting vector is passed through a softmax layer
- Combination: The attention vector from the softmax is combined with the input state that was passed into the MLP

Now we will describe this in equations. The general flavor of the notation is following the paper on Hierarchical Attention by Yang et. Al, but includes insight from the notation used in this paper by Li et. Al as well.

Let \(h_{t}\) be the word representation (either an embedding or a (concatenation) hidden state(s) of an RNN) of dimension \(K_w\) for the word at position \(t\) in some sentence padded to a length of \(M\). Then we can describe the steps via

\begin{equation} u_t = \tanh(W h_t + b)\end{equation} \begin{equation} \alpha_t = \text{softmax}(v^T u_t)\end{equation} \begin{equation} s = \sum_{t=1}^{M} \alpha_t h_t\end{equation}

If we define \(W\in \mathbb{R}^{K_w \times K_w}\) and \(v,b\in\mathbb{R}^{K_w}\), then \(u_t \in \mathbb{R}^{K_w}\), \(\alpha_t \in \mathbb{R}\) and \(s\in\mathbb{R}^{K_w}\). (Note: this is the multiplicative application of attention.) Then, the final option is to determine Even though there is a lot of notation, it is still three equations. How can models improve so much when using attention?

To answer that question, we’re going to explain what each step actually does geometrically, and this will illuminate names like “word-level context vector”. I recommend reading my post on how neural networks bend and twist the input space, so this next bit will make a lot more sense.

First, remember that \(h_t\) is an embedding or representation of a word in a vector space. That’s a fancy way of saying that we pick some N-dimensional space, and we pick a point for every word (hopefully in a clever way). For \(N=2\), we would be drawing a bunch of dots on a piece of paper to represent our words. Every layer of the neural net moves these dots around (and transforms the paper itself!), and this is the key to understanding how attention works. The first piece

\begin{equation}u_t = tanh(Wh_t+b)\end{equation}

takes the word, rotates/scales it and then translates it. So, this layer rearranges the word representation in its current vector space. The tanh activation then twists and bends (but does not break!) the vector space into a manifold. Because \(W:\mathbb{R}^{K_w}\to\mathbb{R}^{K_w}\) (i.e., it does not change the dimension of \(h_t\)), then there is no information lost in this step. If \(W\) projected \(h_t\) to a lower dimensional space, then we would lose information from our word embedding. Embedding \(h_t\) into a higher dimensional space would not give us any new information about \(h_t\), but it would add computational complexity, so no one does that.

The word-level context vector is then applied via a dot product: \begin{equation}v^T u_t\end{equation} As both \(v,u_t\in \mathbb{R}^{K_w}\), it is \(v\)’s job to learn information of this new vector space induced by the previous layer. You can think of \(v\) as combining the components of \(u_t\) according to their relevance to the problem at hand. Geometrically, you can imagine that \(v\) “emphasizes” the important dimensions of the vector space.

Let’s see an example. Say we are working on a classification task: is this word an animal or not. Imagine you have a word embedding defined by the following image:

```
%matplotlib inline
import matplotlib.pyplot as plt
vectors = {'dog' : [.1,.5], 'cat': [.5,.48], 'snow' : [.5,.02], 'rain' : [.4, .05] }
for word, item in vectors.iteritems():
plt.annotate(word, xy=item)
plt.scatter(item[0],item[1])
```

In this toy example, we can assume that the x-axis component of this word embedding is NOT useful for the task of classifying animals, if this was all of our data. So, a choice of \(v\) might be the vector \([0,1]\), which would pluck out the \(y\) dimension, as it is clearly important, and remove the \(x\). The resulting clustering would look like this:

```
# Here we apply v, which zeros out the first dimension
for word, item in vectors.iteritems():
plt.annotate(word, xy=(0,item[1]))
plt.scatter(0,item[1])
```

We can clearly see that \(v\) has identified important information in the vector space representing the words. This is what the word-level context vector’s role is. It picks out the important parts of the vector space for our model to focus in on. This means that the previous layer will move the input vector space around so that \(v\) can easily pick out the important words! In fact, the softmax values of “dog” and “cat” will be larger in this case than “rain” and “snow”. That’s exactly what we want.

The next layer does not have a particularly geometric flavor to it. The application of the softmax is a monotonic transformation (i.e. the importances picked out by \(v^Tu_t\) stay the same). This allows the final representation of the input sentence to be a linear combination of \(h_t\), weighting the word representations by their importance. There is however another way to view the Attention output…

The output is easy to understand probabilistically. Because \(\alpha=[\alpha_t]_{t=1}^M\) sums to \(1\), this yields a probabilistic interpretation of Attention. Each weight \(\alpha_t\) indicates the probability that the word is important to the problem at hand. There are some versions of attention (“hard attention”) which actually will sample a word according to this probability only return that word from the attention layer. In our case, the resulting vector is the expectation of the important words. In particular, if our random variable \(X\) is defined to take on the value \(h_t\) with probability \(\alpha_t\), then the output of our Attention mechanism is actually \(\mathbb{E}[X]\).

To demonstrate this, we’re going to use data from a recent Kaggle competition on using NLP to detect toxic comments. The competition aims to label a text with 6 different (somewhat correlated) labels: “toxic”, “severe_toxic”, “obscene”, “threat”, “insult”, and “identity_hate”. We will expect the Attention layer to give more importance to negative words that could indicate one of those labels have occured. Let’s do a quick data load with some boilerplate code…

```
import numpy as np
import pandas as pd
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
```

```
train=pd.read_csv('train.csv')
train=train.sample(frac=1)
list_classes = ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]
y = train[list_classes].values
```

```
# Example of a data point
train.iloc[159552]
```

```
id 07ea42c192ead7d9
comment_text "\n\nafd\nHey, MONGO, why dont you put a proce...
toxic 0
severe_toxic 0
obscene 0
threat 0
insult 0
identity_hate 0
Name: 2928, dtype: object
```

More NLP boilerplate to tokenize and pad the sequences:

```
max_features = 20000
maxlen = 200
list_sentences = train['comment_text'].values
# Tokenize
tokenizer = Tokenizer(num_words=max_features)
tokenizer.fit_on_texts(list(list_sentences))
list_tokenized_train = tokenizer.texts_to_sequences(list_sentences)
# Pad
X_t = pad_sequences(list_tokenized_train, maxlen=maxlen)
```

There are many versions of attention out there that actually implements a custom Keras layer and does the calculations with low-level calls to the Keras backend. In my implementation, I’d like to avoid this and instead use Keras layers to build up the Attention layer in an attempt to demystify what is going on.

For our implementation, we’re going to follow the formulas used above. The approach was inspired by this repository.

```
from keras.layers import Activation, Concatenate, Permute, SpatialDropout1D, RepeatVector, LSTM, Bidirectional, Multiply, Lambda, Dense, Dropout, Input,Flatten,Embedding
from keras.models import Model
import keras.backend as K
class Attention:
def __call__(self, inp, combine=True, return_attention=True):
# Expects inp to be of size (?, number of words, embedding dimension)
repeat_size = int(inp.shape[-1])
# Map through 1 Layer MLP
x_a = Dense(repeat_size, kernel_initializer = 'glorot_uniform', activation="tanh", name="tanh_mlp")(inp)
# Dot with word-level vector
x_a = Dense(1, kernel_initializer = 'glorot_uniform', activation='linear', name="word-level_context")(x_a)
x_a = Flatten()(x_a) # x_a is of shape (?,200,1), we flatten it to be (?,200)
att_out = Activation('softmax')(x_a)
# Clever trick to do elementwise multiplication of alpha_t with the correct h_t:
# RepeatVector will blow it out to be (?,120, 200)
# Then, Permute will swap it to (?,200,120) where each row (?,k,120) is a copy of a_t[k]
# Then, Multiply performs elementwise multiplication to apply the same a_t to each
# dimension of the respective word vector
x_a2 = RepeatVector(repeat_size)(att_out)
x_a2 = Permute([2,1])(x_a2)
out = Multiply()([inp,x_a2])
if combine:
# Now we sum over the resulting word representations
out = Lambda(lambda x : K.sum(x, axis=1), name='expectation_over_words')(out)
if return_attention:
out = (out, att_out)
return out
```

We are going to train a Bi-Directional LSTM to demonstrate the Attention class. The Bidirectional class in Keras returns a tensor with the same number of time steps as the input tensor, but with the forward and backward pass of the LSTM concatenated. Just to remind the reader, the standard dimensions for this use case in Keras is: (batch size, time steps, word embedding dimension).

```
lstm_shape = 60
embed_size = 128
# Define the model
inp = Input(shape=(maxlen,))
emb = Embedding(input_dim=max_features, input_length = maxlen, output_dim=embed_size)(inp)
x = SpatialDropout1D(0.35)(emb)
x = Bidirectional(LSTM(lstm_shape, return_sequences=True, dropout=0.15, recurrent_dropout=0.15))(x)
x, attention = Attention()(x)
x = Dense(6, activation="sigmoid")(x)
model = Model(inputs=inp, outputs=x)
model.compile(loss='binary_crossentropy',
optimizer='adam',
metrics=['accuracy'])
attention_model = Model(inputs=inp, outputs=attention) # Model to print out the attention data
```

If we look at the summary, we can trace the dimensions through the model.

```
model.summary()
```

```
__________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
==================================================================================================
input_31 (InputLayer) (None, 200) 0
__________________________________________________________________________________________________
embedding_31 (Embedding) (None, 200, 128) 2560000 input_31[0][0]
__________________________________________________________________________________________________
spatial_dropout1d_31 (SpatialDr (None, 200, 128) 0 embedding_31[0][0]
__________________________________________________________________________________________________
bidirectional_37 (Bidirectional (None, 200, 120) 90720 spatial_dropout1d_31[0][0]
__________________________________________________________________________________________________
tanh_mlp (Dense) (None, 200, 120) 14520 bidirectional_37[0][0]
__________________________________________________________________________________________________
word-level_context (Dense) (None, 200, 1) 121 tanh_mlp[0][0]
__________________________________________________________________________________________________
flatten_30 (Flatten) (None, 200) 0 word-level_context[0][0]
__________________________________________________________________________________________________
activation_11 (Activation) (None, 200) 0 flatten_30[0][0]
__________________________________________________________________________________________________
repeat_vector_29 (RepeatVector) (None, 120, 200) 0 activation_11[0][0]
__________________________________________________________________________________________________
permute_31 (Permute) (None, 200, 120) 0 repeat_vector_29[0][0]
__________________________________________________________________________________________________
multiply_28 (Multiply) (None, 200, 120) 0 bidirectional_37[0][0]
permute_31[0][0]
__________________________________________________________________________________________________
expectation_over_words (Lambda) (None, 120) 0 multiply_28[0][0]
__________________________________________________________________________________________________
dense_55 (Dense) (None, 6) 726 expectation_over_words[0][0]
==================================================================================================
Total params: 2,666,087
Trainable params: 2,666,087
Non-trainable params: 0
__________________________________________________________________________________________________
```

```
model.fit(X_t, y, validation_split=.2, epochs=3, verbose=1, batch_size=512)
```

```
Train on 127656 samples, validate on 31915 samples
Epoch 1/3
127656/127656 [==============================] - 192s 2ms/step - loss: 0.1686 - acc: 0.9613 - val_loss: 0.1388 - val_acc: 0.9639
Epoch 2/3
127656/127656 [==============================] - 183s 1ms/step - loss: 0.0939 - acc: 0.9708 - val_loss: 0.0526 - val_acc: 0.9814
Epoch 3/3
127656/127656 [==============================] - 183s 1ms/step - loss: 0.0492 - acc: 0.9820 - val_loss: 0.0480 - val_acc: 0.9828
<keras.callbacks.History at 0x7f5c60dbc690>
```

Because we stored the output of the softmax layer, we can identify what words the model identified as important for classifying each sentence. We’ll define a helper function that will print the words in the sentence and pair them with their corresponding Attention output (i.e., the word’s importance to the prediction).

```
def get_word_importances(text):
lt = tokenizer.texts_to_sequences([text])
x = pad_sequences(lt, maxlen=maxlen)
p = model.predict(x)
att = attention_model.predict(x)
return p, [(reverse_token_map.get(word), importance) for word, importance in zip(x[0], att[0]) if word in reverse_token_map]
```

The data set from Kaggle is about classifying toxic comments. Instead of using examples from the data (which is filled with vulgarities…), I’ll use a relatively benign example to show Attention in action:

```
# This example is not offensive, according to the model.
get_word_importances('ice cream')
```

```
(array([[0.03289272, 0.00060764, 0.00365122, 0.00135027, 0.00818442,
0.00286318]], dtype=float32),
[('ice', 0.24796312), ('cream', 0.18920079)])
```

```
# This example is labeled ALMOST labeled toxic, but isn't, due to the word "worthless".
get_word_importances('ice cream is worthless')
```

```
(array([[0.4471274 , 0.00662996, 0.07287972, 0.00910207, 0.09902474,
0.02583557]], dtype=float32),
[('ice', 0.028298836),
('cream', 0.024065288),
('is', 0.041414138),
('worthless', 0.8417008)])
```

```
# This example is labeled toxic and it's because of the word "sucks".
get_word_importances('ice cream sucks')
```

```
(array([[0.9310674 , 0.04415609, 0.64935035, 0.02346741, 0.4773681 ,
0.07371927]], dtype=float32),
[('ice', 0.0032987772), ('cream', 0.003724447), ('sucks', 0.9874626)])
```

Attention mechanisms are an exciting technique and an active area of research with new approaches being published all the time. Attention has been shown to help achieve state of the art performance on many tasks, but it is also has the obvious advantage of explaining important components of neural networks to a practitioner. In an academic setting, Attention proves very useful for model debugging by allowing one to inspect the predictions and their reasoning. Moreover, having the option available to indicate why a model made a prediction could be very important to fields like healthcare where predictions may be vetted by an expert.

]]>DeepDream made headlines a few years ago for its bizarre images generated by neural networks. As with many scientific advancements, a flashy name won out over the precise name. “Dreaming” sounds more exotic than the more boring truth: “purposely maximizing the L2 norm of different layers”. Anyway, the approach was later extended to video by researchers to more unsettling effects. How to “dream” has been covered at length, so for this post, I’m going to repurpose code from the good people at Keras and only focus on the adversarial side of dreaming. This modified code is included as an appendix below.

Adversarial images have been getting a lot of attention lately. Adversarial images are images that fool state of the art image classifiers but a human would easily classify as the proper class. At present, these images are only possible with access to the classifier you aim to fool, but that could always change in the future. Therefore, researchers are already exploring all the different ways one can exploit an image classifier.

There have been several papers exploring these so-called attacks. The aptly named One pixel attacks change a single pixel to trick a classifier:

Special glasses have been used to cause a classifier to misclassify celebrities:

One can quickly imagine how this can go awry when self-driving cars need to identify obstacles (or pedestrians). With these publications in mind, I would like to walk through another approach that combines “dreaming” to create adversarial images.

```
from keras.applications import inception_v3
from keras import backend as K
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
# Repurposed deepdream.py code from the people at Keras
from deepdream_mod import preprocess_image, deprocess_image, eval_loss_and_grads, resize_img, gradient_descent
K.set_learning_phase(0)
# Load the model and include the "top" classification layer
model = inception_v3.InceptionV3(weights='imagenet',
include_top=True)
```

```
Using TensorFlow backend.
```

Today, we’re going to take this image of a lazy cat for our experiment. Let’s see what InceptionV3 thinks right now:

```
orig_img = preprocess_image('lazycat.jpg')
```

```
top_predicted_label = inception_v3.decode_predictions(model.predict(orig_img), top=1)[0][0]
plt.title('Predicted Label: %s (%f)' % (top_predicted_label[1], top_predicted_label[2]))
plt.imshow(deprocess_image(orig_img))
```

The most likely class according to InceptionV3 is an Egyptian_cat. That’s oddly specific (and wrong), but let’s go with it.

In this case, let’s select a “coffeepot”.

```
coffeepot = np.zeros_like(model.predict(orig_img))
coffeepot[0,505] = 1.
inception_v3.decode_predictions(coffeepot)
```

```
[[(u'n03063689', u'coffeepot', 1.0),
(u'n15075141', u'toilet_tissue', 0.0),
(u'n02317335', u'starfish', 0.0),
(u'n02391049', u'zebra', 0.0),
(u'n02389026', u'sorrel', 0.0)]]
```

Our loss will have a “dreaming” and “adversarial” component.

To achieve “dreaming”, we fix the weights and perform gradient ascent on the input image itself to maximize the L2 norm of a chosen layer’s output of the network. You can also select multiple layers and create a loss to maximize with coefficients, but in this case we will choose a single layer for simplicity.

The interesting part mathematically is as follows: define the input image as \(x_0\) and the output of the \(n\)-th layer as \(f_n(x)\). To dream, we perform the gradient ascent step:

\begin{equation} x_{i+1} = x_{i} + \gamma \frac{\partial}{\partial x} |f_{n}(x)|^2 \end{equation}

where \(\gamma\) is the learning rate and the absolute value represents the L2 norm. For our choice of \(f_n\), we are not limited to activation functions. We could use the output before the activation as well as special layers like Batch Normalization as well.

For the “adversarial” piece, we will also add a component to the loss that will correspond to gradient descent on the predicted label, meaning each iteration we will modify the image to make the neural network think the image is the label we select.

One of the easiest ways to define this is to choose a label and minimize the loss between the true output for an image and the false label. Let us define the model’s output softmax layer as \(f(x)\) for an image \(x\) and let’s let \(o_{fake}\) be the fake label. Let’s assume we are using a categorical crossentropy loss \(\ell\), but any loss would do. Then, we would want to perform the gradient descent step

\begin{equation} x_{i+1} = x_i - \gamma \frac{\partial}{\partial x} \ell( f(x), o_{fake}). \end{equation}

This iteration scheme will adjust the original image to minimize the loss function between the output of the model and the fake label, hence “tricking” the model.

If we add the previous two loss functions, see the final proper loss where we are taking the derivative:

\begin{equation} 2 x_{i+1} = 2 x_{i} + \gamma (\frac{\partial}{\partial x} |f_{n}(x)|^2 - \frac{\partial}{\partial x} \ell( f(x), o_{fake}) ) \end{equation}

After massaging the equation, we have the gradient *descent* rule

\begin{equation} x_{i+1} = x_{i} - \frac{\gamma }{2} \frac{\partial}{\partial x} \left (\ell( f(x), o_{fake}) - |f_{n}(x)|^2 \right ). \end{equation} So, our final loss function for adversarial dreaming to define in Keras is:

\begin{equation} \ell_{\text{adv dream}} = \ell( f(x), o_{fake}) - |f_{n}(x)|^2. \end{equation}

For the final loss, we will use a linear combination of a few randomly selected layer output functions for better dreaming effects, but those are a small technical detail. Let’s see the code:

```
from keras import losses
input_class = K.zeros(shape=(1,1000)) # Variable fake class
# dreaming loss, choose a few layers at random and sum their L2 norms
dream_loss = K.variable(0.)
np.random.seed(1)
np.random.choice(model.layers[:-1], size = 5)
for x in np.random.choice(model.layers[:-1], size = 4):
x_var = x.output
dream_loss += np.random.uniform(0,2) * K.sum(K.square(x_var)) / K.prod(K.cast(K.shape(x_var), 'float32'))
# adversarial loss
adversarial_loss = losses.categorical_crossentropy(input_class, model.output)
# final loss
adversarial_dream_loss = adversarial_loss-dream_loss
```

```
# Compute the gradients of the dream wrt the loss.
dream = model.input # This is the input image
grads = K.gradients(adversarial_dream_loss, dream)[0] # the signs will acheive the desired effect
grads /= K.maximum(K.mean(K.abs(grads)), K.epsilon()) # Normalize for numerical stability
outputs = [adversarial_dream_loss, grads]
# Function to use during dreaming
fetch_loss_and_grads = K.function([dream, input_class], outputs)
```

The code to “dream” is straight forward. The image is shrunk to a smaller size and ran through a selected number of gradient descent steps. Then, the resulting image is upscaled with any detail the original image could have lost when resizing added back in. This is then repeated based on the number of “octaves” selected where during each octave, the image has a different size. The octave scale determines the size ratio between the different octaves.

This is really just a more complicated way to say that the image is passed through the neural net for gradient descent at multiple increasing sizes. The octave concept is a nice geometric approach for deciding what those sizes will be that has been found to work well. Sometimes, additional jitter is added to the image on either each gradient descent step too.

```
# Shamelessly taken from the Keras' example
step = 0.01 # Gradient ascent step size
num_octave = 5 # Number of scales at which to run gradient ascent
octave_scale = 1.2 # Size ratio between scales
iterations = 15 # Number of ascent steps per scale
max_loss = 100.
img = np.copy(orig_img)
if K.image_data_format() == 'channels_first':
original_shape = img.shape[2:]
else:
original_shape = img.shape[1:3]
successive_shapes = [original_shape]
for i in range(1, num_octave):
shape = tuple([int(dim / (octave_scale ** i)) for dim in original_shape])
successive_shapes.append(shape)
successive_shapes = successive_shapes[::-1]
original_img = np.copy(img)
shrunk_original_img = resize_img(img, successive_shapes[0])
for shape in successive_shapes:
print('Processing image shape', shape)
img = resize_img(img, shape)
img = gradient_descent(img,
iterations=iterations,
step=step,
clz=coffeepot,
max_loss=max_loss,
loss_func = fetch_loss_and_grads)
# Upscale
upscaled_shrunk_original_img = resize_img(shrunk_original_img, shape)
same_size_original = resize_img(original_img, shape)
# Add back in detail from original image
lost_detail = same_size_original - upscaled_shrunk_original_img
img += lost_detail
shrunk_original_img = resize_img(original_img, shape)
```

```
('Processing image shape', (144, 144))
('..Loss value at', 0, ':', array([ 13.78646088], dtype=float32))
...
('..Loss value at', 14, ':', array([ 5.75342369], dtype=float32))
('Processing image shape', (173, 173))
('..Loss value at', 0, ':', array([ 12.50799942], dtype=float32))
...
('..Loss value at', 14, ':', array([ 4.36679077], dtype=float32))
('Processing image shape', (207, 207))
('..Loss value at', 0, ':', array([ 12.14435482], dtype=float32))
...
('..Loss value at', 14, ':', array([-8.84395885], dtype=float32))
('Processing image shape', (249, 249))
('..Loss value at', 0, ':', array([ 6.44068241], dtype=float32))
...
('..Loss value at', 14, ':', array([-9.34126949], dtype=float32))
('Processing image shape', (299, 299))
('..Loss value at', 0, ':', array([ 1.45110321], dtype=float32))
...
('..Loss value at', 14, ':', array([-9.23139477], dtype=float32))
```

```
f, axs= plt.subplots(1,2, figsize=(8,10))
top_predicted_label = inception_v3.decode_predictions(model.predict(orig_img), top=1)[0][0]
axs[0].set_title('Predicted Label: %s (%f)' % (top_predicted_label[1], top_predicted_label[2]))
axs[0].imshow(deprocess_image(orig_img))
top_predicted_label = inception_v3.decode_predictions(model.predict(img), top=1)[0][0]
axs[1].set_title('Predicted Label: %s (%f)' % (top_predicted_label[1], top_predicted_label[2]))
axs[1].imshow(deprocess_image(img))
```

As you can see, we accomplished our initial goal to dream in a directed adversarially way. There are many other approaches that can be attempted, and as you can see, this is a very popular avenue of research at the moment.

Here’s the functions used in this post:

```
# Modification of https://github.com/keras-team/keras/blob/master/examples/deep_dream.py
from keras import backend as K
from keras.applications import inception_v3
from keras.preprocessing.image import load_img, img_to_array
import numpy as np
import scipy
def preprocess_image(image_path):
# Util function to open, resize and format pictures
# into appropriate tensors.
img = load_img(image_path)
img = img_to_array(img)
img = np.expand_dims(img, axis=0)
img = inception_v3.preprocess_input(img)
return img
def deprocess_image(x):
# Util function to convert a tensor into a valid image.
x=np.copy(x)
if K.image_data_format() == 'channels_first':
x = x.reshape((3, x.shape[2], x.shape[3]))
x = x.transpose((1, 2, 0))
else:
x = x.reshape((x.shape[1], x.shape[2], 3))
x /= 2.
x += 0.5
x *= 255.
x = np.clip(x, 0, 255).astype('uint8')
return x
def eval_loss_and_grads(x,clz, loss_func):
outs = loss_func([x,clz])
loss_value = outs[0]
grad_values = outs[1]
return loss_value, grad_values
def resize_img(img, size):
img = np.copy(img)
if K.image_data_format() == 'channels_first':
factors = (1, 1,
float(size[0]) / img.shape[2],
float(size[1]) / img.shape[3])
else:
factors = (1,
float(size[0]) / img.shape[1],
float(size[1]) / img.shape[2],
1)
return scipy.ndimage.zoom(img, factors, order=1)
def gradient_descent(x, iterations, step, clz=None, max_loss=None, loss_func=None):
for i in range(iterations):
loss_value, grad_values = eval_loss_and_grads(x, clz, loss_func)
if max_loss is not None and loss_value > max_loss:
break
print('..Loss value at', i, ':', loss_value)
x -= step * grad_values
return x
```

In order to not bog this post down with excessive details, I’m going to assume that the reader is familiar with the concepts of (stochastic) gradient descent and what asynchronous programming looks like.

Stochastic gradient descent is an iterative algorithm. If one wanted to speed it up, they would have to move the iterative calculation to a faster processor (or say, a GPU). Hogwild! is an approach to parallelize SGD in order to scale SGD to quickly train on even larger data sets. The basic algorithm of Hogwild! is deceptively simple. The only pre-req is that all processors on the machine can read and update the weights of your algorithm (i.e. of your neural network or linear regression) at the same time. In short, the weights are stored in shared memory with atomic component-wise addition. The individual processors’ SGD update code is described below in adapted from the original paper:

```
loop
sample an example (X_i, y_i) from your training set
evaluate the gradient descent step using (X_i, y_i)
perform the update step component-wise
end loop
```

So, we run this code on multiple processors updating the same model weights at the same time. You can see where the name Hogwild! comes from– the algorithm is like watching a bunch of farm animals bucking around at random, stepping on each other’s toes.

There are a few conditions for this algorithm to work, but for our purposes, one way to phrase a (usually) sufficient condition is that the gradient updates are sparse, meaning that there are only a few non-zero elements of the gradient estimate used in SGD. This is likely to occur when your data itself being sparse. A sparse update means that it’s unlikely for the different processors to step on each other’s toes, i.e. trying to access the same weight-component to update at the same time. However, collisions do occur, but this has been observed to act as a form of regularization in practice.

As an exercise, we will implement linear regression using Hogwild! for training. I have prepared a git repo with the full implemetation adhering the sklearn’s API, and I will use pieces of that code here.

To make sure we’re all on the same page, here’s the terminology I will use in the subsequent sections. Our linear regression model \(f\) will take \(x\in\mathbb{R}^n\) as input and output a real number \(y\) via

\begin{equation} f(x) = w \cdot x, \end{equation} where \(w\in\mathbb{R}^n\) is our vector of weights for our model. We will use squared error as our loss function, which for a single example is written as \begin{equation} \ell(x,y) = ( f(x)-y )^2 = ( w\cdot x - y)^2. \end{equation}

To train our model via SGD, we need to calculate the gradient update step so that we may update the weights via: \begin{equation} w_{t+1} = w_t - \lambda G_w (x,y), \end{equation} where \(\lambda\) is the learning rate and \(G_w\) is an estimate of \(\nabla_w \ell\) satisfying: \begin{equation} \mathbb{E}[G_w (x,y)] = \nabla_w \ell(x,y). \end{equation}

In particular,

\begin{equation} G_w (x,y):= -2\, (w\cdot x - y)\, x \in \mathbb{R}^n \end{equation}

First, we quickly generate our training data. We will assume our data was generated by a function of the same form as our model, namely a dot product between a weight vector and our input vector \(x\). This true weight vector I will call the “real \(w\)”.

```
import scipy.sparse
n=10 # number of features
m=20000 # number of training examples
X = scipy.sparse.random(m,n, density=.2).toarray() # Guarantees sparse grad updates
real_w = np.random.uniform(0,1,size=(n,1)) # Define our true weight vector
X = X/X.max() # Normalizing for training
y = np.dot(X,real_w)
```

The multiprocessing library allows you to spin up additional processes to run code in parallel. We will use the multiprocessing.Pool’s map function to calculate and apply the gradient update step asynchronously. The key components of the algorithm is that the weight vector is accessible to all processes at the same time and can be accessed without any locking.

First, we need to define the weight vector \(w\) in shared memory that can be accessed without locks. We will have to use the “sharedctypes.Array” class from multiprocessing for this functionality. We will further use numpy’s frombuffer function to make it accessible from a numpy array.

```
from multiprocessing.sharedctypes import Array
from ctypes import c_double
import numpy as np
coef_shared = Array(c_double,
(np.random.normal(size=(n,1)) * 1./np.sqrt(n)).flat,
lock=False) # Hogwild!
w = np.frombuffer(coef_shared)
w = w.reshape((n,1))
```

For the final parallelization code, we will use multiprocessing.Pool to map our gradient update out to several workers. We will need to expose this vector to the workers so this will work.

Our ultimate goal is to perform the gradient update in parallel. To do this, we need to define what this update is. In the git repo, I took a much more sophisticated approach to expose our weight vector \(w\) to multiple processes. To avoid over-complicating this explanation with Python technicalities, I’m going to just use a global variable to take advantage of how multiprocessing.Pool.map works. The map function copies over the entire namespace of the function to a new process, and as I need to expose a single \(w\) to all workers, this will be sufficient.

```
# The calculation has been adjusted to allow for mini-batches
learning_rate = .001
def mse_gradient_step(X_y_tuple):
global w # Only for instructive purposes!
X, y = X_y_tuple # Required for how multiprocessing.Pool.map works
# Calculate the gradient
err = y.reshape((len(y),1))-np.dot(X,w)
grad = -2.*np.dot(np.transpose(X),err)/ X.shape[0]
# Update the nonzero weights one at a time
for index in np.where(abs(grad) > .01)[0]:
coef_shared[index] -= learning_rate*grad[index,0]
```

We are going to have to cut the training examples into tuples, one per example, to pass to our workers. We will also be reshaping them to adhere to the way the gradient step is written above. Code is included to allow you to use mini-batches via the batch_size variable.

```
batch_size=1
examples=[None]*int(X.shape[0]/float(batch_size))
for k in range(int(X.shape[0]/float(batch_size))):
Xx = X[k*batch_size : (k+1)*batch_size,:].reshape((batch_size,X.shape[1]))
yy = y[k*batch_size : (k+1)*batch_size].reshape((batch_size,1))
examples[k] = (Xx, yy)
```

Parallel programming can be very difficult. low-level code, but the multiprocessing library has abstracted the details away from us. This makes this final piece of code underwhelming as a big reveal. Anyway, after the definitions above, the final code for Hogwild! is:

```
from multiprocessing import Pool
# Training with Hogwild!
p = Pool(5)
p.map(mse_gradient_step, examples)
print('Loss function on the training set:', np.mean(abs(y-np.dot(X,w))))
print('Difference from the real weight vector:', abs(real_w-w).sum())
```

```
Loss function on the training set: 0.0014203406038
Difference from the real weight vector: 0.0234173242317
```

As you probably know, deep learning is pretty popular right now. The neural networks employed to solve deep learning problems are trained via stochastic gradient descent. Deep learning implies large data sets (or, additionally some type of interactive environment in the case of reinforcement learning). As the scale increases, training neural networks becomes slower and more cumbersome. So, parallelizing the training of neural networks is a big deal.

There are other approaches to parallelizing SGD. Google has popularized Downpour SGD among others. Downpour SGD basically keeps the weights on a central hub where several model instances on different pieces of the data send updates and retrieve refreshed rates. This idea has been extended in the case of reinforcement learning and seen to provide surprising improvements in performance and a decrease in training time. In particular, the A3C model uses asynchronous agents playing multiple game instances who publish gradient updates to a central hub, along with a few other tweaks. This has been seen to alleviate some of the time correlation issues found in reinforcement learning that causes divergence, which was previously tackled via experience replay.

The point is asynchronous methods are here to stay, and who knows what new advances may come from them in the next few years. At present, there are not many mathematical proof pertaining to the accuracy increases seen for asynchronous methods. Indeed, many of the approaches are not theoretically justified, but their adoption has been driven by impressive practical results. For mathematicians, it will be interesting to see what new mathematics arise from trying to explain these methods formally.

]]>