To calculate the CUPED estimate, namely \(\theta\), the authors parametrized a function \(Y_{cv}(\theta)\) where \(\mathbb{E}(Y_{cv}(\theta))=Y\). Then, they chose the minimal \(\theta\) such that \(\theta = \min_{\hat\theta} var(Y_{cv}(\hat\theta))\). When you know the trick, it’s not that bad to generalize to more variables.
The first thing you have to do is allow multiple covariates. With a single covariate we had \(\theta\) and \(X\), the covariate. The analog in the multiple covariate case is a collection of \(\theta_i\) for covariates \(X_i\). In this setting, \(Y_{cv}\) is defined via \(\begin{equation} Y_{cv} = \bar{Y} - \sum_{i} \theta_i (\bar{X_i} - \mathbb{E}(X_i)) \end{equation}\) So, using rules about the variance of linear combinations, we have \(\begin{equation} var ( Y_{cv}) = \frac{1}{n} \left ( var(Y) + \sum_i \theta_i^2 var(X_i) + 2 \sum_{1 \leq i< j\leq n} \theta_i \theta_j cov(X_i, X_j) - \sum_i 2\theta_i cov(Y,X_i) \right ). \end{equation}\)
In the previous equation, we used the identity \(var(\bar{Y}) = var(Y)/n\) where \(n\) is the size of the sample. (We are glossing over many steps and definitions to get to the meat and not distract from the main point. For more details, see the CUPED paper’s definitions.)
Now, consider \(g(\boldsymbol \theta):=var(Y_{cv})\) where \(\boldsymbol\theta:= (\theta_1,\dots, \theta_m)\).
Now, to find the minimum value of \(\theta\) of this quadratic equation, we need to take the (multivariate) derivative with respect to \(\boldsymbol \theta\) and find the critical points– which in this case the critical point will be a minimum.
\[\begin{equation}\frac{\partial g}{\partial \theta_i} = \frac{1}{n}\left( 2\theta_i var(X_i) + 2 \sum_{j\neq i} \theta_j cov(X_i, X_j) - 2 cov(Y, X_i) \right) \end{equation}\]We can write this another way if we consider the vector \(\nabla \boldsymbol \theta = (\frac{\partial g}{\partial \theta_1},\dots \frac{\partial g}{\partial \theta_m})^T.\)
If we set \(\nabla \boldsymbol \theta = 0\), we can remove the factor of \(2\) and \(1/n\) in all the terms of the partial derivative (by dividing both sides by \(2\) and \(1/n\)) and we have
\[\begin{equation} \nabla \boldsymbol \theta = \Sigma \boldsymbol \theta - Z = 0 \end{equation}\]where \(\Sigma\) is the covariance matrix of the \(X_i\) and \(Z\) is a vector such that \(Z_i = cov(Y,X_i)\). Therefore, the minimum is achieved when we set
\[\begin{equation} \boldsymbol \theta := \Sigma^{-1}Z. \end{equation}\]Here, we will create a \(Y\) from a linear combination of covariates \(X_i\)’s, i.e, \(Y=\sum_i a_i X_i\). We will then solve for \(\theta\) using the formula above and we will find that the coefficients \(a_i\) and \(\boldsymbol \theta\) are equal.
import numpy as np
import pandas as pd
np.random.seed(8080)
size = 50000
num_X = 10
X_means = np.random.uniform(0, 1, num_X)
Xs = [np.random.normal(X_means[k], .01, size) for k in range(num_X)]
coefficients = np.random.uniform(0,1,num_X)
Y = np.sum([a*b for a,b in zip(coefficients, Xs)],axis=0) \
+ np.random.normal(0,.001,size)
# recover coefficients/theta via the formula, note n must be large
# for theta to be a good approximation of the coefficients
big_cov = np.cov([Y] + Xs) # Calculating sigma and z together
sigma = big_cov[1:,1:]
z = big_cov[1:,0].reshape((num_X,1))
theta = np.dot(np.linalg.inv(sigma),z) # here's the formula!
# This will be a small number, not perfectly 0
# due to the error term we add, this value isn't exactly 0
np.mean(theta.reshape((num_X,))-np.array(coefficients))
-9.495897072566845e-05
Wait? Why are they equal? So, there was a little gap in the logic above. I did not mention why those two values should be equal in the first place, but I showed it happens in this case. It is actually due to the Gauss-Markov Theorem, which states that under a few assumptions on the data distribution, the ordinary least squares (OLS) estimator is unbiased and has the smallest variance.
Let’s play that back to really grok it. I was coming up with a linear estimate for \(Y\), in terms of a linear combination \(Y_{cv}\) (so a linear estimator), and I was solving explicitly for coefficients (\(\theta\)) which minimized the variance. Since \(Y\) is already a linear combination, the OLS estimator will recover the coefficients of \(Y\). Gauss-Markov says that the OLS solution is the unique “best linear unbiased estimator”, and so solving for a variance minimizing linear combination (\(Y_{cv}\)) another way will still yield the same coefficients.
In the CUPED paper, the authors state that one covariate they found predictive was the first day the user entered the experiment. This is not a pre-experiment variable but it is independent of the treatment. We will craft a simulation that leverages this idea to create covariates to test for our formula above.
In our simulation, we will simulate a user’s query count during a test period under a control and a treatment. The treatment will make a user more likely to query. Depending on the day of the week on their first entrance into the experiment, a user will have a different additive component to their query propensity. This effect only impacts their visits on the first day.
# Using the simulation from a previous post
# with a few additions
import numpy as np
import pandas as pd
np.random.seed(1)
user_sample_mean = 8
user_standard_error = 3
users = 1000
# assign groups
treatment = np.random.choice([0,1], users)
treatment_effect = 2
# Identify query per session
user_query_means = np.random.normal(user_sample_mean, user_standard_error, users)
def run_session_experiment(user_means, users, user_assignment, treatment_effect):
# reate click rate per session level
queries_per_user = \
treatment_effect*treatment[users]\
+ user_means[users] \
+ np.random.normal(0,1, len(users))
queries_per_user=queries_per_user.astype(int)
queries_per_user[queries_per_user<0] = 0
return pd.DataFrame({'queries': queries_per_user, 'user': users, 'treatment': treatment[users]})
# Generate pre-experiment data for each user once, i.e. over some period
pre_data=run_session_experiment(user_query_means, range(users),
treatment, 0)
pre_data.columns = ['pre_' + k if k != 'user' else k for k in pre_data.columns]
pre_data = pre_data[['pre_queries','user']]
# Generate experiment data
day_impact = np.random.uniform(-3, 6, 7)
dfs = []
users_seen = set()
users_first_day = []
for k in range(14):
# select the users for that day, each user has a 2/14 change of appearing
day_users = np.random.choice([0,1], p=[12/14,2/14], size=users)
available_users = np.where(day_users==1)[0]
day_user_query_means = user_query_means + day_impact[k % 7]
df = run_session_experiment(day_user_query_means, available_users, treatment,
treatment_effect)
df['first_day'] = k % 7
df['real_day'] = k
dfs.append(df)
# We are doing a user level analysis with randomization unit
# equal to user as well. This means groupby's should be on user!
df=pd.concat(dfs)
def get_first_day(x):
# Get the "first_day" value corresponding to their actual first day
t=np.argmin(x['real_day'].values)
tmp = x.iloc[t]['first_day']
return tmp
dd=df.groupby(['user','treatment']).agg({'queries': 'sum'})
dd['first_day']=df.groupby(['user','treatment']).apply(get_first_day)
dd.reset_index(inplace=True) # pandas is really ugly sometimes
# combine data, notice each row is a unique user
data=dd.merge(pre_data, on='user')
# Calculate theta
covariates=pd.get_dummies(data['first_day'],drop_first=False)
covariates.columns = ['day_'+str(k) for k in covariates.columns]
covariates['pre_queries'] = data['pre_queries']
all_data = np.hstack((data[['queries']].values,covariates.values))
big_cov = np.cov(all_data.T) # 9x9 matrix
sigma = big_cov[1:,1:] # 8x8
z = big_cov[1:,0] # 8x1
theta = np.dot(np.linalg.inv(sigma),z)
# Construct CUPED estimate
Y = data['queries'].values.astype(float)
covariates = covariates.astype(float)
Y_cv = Y.copy()
for k in range(covariates.shape[1]):
Y_cv -= theta[k]*(covariates.values[:,k] - covariates.values[:,k].mean())
real_var, reduced_var = np.sqrt(Y.var()/len(Y)),\
np.sqrt(Y_cv.var()/len(Y))
reduced_var/real_var # variance reduction
0.8368532147135743
# Let's try OLS!
from statsmodels.formula.api import ols
results= ols('queries ~ pre_queries + C(first_day) + treatment', data).fit()
# Some calculations for the final table
effect = Y[data.treatment==1].mean() - Y[data.treatment==0].mean()
ste = Y[data.treatment==1].std()/np.sqrt(len(Y)) + Y[data.treatment==0].std()/np.sqrt(len(Y))
cuped_effect = Y_cv[data.treatment==1].mean() - Y_cv[data.treatment==0].mean()
cuped_ste = Y_cv[data.treatment==1].std()/np.sqrt(len(Y)) + Y_cv[data.treatment==0].std()/np.sqrt(len(Y))
pd.DataFrame(index=['t-test', 'CUPED', 'OLS'],data=
{
"effect size estimate": [effect, cuped_effect, results.params['treatment']],
"standard error": [ste, cuped_ste, results.bse['treatment']]
}
)
effect size estimate | standard error | |
---|---|---|
t-test | 5.014989 | 1.048969 |
CUPED | 4.458988 | 0.876270 |
OLS | 4.486412 | 0.884752 |
If you have a non-user level metric, like the click through rate, the analysis from the first section is mostly unchanged. When you define \(Y_{cv}\), we must account for those terms differently. Following Appendix B from the CUPED paper, let \(V_{i,+}\) equal the sum of obervations of statistic \(V\) for user \(i\), \(n\) the number of users and \(\frac{1}{n} \sum_i V_{i,+} = \bar{V}\). Let \(M_{i,+}\) be the number of visits for user \(i\). Let \(X\) be another user-level metric we will use for the correction. Then, we have the equation \(\begin{equation}Y_{cv} = \bar{Y} - \theta_1 \left( \frac{\sum_{i} V_{i,+}}{\sum_{i} M_{i,+}} - \mathbb{E}{\frac{\sum_{i} V_{i,+}}{\sum_{i} M_{i,+}}} \right ) - \theta_2 ( \bar{X} - \mathbb{E}X).\end{equation}\) In this case, we have a page-level metric \(V/M\) being used as a covariate for a user-level metric \(Y\). Is this realistic? Maybe, but at least it will serve to illustrate what to do here.
When you write out the formula for the variance, you will have a term of the form
\[\begin{equation} cov \left (\frac{\sum_{i} V_{i,+}}{\sum_{i} M_{i,+}} , \bar{X}\right ), \end{equation}\]where you need to apply the delta method to compute this covariance. We can write
\(\begin{equation}cov \left ( \frac{\sum_i V_{i,+}}{\sum_i M_{i,+}}, \bar{X} \right ) \approx cov\left (\frac{1}{\mu_{M}}\bar{V} - \frac{\mu_{V}}{\mu_{M}^2}\bar{M} , \bar{X}\right).\end{equation}\) This term can be simplified via properties of covariance. In particular, \(\begin{equation} cov\left (\frac{1}{\mu_{M}}\bar{V} - \frac{\mu_{V}}{\mu_{M}^2}\bar{M} , \bar{X}\right)= \frac{1}{\mu_M}cov(\bar{V},\bar{X}) - \frac{\mu_V}{\mu_{M}^2 }cov(\bar{M},\bar{X}). \end{equation}\) At this point, you are able to calculate all the necessary terms and can sub this value in for the \(\partial d / \partial \theta_i\) equation.
Using the previous presentation of the delta method, we can actually make our lives easier by replacing the vector \(\bar{V}/\bar{M}\) with the delta estimate \(\frac{1}{\mu_{M}}\bar{V} - \frac{\mu_{V}}{\mu_{M}^2}\bar{M}\) and then calculating the covariance, rather than applying any of the complicated delta formulae.
All that’s left is to convince you of this. First, I’ll take the formula for the delta method from my previous post and show it is equivalent to taking the variance of the vector \(\frac{1}{\mu_{M}}\bar{V} - \frac{\mu_{V}}{\mu_{M}^2}\bar{M}\).
# Define our data
V = np.random.normal(user_sample_mean, user_standard_error, users)
M = np.random.normal(user_sample_mean*2, user_standard_error, users)
X = np.random.normal(0,1,users)
mu_V, mu_M = V.mean(), M.mean()
def _delta_method(clicks, views):
# Following Kohavi et. Al. clicks and view are at the per-user level
# Clicks and views are aggregated to the user level and lined up by user,
# i.e., row 1 = user 1 for both X and Y
K = len(clicks)
X = clicks
Y = views
# sample mean
X_bar = X.mean()
Y_bar = Y.mean()
# sample variance
X_var = X.var(ddof=1)
Y_var = Y.var(ddof=1)
cov = np.cov(X,Y, ddof=1)[0,1] # cov(X-bar, Y-bar) = 1/n * cov(X,Y)
# based on deng et. al
return (1/K)*(1/Y_bar**2)*(X_var + Y_var*(X_bar/Y_bar)**2 - 2*(X_bar/Y_bar)*cov)
_delta_method(V, M)
4.452882421409211e-05
# Take the variance of the taylor expansion of V/M
(1/users)*np.var((1/mu_M)*V-(mu_V/mu_M**2)*M, ddof=1)
4.452882421409211e-05
How do we use this insight to make our \(\sigma\) calculation easier? Simply replace any vector of the metric \(V/M\) with the linearized formula \(\frac{\bar{V}}{\mu_{M}} - \frac{\mu_{V}}{\mu_{M}^2}\bar{M}\) and take the covariance as usual.
# Formula for the covariance from the previous section
delta_var = (1/users)*((1/mu_M)*np.cov(V,X)[0,1] - (mu_V/mu_M**2)*np.cov(M,X)[0,1])
delta_var
-5.78229317802757e-06
# Sub in the value for each vector before taking the covariance
pre_cov_est = (1/users)*np.cov((1/mu_M)*V-(mu_V/mu_M**2)*M, X)[0,1]
pre_cov_est
-5.782293178027562e-06
In our case, \(V\) and \(M\) are correlated and so the equivalence I empirically demonstrated was not a sleight of hand that doesn’t always work.
np.cov(V,M)
array([[9.10327978, 0.10338676],
[0.10338676, 9.29610525]])
I have personally implemented the delta method formula multiple times before realizing I could linearize before calculating the covariance, and so I do not think this calculation is obvious. Upon rereading the original delta method paper, the same formula is very briefly mentioned by Deng et al. in section 2.2 after equation (3) when they define $W_i$. In that section, they do not explicitly mention using the $W_i$ to calculate the variance, which is what I do above. However, it turns out that this is a common approach in the field to calculate the delta method, which is why it’s not belabored in the papers as it’s available in several books, including Art Owen’s online book in equation (2.29). Thanks to Deng (of Deng et al.) for sharing the reference.
]]>In this post, we will empirically demonstrate the folllowing things:
A basic \(t\) test assumes analysis unit (usually users) equals your randomization unit (also users, as treatments are sticky per users in most cases). The problem is many of our units have mismatched units. A common unit is the click through rate (CTR), commonly defined as
\[\begin{equation}\frac{clicks}{sessions},\end{equation}\]where sessions can be thought of as page views for this discussion. Right here, we have a mismatch in units as these are both at the session level and not the user level. We can fix this by writing each as a “per user” metric, by dividing top and bottom by the number of users:
\[\begin{equation}\frac{clicks}{sessions} = \frac{\frac{clicks}{N}}{\frac{sessions}{N}}=\frac{clicks\ per\ user}{sessions\ per\ user}.\end{equation}\]Simple enough– now our numerator and denominator have the same units, but now when we try to calculate the variance of this term, using np.var
will underestimate the variance. This underestimate leads to a higher false positive rate. In fact, if you perform simulated A/A tests, the p-values will not be uniformly distributed but instead have a slight bump on the left hand side of your plot, near 0.
To avoid this and to regain a flat p-value distribution, the guidance in the A/B testing community is to use the delta method. The delta method uses a Taylor expansion (more or less) to find an approximation to the variance which is “close enough” when the number of users \(N\) becomes large. I’ll spare you the formula, but it’s in the paper linked.
However, if you come from a stats background, you may be thinking that you should be able to just use Ordinary Least Squares (OLS) with clustered standard errors. For awhile, there was no official document linking these two techniques, the delta method from A/B testing and OLS with cluster robust standard errors (implemented via the sandwich estimator in the statsmodels
package). Until Deng et al., published a proof they are equivalent. But, interestingly the idea of clustered standard errors were well understood in economics and other fields, but had not been connected to the A/B testing literature until 2021.
With that said, it’s one thing to understand the techniques individually and their connections at a high level and quite another thing to understand them in practice. I speak from experience because althought I knew the theory behind the contents of this post, it took me much longer than I’d like to admit to write it down coherently with simulated examples. And, that is the point of this– to present empirical evidence of the aforementioned claims.
In each example, I will simulate the true variance (and know the effect size by definition) and then aim to recover the proper variance through various techniques. I will also comment on when things go awry to hopefully help others in the same position. Each example will illustrate how to use a \(t\)-test for the calculation, as that method is the most scalable.
As a refresher and to set notation, assume we have two groups \(A\) and \(B\) are associated with random variables \(X\) and \(Y\). We define their sample means as \(\bar{X}\) and \(\bar{Y}\) respectively, their variances \(\text{var}(X):=\sigma_X^2\) and \(\text{var}(Y):=\sigma_Y^2\), and their user counts as \(n_A\) and \(n_B\). The simplest formula for the \(t\)-score, which for large \(n\) is equivalent approximately to the \(z\)-score, is \(\begin{equation}\frac{\bar{X}-\bar{Y}}{ \sqrt{ \frac{\sigma_X^2}{n_A} + \frac{\sigma_Y^2}{n_B} }}.\end{equation}\)
Note, because the two groups are independent, then \(X\) and \(Y\) are independent, and the denominator comes from the observation that \(\begin{equation}\text{var}(\bar{X}-\bar{Y}) = \text{var}(\bar{X}) + \text{var}(\bar{Y}) = \frac{\text{var}(X)}{n_X}+\frac{\text{var}(Y)}{n_Y}.\end{equation}\)
import numpy as np
import pandas as pd
from statsmodels.formula.api import ols
np.random.seed(1)
user_sample_mean = .3
user_standard_error = .15
users = 50000
pre_n_views = 100000
n_views = 100000
# assign groups
treatment = np.random.choice([0,1], users)
treatment_effect = .25
# Identify individual user click percentage mean
user_click_means = np.random.normal(user_sample_mean, user_standard_error, users)
def run_session_experiment(user_click_means, n_views, user_assignment, treatment_effect):
# determine user per each sessions/view
user_sessions = np.random.choice(users, n_views)
# reate click rate per session level
click_percent_per_session = treatment_effect*treatment[user_sessions] + user_click_means[user_sessions] + np.random.normal(0,.01, n_views)
# Remove <0 or >1
click_percent_per_session[click_percent_per_session>1] = 1
click_percent_per_session[click_percent_per_session<0] = 0
clicks_observed = np.random.binomial(1, click_percent_per_session)
return pd.DataFrame({'clicks': clicks_observed, 'user': user_sessions, 'treatment': treatment[user_sessions], 'views' : np.ones_like(user_sessions)})
# Simulate for "true" standard error
diffs = []
for k in range(100):
o = run_session_experiment(user_click_means, n_views, treatment, treatment_effect)
gbd=o.groupby(['user','treatment'], as_index=False).agg({'clicks':'sum', 'views': 'sum'})
A, B = gbd.treatment == 0, gbd.treatment ==1
diffs.append((gbd.clicks[B].sum()/gbd.views[B].sum()) - (gbd.clicks[A].sum()/gbd.views[A].sum()))
print('effect','ste', np.array(diffs).mean(), np.array(diffs).std()) # std error
effect ste 0.24817493916173397 0.003371032173547226
data = run_session_experiment(user_click_means, n_views, treatment, treatment_effect)
df = pd.DataFrame({'Y' : data.clicks, 'd' : data.treatment, 'g' : data.user})
o=ols('Y~d',df).fit(cov_type='cluster', cov_kwds={'groups': df['g'], 'use_correction':False})
o.params['d'],o.bse['d'] # see standard error for d
(0.24903284320201524, 0.003304827358515906)
def _delta_method(clicks, views):
# Following Kohavi et. Al. clicks and view are at the per-user level
# Clicks and views are aggregated to the user level and lined up by user,
# i.e., row 1 = user 1 for both X and Y
K = len(clicks)
X = clicks
Y = views
# sample mean
X_bar = X.mean()
Y_bar = Y.mean()
# sample variance
X_var = X.var(ddof=1)
Y_var = Y.var(ddof=1)
cov = np.cov(X,Y, ddof=1)[0,1] # cov(X-bar, Y-bar) = 1/n * cov(X,Y)
# based on deng et. al
return (1/K)*(1/Y_bar**2)*(X_var + Y_var*(X_bar/Y_bar)**2 - 2*(X_bar/Y_bar)*cov)
# NOTICE - delta method only is a good approximation when the cluster size K is approximately constant and number of users is larger
gbd=data.groupby(['user','treatment'], as_index=False).agg({'clicks':'sum', 'views': 'sum'})
A, B = gbd.treatment == 0, gbd.treatment ==1
np.sqrt(_delta_method(gbd.clicks[A], gbd.views.values[A])+_delta_method(gbd.clicks[B], gbd.views[B]))
0.0033049040362566383
The delta method only applies when \(n\) is large, and so for small \(n\), the approximation may not hold. In fact, if you rerun this when \(n<1000\), the delta method isn’t a good approximation.
In particular, consider \(\hat Y_{cv}=Y-\theta (X-\mathbb{E}X)\) from the CUPED paper, where \(\theta = \text{cov}(X,Y)/\text{var}(X)\). Then, \(\theta\) and the proper variance reduction can be gained by constructing the OLS formula \(Y\sim X+d+c\) where \(X\) is your pre-experiment covariate, \(d\) is your treatment indicator, and \(c\) is a constant.
# Changing the counts for a better approximation
# Larger number of users negate the need for CUPED as each user only appears once or twice
users = 10000
#pre_n_views = 200000
n_views = 200000
# Simulate for "true" standard error, we know the exact value of the effect by design
diffs = []
for k in range(100):
o = run_session_experiment(user_click_means, n_views, treatment, treatment_effect)
gbd=o.groupby(['user','treatment'], as_index=False).agg({'clicks':'sum', 'views': 'sum'})
A, B = gbd.treatment == 0, gbd.treatment ==1
diffs.append(gbd.clicks[B].mean() - gbd.clicks[A].mean())
sim_ste = np.array(diffs).std() # std error
# For a user metric, must do sum-- not mean-- as that considers session counts
# Generate pre-experiment data and experiment data, using same user click means and treatments, etc.
before_data = run_session_experiment(user_click_means, pre_n_views, treatment, 0)
after_data = run_session_experiment(user_click_means, n_views, treatment, treatment_effect)
# To apply the delta method or cuped, the calculations are at the user level, not sessions
bd_agg = before_data.groupby('user', as_index=False).agg({'clicks': 'sum', 'views' : 'sum'})
ad_agg = after_data.groupby('user', as_index=False).agg({'treatment': 'max', 'clicks': 'sum', 'views' : 'sum'})
#assert bd_agg.user.nunique() == ad_agg.user.nunique() # make sure all users sampled
bd_agg.columns = ['pre_' + col for col in bd_agg.columns]
d=ad_agg.join(bd_agg, on='user', how='left')
d['pre_clicks'] = d['pre_clicks'].fillna(d.pre_clicks.mean()) # users not seen in pre-period
cv = np.cov(d['clicks'], d['pre_clicks']) # clicks per user!
theta = cv[0,1]/cv[1,1]
y_hat = d.clicks - theta * (d.pre_clicks - d.pre_clicks.mean())
d['y_hat'] = y_hat
d['y_hat_div_n'] = y_hat
means = d.groupby('treatment').agg({'y_hat':'mean'})
variances = d.groupby('treatment').agg({'y_hat':'var', 'clicks':'count'})
effect=means.loc[1] - means.loc[0]
standard_error = np.sqrt(variances.loc[0, 'y_hat'] / variances.loc[0, 'clicks'] + variances.loc[1, 'y_hat'] / variances.loc[1, 'clicks'])
t_score = (means.loc[1] - means.loc[0]) / standard_error
# variance reduction
var_redux=np.corrcoef(d['clicks'], d['pre_clicks'])[0,1]**2
# Two step OLS, which indirectly estimates theta with OLS
s1 = ols('Y ~ X', pd.DataFrame({'Y': d.clicks, 'X': d.pre_clicks})).fit()
residuals = s1.resid # Y - theta*(X) - C, constant C handles the mean term
d['y_hat'] = residuals
# You can either do the t-test here or use OLS again
means = d.groupby('treatment').agg({'y_hat':'mean'})
variances = d.groupby('treatment').agg({'y_hat':'var', 'clicks':'count'})
effect2 = means.loc[1] - means.loc[0]
standard_error2 = np.sqrt(variances.loc[0, 'y_hat'] / variances.loc[0, 'clicks'] + variances.loc[1, 'y_hat'] / variances.loc[1, 'clicks'])
t_score2 = (means.loc[1] - means.loc[0]) / standard_error
s2 = ols('resid ~ d', pd.DataFrame({'resid': d['y_hat'], 'd': d.treatment})).fit()
# Direct OLS w/ pre-term stuff, note a constant is added automatically with the formula API
s3 = ols('Y ~ d + X', pd.DataFrame({'Y': d.clicks, 'd': d.treatment, 'X': d.pre_clicks})).fit()
# Direct OLS, no pre-term
s4 = ols('Y ~ d ', pd.DataFrame({'Y': d.clicks, 'd': d.treatment, 'X': d.pre_clicks})).fit()
# t-test, no CUPED
means = d.groupby('treatment').agg({'clicks':'mean'})
variances = d.groupby('treatment').agg({'clicks':'var', 'y_hat':'count'})
effect_none=(means.loc[1] - means.loc[0]).iloc[0]
standard_error_none = np.sqrt(variances.loc[0, 'clicks'] / variances.loc[0, 'y_hat'] + variances.loc[1, 'clicks'] / variances.loc[1, 'y_hat'])
t_score_none = effect_none/standard_error_none
ground_truth_effect = (n_views/users)*treatment_effect
pd.DataFrame(
{ 'method' : ['Ground Truth','Vanilla t-test','Vanilla OLS, no CUPED', 'CUPED with theta formula -> t-test', 'OLS estimates residuals -> t-test', '2 Step OLS', 'OLS'],
'estimated effect': [ground_truth_effect, effect_none, s4.params['d'],effect.iloc[0], effect2.iloc[0], s2.params['d'], s3.params['d']],
'theta': [None, None, None, theta, s1.params['X'], s1.params['X'],s3.params['X']],
't_score' : [ground_truth_effect/sim_ste, t_score_none, s4.tvalues['d'],t_score.iloc[0], t_score2.iloc[0], s2.tvalues['d'],s3.tvalues['d']],
'standard_error': [sim_ste, standard_error_none, s4.bse['d'], standard_error, standard_error2, s2.bse['d'], s3.bse['d']]}
)
method | estimated effect | theta | t_score | standard_error | |
---|---|---|---|---|---|
0 | Ground Truth | 5.000000 | NaN | 84.642781 | 0.059072 |
1 | Vanilla t-test | 4.915146 | NaN | 59.331897 | 0.082842 |
2 | Vanilla OLS, no CUPED | 4.915146 | NaN | 59.340633 | 0.082829 |
3 | CUPED with theta formula -> t-test | 4.927060 | 0.300307 | 60.330392 | 0.081668 |
4 | OLS estimates residuals -> t-test | 4.927060 | 0.300307 | 60.330392 | 0.081668 |
5 | 2 Step OLS | 4.927060 | 0.300307 | 60.339849 | 0.081655 |
6 | OLS | 4.927445 | 0.310026 | 60.340052 | 0.081661 |
# expected std? due to cuped
np.sqrt((1-var_redux)*standard_error_none**2) # close !
0.08202757811298232
This situation occurs for session or page level metrics like the click through rate \(\left ( \frac{clicks}{sessions} \right )\) when typically the A/B split occurs on users (i.e., you hash the user ID, the treatment is sticky by user, etc.) yet the metric does not contain users at all. As seen in the first example, this is a common situation to use the delta method. However, it’s unclear at first glance how to do this with CUPED. In the paper by Deng et al., Appendix B sketches how to combine the delta method with CUPED.
However, it is complicated, and it could be confusing to implement. In this case, you don’t apply the delta method like in the typical \(t\)-test, but rather Deng et al. applies the delta method (i.e., Taylor’s theorem) to linearize the terms in the variance and covariance and then passes \(n\to\infty\) to get an approximation. This is the same process as deriving the normal delta method, but this time they use it to get a formula in terms of several user level metrics. So, at the end of the day, you don’t calculate the traditional delta method at all for the variances, but rather you use their new formula given in Appendix B which I have recreated below.
def _extract_variables(d):
class VariableHolder:
'''
Convenient container class to hold common calculations that I have to do several times
off of data frames and return all those variables to the previous scope
d must be aggregated to the iid level, i.e., user
'''
n =len(d)
Y, N = d.clicks.values, d.views.values
X, M = d.pre_clicks.values, d.pre_views.values
sigma = np.cov([Y,N,X,M]) # 4
mu_Y, mu_N = Y.mean(), N.mean()
mu_X, mu_M = X.mean(), M.mean()
beta1 = np.array([1/mu_N,-mu_Y/mu_N**2,0,0]).T
beta2 = np.array([0,0,1/mu_M,-mu_X / mu_M**2]).T
return VariableHolder()
def calculate_theta(V):
# assumes the V, the VariableHolder class is passed in
# formula from Deng et al. n's would cancel out
theta = np.dot(V.beta1, np.matmul(V.sigma,V.beta2)) / np.dot(V.beta2, np.matmul(V.sigma,V.beta2))
return theta
By the definition of variance, \(var(\hat{Y}_{cv})= var(Y/N) + \theta^2 var(X/M)-2\theta cov(Y/N, X/M)\). Each term has to be calculated via the delta method and summed together… unless we can use OLS!
def var_y_hat(V, theta):
# assumes the V, the VariableHolder class is passed in
# formula from Deng et al., using the delta method
# to arrive at a new formula from Appendix B
var_Y_div_N = np.dot(V.beta1, np.matmul(V.sigma,V.beta1.T))
var_X_div_M = np.dot(V.beta2, np.matmul(V.sigma,V.beta2))
cov = np.dot(V.beta1, np.matmul(V.sigma,V.beta2))
# can also use traditional delta method for the var_Y_div_N type terms
return (var_Y_div_N+(theta**2)*var_X_div_M-2*theta*cov)/V.n
def cuped_exp(user_click_means, user_assignment, treatment_effect, views=150000):
# Generate pre-experiment data and experiment data, using same user click means and treatments, etc.
before_data = run_session_experiment(user_click_means, views, treatment, 0)
after_data = run_session_experiment(user_click_means, views, treatment, treatment_effect)
# To apply the delta method or cuped, the calculations are at the user level, not sessions
bd_agg = before_data.groupby('user', as_index=False).agg({'clicks': 'sum', 'views' : 'sum'})
ad_agg = after_data.groupby('user', as_index=False).agg({'treatment': 'max', 'clicks': 'sum', 'views' : 'sum'})
bd_agg.columns = ['pre_' + col for col in bd_agg.columns]
d=ad_agg.join(bd_agg, on='user', how='left')
d['pre_clicks'] = d['pre_clicks'].fillna(d.pre_clicks.mean()) # users not seen in pre-period
d['pre_views'] = d['pre_views'].fillna(d.pre_views.mean()) # users not seen in pre-period
V = _extract_variables(d)
theta = calculate_theta(V)
Y_hat = V.Y/V.N - theta*(V.X/V.M - (V.X/V.M).mean()) # Y_hat mean!
# create assignment
d['treatment']=[user_assignment[k] for k in d.user]
df = pd.DataFrame({"resids" : Y_hat, "d": d.treatment.values})
results = ols('resids ~ d', df).fit()
results.params['d'], results.pvalues['d']/2 # divide by 2 for single tail
A = d.treatment == 0
# should I do this or not? should it just be 1 variance
Vt = _extract_variables(d[~A])
Vc = _extract_variables(d[A])
# Return difference of means, variance
return Y_hat[~A].mean() - Y_hat[A].mean(), \
var_y_hat(Vt,theta)+var_y_hat(Vc,theta), d, theta, results
# Keep the users fixed per experiment, to simulate sampling from a real user base
users = 10000
n_views = 200000 # each user need enough examples for the covariances to be nonzero
np.random.seed(10)
user_click_means = np.random.normal(user_sample_mean, user_standard_error, users)
treatment = np.random.choice([0,1], users)
cupeds = []
thetas = []
for k in range(100):
o=cuped_exp(user_click_means, treatment, treatment_effect, views=n_views)
cupeds.append(o[0])
thetas.append(o[-2])
# Ground truth values from simulation
cupeds=np.array(cupeds)
cm=cupeds.mean()
cste=cupeds.std()
t_score = cm/cste
# Additional run for traditional t-test using Y_hat and variance correction
o=cuped_exp(user_click_means, treatment, treatment_effect, views=n_views)
results=o[-1]
theta=o[-2]
delta_std = np.sqrt(o[1])
# OLS session level with pre-treatment variables
before_data = run_session_experiment(user_click_means, n_views, treatment, 0) # No treatment in before period
after_data = run_session_experiment(user_click_means, n_views, treatment, treatment_effect)
bd_agg = before_data.groupby('user', as_index=False).agg({'clicks': 'sum', 'views' : 'sum'})
bd_agg.columns = ['pre_'+k if k != 'user' else k for k in bd_agg.columns]
# Join raw session data and fill in missing values
ols_data = after_data.merge(bd_agg, on='user', how='left')
ols_data['pre_clicks'] = ols_data['pre_clicks'].fillna(before_data.clicks.mean()) # users not seen in pre-period
ols_data['pre_views'] = ols_data['pre_views'].fillna(before_data.views.mean()) # users not seen in pre-period
s7=ols('Y ~ X + d', {'Y': ols_data.clicks, 'X' : (ols_data.pre_clicks/ols_data.pre_views - (ols_data.pre_clicks/ols_data.pre_views).mean()) , 'd': ols_data.treatment}).fit(cov_type='cluster', cov_kwds={'groups': ols_data['user'], 'use_correction':False})
# theta distribution and its standard deviation (not standard error)
np.mean(thetas),np.sqrt(np.var(thetas))
(0.679012644432057, 0.008879178301584759)
pd.DataFrame(
{ 'method' : ['Simulation (Ground truth)', 'CUPED w/ Special Theta -> t-test', 'OLS with cluster robust SEs'],
'estimated effect': [cm,results.params['d'], s7.params['d']],
'theta': [np.mean(thetas), theta, s7.params['X']],
't_score' : [t_score, results.tvalues['d'],s7.tvalues['d']],
'standard_error': [cste, delta_std,s7.bse['d']]}
)
method | estimated effect | theta | t_score | standard_error | |
---|---|---|---|---|---|
0 | Simulation (Ground truth) | 0.249992 | 0.679013 | 101.278105 | 0.002468 |
1 | CUPED w/ Special Theta -> t-test | 0.256820 | 0.679159 | 96.635135 | 0.002712 |
2 | OLS with cluster robust SEs | 0.249392 | 0.676926 | 92.796036 | 0.002688 |
So, now we have seen the myriad of ways that CUPED and OLS tie together with the delta method sprinkled in for good measure. The question remaining is why use the delta method at all, or the CUPED calculation, if OLS with situationally using cluster robust standard errors can do it for you. It’s a good question, and the answer lies in performance. The full OLS calculation is more computationally expensive and cannot as easily be parallelized, and so scaling OLS to very large datasets could prove challenging. It’s a similar reason as to why practitioners prefer to use the delta method instead of bootstrapping for a variance estimate. Both would yield a correct variance approximation, but one takes a much longer time to compute.
I have also specifically avoided the questions of the validity of possibly non-IID data. This topic is covered at length in this paper from Deng et al. and may be a topic for a later post.
In the CUPED paper, Deng et al. states “[t]he single control variate case can be easily generalized to include multiple variables.” Although this may be true, I don’t know if the solution is obvious if you are not currently well-versed in the mathematical background required– specifically if you would like to have a formula for the variance. However, it is very straightforward to see that you can add as many covariates as you’d like to the OLS formula and still have your variance estimated for you. Otherwise, you’d have to find the least squared solution to your multiple covariate formula and likely use bootstrapping for the variance. Maybe this is worth exploring in a later post.
I have to extend a thank you to many people who discussed and simulated this problem along side me to get to a coherent post: Kevin Foley who introduced me to this connection and championed OLS and the formula API from statsmodels
, Tianwen Chen for discussing the various approaches and reviewing my simulations, and Chiara Capuano for helping me with reference hunting, giving feedback, and proofreading.
Points are not the only defining factor in a game of football. Statistics did not predict the Eagles would win Superbowl LII, and in fact, most projections undercut the Eagles the entire year. There was something missing in the data– perhaps unmeasurable– that made the final difference. Now, even though fantasy football is solely based on points, point projection models are (understandably) not based on these unmeasurable quantities. What if Nick Foles is on a hot streak? What if Tom Brady has an injured hand and tries to catch a pass? What if the Patriots are still cheating but lose anyway? Sports fans however have taken these things into account for years through their gut feelings. Before you storm off citing “unscientific” methods, hear me out. Many aggregation sites look at the average of rankings given by experts. These rankings (as we will see below) are not ordered by total point projections. Sometimes, they are defined by them, but not always.
If these gut feelings satisfy a few assumptions, the central limit theorem still implies that their average will converge to a true ranking (hand waive, hand waive). If we want to understand “hot streaks” or “momentum” quantitatively, we would have to model them explicitly and introduce a new source of bias. Instead we average the rankings in hope that those individual sources of bias wash out and we are left with something reasonable to base decisions off of. The following analysis compares clustering raw player points (as suggested by the NYTimes article) with taking into account points and expert rankings. We will use data from FantasyPros.com.
In order to not “double dip” in the data, I’m going to use two separate expert sources for my projections vs. my rankings in order to get multiple opinions on the ranking.
The first thing we’re going to do is explore the data and see what the relationship looks like between points and rankings. The data contains projections for the overall points in a season for each player from one group of experts and an average ranking from a second group of experts.
import pandas as pd
d=pd.read_csv('overall.csv')
df= pd.concat([pd.read_csv(x).iloc[1:] for x in ['qb.csv','wr.csv','rb.csv']])
df.head()
ATT | ATT.1 | CMP | FL | FPTS | INTS | Player | REC | TDS | TDS.1 | Team | YDS | YDS.1 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 587.0 | 58.9 | 376.1 | 2.5 | 321.7 | 9.2 | Aaron Rodgers | NaN | 33.8 | 2.1 | GB | 4,166.6 | 309.3 |
2 | 583.7 | 27.7 | 383.8 | 2.4 | 300.8 | 8.7 | Tom Brady | NaN | 32.5 | 0.9 | NE | 4,597.1 | 41.0 |
3 | 532.3 | 90.4 | 325.1 | 2.5 | 296.6 | 12.5 | Russell Wilson | NaN | 26.0 | 3.2 | SEA | 3,827.4 | 503.2 |
4 | 510.8 | 76.7 | 309.0 | 2.7 | 288.5 | 16.0 | Deshaun Watson | NaN | 27.4 | 3.5 | HOU | 3,781.0 | 442.1 |
5 | 512.9 | 120.0 | 299.5 | 2.2 | 287.8 | 14.5 | Cam Newton | NaN | 22.4 | 5.0 | CAR | 3,487.2 | 620.6 |
d.head()
Rank | WSID | Overall | Team | Pos | Bye | Best | Worst | Avg | Std Dev | ADP | vs. ADP | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1.0 | NaN | Todd Gurley | LAR | RB1 | 12.0 | 1.0 | 7.0 | 1.9 | 1.3 | 1.0 | 0.0 |
1 | 2.0 | NaN | Le'Veon Bell | PIT | RB2 | 7.0 | 1.0 | 9.0 | 2.5 | 1.8 | 2.0 | 0.0 |
2 | 3.0 | NaN | David Johnson | ARI | RB3 | 9.0 | 1.0 | 11.0 | 3.3 | 2.0 | 3.0 | 0.0 |
3 | 4.0 | NaN | Antonio Brown | PIT | WR1 | 7.0 | 1.0 | 11.0 | 4.4 | 1.7 | 5.0 | 1.0 |
4 | 5.0 | NaN | Ezekiel Elliott | DAL | RB4 | 8.0 | 1.0 | 18.0 | 5.8 | 2.9 | 4.0 | -1.0 |
dd=d.merge(df, left_on='Overall', right_on='Player')
dd = dd[~dd.Avg.isnull()] # remove null projections
def get_color(x):
try:
if 'WR' in x:
return 'b'
if 'RB' in x:
return 'r'
if 'TE' in x:
return 'g'
if 'QB' in x:
return 'black'
if 'K' in x:
return 'o'
return 'y'
except:
return 'c'
plt.scatter(dd.Avg, dd.FPTS, c=dd.Pos.apply(get_color))
plt.ylabel('points')
plt.xlabel('rank')
Text(0.5,0,u'rank')
As you can see, the quarterbacks (in black) clearly are separate from the other positions in terms of how they’re pointed vs. ranked. The most interesting positions to look at are the running backs (red) and wide receivers (blue). You can see there’s almost a linear relationship between the ranking and the number of points scored, which you would expect. A high scoring player should be ranked highly and most ranking algorithms use projected points as a large factor. However, there is still a curve to the trend of rank vs. points and that means that the ranking algorithms had more factors than purely points. The question is: Can we identify additional insight from this picture to improve our team?
We’re going exclude QBs for the analysis.
n_cluster = 8
gm = GaussianMixture(n_components = n_cluster,random_state=7)
gm.fit(dt[['FPTS']])
cluster=gm.predict(dt[['FPTS']])
from matplotlib import cm
import numpy as np
plt.scatter(dt.Avg, dt.FPTS, c=[cm.spectral(k/float(n_cluster)) for k in cluster])
plt.ylabel('points')
plt.xlabel('rank')
Look at the edges of the clusters. It’s a flat line because the approach ignored that many of those players have vastly different ranks. In the lower tiers, that won’t matter too much, but especially in the higher and mid tiers it’s an important dimension to think about.
Now, let’s do the analysis again but this time consider the ranking. We will calculate a new rank based off the point totals and the average expert ranking, and then we will cluster based on both the point projection and the original ranking.
To obtain a new ranking, I’m going to orthogonally project the data onto the line connecting the worst player to the best player:
rankbest=dt.loc[0][['Avg','FPTS']].values # best person clearly
ranklast=dt.loc[288][['Avg','FPTS']].values # worst person
plt.plot([rankbest[0],ranklast[0]],[rankbest[1],ranklast[1]])
plt.scatter(dt.Avg, dt.FPTS)
plt.ylabel('points')
plt.xlabel('rank')
This is a somewhat ad-hoc decision and it was a decision based on looking at the data. You could also use a less geometric approach.
The standard way to project a vector \(u\in\mathbb{R}^n\) onto \(v\in\mathbb{R}^n\) is as follows:
\(\)\frac{u\cdot v}{v\cdot v} v:= cv\(\)
You simply scale \(v\) by \(c\in\mathbb{R}\) to represent \(u\) projected onto \(v\). Now, \(c\) is the player’s new “rank”.
# Project the data and calculate c
to_proj = rankbest-ranklast
vals = dt.loc[:288][['Avg','FPTS']].values - ranklast
c = np.dot(vals, to_proj)/ np.dot(to_proj, to_proj)
dt['original_clusters'] = cluster
The calculation on the two dimensional data is more straightforward.
gm.fit(dt[['Avg','FPTS']])
dt['gmm2d_cluster'] = gm.predict(dt[['Avg','FPTS']])
We have adjusted the colors to match between the two plots.
We all know who the best players are and they’re not a surprise. By clustering by points, you’ll always see the best players on top– but everyone knows their top draft picks. We’re trying to find a sound tie breaker for the later draft rounds.. It’s clear from the plots that the standard GMM approach is missing some of the considerations from the experts (found in their ranking). The question as to which rank aware approach is better is an open question, but to my eyes, I like that the 1D GMM approach drops appears keep lower ranked players in their own cluster.
Next, we will plot the ranking and tier against projected points as the original New York Times article does to show the different insights you would get from our two dimensional approach vs. the standard approach.
Click here to enlarge the image.
Click here to enlarge the image.
Now, when a person has a different rank vs. other players with the same points, you can see that it effects their tiering. In the bottom right plot for wide receivers, you can see a completely different tiering. To obtain a final ranking, you would sort via the tiers followed by the normalized rank.
One interesting point is that this approach may allow you to identify over valued players because the rank, tier, and point projections aren’t deterministically linked. Look at the bottom right plot for wide recivers. You can see Corey Davis is included in the yellow group even though his rank is much higher than the other players. After a quick google, the Bleacher Report states that “…Corey Davis … no longer [qualifies as a sleeper] as [his] stocks have ascended. If anything, [he’s] in danger of getting overdrafted over the weekend.”. This insight is not clear from the original ranking/tiers.
Similarly, it also appears to draw different cutoffs as to who are “elite” players. This approach to ranking puts the top three wide receivers in a tier of their own where the original approach implies the first ~10 wide receivers are equivalent. This method would tell you to prioritize drafting the top 3 wide receivers over the others if you are able to.
As always, analytics for fantasy football are more of an art than science due to the limited data, but I hope this approach shows you an alternative way to use Gaussian Mixture Models to combine information about players into tiers.
]]>The residual layer is actually quite simple: add the output of the activation function to the original input to the layer. As a formula, the \(k+1\)th layer has the formula: \begin{equation} x_{k+1} = x_{k} + F(x_{k})\end{equation} where \(F\) is the function of the \(k\)th layer and its activation. For example, \(F\) might represent a convolutional layer with a relu activation. This simple formula is a special case of the formula: \begin{equation} x_{k+1} = x_{k} + h F(x_k),\end{equation} which is the formula for the Euler method for solving ordinary differential equations (ODEs) when \(h=1\). (Fun fact: the Euler method was the approach suggested by Katherine Johnson in the movie Hidden Figures).
Let’s take a quick detour on ODEs to appreciate what this connection means. Consider a simplified ODE from physics: we want to model the position \(x\) of a marble. Assume we can calculate its velocity \(x^\prime\) (the derivative of position) at any position \(x\). We know that the marble starts at rest \(x(0)=0\) and that its velocity at time \(t\) depends on its position through the formula: \begin{equation} x^\prime(t) = f(x) \end{equation} \begin{equation}x(0) = 0.\end{equation} The Euler method solves this problem by following the physical intuition: my position at a time very close to the present depends on my current velocity and position. For example, if you are travelling at a velocity of 5 meters per second, and you travel 1 second, your position changes by 5 meters. If we travel \(h\) seconds, we will have travelled \(5h\) meters. As a formula, we said: \begin{equation}x(t+h) = x(t) + h x^\prime(t),\end{equation} but since we know \(x^\prime(t) = f(x)\), we can rewrite this as \begin{equation} x(t+h) = x(t) + h f(x).\end{equation} If you squint at this formula for the Euler method, you can see it looks just like the formula for residual layers!
This observation has meant three things for designing neural networks:
These are all very exciting ideas, but the second two are particularly interesting– and intimately connected. The reason one can create arbitrarily deep networks with a finite memory footprint is by designing neural networks based on stable ODEs and numerical discretizations.
Rather than solely using the weights to define transformations (i.e. features), the dynamics of the underlying ODE are explicitly considered for determining features. Some implementations use multiple “ODE blocks” (my term, not standard) using different \(F\)’s, with each block responsible for 5-10+ layers of the final network. This means that the input will propagate according to the ODE for 5-10 time steps, and the trajectory helps to determine the learned features. The concept of stability comes into play here. Put simply, an ODE is stable if small perturbations of the initial position result in small perturbations of the final (asymptotic) position. However, there are stable ODEs that push all solutions to a single point, which is also not helpful for creating features.
For neural networks, the most important style of ODE to consider is the following: \begin{equation} x^\prime(t) = \sigma( K x(t) + b):=F(x)\end{equation} where \(\sigma\) is the activation function of a layer, \(K\) the matrix representing the weights of the layer and \(b\) the bias term. (Note, convolutional transformations can be represented by a matrix as well).
The stability of a (linear) system of ODEs is determined by eigenvalues of the Jacobian of \(F\). The important thing to know is that after some calculations one can show that eigenvalues of the Jacobian of \(F\) are determined by the weight matrix \(K\) in this case.
One way to understand the dynamics of an ODE are phase diagrams. Phase diagrams are plots of the direction of motion of a system at a given point indicated by an arrow. Then, some trajectories of the ODE are plotted to visualize what would happen given an initial condition. Here’s a plot from Haber and Ruthotto 2017 analyzing trajectories of the above \(ODE\) after 10 steps given different eigenvalues for a \(K\in\mathbb{R}^{2\times2}\):
From stability theory, the authors knew positive (real) eigenvalues for \(K\) would lead to a non-stable ODE which would not be ideal for learning features, and likewise, negative (real) eigenvalues would collapse all initial conditions to a single point. Zero-valued eigenvalues would lead to a dynamical system that doesn’t move at all! They suggest to use matrices whose real part is zero but have an imaginary component leading to the plot on the right. This can be done by designing the matrix to have these imaginary eigenvalues.
This theory has led to researchers proposing network designs that are inherently stable, either by construction via the weights (ex. using antisymmetric matrices) or choice of ODE. This is a very new approach for neural networks and it is exciting to see mathematical theory taking center stage. For a more in depth analysis, see Haber and Ruthotto 2017.
As we can see from the formula: \begin{equation} x_{k+1} = x_k +hF(x_k) \end{equation} the \(k+1\)th layer is defined by applying \(F\) (the weights and activation) and then adding it to \(x_k\). If you fix \(F\) for multiple layers, you can effectively have unlimited layers with only the memory footprint required to store the weights of \(F\).
Major libraries today today (i.e. Tensorflow) rely on symbolic gradient calculations (and in some cases “automatic differentation”) for their SGD updates. As these are calculated off of the computational graph, even such arbitrarily deep networks could create large graphs resulting in memory issues. One solution would be to train a model on larger hardware and then create the neural network evaluation function in a separate implementation which only loads one set of weights into memory per each layer. Then, an arbitrarily deep network could fit on much smaller hardware. It is possible that a new implementation could be built to train arbitrarily deep nets on hardware with small memory footprints by combining the chain rule / back propogration in a more componentwise fashion, but it would increase training time.
At a high level, our approach will combine arbitrary depth with stable propogation. We will reuse each layer multiple times to maximize our depth while keeping our memory footprint small. Stability will be guaranteed by the underlying ODE we will use.
For stability we will follow the approach is first suggested in Haber and Ruthotto 2017. The idea is to use a Hamiltonian system that is designed to be stable. This equation will be used as the underlying ODE:
\begin{equation}
\frac{\partial}{\partial t}
\left( \begin{array}{c}{y\\ z}\end{array} \right)(t) =
\sigma\left ( \left (
\begin{matrix} 0 & K(t) \
-K(t)^T & 0 \end{matrix}
\right )
\left( \begin{array}{c}{y\\ z}\end{array} \right)(t)
+
b(t) \right );
\left( \begin{array}{c}{y\\ z}\end{array} \right)(0)=
\left( \begin{array}{c}{y_0\\ 0}\end{array} \right).
\end{equation}
There is a lot here, so let’s do it step by step: first, \(K\) and \(b\) represent the weights and bias terms respectively for our convolutional layer and its bias term. (Note: a convolution can be represented as a matrix, which we use here.) They could depend number of times \(t\) you pass your input \(y(0)=y_0\) through the layer. The \(z\) represents an auxilary variable introduced to improve the stability of the system, but practically \(z\) will represent the additional filters we add through zero padding. In particular, we will take an image that has been transformed into a 4D tensor and double (say) the number of channels through zero padding. Then, our layer will define \(y\) as the non-zero channels and \(z\) as the zero channels. This is a practice known as “channel-wise partition” for \(y\) and \(z\) and the concept first appeared in Dinh et al. 2016.
The reason for adding this \(z\) is twofold. First, zero padding takes the place of using additional features in a convolutional layer. Instead of making a new Conv2D layer with double the filters, we’ll simply double the channels of the tensor through zero padding. Second, by introducting \(z\), we can make our dynamical system stable, because the Jacobian of this equation has negative real eigenvalues. This is proven in the Haber and Ruthotto 2017. The Jacobian depends on the eigenvalues of the matrix
\begin{equation}
\left (
\begin{matrix}{} 0 & K(t) \
-K(t)^T & 0 \end{matrix}
\right ),
\end{equation}
and by construction, this matrix is antisymmetric (\(A^T = -A\)) which implies any real eigenvalues of the matrix are \(0\). This yields that the forward propogation of this dynamical system will be stable, which will imply stability in training.
By leveraging the same concept behind arbitrary depth we can reduce the memory footprint of the network. We do this by reusing a layer multiple times, relying on the propogation of the ODE to learn more complicated features. To clarify, we do this by running an image through the layer, and then passing the output back into the layer. So then, a single layer can be “unrolled” in this fashion to actually represent multiple layers. In our case, \(t\) will indicate the number of times we have passed the image through the layer. The key to this trick is to define \(K(t)\equiv K\) and \(b(t)\equiv b\) for all \(t\).
We implement this by extending the convolutional layer in Keras. The important piece to keep in mind is that \(K\) is implemented as a convolution and \(K^T\) is a transposed convolution using the same weights. Then, we implement the Euler forward method for both \(y\) and \(z\), and then pass them back into the convolutional layers until we meet the number of unrolls, stored in the “unroll_length” variable. The Euler forward step in our case looks like this: \begin{equation} y(t+h) = y(t) + h \sigma( K z(t) + b_1(t) ) \end{equation} \begin{equation} z(t+h) = z(t) + h \sigma( -K^T y(t) + b_2(t) ). \end{equation} It’s important to choose a small enough \(h\) for stability to hold– which you can accomplish through trial and error on a validation set. To train the network below, we set \(h=.15\).
Here we replicate functions of final classes for illustrational purposes. The full class definitions can be found here.
class HamiltonianConv2D(Conv2D):
def call(self, inputs):
h=self.h
tmp, otmp = tf.split(inputs, num_or_size_splits=2, axis=-1) # Channel-wise partition for y and z
for k in range(self.unroll_length):
y_f_tmp = K.conv2d(
otmp,
self.kernel,
strides=self.strides,
padding=self.padding,
data_format=self.data_format,
dilation_rate=self.dilation_rate)
x_f_tmp = -K.conv2d_transpose(
x=tmp,
kernel=self.kernel,
strides=self.strides,
output_shape = K.shape(tmp),
padding=self.padding,
data_format=self.data_format)
if self.use_bias:
y_f_tmp = K.bias_add(
y_f_tmp,
self.bias,
data_format=self.data_format)
x_f_tmp = K.bias_add(
x_f_tmp,
self.bias2,
data_format=self.data_format)
# Euler Forward step
tmp = tmp + h*self.activation(y_f_tmp)
otmp = otmp + h*self.activation(x_f_tmp)
out = K.concatenate([tmp,otmp], axis=-1)
return out
We also have to implement the zero padding for adding channels to our tensors. This is done via the following layer:
class ChannelZeroPadding(Layer):
def call(self, x):
return K.concatenate([x] + [K.zeros_like(x) for k in range(self.padding-1)], axis=-1)
Using TensorFlow backend.
In the original ResNet paper implementation section, they select a batch size of 256 for 64000 mini batch updates. In my implementation, my computer could not handle that large of a batch, so I use a batch size of 256/8=32 and multiply the number of mini batch updates by 8 to ensure my training sees each data point the same number of times. I adopt a similar learning rate decay schedule to the the paper, outlined in the experiments section: I will divide the learning rate by 10 at iterations 32000 and 48000.
There are also some differences when training SOTA residual networks compared to (say) the current standard for reinforcement learning, which I found surprising. They are (in Keras lingo):
I follow the standards in the image classification community for my training regiment, which includes a data generation pipeline that perturbs the input images. The notebook for the full training code and all hyperparameters in Keras can be found here.
The final network is comprised of 5 ODE-inspired “units” made up of iterations of the layer define above. The approach is similar to the architecture found in this image below from Chang et al. 2017:
You can see that the units are followed by (Average) Pooling and Zero Padding as we defined in the previous section. For some of our units, we skip the pooling layer to be able to get more channels without shrinking the image too much.
model.summary()
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_1 (InputLayer) (None, 32, 32, 3) 0
_________________________________________________________________
conv1_pad (ZeroPadding2D) (None, 38, 38, 3) 0
_________________________________________________________________
conv2d_1 (Conv2D) (None, 32, 32, 16) 2368
_________________________________________________________________
batch_normalization_1 (Batch (None, 32, 32, 16) 64
_________________________________________________________________
activation_1 (Activation) (None, 32, 32, 16) 0
_________________________________________________________________
hamiltonian_conv2d_1 (Hamilt (None, 32, 32, 16) 592
_________________________________________________________________
average_pooling2d_1 (Average (None, 16, 16, 16) 0
_________________________________________________________________
channel_zero_padding_1 (Chan (None, 16, 16, 64) 0
_________________________________________________________________
hamiltonian_conv2d_2 (Hamilt (None, 16, 16, 64) 9280
_________________________________________________________________
channel_zero_padding_2 (Chan (None, 16, 16, 128) 0
_________________________________________________________________
hamiltonian_conv2d_3 (Hamilt (None, 16, 16, 128) 36992
_________________________________________________________________
channel_zero_padding_3 (Chan (None, 16, 16, 256) 0
_________________________________________________________________
hamiltonian_conv2d_4 (Hamilt (None, 16, 16, 256) 147712
_________________________________________________________________
average_pooling2d_2 (Average (None, 8, 8, 256) 0
_________________________________________________________________
channel_zero_padding_4 (Chan (None, 8, 8, 512) 0
_________________________________________________________________
hamiltonian_conv2d_5 (Hamilt (None, 8, 8, 512) 590336
_________________________________________________________________
average_pooling2d_3 (Average (None, 4, 4, 512) 0
_________________________________________________________________
channel_zero_padding_5 (Chan (None, 4, 4, 1024) 0
_________________________________________________________________
hamiltonian_conv2d_6 (Hamilt (None, 4, 4, 1024) 2360320
_________________________________________________________________
average_pooling2d_4 (Average (None, 1, 1, 1024) 0
_________________________________________________________________
flatten_1 (Flatten) (None, 1024) 0
_________________________________________________________________
dense_1 (Dense) (None, 10) 10250
=================================================================
Total params: 3,157,914
Trainable params: 3,157,882
Non-trainable params: 32
_________________________________________________________________
For comparison, the ResNet 50 implementation in Keras has 25,636,712 parameters and 168 layers.
For training, I will use the typical data augmentation techniques and they can be seen in the notebook.
The final accuracy on the test set is is 90.37%.
For simplicity, I used the Euler forward method for discretizing the ODE. This is based on numerical quadrature:
\begin{equation} x^\prime(t) = f(x) \Rightarrow \int_t^{t+h} x^\prime(z)\,dz = \int_t^{t+h} f(z)\,dz\Rightarrow x(t+h)-x(t) = \int_t^{t+h} f(z)\,dz.\end{equation} In other words, \begin{equation} x(t+h) = x(t) + \int_{t}^{t+h}f(z)\,dz \approx x(t) + hf(t).\end{equation} The \(\approx\) sign comes from a simple numerical quadrature (integration) rule. Other rules have been used to discretize ODEs like the midpoint rule or the leapfrog rule. Other rules can be better approximations to the integral which lead to better approximations to the ODE. For Hamiltonian systems, the Verlet rule is used which leads to the approach in Chang et al. 2017.
]]>Deep Reinforcement Learning sounds intractable for a layperson. “Reinforcement learning” alone sounds scary (I’m reinforcing WHAT again? Random actions? Bah!), and then you add the word “deep” and most people simply drop off. This talk details what I’ve learned from replicating the original NIPS/Nature papers on Deep Reinforcement Learning for playing Atari games and walks through some python implementations of the basic steps for reinforcement learning code. The talk uses Keras (TensorFlow backend) and tries (albeit, unsuccessfully) to hide advanced mathematical formulations of the problem. By the end of the talk, we will all have taken a modest step forward in learning deep reinforcement learning.
You can access the slides here.
]]>Attention manifests differently in different contexts. In some NLP problems, Attention allows the model to put more emphasis on different words during prediction, e.g. from Yang et. Al:
Interestingly, there is increasing research that calls this mechanism “Memory”, which some claim is a better title for such a versatile layer. Indeed, the Attention layer can allow a model to “look back” at previous examples that are relevant at prediction time, and the mechanism has been used in that way for so called Memory Networks. A discussion of Memory Networks is outside the scope of this post, so for our purposes, we will discuss Attention named as such, and explore the properties apparent to this nomenclature.
The Attention Mechanism mathematically is deceptively simple for its far reaching applications. I am going to go over the style of Attention that is common in NLP applications. A good overview of different approaches can be found here. In this case, Attention can be broken down into a few key steps:
Now we will describe this in equations. The general flavor of the notation is following the paper on Hierarchical Attention by Yang et. Al, but includes insight from the notation used in this paper by Li et. Al as well.
Let \(h_{t}\) be the word representation (either an embedding or a (concatenation) hidden state(s) of an RNN) of dimension \(K_w\) for the word at position \(t\) in some sentence padded to a length of \(M\). Then we can describe the steps via
\begin{equation} u_t = \tanh(W h_t + b)\end{equation} \begin{equation} \alpha_t = \text{softmax}(v^T u_t)\end{equation} \begin{equation} s = \sum_{t=1}^{M} \alpha_t h_t\end{equation}
If we define \(W\in \mathbb{R}^{K_w \times K_w}\) and \(v,b\in\mathbb{R}^{K_w}\), then \(u_t \in \mathbb{R}^{K_w}\), \(\alpha_t \in \mathbb{R}\) and \(s\in\mathbb{R}^{K_w}\). (Note: this is the multiplicative application of attention.) Then, the final option is to determine Even though there is a lot of notation, it is still three equations. How can models improve so much when using attention?
To answer that question, we’re going to explain what each step actually does geometrically, and this will illuminate names like “word-level context vector”. I recommend reading my post on how neural networks bend and twist the input space, so this next bit will make a lot more sense.
First, remember that \(h_t\) is an embedding or representation of a word in a vector space. That’s a fancy way of saying that we pick some N-dimensional space, and we pick a point for every word (hopefully in a clever way). For \(N=2\), we would be drawing a bunch of dots on a piece of paper to represent our words. Every layer of the neural net moves these dots around (and transforms the paper itself!), and this is the key to understanding how attention works. The first piece
\begin{equation}u_t = tanh(Wh_t+b)\end{equation}
takes the word, rotates/scales it and then translates it. So, this layer rearranges the word representation in its current vector space. The tanh activation then twists and bends (but does not break!) the vector space into a manifold. Because \(W:\mathbb{R}^{K_w}\to\mathbb{R}^{K_w}\) (i.e., it does not change the dimension of \(h_t\)), then there is no information lost in this step. If \(W\) projected \(h_t\) to a lower dimensional space, then we would lose information from our word embedding. Embedding \(h_t\) into a higher dimensional space would not give us any new information about \(h_t\), but it would add computational complexity, so no one does that.
The word-level context vector is then applied via a dot product: \begin{equation}v^T u_t\end{equation} As both \(v,u_t\in \mathbb{R}^{K_w}\), it is \(v\)’s job to learn information of this new vector space induced by the previous layer. You can think of \(v\) as combining the components of \(u_t\) according to their relevance to the problem at hand. Geometrically, you can imagine that \(v\) “emphasizes” the important dimensions of the vector space.
Let’s see an example. Say we are working on a classification task: is this word an animal or not. Imagine you have a word embedding defined by the following image:
%matplotlib inline
import matplotlib.pyplot as plt
vectors = {'dog' : [.1,.5], 'cat': [.5,.48], 'snow' : [.5,.02], 'rain' : [.4, .05] }
for word, item in vectors.iteritems():
plt.annotate(word, xy=item)
plt.scatter(item[0],item[1])
In this toy example, we can assume that the x-axis component of this word embedding is NOT useful for the task of classifying animals, if this was all of our data. So, a choice of \(v\) might be the vector \([0,1]\), which would pluck out the \(y\) dimension, as it is clearly important, and remove the \(x\). The resulting clustering would look like this:
# Here we apply v, which zeros out the first dimension
for word, item in vectors.iteritems():
plt.annotate(word, xy=(0,item[1]))
plt.scatter(0,item[1])
We can clearly see that \(v\) has identified important information in the vector space representing the words. This is what the word-level context vector’s role is. It picks out the important parts of the vector space for our model to focus in on. This means that the previous layer will move the input vector space around so that \(v\) can easily pick out the important words! In fact, the softmax values of “dog” and “cat” will be larger in this case than “rain” and “snow”. That’s exactly what we want.
The next layer does not have a particularly geometric flavor to it. The application of the softmax is a monotonic transformation (i.e. the importances picked out by \(v^Tu_t\) stay the same). This allows the final representation of the input sentence to be a linear combination of \(h_t\), weighting the word representations by their importance. There is however another way to view the Attention output…
The output is easy to understand probabilistically. Because \(\alpha=[\alpha_t]_{t=1}^M\) sums to \(1\), this yields a probabilistic interpretation of Attention. Each weight \(\alpha_t\) indicates the probability that the word is important to the problem at hand. There are some versions of attention (“hard attention”) which actually will sample a word according to this probability only return that word from the attention layer. In our case, the resulting vector is the expectation of the important words. In particular, if our random variable \(X\) is defined to take on the value \(h_t\) with probability \(\alpha_t\), then the output of our Attention mechanism is actually \(\mathbb{E}[X]\).
To demonstrate this, we’re going to use data from a recent Kaggle competition on using NLP to detect toxic comments. The competition aims to label a text with 6 different (somewhat correlated) labels: “toxic”, “severe_toxic”, “obscene”, “threat”, “insult”, and “identity_hate”. We will expect the Attention layer to give more importance to negative words that could indicate one of those labels have occured. Let’s do a quick data load with some boilerplate code…
import numpy as np
import pandas as pd
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
train=pd.read_csv('train.csv')
train=train.sample(frac=1)
list_classes = ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]
y = train[list_classes].values
# Example of a data point
train.iloc[159552]
id 07ea42c192ead7d9
comment_text "\n\nafd\nHey, MONGO, why dont you put a proce...
toxic 0
severe_toxic 0
obscene 0
threat 0
insult 0
identity_hate 0
Name: 2928, dtype: object
More NLP boilerplate to tokenize and pad the sequences:
max_features = 20000
maxlen = 200
list_sentences = train['comment_text'].values
# Tokenize
tokenizer = Tokenizer(num_words=max_features)
tokenizer.fit_on_texts(list(list_sentences))
list_tokenized_train = tokenizer.texts_to_sequences(list_sentences)
# Pad
X_t = pad_sequences(list_tokenized_train, maxlen=maxlen)
There are many versions of attention out there that actually implements a custom Keras layer and does the calculations with low-level calls to the Keras backend. In my implementation, I’d like to avoid this and instead use Keras layers to build up the Attention layer in an attempt to demystify what is going on.
For our implementation, we’re going to follow the formulas used above. The approach was inspired by this repository.
from keras.layers import Activation, Concatenate, Permute, SpatialDropout1D, RepeatVector, LSTM, Bidirectional, Multiply, Lambda, Dense, Dropout, Input,Flatten,Embedding
from keras.models import Model
import keras.backend as K
class Attention:
def __call__(self, inp, combine=True, return_attention=True):
# Expects inp to be of size (?, number of words, embedding dimension)
repeat_size = int(inp.shape[-1])
# Map through 1 Layer MLP
x_a = Dense(repeat_size, kernel_initializer = 'glorot_uniform', activation="tanh", name="tanh_mlp")(inp)
# Dot with word-level vector
x_a = Dense(1, kernel_initializer = 'glorot_uniform', activation='linear', name="word-level_context")(x_a)
x_a = Flatten()(x_a) # x_a is of shape (?,200,1), we flatten it to be (?,200)
att_out = Activation('softmax')(x_a)
# Clever trick to do elementwise multiplication of alpha_t with the correct h_t:
# RepeatVector will blow it out to be (?,120, 200)
# Then, Permute will swap it to (?,200,120) where each row (?,k,120) is a copy of a_t[k]
# Then, Multiply performs elementwise multiplication to apply the same a_t to each
# dimension of the respective word vector
x_a2 = RepeatVector(repeat_size)(att_out)
x_a2 = Permute([2,1])(x_a2)
out = Multiply()([inp,x_a2])
if combine:
# Now we sum over the resulting word representations
out = Lambda(lambda x : K.sum(x, axis=1), name='expectation_over_words')(out)
if return_attention:
out = (out, att_out)
return out
We are going to train a Bi-Directional LSTM to demonstrate the Attention class. The Bidirectional class in Keras returns a tensor with the same number of time steps as the input tensor, but with the forward and backward pass of the LSTM concatenated. Just to remind the reader, the standard dimensions for this use case in Keras is: (batch size, time steps, word embedding dimension).
lstm_shape = 60
embed_size = 128
# Define the model
inp = Input(shape=(maxlen,))
emb = Embedding(input_dim=max_features, input_length = maxlen, output_dim=embed_size)(inp)
x = SpatialDropout1D(0.35)(emb)
x = Bidirectional(LSTM(lstm_shape, return_sequences=True, dropout=0.15, recurrent_dropout=0.15))(x)
x, attention = Attention()(x)
x = Dense(6, activation="sigmoid")(x)
model = Model(inputs=inp, outputs=x)
model.compile(loss='binary_crossentropy',
optimizer='adam',
metrics=['accuracy'])
attention_model = Model(inputs=inp, outputs=attention) # Model to print out the attention data
If we look at the summary, we can trace the dimensions through the model.
model.summary()
__________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
==================================================================================================
input_31 (InputLayer) (None, 200) 0
__________________________________________________________________________________________________
embedding_31 (Embedding) (None, 200, 128) 2560000 input_31[0][0]
__________________________________________________________________________________________________
spatial_dropout1d_31 (SpatialDr (None, 200, 128) 0 embedding_31[0][0]
__________________________________________________________________________________________________
bidirectional_37 (Bidirectional (None, 200, 120) 90720 spatial_dropout1d_31[0][0]
__________________________________________________________________________________________________
tanh_mlp (Dense) (None, 200, 120) 14520 bidirectional_37[0][0]
__________________________________________________________________________________________________
word-level_context (Dense) (None, 200, 1) 121 tanh_mlp[0][0]
__________________________________________________________________________________________________
flatten_30 (Flatten) (None, 200) 0 word-level_context[0][0]
__________________________________________________________________________________________________
activation_11 (Activation) (None, 200) 0 flatten_30[0][0]
__________________________________________________________________________________________________
repeat_vector_29 (RepeatVector) (None, 120, 200) 0 activation_11[0][0]
__________________________________________________________________________________________________
permute_31 (Permute) (None, 200, 120) 0 repeat_vector_29[0][0]
__________________________________________________________________________________________________
multiply_28 (Multiply) (None, 200, 120) 0 bidirectional_37[0][0]
permute_31[0][0]
__________________________________________________________________________________________________
expectation_over_words (Lambda) (None, 120) 0 multiply_28[0][0]
__________________________________________________________________________________________________
dense_55 (Dense) (None, 6) 726 expectation_over_words[0][0]
==================================================================================================
Total params: 2,666,087
Trainable params: 2,666,087
Non-trainable params: 0
__________________________________________________________________________________________________
model.fit(X_t, y, validation_split=.2, epochs=3, verbose=1, batch_size=512)
Train on 127656 samples, validate on 31915 samples
Epoch 1/3
127656/127656 [==============================] - 192s 2ms/step - loss: 0.1686 - acc: 0.9613 - val_loss: 0.1388 - val_acc: 0.9639
Epoch 2/3
127656/127656 [==============================] - 183s 1ms/step - loss: 0.0939 - acc: 0.9708 - val_loss: 0.0526 - val_acc: 0.9814
Epoch 3/3
127656/127656 [==============================] - 183s 1ms/step - loss: 0.0492 - acc: 0.9820 - val_loss: 0.0480 - val_acc: 0.9828
<keras.callbacks.History at 0x7f5c60dbc690>
Because we stored the output of the softmax layer, we can identify what words the model identified as important for classifying each sentence. We’ll define a helper function that will print the words in the sentence and pair them with their corresponding Attention output (i.e., the word’s importance to the prediction).
def get_word_importances(text):
lt = tokenizer.texts_to_sequences([text])
x = pad_sequences(lt, maxlen=maxlen)
p = model.predict(x)
att = attention_model.predict(x)
return p, [(reverse_token_map.get(word), importance) for word, importance in zip(x[0], att[0]) if word in reverse_token_map]
The data set from Kaggle is about classifying toxic comments. Instead of using examples from the data (which is filled with vulgarities…), I’ll use a relatively benign example to show Attention in action:
# This example is not offensive, according to the model.
get_word_importances('ice cream')
(array([[0.03289272, 0.00060764, 0.00365122, 0.00135027, 0.00818442,
0.00286318]], dtype=float32),
[('ice', 0.24796312), ('cream', 0.18920079)])
# This example is labeled ALMOST labeled toxic, but isn't, due to the word "worthless".
get_word_importances('ice cream is worthless')
(array([[0.4471274 , 0.00662996, 0.07287972, 0.00910207, 0.09902474,
0.02583557]], dtype=float32),
[('ice', 0.028298836),
('cream', 0.024065288),
('is', 0.041414138),
('worthless', 0.8417008)])
# This example is labeled toxic and it's because of the word "sucks".
get_word_importances('ice cream sucks')
(array([[0.9310674 , 0.04415609, 0.64935035, 0.02346741, 0.4773681 ,
0.07371927]], dtype=float32),
[('ice', 0.0032987772), ('cream', 0.003724447), ('sucks', 0.9874626)])
Attention mechanisms are an exciting technique and an active area of research with new approaches being published all the time. Attention has been shown to help achieve state of the art performance on many tasks, but it is also has the obvious advantage of explaining important components of neural networks to a practitioner. In an academic setting, Attention proves very useful for model debugging by allowing one to inspect the predictions and their reasoning. Moreover, having the option available to indicate why a model made a prediction could be very important to fields like healthcare where predictions may be vetted by an expert.
]]>DeepDream made headlines a few years ago for its bizarre images generated by neural networks. As with many scientific advancements, a flashy name won out over the precise name. “Dreaming” sounds more exotic than the more boring truth: “purposely maximizing the L2 norm of different layers”. Anyway, the approach was later extended to video by researchers to more unsettling effects. How to “dream” has been covered at length, so for this post, I’m going to repurpose code from the good people at Keras and only focus on the adversarial side of dreaming. This modified code is included as an appendix below.
Adversarial images have been getting a lot of attention lately. Adversarial images are images that fool state of the art image classifiers but a human would easily classify as the proper class. At present, these images are only possible with access to the classifier you aim to fool, but that could always change in the future. Therefore, researchers are already exploring all the different ways one can exploit an image classifier.
There have been several papers exploring these so-called attacks. The aptly named One pixel attacks change a single pixel to trick a classifier:
Special glasses have been used to cause a classifier to misclassify celebrities:
One can quickly imagine how this can go awry when self-driving cars need to identify obstacles (or pedestrians). With these publications in mind, I would like to walk through another approach that combines “dreaming” to create adversarial images.
from keras.applications import inception_v3
from keras import backend as K
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
# Repurposed deepdream.py code from the people at Keras
from deepdream_mod import preprocess_image, deprocess_image, eval_loss_and_grads, resize_img, gradient_descent
K.set_learning_phase(0)
# Load the model and include the "top" classification layer
model = inception_v3.InceptionV3(weights='imagenet',
include_top=True)
Using TensorFlow backend.
Today, we’re going to take this image of a lazy cat for our experiment. Let’s see what InceptionV3 thinks right now:
orig_img = preprocess_image('lazycat.jpg')
top_predicted_label = inception_v3.decode_predictions(model.predict(orig_img), top=1)[0][0]
plt.title('Predicted Label: %s (%f)' % (top_predicted_label[1], top_predicted_label[2]))
plt.imshow(deprocess_image(orig_img))
The most likely class according to InceptionV3 is an Egyptian_cat. That’s oddly specific (and wrong), but let’s go with it.
In this case, let’s select a “coffeepot”.
coffeepot = np.zeros_like(model.predict(orig_img))
coffeepot[0,505] = 1.
inception_v3.decode_predictions(coffeepot)
[[(u'n03063689', u'coffeepot', 1.0),
(u'n15075141', u'toilet_tissue', 0.0),
(u'n02317335', u'starfish', 0.0),
(u'n02391049', u'zebra', 0.0),
(u'n02389026', u'sorrel', 0.0)]]
Our loss will have a “dreaming” and “adversarial” component.
To achieve “dreaming”, we fix the weights and perform gradient ascent on the input image itself to maximize the L2 norm of a chosen layer’s output of the network. You can also select multiple layers and create a loss to maximize with coefficients, but in this case we will choose a single layer for simplicity.
The interesting part mathematically is as follows: define the input image as \(x_0\) and the output of the \(n\)-th layer as \(f_n(x)\). To dream, we perform the gradient ascent step:
\begin{equation} x_{i+1} = x_{i} + \gamma \frac{\partial}{\partial x} |f_{n}(x)|^2 \end{equation}
where \(\gamma\) is the learning rate and the absolute value represents the L2 norm. For our choice of \(f_n\), we are not limited to activation functions. We could use the output before the activation as well as special layers like Batch Normalization as well.
For the “adversarial” piece, we will also add a component to the loss that will correspond to gradient descent on the predicted label, meaning each iteration we will modify the image to make the neural network think the image is the label we select.
One of the easiest ways to define this is to choose a label and minimize the loss between the true output for an image and the false label. Let us define the model’s output softmax layer as \(f(x)\) for an image \(x\) and let’s let \(o_{fake}\) be the fake label. Let’s assume we are using a categorical crossentropy loss \(\ell\), but any loss would do. Then, we would want to perform the gradient descent step
\begin{equation} x_{i+1} = x_i - \gamma \frac{\partial}{\partial x} \ell( f(x), o_{fake}). \end{equation}
This iteration scheme will adjust the original image to minimize the loss function between the output of the model and the fake label, hence “tricking” the model.
If we add the previous two loss functions, see the final proper loss where we are taking the derivative:
\begin{equation} 2 x_{i+1} = 2 x_{i} + \gamma (\frac{\partial}{\partial x} |f_{n}(x)|^2 - \frac{\partial}{\partial x} \ell( f(x), o_{fake}) ) \end{equation}
After massaging the equation, we have the gradient descent rule
\begin{equation} x_{i+1} = x_{i} - \frac{\gamma }{2} \frac{\partial}{\partial x} \left (\ell( f(x), o_{fake}) - |f_{n}(x)|^2 \right ). \end{equation} So, our final loss function for adversarial dreaming to define in Keras is:
\begin{equation} \ell_{\text{adv dream}} = \ell( f(x), o_{fake}) - |f_{n}(x)|^2. \end{equation}
For the final loss, we will use a linear combination of a few randomly selected layer output functions for better dreaming effects, but those are a small technical detail. Let’s see the code:
from keras import losses
input_class = K.zeros(shape=(1,1000)) # Variable fake class
# dreaming loss, choose a few layers at random and sum their L2 norms
dream_loss = K.variable(0.)
np.random.seed(1)
np.random.choice(model.layers[:-1], size = 5)
for x in np.random.choice(model.layers[:-1], size = 4):
x_var = x.output
dream_loss += np.random.uniform(0,2) * K.sum(K.square(x_var)) / K.prod(K.cast(K.shape(x_var), 'float32'))
# adversarial loss
adversarial_loss = losses.categorical_crossentropy(input_class, model.output)
# final loss
adversarial_dream_loss = adversarial_loss-dream_loss
# Compute the gradients of the dream wrt the loss.
dream = model.input # This is the input image
grads = K.gradients(adversarial_dream_loss, dream)[0] # the signs will acheive the desired effect
grads /= K.maximum(K.mean(K.abs(grads)), K.epsilon()) # Normalize for numerical stability
outputs = [adversarial_dream_loss, grads]
# Function to use during dreaming
fetch_loss_and_grads = K.function([dream, input_class], outputs)
The code to “dream” is straight forward. The image is shrunk to a smaller size and ran through a selected number of gradient descent steps. Then, the resulting image is upscaled with any detail the original image could have lost when resizing added back in. This is then repeated based on the number of “octaves” selected where during each octave, the image has a different size. The octave scale determines the size ratio between the different octaves.
This is really just a more complicated way to say that the image is passed through the neural net for gradient descent at multiple increasing sizes. The octave concept is a nice geometric approach for deciding what those sizes will be that has been found to work well. Sometimes, additional jitter is added to the image on either each gradient descent step too.
# Shamelessly taken from the Keras' example
step = 0.01 # Gradient ascent step size
num_octave = 5 # Number of scales at which to run gradient ascent
octave_scale = 1.2 # Size ratio between scales
iterations = 15 # Number of ascent steps per scale
max_loss = 100.
img = np.copy(orig_img)
if K.image_data_format() == 'channels_first':
original_shape = img.shape[2:]
else:
original_shape = img.shape[1:3]
successive_shapes = [original_shape]
for i in range(1, num_octave):
shape = tuple([int(dim / (octave_scale ** i)) for dim in original_shape])
successive_shapes.append(shape)
successive_shapes = successive_shapes[::-1]
original_img = np.copy(img)
shrunk_original_img = resize_img(img, successive_shapes[0])
for shape in successive_shapes:
print('Processing image shape', shape)
img = resize_img(img, shape)
img = gradient_descent(img,
iterations=iterations,
step=step,
clz=coffeepot,
max_loss=max_loss,
loss_func = fetch_loss_and_grads)
# Upscale
upscaled_shrunk_original_img = resize_img(shrunk_original_img, shape)
same_size_original = resize_img(original_img, shape)
# Add back in detail from original image
lost_detail = same_size_original - upscaled_shrunk_original_img
img += lost_detail
shrunk_original_img = resize_img(original_img, shape)
('Processing image shape', (144, 144))
('..Loss value at', 0, ':', array([ 13.78646088], dtype=float32))
...
('..Loss value at', 14, ':', array([ 5.75342369], dtype=float32))
('Processing image shape', (173, 173))
('..Loss value at', 0, ':', array([ 12.50799942], dtype=float32))
...
('..Loss value at', 14, ':', array([ 4.36679077], dtype=float32))
('Processing image shape', (207, 207))
('..Loss value at', 0, ':', array([ 12.14435482], dtype=float32))
...
('..Loss value at', 14, ':', array([-8.84395885], dtype=float32))
('Processing image shape', (249, 249))
('..Loss value at', 0, ':', array([ 6.44068241], dtype=float32))
...
('..Loss value at', 14, ':', array([-9.34126949], dtype=float32))
('Processing image shape', (299, 299))
('..Loss value at', 0, ':', array([ 1.45110321], dtype=float32))
...
('..Loss value at', 14, ':', array([-9.23139477], dtype=float32))
f, axs= plt.subplots(1,2, figsize=(8,10))
top_predicted_label = inception_v3.decode_predictions(model.predict(orig_img), top=1)[0][0]
axs[0].set_title('Predicted Label: %s (%f)' % (top_predicted_label[1], top_predicted_label[2]))
axs[0].imshow(deprocess_image(orig_img))
top_predicted_label = inception_v3.decode_predictions(model.predict(img), top=1)[0][0]
axs[1].set_title('Predicted Label: %s (%f)' % (top_predicted_label[1], top_predicted_label[2]))
axs[1].imshow(deprocess_image(img))
As you can see, we accomplished our initial goal to dream in a directed adversarially way. There are many other approaches that can be attempted, and as you can see, this is a very popular avenue of research at the moment.
Here’s the functions used in this post:
# Modification of https://github.com/keras-team/keras/blob/master/examples/deep_dream.py
from keras import backend as K
from keras.applications import inception_v3
from keras.preprocessing.image import load_img, img_to_array
import numpy as np
import scipy
def preprocess_image(image_path):
# Util function to open, resize and format pictures
# into appropriate tensors.
img = load_img(image_path)
img = img_to_array(img)
img = np.expand_dims(img, axis=0)
img = inception_v3.preprocess_input(img)
return img
def deprocess_image(x):
# Util function to convert a tensor into a valid image.
x=np.copy(x)
if K.image_data_format() == 'channels_first':
x = x.reshape((3, x.shape[2], x.shape[3]))
x = x.transpose((1, 2, 0))
else:
x = x.reshape((x.shape[1], x.shape[2], 3))
x /= 2.
x += 0.5
x *= 255.
x = np.clip(x, 0, 255).astype('uint8')
return x
def eval_loss_and_grads(x,clz, loss_func):
outs = loss_func([x,clz])
loss_value = outs[0]
grad_values = outs[1]
return loss_value, grad_values
def resize_img(img, size):
img = np.copy(img)
if K.image_data_format() == 'channels_first':
factors = (1, 1,
float(size[0]) / img.shape[2],
float(size[1]) / img.shape[3])
else:
factors = (1,
float(size[0]) / img.shape[1],
float(size[1]) / img.shape[2],
1)
return scipy.ndimage.zoom(img, factors, order=1)
def gradient_descent(x, iterations, step, clz=None, max_loss=None, loss_func=None):
for i in range(iterations):
loss_value, grad_values = eval_loss_and_grads(x, clz, loss_func)
if max_loss is not None and loss_value > max_loss:
break
print('..Loss value at', i, ':', loss_value)
x -= step * grad_values
return x
In order to not bog this post down with excessive details, I’m going to assume that the reader is familiar with the concepts of (stochastic) gradient descent and what asynchronous programming looks like.
Stochastic gradient descent is an iterative algorithm. If one wanted to speed it up, they would have to move the iterative calculation to a faster processor (or say, a GPU). Hogwild! is an approach to parallelize SGD in order to scale SGD to quickly train on even larger data sets. The basic algorithm of Hogwild! is deceptively simple. The only pre-req is that all processors on the machine can read and update the weights of your algorithm (i.e. of your neural network or linear regression) at the same time. In short, the weights are stored in shared memory with atomic component-wise addition. The individual processors’ SGD update code is described below in adapted from the original paper:
loop
sample an example (X_i, y_i) from your training set
evaluate the gradient descent step using (X_i, y_i)
perform the update step component-wise
end loop
So, we run this code on multiple processors updating the same model weights at the same time. You can see where the name Hogwild! comes from– the algorithm is like watching a bunch of farm animals bucking around at random, stepping on each other’s toes.
There are a few conditions for this algorithm to work, but for our purposes, one way to phrase a (usually) sufficient condition is that the gradient updates are sparse, meaning that there are only a few non-zero elements of the gradient estimate used in SGD. This is likely to occur when your data itself being sparse. A sparse update means that it’s unlikely for the different processors to step on each other’s toes, i.e. trying to access the same weight-component to update at the same time. However, collisions do occur, but this has been observed to act as a form of regularization in practice.
As an exercise, we will implement linear regression using Hogwild! for training. I have prepared a git repo with the full implemetation adhering the sklearn’s API, and I will use pieces of that code here.
To make sure we’re all on the same page, here’s the terminology I will use in the subsequent sections. Our linear regression model \(f\) will take \(x\in\mathbb{R}^n\) as input and output a real number \(y\) via
\begin{equation} f(x) = w \cdot x, \end{equation} where \(w\in\mathbb{R}^n\) is our vector of weights for our model. We will use squared error as our loss function, which for a single example is written as \begin{equation} \ell(x,y) = ( f(x)-y )^2 = ( w\cdot x - y)^2. \end{equation}
To train our model via SGD, we need to calculate the gradient update step so that we may update the weights via: \begin{equation} w_{t+1} = w_t - \lambda G_w (x,y), \end{equation} where \(\lambda\) is the learning rate and \(G_w\) is an estimate of \(\nabla_w \ell\) satisfying: \begin{equation} \mathbb{E}[G_w (x,y)] = \nabla_w \ell(x,y). \end{equation}
In particular,
\begin{equation} G_w (x,y):= -2\, (w\cdot x - y)\, x \in \mathbb{R}^n \end{equation}
First, we quickly generate our training data. We will assume our data was generated by a function of the same form as our model, namely a dot product between a weight vector and our input vector \(x\). This true weight vector I will call the “real \(w\)”.
import scipy.sparse
n=10 # number of features
m=20000 # number of training examples
X = scipy.sparse.random(m,n, density=.2).toarray() # Guarantees sparse grad updates
real_w = np.random.uniform(0,1,size=(n,1)) # Define our true weight vector
X = X/X.max() # Normalizing for training
y = np.dot(X,real_w)
The multiprocessing library allows you to spin up additional processes to run code in parallel. We will use the multiprocessing.Pool’s map function to calculate and apply the gradient update step asynchronously. The key components of the algorithm is that the weight vector is accessible to all processes at the same time and can be accessed without any locking.
First, we need to define the weight vector \(w\) in shared memory that can be accessed without locks. We will have to use the “sharedctypes.Array” class from multiprocessing for this functionality. We will further use numpy’s frombuffer function to make it accessible from a numpy array.
from multiprocessing.sharedctypes import Array
from ctypes import c_double
import numpy as np
coef_shared = Array(c_double,
(np.random.normal(size=(n,1)) * 1./np.sqrt(n)).flat,
lock=False) # Hogwild!
w = np.frombuffer(coef_shared)
w = w.reshape((n,1))
For the final parallelization code, we will use multiprocessing.Pool to map our gradient update out to several workers. We will need to expose this vector to the workers so this will work.
Our ultimate goal is to perform the gradient update in parallel. To do this, we need to define what this update is. In the git repo, I took a much more sophisticated approach to expose our weight vector \(w\) to multiple processes. To avoid over-complicating this explanation with Python technicalities, I’m going to just use a global variable to take advantage of how multiprocessing.Pool.map works. The map function copies over the entire namespace of the function to a new process, and as I need to expose a single \(w\) to all workers, this will be sufficient.
# The calculation has been adjusted to allow for mini-batches
learning_rate = .001
def mse_gradient_step(X_y_tuple):
global w # Only for instructive purposes!
X, y = X_y_tuple # Required for how multiprocessing.Pool.map works
# Calculate the gradient
err = y.reshape((len(y),1))-np.dot(X,w)
grad = -2.*np.dot(np.transpose(X),err)/ X.shape[0]
# Update the nonzero weights one at a time
for index in np.where(abs(grad) > .01)[0]:
coef_shared[index] -= learning_rate*grad[index,0]
We are going to have to cut the training examples into tuples, one per example, to pass to our workers. We will also be reshaping them to adhere to the way the gradient step is written above. Code is included to allow you to use mini-batches via the batch_size variable.
batch_size=1
examples=[None]*int(X.shape[0]/float(batch_size))
for k in range(int(X.shape[0]/float(batch_size))):
Xx = X[k*batch_size : (k+1)*batch_size,:].reshape((batch_size,X.shape[1]))
yy = y[k*batch_size : (k+1)*batch_size].reshape((batch_size,1))
examples[k] = (Xx, yy)
Parallel programming can be very difficult. low-level code, but the multiprocessing library has abstracted the details away from us. This makes this final piece of code underwhelming as a big reveal. Anyway, after the definitions above, the final code for Hogwild! is:
from multiprocessing import Pool
# Training with Hogwild!
p = Pool(5)
p.map(mse_gradient_step, examples)
print('Loss function on the training set:', np.mean(abs(y-np.dot(X,w))))
print('Difference from the real weight vector:', abs(real_w-w).sum())
Loss function on the training set: 0.0014203406038
Difference from the real weight vector: 0.0234173242317
As you probably know, deep learning is pretty popular right now. The neural networks employed to solve deep learning problems are trained via stochastic gradient descent. Deep learning implies large data sets (or, additionally some type of interactive environment in the case of reinforcement learning). As the scale increases, training neural networks becomes slower and more cumbersome. So, parallelizing the training of neural networks is a big deal.
There are other approaches to parallelizing SGD. Google has popularized Downpour SGD among others. Downpour SGD basically keeps the weights on a central hub where several model instances on different pieces of the data send updates and retrieve refreshed rates. This idea has been extended in the case of reinforcement learning and seen to provide surprising improvements in performance and a decrease in training time. In particular, the A3C model uses asynchronous agents playing multiple game instances who publish gradient updates to a central hub, along with a few other tweaks. This has been seen to alleviate some of the time correlation issues found in reinforcement learning that causes divergence, which was previously tackled via experience replay.
The point is asynchronous methods are here to stay, and who knows what new advances may come from them in the next few years. At present, there are not many mathematical proof pertaining to the accuracy increases seen for asynchronous methods. Indeed, many of the approaches are not theoretically justified, but their adoption has been driven by impressive practical results. For mathematicians, it will be interesting to see what new mathematics arise from trying to explain these methods formally.
]]>In this post, we are going to discuss the Kullback-Leibler Importance Estimation Procedure (KLIEP) for importance weighting training samples. This approach was implemented in the DensityRatioEstimator class from pykliep. First, we will motivate KLIEP and identify how it weights training examples. Then we will outline the API for pykliep and walk through an example.
Assume you have a training set that is distributed according to \begin{equation} x \sim p(x)\end{equation} and a test set distributed via \begin{equation} x \sim q(x). \end{equation} When p!=q, this is called covariate shift, and it violates a key assumption of machine learning: the training set distribution is the same as the test set distribution. (In this case, \(x=(X,y)\) for notational convenience.) One approach to correct this phenomenom is called importance weighting. Essentially, you attach a sample weight to each training example during training in order to satisfy: \begin{equation} E_{x \sim w(x)p(x)} \text{loss}(x) = E_{x\sim q(x)} \text{loss}(x) \end{equation}
Theoretically, the best choice of \(w(x)=q(x)/p(x)\) because \begin{equation} E_{x \sim w(x)p(x)} \text{loss}(x) = \int_{supp(w(x)p(x))} \text{loss}(x)w(x)p(x)\, dx \end{equation} \begin{equation} \int_{supp(q(x))} \text{loss}(x)q(x)\, dx = E_{x\sim q(x)} \text{loss}(x), \end{equation} assuming their supports are equal. So, we can see that the expected loss on the test distribution \(q\) is the same as using the sample weights \(w(x)\) on the training set! (Notice, this is trivially the case when \(p(x)=q(x)\).)
Don’t celebrate yet. It is not very easy to estimate \(w(x)\), and many approaches are unstable. A simple approach is to train a classifier to predict if an example comes from the training set or the test set. Then you would use the likelihood score from a model \begin{equation}\hat w(x)= \frac{p(y =1 |x)}{ p(y=0 |x) }\end{equation} where \(y=1\) indicates membership in the test set. This tends to work poorly in practice as the ratio is very sensitive.
A more stable approach is the aforementioned KLIEP algorithm. This approach directly estimates the density ratio \(q(x)/p(x)\). The original implementation is not very scalable, but since the first publication, there have been new approaches to increase its efficiency.
The KLIEP algorithm defines an estimate for \(w\) of the following form:
\begin{equation} \hat w(x) = \sum_{x_i \in D_{te}} \alpha_i K(x,x_i,\sigma), \end{equation}
where \(D_{te}\subset X_{te}\) and \(\alpha_i,\sigma \in \mathbb{R}\). The original paper choose \(K\) to be a Gaussian kernel, i.e.: \begin{equation} K(x,x_i,\sigma) = e^{ - \frac{| x-x_i|^2}{2\sigma^2}}. \end{equation}
The set \(D_{te}\) is a random sample of test set vectors. The parameter \(\sigma\) is called the “width” of the kernel. The authors then introduce the quantity
\begin{equation} J = \frac{1}{|X_{te}|} \sum_{x\in X_{te}}\log \hat w (x), \end{equation} where \(X_{te}\) is the test set. They prove that the larger this value is, the closer \(q(x)\) and \(\hat w(x) p(x)\) are in Kullback-Leibler divergence. In other words, you want that value maximized to have the best possible sample weights for training. The KLIEP uses gradient descent to find \(\alpha_i\) which maximizes \(J\) given \(D_{te}\) and \(\sigma\) with the constraints that \begin{equation} \alpha_i \geq 0 \quad\text{and}\quad\int \hat w(x) p(x) \, dx = 1, \end{equation} which means the quantity \(\hat w(x) p(x)\) is akin to a true probability density function (i.e. it integrates to 1).
The DensityRatioEstimator class has two main parameters to tune: num_params and sigma, which correspond to the size of \(D_{te}\) and \(\sigma\) respectively. An interesting component of the approach is the use of likelihood cross validation (LCV). As opposed to grid search cross validation, this approach only splits the test set into folds. It turns out that this is the proper form of cross validation to estimate the theoretical quantity that indicates goodness of fit. Otherwise, it is very similar to grid search cross validation:
Let’s walk through how to use the DensityRatioEstimator class which implements KLIEP. It’s meant to follow the spirit of sklearn’s conventions. Overall, the API is similar to sklearn, but there are key differences. This estimator takes as its input both the training and test set, which already breaks sklearn’s “fit” API. One facet of “fit” is that X and y are the same length, and obviously that is not the case for your training and test set in general.
In theory, there’s nothing stopping us from using the procedure on \(x=(X,y)\). For training sets with a time element, one can test if KLIEP improves your validation scores. In problems without time explicitly as a feature, the underlying distribution can still drift over time. You can still cut your data up by time (collection time? date added to the data?) and apply KLIEP. You might even have data to predict that is unlabeled. In that case, you can simply use \(X_{test}\) to develop sample weights (and then your approach is edging towards a semi-supervised learning approach as you use labeled and unlabeled examples).
Next, an example: We will define a simple learning problem where \(x\in\mathbb{R}^3\). The training set is generated via \begin{equation} X \sim N(0,.5)^2 \end{equation} and a covariate shift in \(X\) will occur and the test set will be distributed say: \begin{equation} X \sim (N(1,.5), N(1,.5)). \end{equation}
The y data will be generated by \begin{equation} y= \sin(n\pi)+m + N(0,\epsilon) \end{equation} where \(X=(m,n)\in \mathbb{R}^2\).
First, let’s visualize the covariate shift.
eps = 1e-3
np.random.seed(1)
X_train = np.random.normal(0,.5,size=(500,2))
X_test = np.column_stack((np.random.normal(1,.5,size=(300,)),np.random.normal(1,.5,size=(300,))))
y_train = np.sin(X_train[:,1]*np.pi) + X_train[:,0] + np.random.normal(0,eps)
y_test = np.sin(X_test[:,1]*np.pi) + X_test[:,0] + np.random.normal(0,eps)
%matplotlib inline
import matplotlib.pyplot as plt
f,axs=plt.subplots(1,2, figsize=(10,5))
axs[0].scatter(X_train[:,0],X_train[:,1], color='b', alpha=.2)
axs[0].scatter(X_test[:,0],X_test[:,1], color='r', alpha=.2)
axs[0].set_title('Distributions of X: train (blue) vs. test (red)')
axs[1].set_title('Distributions of y: train (blue) vs. test (red)')
axs[1].hist(y_train, color='b', alpha=.2)
axs[1].hist(y_test, color='r', alpha=.2)
Now, we will train two random forests: with and without KLIEP sample weights.
from pykliep import DensityRatioEstimator
# Estimate the density ratio w
kliep = DensityRatioEstimator(random_state=7)
kliep.fit(X_train, X_test) # keyword arguments are X_train and X_test
weights = kliep.predict(X_train)
rf = RandomForestRegressor(n_estimators=50, random_state=2)
rf2 = RandomForestRegressor(n_estimators=50, random_state=2)
f,axs = plt.subplots(1,2, figsize=(10,5))
rf.fit(X_train, y_train)
axs[0].scatter(np.log10(y_test+1), np.log10(rf.predict(X_test)+1))
axs[0].set_title('RFW MAE = %s' % np.mean(np.abs(y_test-rf.predict(X_test))))
rf2.fit(X_train, y_train, sample_weight=weights) # Train using the sample weights!
axs[1].scatter(np.log10(y_test+1), np.log10(rf2.predict(X_test)+1))
axs[1].set_title('RF + KLIEP MAE = %s' % np.mean(np.abs((y_test-rf2.predict(X_test)))))
axs[0].set_ylabel('Log Predicted')
axs[1].set_xlabel('Log Actual')
axs[0].plot([-.8,.6], [-.8,.6], color='r')
axs[1].plot([-.8,.6], [-.8,.6], color='r')
If you had access to the true \(w(x)\), you could just look at \(w(x)\) and \(\hat w(x)\) and compare. We are going to do just that.
If we use the last example above, the pdf for \(X=(m,n)\) is independent in each component and so \begin{equation} p(X) = p(m)p(n) \end{equation} for the training and test set. So then, \(w(X)\) is given by \begin{equation} w(X) = \frac{p_{te}(m) p_{te}(n)}{p_{tr}(m) p_{tr}(n)}. \end{equation} We can then plot how well \(\hat w (X)\) approximates \(w(X)\).
np.random.seed(128)
X_train = np.random.normal(0,.5,size=(1000,2))
X_test = np.random.normal(1,.5,size=(1500,2))
kliep = DensityRatioEstimator(random_state=4)
kliep.fit(X_train, X_test)
# Take a look at the LCV results for (num_params, sigma)
kliep._j_scores
[((0.2, 0.25), 2.3883652107662297),
((0.1, 0.25), 2.2569424774193245),
((0.2, 0.5), 1.8987499277478477),
((0.1, 0.5), 1.8136986917106765),
((0.2, 0.75), 1.4189331965313798),
((0.1, 0.75), 1.3891472114097529),
((0.2, 1), 1.0377502976502824),
((0.1, 1), 0.98391914587619367),
((0.2, 0.1), 0.94975924498825803),
((0.1, 0.1), 0.19270111220322969)]
import scipy.stats
# Define the true w(X)
def w_true(X):
num = scipy.stats.norm(loc=1, scale=.5).pdf(X[:,0])*scipy.stats.norm(loc=1, scale=.5).pdf(X[:,1])
denom =scipy.stats.norm(scale=.5).pdf(X[:,0])*scipy.stats.norm(scale=.5).pdf(X[:,1])
return num/denom
As \(X\) is two dimentional, we’ll actually plot a parametrized line \(X(t):= (m(t),n(t))\) where \(m(t)=n(t)=t\). This is like we are taking a slice of \(w(X)\) and will allow us to plot a 2-D image.
Xt = np.column_stack((np.linspace(0,1,100),np.linspace(0,1,100)))
%matplotlib inline
import matplotlib.pyplot as plt
plt.plot(np.linspace(0,1,100),w_true(Xt),'b')
plt.plot(np.linspace(0,1,100),kliep.predict(Xt),color='r')
plt.title('$$w(X)$$ (Blue) vs $$\hat w(X)$$ (Red)')
plt.xlabel('t')
The illustration is only half the battle. In order to prove anything, we need to be able to formally (mathematically) write down the objective. This informs not only the proof strategy but gives one the end goal to “keep an eye out for” while studying the proof.
The goal is that the Generator generates a examples that is indistinguishable from the real data. Mathematically, this is the concept of random variables being equal in distribution (or in law). Another way to say this is that their probability density functions (i.e. the probability measure induced by the random variable on its range) are equal: \(p_G(x)=p_{data}(x)\). This is exactly the strategy of the proof: define an optimization problem where the optimal \(G\) satisfies \(p_G(x)=p_{data}(x)\). If you knew that your solution \(G\) satisfies this relationship, then you have a reasonable expectation that you can represent the optimal \(G\) as a neural network via typical SGD training!
With the counterfitting illustration in mind, the motivation for defining the optimization problem is guided by the following two terms. First, we require that the discriminator \(D\) recognizes examples from the \(p_{data}(x)\) distribution, hence: \begin{equation} E_{x \sim p_{data}(x)}\log(D(x)) \end{equation} where \(E\) denotes the expectation. This term comes from the “positive class” of the log-loss function. Maximizing this term corresponds to \(D\) being able to precisely predict \(D(x)=1\) when \(x\sim p_{data}(x)\).
The next term has to do with the Generator \(G\) tricking the discriminator. In particular, the term comes from the “negative class” of the log-loss function: \begin{equation} E_{z \sim p_{z}(z)}\log(1-D(G(z))). \end{equation} If this value is maximized (remember, the log of \(x<1\) is negative), that means \(D(G(z))\approx 0\) and so \(G\) is NOT tricking \(D\).
To combine these two concepts, the Discriminator’s goal is to maximize \begin{equation} E_{x \sim p_{data}(x)}\log(D(x))+E_{z \sim p_{z}(z)}\log(1-D(G(z))) \end{equation} given \(G\), which means that \(D\) properly identifies real and fake data points. The optimal Discriminator in the sense of the above equation for a given \(G\) will be denoted \(D^*_G\). Define the value function
\begin{equation} V(G,D):= E_{x \sim p_{data}(x)}\log(D(x))+E_{z \sim p_z(z)}\log(1-D(G(z))). \end{equation}
Then we can write \(D_G^* = \text{argmax}_D V(G,D)\). Now, \(G\)’s aim is the reverse– the optimial \(G\) minimizes the previous equation when \(D=D^*_G\). In the paper, they prefer to write the optimial \(G\) and \(D\) to solve the minimax game for the value function
\begin{equation} V(G,D)= E_{x \sim p_{data}(x)}\log(D(x))+E_{z \sim p_z(z)}\log(1-D(G(z))). \end{equation}
Then, we can write the optimial solution satisfies \begin{equation} G^* = \text{argmin}_G V(G,D_G^*). \end{equation}
At this point, we must show that this optimization problem has a unique solution \(G^*\) and that this solution satisfies \(p_G=p_{data}\).
One big idea from the GAN paper– which is different from other approaches– is that \(G\) need not be invertible. This is crucial because in practice \(G\) is NOT invertible. Many pieces of notes online miss this fact when they try to replicate the proof and incorrectly use the change of variables formula from calculus (which would depend on \(G\) being invertible). Rather, the whole proof relies on this equality:
\begin{equation} E_{z \sim p_{z}(z)}\log(1-D(G(z))) = E_{x \sim p_{G}(x)}\log(1-D(x)) . \end{equation}
This equality comes from the Radon-Nikodym Theorem of measure theory. Colloquially, it’s sometimes called the law of the unconscious statistician, which was probably given its name by mathematicians– if I had to guess. This equality shows up in the proof of Proposition 1 in the original GAN paper to little fanfare when they write:
\begin{equation} \int_{x} p_{data}(x)\log D(x) \, \mathrm{d}x + \int_{z} p(z)\log ( 1- D(G(z))) \, \mathrm{d}z \end{equation} \begin{equation} =\int_{x} p_{data}(x)\log D(x) + p_G(x) \log ( 1- D(x)) \, \mathrm{d}x \end{equation}
I have seen lecture notes use the change of variables formula for this line! That is incorrect, as to do a change of variables, one must calculate \(G^{-1}\) which is not assumed to exist (and in practice for neural networks– does not exist!). I imagine this approach is so common in the machine learning / stats literature that it goes unnamed.
Due to the last equation, we actually can write down the optimal \(D\) given \(G\). Perhaps surprisingly, it comes from simple calculus. We will find a maximum of the integrand in the equation above and then choose that for \(D\). Notice, the integrand can be written: \begin{equation} f(y)= a \log y + b \log(1-y). \end{equation} To find the critical points, \begin{equation} f^\prime(y) = 0 \Rightarrow \frac{a}{y} - \frac{b}{1-y} = 0 \Rightarrow y = \frac{a}{a+b} \end{equation} if \(a+b \neq 0\). If you continue and do the second derivative test: \begin{equation} f^{\prime\prime}\big ( \frac{a}{a+b} \big) = - \frac{a}{(\frac{a}{a+b})^2} - \frac{b}{(1-\frac{a}{a+b})^2} < 0 \end{equation} when \(a,b \in (0,1)\), and so \(\frac{a}{a+b}\) is a maximum.
So, we can then write that \begin{equation} V(G,D) = \int_{x} p_{data}(x)\log D(x) + p_G(x) \log ( 1- D(x)) \, \mathrm{d}x \end{equation} \begin{equation} \leq \int_x \max_y {p_{data}(x)\log y + p_G(x) \log ( 1- y)}\, \mathrm{d}x. \end{equation} If you let \(D(x) = \frac{p_{data}}{p_{data}+p_{G}}\), you then achieve this maximum. Since \(f\) has a unique maximizer on the interval of interest, optimal \(D\) is unique as well, as no other choice of \(D\) will achieve the maximum.
Notice, that this optimal \(D\) is not practically calculable, but mathematically it is still very important. We do not know \(p_{data}(x)\) a priori and so we would never be able to use it directly during training. On the other hand, its existence enables us to prove that an optimal \(G\) exists, and during training we only need to approximate this \(D\).
Of course, the goal of the GAN process is for \(p_G = p_{data}\). What does this mean for the optimal \(D\)? Substituting into the equation for \(D^*_G\), \begin{equation} D_G^* = \frac{p_{data}}{p_{data}+p_G} = \frac{1}{2}. \end{equation} This means that the Discriminator is completely confused, outputting \(1/2\) for examples from both \(p_{G}\) and \(p_{data}\). Using this idea, the authors exploit this to prove that this \(G\) is the solution to the minimax game. The theorem is as follows:
“The global minimum of the virtual training criterion \(C(G)=\max_D V(G,D)\) is acheived if and only if \(p_G = p_{data}\).”
The theorem in the paper is and if and only if statement, and so we prove both directions. First, we approach the backwards direction and show what value \(C(G)\) takes. Then, we approach the forward direction with the newfound knowledge from the backwards direction. Assuming \(p_G=p_{data}\), we can write
\begin{equation} V(G, D_G^*) = \int_{x} p_{data}(x)\log \frac{1}{2} + p_G(x) \log \big ( 1- \frac{1}{2}\big) \, \mathrm{d}x \end{equation}
and \begin{equation} V(G,D_G^*) = - \log 2 \int_{x}p_{G}(x) \,\mathrm{d}x - \log 2 \int_x p_{data}(x)\, \mathrm{d}x = - 2 \log 2 = -\log4. \end{equation} This value is a candidate for the global minimum (since it is what occurs when \(p_G=p_{data}\)). We will stop the backwards direction for now because we want to prove that this value always the minimum (which will satisfy the “if” and “only if” pieces at the same time). So, drop the assumption that \(p_G=p_{data}\) and observe that for any \(G\), we can plug in \(D^*_{G}\) into \(C(G)=\max_D V(G,D)\): \begin{equation} C(G) = \int_{x} p_{data}(x)\log \big (\frac{p_{data}(x)}{p_{G}(x)+p_{data}(x)} \big ) + p_G(x) \log\big ( \frac{p_{G}(x)}{p_{G}(x)+p_{data}(x)}\big ) \, \mathrm{d}x. \end{equation}
The second integrand comes from the following trick:
\begin{equation} 1-D_G^*(x) = 1 - \frac{p_{data}(x)}{p_{G}(x)+p_{data}(x)} =\frac{p_G(x) + p_{data}(x)}{p_{G}(x)+p_{data}(x)} - \frac{p_{data}(x)}{p_{G}(x)+p_{data}(x)} \end{equation} \begin{equation} = \frac{p_{G}(x)}{p_{G}(x)+p_{data}(x)}. \end{equation}
Knowing the \(-\log 4\) is the candidate for the global minimum, I want to stuff that value into the equation, so I add and subtract \(\log2\) from each integral, multiplied by the probabiliy densities. This is a common mathematical proof technique which doesn’t change the equality because in essense I am adding 0 to the equation.
\begin{equation} C(G) = \int_{x} (\log2 -\log2)p_{data}(x) + p_{data}(x)\log \big (\frac{p_{data}(x)}{p_{G}(x)+p_{data}(x)} \big ) \end{equation} \begin{equation} \quad+(\log2 - \log2)p_G(x) + p_G(x) \log\big ( \frac{p_G(x)}{p_G(x)+p_{data}(x)}\big ) \, \mathrm{d}x. \end{equation}
\begin{equation} C(G) = - \log2\int_{x} p_{G}(x) + p_{data}(x)\, \mathrm{d}x \end{equation} \begin{equation}+\int_{x}p_{data}(x)\Big(\log2 + \log \big (\frac{p_{data}(x)}{p_{G}(x)+p_{data}(x)} \big ) \big) \end{equation} \begin{equation}+ p_G(x)\Big (\log2 + \log\big ( \frac{p_{G}(x)}{p_{G}(x)+p_{data}(x)}\big ) \Big ) \, \mathrm{d}x. \end{equation}
Due to the definition of probability densities, integrating \(p_G\) and \(p_{data}\) over their domain equals 1, i.e.: \begin{equation} -\log2\int_{x} p_{G}(x) + p_{data}(x)\, \mathrm{d}x = -\log2 ( 1 + 1) = -2\log2 = -\log4. \end{equation} Moreover, using the definition of the logarithm, \begin{equation} \log2 + \log \big (\frac{p_{data}(x)}{p_{G}(x)+p_{data}(x)} \big ) = \log \big ( 2\frac{p_{data}(x)}{p_{G}(x)+p_{data}(x)} \big )\end{equation} \begin{equation} = \log \big (\frac{p_{data}(x)}{(p_{G}(x)+p_{data}(x))/2} \big ). \end{equation}
So, substituting in these equalities we can write: \begin{equation} C(G) = - \log4 + \int_{x}p_{data}(x)\log \big (\frac{p_{data}(x)}{(p_{G}(x)+p_{data}(x))/2} \big )\,\mathrm{d}x \end{equation} \begin{equation}+ \int_x p_G(x)\log\big ( \frac{p_{G}(x)}{(p_{G}(x)+p_{data}(x))/2}\big ) \, \mathrm{d}x. \end{equation}
Now, for those of you who have heard of the Kullback-Leibler divergence, you will recognize that each integral is exactly that! In particular, \begin{equation} C(G) = - \log4 + KL\big (p_{data} \big | \frac{p_{data}+p_{G}}{2} \big ) + KL\big(p_{G} \big | \frac{p_{data}+p_{G}}{2}\big) . \end{equation}
The Kullback-Leibler divergence is always non-negative, so we immediately see that \(-\log 4\) is the global minimum of \(C(G)\). If we prove only one \(G\) acheives this, then we are finished because \(p_G=p_{data}\) would be the unique point where \(C(G)=-\log 4\). To do this, we have to notice a second identity: \begin{equation} C(G) = - \log4 + 2\cdot JSD \big (p_{data} | p_{G}\big ) . \end{equation}
The JSD term is known as the Jenson-Shannon divergence. Of course, you would have to be initiated in the field to know this identify. This divergence is actually the square of the Jenson-Shannon distance metric, which gives us the property that \(JSD \big (p_{data} \| p_{G}\big )\) is only 0 when \(p_{data}=p_{G}\).
Now, the main thrust of the paper has been proven: namely, that \(p_G=p_{data}\) at the optimum of \(\max_D V(G,D)\). There is an additional proof in the paper that states given enough training data and the right circumstances, the training process will converge to the optimal \(G\). We will omit the details.
Most implementations do not explicitly perform the minimax optimization suggested in the original paper. Instead, they use a more typical training regiment inspired by the paper. For clarity, we will use the concrete case of generating images. First, we define two models. The first is the Discriminator \(D(x)\) and the second is the Adversarial net \(A(z)=D(G(z))\). Notice, the adversarial net is also a classifier. In our setting, we will define the positive label 1 to be a real image and 0 to be a fake image. So, training will consist of two steps for each minibatch:
1. Train the Discriminator D to discriminate between real images and generated images via a standard 0-1 classification loss function.
2. Freeze the weights of D and train the adversarial network A with generated images with their labels forced to be 1.
The second step is interesting because it’s very intuitive. We are not updating \(D\)’s weights, but we are updating \(G\) to make the generated images “look more like real images” (i.e. having the Adversarial net output 1 for fake images). If the Adversarial network outputs a 1, that means we have tricked the Discriminator.
import sklearn
from sklearn.datasets import fetch_mldata
mnist = fetch_mldata('MNIST original')
mnist['data'] = mnist['data'].astype(float) / 255. # Normalized!
import keras
from keras.models import Sequential
from keras.layers import Dense, Activation, Conv2D, BatchNormalization, Input, Dropout, UpSampling2D, Reshape, Flatten, Conv2DTranspose
from keras.layers.advanced_activations import LeakyReLU
For the network itself, I combined my favorite implementations and approaches. It’s sometimes a pain to pick hyperparameters for a network, so it’s always good to have a reference for your first implementation. The Discriminator is a standard Convolutional Neural Network. The Generator is in a sense the reverse of a CNN. We will use transposed convolutional layers to take a dense vector from the latent distribution and upscale/transform those vectors to our desired image shape.
Rather than using Convolution2D and UpSampling2D as some implementations do, you can achieve the same results from using Conv2DTranspose in Keras. There’s deep mathematical reasoning why they are equivalent, but I will leave that for possibly a future post. There’s a page that graphically shows the difference between convolutions and transposed convolutions.
BatchNormalization has become known as another source of regularization, but there is no rigorous proof of this. Like RMSprop, it just appears to work better. BatchNormalization normalizes the output of each layer before passing it through the next layer. This speeds up training and appears to lead to better results. There is no consensus (as of right now) on whether it should be used before or after the activation function, and so I use it before.
depth=200
dropout=.3
image_size = 28
half_image_size = 7
gan_layers = [
Dense(depth*half_image_size*half_image_size, input_shape=(100,), init='glorot_normal'),
BatchNormalization(momentum=0.9),
Activation('relu'),
Dropout(dropout),
Reshape([half_image_size,half_image_size,depth]),
Conv2DTranspose(depth/2, 5, strides=(2,2), padding='same', init='glorot_uniform'), # (14,14,128)
BatchNormalization(momentum=0.9),
Activation('relu'),
Dropout(dropout),
Conv2DTranspose(depth/4, 3, strides=(2,2), padding='same', init='glorot_uniform'), # (28,28,64)
BatchNormalization(momentum=0.9),
Activation('relu'),
Dropout(dropout),
Conv2DTranspose(depth/8, 3, padding='same', init='glorot_uniform'), # (28,28,32)
BatchNormalization(momentum=0.9),
Activation('relu'),
Dropout(dropout),
Conv2DTranspose(1, 3, padding='same'), # (28,28,1)
Activation('sigmoid')
]
gen = Sequential(layers=gan_layers)
leaky_alpha = .2
d_depth = 256
discriminator_layers = [
Conv2D(d_depth, 5, strides=(2,2), input_shape=(image_size, image_size, 1), data_format='channels_last', padding='same'),
LeakyReLU(alpha=leaky_alpha),
Dropout(dropout),
Conv2D(d_depth*2, 5, strides=(2,2), padding='same'),
LeakyReLU(alpha=leaky_alpha),
Dropout(dropout),
Flatten(),
Dense(256),
LeakyReLU(alpha=leaky_alpha),
Dropout(dropout),
Dense(2),
Activation('softmax')
]
dsc = Sequential(layers=discriminator_layers)
from keras import optimizers
opt = optimizers.Adam(lr=2e-4)
dopt = optimizers.Adam(lr=1e-4)
dsc.compile(optimizer=dopt, loss='categorical_crossentropy')
Remember to turn off training on the Discriminator’s weights!
for layer in dsc.layers:
layer.trainable = False
from keras.models import Model
adv_model = Sequential([gen,dsc])
adv_model.compile(optimizer=opt, loss='categorical_crossentropy')
adv_model.summary()
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
sequential_1 (Sequential) (None, 28, 28, 1) 1586351
_________________________________________________________________
sequential_2 (Sequential) (None, 2) 9707266
=================================================================
Total params: 11,293,617
Trainable params: 1,566,401
Non-trainable params: 9,727,216
_________________________________________________________________
Notice you can clearly see the discriminator’s weights are not trainable.
The jury is out on whether or not pretraining the Discriminator is worthwhile. Many recent papers suggest it is not needed. However, this is based on subjectively what Generators predict “more realistic” images. So, I’ll just follow what the original paper did.
# prior distribution
generate_values = lambda x : np.random.uniform(-1,1, size=(x,100))
# Initial Discriminator Training
import numpy as np
initial_training_size = 10000
real_images = mnist['data'][np.random.choice(len(mnist['data']), size=initial_training_size)].reshape(initial_training_size,image_size, image_size,1)
fake_images = gen.predict(generate_values(initial_training_size))
%matplotlib inline
import matplotlib.pyplot as plt
f, axs =plt.subplots(1,2, sharey=True)
axs[0].imshow(real_images[1,:,:,0])
axs[1].imshow(fake_images[1,:,:,0])
axs[0].set_title('Real Image')
axs[1].set_title('Fake Image')
# Compile initial training set
X = np.concatenate([real_images, fake_images])
y = np.zeros((2*initial_training_size,2))
y[:initial_training_size,1] = 1
y[initial_training_size:,0] = 1
dsc.fit(X,y, batch_size=32, epochs=1)
steps = 10000
y = np.zeros(shape=(mini_batch_size,2))
y[:mini_batch_size/2, 1]=1
y[mini_batch_size/2:, 0]=1
d_loss = np.zeros(steps)
a_loss = np.zeros(steps)
for k in range(steps):
if k % 50 == 0:
print(k)
X = np.concatenate([
mnist['data'][np.random.choice(len(mnist['data']), size=mini_batch_size/2)].reshape(mini_batch_size/2,image_size, image_size,1),
gen.predict(generate_values(mini_batch_size/2))
])
d_loss[k] = dsc.train_on_batch(X,y)
# Train GAN
X_adv = generate_values(mini_batch_size)
a_loss[k] = adv_model.train_on_batch(X_adv, y_adv)
Full Disclosure: I may have ran the code a few times to get a good looking image…
%matplotlib inline
import matplotlib.pyplot as plt
f, axs =plt.subplots(1,2, sharey=True)
axs[0].imshow(real_images[1,:,:,0])
axs[0].set_title('Real Image')
axs[1].set_title('Fake Image')
axs[1].imshow(gen.predict(generate_values(1))[0,:,:,0])
%matplotlib inline
import matplotlib.pyplot as plt
f, axs =plt.subplots(3,8, sharey=True, figsize=(10,4))
plt.suptitle('Generated Images')
for j in range(3):
for k in range(8):
axs[j][k].imshow(gen.predict(generate_values(1))[0,:,:,0])
plt.plot(d_loss, color='r')
plt.plot(a_loss, color='b')
plt.xlabel('Steps')
plt.ylabel('Value')
plt.title('Loss')
plt.legend(['Discriminator','Adversarial'])