Correlation is a very fundamental and viseral way of understanding how the stock market works and how strategies perform. Modern portfolio theory has made great progress in tying together stock data with portfolio selection. Today, we’re going to explore how the eigendecomposition of the returns covariance matrix could help you invest. This will be the first post in a series of posts to help you understand how advanced mathematics can fit into investing.

First, let’s get some definitions out of the way. We will denote the covariance matrix of the returns over some period as $\Sigma$. The price for an asset $i$ at a given time $t$ is $p_{i,t}$ and returns for each stock $i$ will be written as $r_{i,t}=(p_{i,t}-p_{i,t-1})/p_{i,t-1}$. These returns can be arranged into a matrix and used to find the sample covariance which we will do below.

Warning! The analysis herein is a presentation of a part of Markowitz Portfolio Theory. 
Notes you may come across for this theory assumes linear returns (not log!), 
and it will be confusing as some notes do casually use log-returns without so much as 
a mention of the extentions to correct for this.

We will further define the amount invested in stock $i$ relative to one’s total capital as $w_i$. If $w_{Apple}=.5$, this means we are investing 50% of our capital into Apple. This is also called going “long” in a stock. If a weight is negative, that is related to a “short sale”, where you borrow capital to buy the stock. Usually, you “long” (verb) stocks that will increase in value and “short” (verb) stocks that will decrease.

Portfolio Theory Fundamentals

If we have a set of stocks $S$ and our chosen weights represented as a vector $w = [w_i]_{i \in S}$, then the pair $(S,w)$ describes a portfolio allocation. This can be used, for instance, in a “buy and hold” strategy for investing. If we had access to the covariance matrix $\Sigma$, then the risk (variance!) of the investing strategy is given by the formula \begin{equation} \text{risk} = w^T \Sigma w, \end{equation}

This is an extremely important idea– we have a way to quantify our risk for a given portfolio! Let me add one interesting point mathematically. This formula lends itself to be used in all sorts of optimization problems (which we may head into later), but there’s a very simple one we can point out immediately. If we would like the minimum risk when the weight are of unit length in $L^2$, the weight are simply the eigenvector associated with the smallest eigenvalue and the risk is the smallest eigenvalue. We have just witnessed a powerful connection, and frankly, one of the most approachable applications of eigenvalues and eigenvectors.

Introducing Eigenportfolios

Let’s switch gears and talk about the covariance matrix. For those familiar with Principal Component Analysis, some of this next section will look familiar. In fact, since the covariance matrix is similar by definition to the correlation matrix (in the sense of a linear transformation similarity), the eigenvalues will be the same and the eigenvectors (which we are interested in) have a 1-1 correspondence between them, assuming none of the variances for any of the stocks are equal to 0.

The covariance matrix has some nice properties in the case of stock returns. First of all, it is a symmetric matrix, and so its eigenvalues are positive and its eigenvectors are orthogonal to each other. The typical interpretation of the eigenvalue decomposition of a covariance matrix is this:

The eigenvalues give the “variance” of each “factor” or eigenvector
The variance associated with each factor is “uncovariated” with the others (If we use the correlation matrix, this would be uncorrelated.)

I put the words above in quotes because this is meaningless until you connect it to where the matrix came from. Here, we know loosely that each eigenvalue corresponds to the risk of a portfolio and that the eigenvectors can represent an allocation of weights. So in this context, we can make the interpretations

Eigenvectors are the “eigenportfolios”, strategy weight allocations which are uncorrolated to other eigenportfolios
Eigenvalues are the “risk” of the given eigenportfolio

The Market Eigenvalue

As an aside, it has been empirically verified that the largest eigenvalue and associated eigenportfolio will correlate strongly with the market, assuming the data you built the matrix with captures the market. Usually, when using this technique in practice, you may want to remove that eigenportfolio so that your returns will be (hopefully) disconnected with the marked.

A Numerical Example

To show an example, first we need to construct a correlation matrix from closing stock price data.

pandas API

Now, we’re going to introduce how to use the pandas API to get stock data off of Yahoo. We’ll also use pandas to prepare the data for the analysis. This data should not necessarily be used for investment strategies because (but not limited to) Yahoo’s data suffers from survivers’ bias (stocks that don’t exist anymore are dropped) and their treatment of splits and dividends. However, to read basic data is very easy. Pandas uses the DataReader class combined with the ticker and a start and end date to grab data.

We are going to grab a stocks and calculate their returns.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
    import pandas as pd
    import datetime as dt
    from pandas.io.data import DataReader
    
    start, end = dt.datetime(2012, 1, 1), dt.datetime(2013, 12, 31)
    
    tickers = ['AAPL', 'YHOO','GOOG', 'MSFT','ALTR','WDC','KLAC'] 
    # List shamelessly taken from:
    # http://stackoverflow.com/questions/28174193/add-new-column-based-on-a-list-and-sort-date-by-newest
    
    prices = pd.DataFrame()
    
    for ticker in tickers:
        prices[ticker] = DataReader(ticker,'yahoo', start, end).loc[:,'Close'] #S&P 500
        
    prices.head()

	AAPL	YHOO	GOOG	MSFT	ALTR	WDC	KLAC
Date
2012-01-03	411.230000	16.290001	665.411118	26.770000	37.599998	30.980000	47.459999
2012-01-04	413.440010	15.780000	668.281154	27.400000	37.169998	31.299999	46.910000
2012-01-05	418.029995	15.640000	659.011109	27.680000	37.459999	32.759998	47.520000
2012-01-06	422.400002	15.520000	650.021102	28.110001	37.490002	33.490002	47.730000
2012-01-09	421.730000	15.460000	622.461047	27.740000	37.779999	33.750000	48.180000

1
2
    returns = prices.pct_change()
    returns.head()

	AAPL	YHOO	GOOG	MSFT	ALTR	WDC	KLAC
Date
2012-01-03	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2012-01-04	0.005374	-0.031308	0.004313	0.023534	-0.011436	0.010329	-0.011589
2012-01-05	0.011102	-0.008872	-0.013871	0.010219	0.007802	0.046645	0.013004
2012-01-06	0.010454	-0.007673	-0.013642	0.015535	0.000801	0.022283	0.004419
2012-01-09	-0.001586	-0.003866	-0.042399	-0.013163	0.007735	0.007763	0.009428

We remove the first row of NaN’s and we’re finished cleaning the data. Now, we’re going to take the finance equivalent of a training set and test set. They typically test their algorithms “in sample” and “out of sample”.

1
2
3
4
5
6
7
    returns = returns.iloc[1:, :] # Remove first row of NA's
    
    training_period = 30
    in_sample = returns.iloc[:(returns.shape[0]-training_period), :].copy()
    
    # Save the tickers
    tickers = returns.columns.copy()

The Eigenvalues of the Covariance Matrix

Let’s extract an eigenportfolio! To avoid the dreaded Market-correlated portfolio (we aren’t trying to build an index fund here!), let’s take a portfolio using the next eigenvector. We are going to use pandas combined with NumPy.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
    import numpy as np
    
    covariance_matrix = in_sample.cov()
    
    D, S = np.linalg.eigh(covariance_matrix)
    
    eigenportfolio_1 = S[:,-1] / np.sum(S[:,-1]) # Normalize to sum to 1
    
    eigenportfolio_2 = S[:,-2] / np.sum(S[:,-2]) # Normalize to sum to 1


    # Setup Portfolios
    eigenportfolio = pd.DataFrame(data= eigenportfolio_1, columns = ['Investment Weight'], index = tickers)
    eigenportfolio2 = pd.DataFrame(data= eigenportfolio_2, columns = ['Investment Weight'], index = tickers)
    
    # Plot
    %matplotlib inline
    import matplotlib.pyplot as plt
    f = plt.figure()
        
    ax = plt.subplot(121)
    eigenportfolio.plot(kind='bar', ax=ax, legend=False)
    plt.title("Max E.V. Eigenportfolio")
    ax = plt.subplot(122)
    eigenportfolio2.plot(kind='bar', ax=ax, legend=False)
    plt.title("2nd E.V. Eigenportfolio")

png

This is a typical chart for visualizing an allocation strategy. Each bar indicates the weight associated with a stock and the tickers are arranged in alphabetical order. These two portfolios are supposed to have uncorrolated returns, and possibly in large part, because of their different treatment of WDC. Of course, we could calculate the expected risk or returns for this portfolio, but it would be more fun to see it in action!

Visualizing the Results

At this point, we are ready to see how our modest strategy will perform on real data. However, the caveat is the short sale, or the negative weight we found for TSLA. Short selling would require more logic built into a simulation and it’s not very instructive, so instead we’re going to zero out that weight and renormalize.

In Sample vs. Out of Sample

In machine learning speak, we are really interested in our performance on the test set, but are happy to know a little about the performance on the training set. To do this, we’re going to plot our culmulative returns over time in the out of sample period.

Culmulative returns at time $t$ is given by $CR_t = \prod_{i=0}^t (1+r_i)-1$ where $r_i$ is the return of the period. This will allow us to visualize the performance of the algorithms better over time.

Again, to avoid issues around short sales, I am going to use the portfolio without them.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
    import matplotlib.pyplot as plt
    %matplotlib inline
    
    def get_cumulative_returns_over_time(sample, weights):
        return (((1+sample).cumprod(axis=0))-1).dot(weights)
    
    in_sample_ind = np.arange(0, (returns.shape[0]-training_period+1))
    out_sample_ind = np.arange((returns.shape[0]-training_period+1), returns.shape[0])
    
    cumulative_returns = get_cumulative_returns_over_time(returns, eigenportfolio).values
        
    f = plt.figure(figsize=(10,4))
    
    ax = plt.subplot(121)
    ax.plot(cumulative_returns[in_sample_ind], 'black')
    ax.plot(out_sample_ind,cumulative_returns[out_sample_ind], 'r')
    plt.title("Eigenportfolio")
    
    ax = plt.subplot(122)
    plt.plot((((1+returns.loc[:,'AAPL']).cumprod(axis=0))-1))
    plt.title("AAPL")

png

As you can see, the black is our “in sample” performance and the red is our “out of sample”. If we compare this to Apple during the same time period, you’ll see that we, in fact, did much better and avoided some of the volatility of Apple. Of course, we’re using what would be the “Market Mode” if we had more stocks included, and so it’s also the riskiest eigenportfolio, as it related to the highest eigenvalue.

Extensions

We are really just scratching the surface as to what can be done. Once you begin thinking about portfolios as eigenvectors, then there are many many avenues to continue this direction of research. For example,

This seems very ad-hoc… aren’t there optimization problems to choose portfolios based on certain criteria? Yes! That’s one of the main questions of Markowitz Portfolio Theory!
How can I get out of a bad strategy before it’s too late? Is there a way to tell if my strategy is performing as expected? We can address this with some statistical modeling of your strategy.
Stability of eigenvectors: What if your second eigenportfolio suddenly switched places with your first and IT becomes the most correlated to the market? You selected it to avoid market correlation! This is an active area of research.
How do we deal with sampling error? There is a lot of (seemingly random) error that can arise in building your correlation matrix. This can be alleviated with Random Matrix Theory.

In the coming posts, I hope to expand upon this foundation and add more tools to your investing arsenal. And as always, be wary of trying this with real money!

The next post in the series can be found here.

Scott Rome

Eigen-vesting I. Linear Algebra Can Help You Choose Your Stock Portfolio

Table of Contents

Portfolio Theory Fundamentals

Introducing Eigenportfolios

The Market Eigenvalue

A Numerical Example

pandas API

The Eigenvalues of the Covariance Matrix

Visualizing the Results

In Sample vs. Out of Sample

Extensions

You May Also Enjoy

Extending the Gaussian Mixture Approach for Fantasy Football Tiering Math Practical

Leveraging Factorization Machines for Wide Sparse Data and Supervised Visualization Math Theory

Bayesian Hierarchical Modeling Applied to Fantasy Football Projections for Increased Insight and Confidence Math Practical

Making Fantasy Football Projections Via A Monte Carlo Simulation Math Practical