Inverse covariance matrix estimation for the global minimum variance portfolio

(1)

Inverse covariance matrix

estimation for the global minimum variance portfolio

Shatthik Barua

Master’s Thesis, Spring 2017

(2)

This master’s thesis is submitted under the master’s programme Modelling and Data Analysis, with programme optionFinance, Insurance and Risk, at the Department of Mathematics, University of Oslo. The scope of the thesis is 60 credits.

The front page depicts a section of the root system of the exceptional Lie group E8, projected into the plane. Lie groups were invented by the Norwegian mathematician Sophus Lie (1842–1899) to express symmetries in differential equations and today they play a central role in various parts of mathematics.

(3)

Abstract

The estimation of inverse covariance matrices plays a major role in portfolio optimization, for the global minimum variance portfolio in mean-variance analysis it is the only parameter used to determine the asset allocation. In this thesis I propose to of use the graphical lasso methodology to directly estimate the inverse covariance matrix, and apply it to the global minimum variance portfolio. The results indicate that the graphical lasso provides better out-of-sample portfolio variance than the traditional sample estimator.

(4)

Acknowledgement

First and foremost, I would like to thank my supervisors Ingrid Kristine Glad and Nils Lid Hjort for their patience and feedback. I would also like to thank my friends and family for their support and positive distractions during the thesis work.

(5)

Chapter 1 Introduction

The theoretical foundation for how an investor should invest in risky assets was laid by the economist Harry Markowitz in 1952. In Markowitz theory the investor’s portfolio can be seen as function of two parameters, the vector of the expected returns of the risky assetsµ and the covariance matrix of the returns Σ. And for this reason, it is often referred to as mean-variance portfolio analysis. The mean-variance framework, serves as the foundation for many theories which are still relevant today, such as the Capital Asset Pricing Model (CAPM). In practice µ and Σ are unknown and has to be estimated from historical data or expectations about the future. This means that precision of these estimates is uncertain and estimation error can be made. Studies show that estimation errors lead to poor out-of-sample portfolio performance, and that portfolios are highly sensitive to estimates of means and covariance. To circumnavigate the problem of estimation error one can only focus on the estimation of covariance matrix and ignore the expected returns, since it is shown the errors in estimates of the expected returns have a bigger impact on portfolio than of covariance matrix. When you only focuses on the estimation of Σ, the optimal portfolio becomes the global minimum variance portfolio, which is a portfolio with the smallest possible variance of all portfolios. The global minimum variance portfolio can be seen as a function of one parameter, Σ, or more preciselyΣ⁻¹. Several methods have been proposed in the literature to estimate the global minimum variance portfolio, but most of the focus on the estimation of Σ instead of estimating Σ⁻¹ directly. Σ⁻¹ also called the precision matrix has a specific mathematical interpretation, the elements in Σ⁻¹ contains information about the partial correlation between variables. One can think of this as a measure of how correlated two assets are, given the influence of a set of other assets has been considered, and if the normal distribution is assumed as the distribution of the asset returns, then a partial correlation of 0 implies conditional independence. Given the idea that conditional independence implies a 0 in Σ⁻¹ it is natural to think of the use of regularization methods to estimateΣ⁻¹. In this thesis I will explore how the graphical lasso method can be used to estimate Σ⁻¹ and how well it performs for the estimation global minimum variance portfolio.

(8)

1.1 Outline

I will start off by giving an introduction to portfolio optimization and mean-variance portfolio analysis in chapter 2. Then the data that will be used in this thesis and some of its statistical properties will be presented in chapter 3. In chapter 4 the topic of covariance, covariance estimation and inverse covariance estimation will be discussed and the graphical lasso method will be presented. Chapter 5 will go through methods used in the simulations and Chapter 6 will discuss the topic of tuning parameter selection for the graphical lasso. In Chapter 7 analysis and results will be presented and concluding remarks will be in chapter 8.

The software used throughout this thesis is the statistical programing language R, all scrips used in the analysis will be included in the Appendix.

(9)

Chapter 2 Portfolio optimization

How to invest financial resources is one of the the classical problems of finance.

Investors try to allocate their available resources to the assets that will maximize the value of the invested funds and at the same time eliminate the risk of losses. An asset is an investment instrument that can be bought and sold. One concept most investors are aware of is the relationship between risk and return, that there is a positive relationship between the risk and the return of a financial asset. In other words, when the risk of an asset increases, so does its return. This means that if an investor is taking on more risk, he is expected to be compensated with a higher return. How to find the optimal trade-off between risk and return is described in the field of mathematical finance under portfolio theory. Portfolio theory was first developed by Harry Markowitz[1] in the 1950s and his work formed the foundation of modern finance. Before going further into the portfolio theory I will define some of the basic concepts that will be used throughout this thesis.

2.0.1 Portfolio return

• Let I denote the investment, the monetary value the investor wants to invest.

• The possible assets that can be invested in are 1, ..., P assets.

• Let S_i(t) denote the price of asseti at time t

• Let I_i be the value invested in asset i,i∈1, ..., P such that I =PP i=1I_i.

• Define w_i = ^I_Iⁱ to be the proportion invested in asset i, PP

i=1w_i = 1

At time t = 0 we have the value I at our disposal, we split I and invest Ii for i= 1, ..., P in the different assets, this gives us the amount M_i in asset i such that I_i =S_i(0)M_i. The value of our portfolio at time t,P(t), can then be expressed as

P(t) =

P

X

i=1

S_i(t)M_i

we see that att = 0, the value will be our initial investmentI,P(0) =I. A quantity that is important, is the return, which is the gain or loss of an investment. We define the one period return of asseti as R_i(t),

(10)

R_i(t) = S_i(t)−S_i(t−1) S_i(t−1) and the portfolio return asR^p(t),

R^p(t) = P(t)−P(t−1) P(t−1)

For t = 1 we can express the portfolio return R^p(t) as a weighted average of the asset returns, see Appendix

R^p(1) = P(1)−P(0) P(0) =

P

X

i=1

w_iR_i(1)

2.0.2 Shorting

Sometimes it is possible to sell an asset that you do not own by short selling the asset. Short selling (shorting) involves borrowing an asset from a lender and then selling the borrowed asset to a buyer. Then at a later time, you purchase the asset and return it to the lender. Let us say you borrow M_i amount of asset i at time t = 0, we denote borrowing as −M_i. The asset can be sold for −M_iS_i(0) = −I_i. Notice that by shorting asset i we open up more capital Ii for investing of other assets, since we can re-invest the value we get from selling the borrowed asset at t = 0 in other assets. At time t when it is time to give back the borrowed assets, you have to buy −Mi of asset i for price Si(t), −MiSi(t) = −Ii(t). The profit made is −I_i(t)−(−I_i) = −I_i(t) +I_i. We see that we have a positive profit when S_i(0) > S_i(t). So by shorting assets can have negative asset allocations −I_i and portfolio weights −wi. Shorting assets can be risky since there is no upper bound onS_i(t), which can result in unlimited loss. While when buying an asset the loss is bounded by that the asset price cannot be negative.

2.1 Markowitz

In 1952 the economist Harry Markowitz laid the foundation for portfolio optimization in his paper, Portfolio Selection[1]. The theory presented in the paper is often referred to as mean-variance portfolio analysis, it analyzes how wealth can be optimally invested in assets which differ in regard to their expected return and risk[2].

Markowitz showed that under certain given conditions, the choice of a portfolio can be reduced to balancing the expected return of the portfolio and its variance[2]. And that it is possible to reduce risk through diversification such that the risk of the portfolio, measured as its variance, will depend not only on the individual variances of the return on different assets, but also on the pairwise covariances of all assets. The practice of taking returns as well as risk into account when making investment deci- sions was well known before Markowitz, but he was the first to develop a rigorously formulated mathematical framework for portfolio optimization[2]. Harry Markowitz was awarded the Nobel Prize in Economic Sciences in 1990 for his work in portfolio selection.

(11)

2.2 Mean-variance portfolio analysis

In this chapter, I will present the key concepts, assumptions, and quantities of Markowitz mean-variance model.

The main quantity of focus in the mean-variance model is the portfolio return over a singe period.

R^p =

P

X

i=1

wiRi(1)

At time t= 0 we decide how to allocate our wealth between the P risky assets and then evaluate the return at time t = 1. By a risky asset, I mean assets that have stochastic return R = (R₁, ..., R_P)^T over the period. I will drop the time notation here,Ri =Ri(1), since we only consider one period.

The proportion of wealth invested in asset is represented by the P dimensional vectorw = (wi, .., wp)^T, here wi is the proportion invested in asset i. It is assumed that the investor will invest all of the wealth at his disposal ,Pp

i=1w_i = 1, and that short selling is allowed, w_i can be negative. Since asset returns are stochastic the investor will not know the returns at timet= 1, but it is assumed that the investor knows the expectation, variance and covariance of the asset return.

The expected value of the asset return at the end of the period is given by E[R] =µ= (µ₁, ..., µ_P)^T

and the variance and covariance of the asset returns is Cov(R) = Σ

Here Σ is the covariance matrix with the asset variance on the diagonal, Σ_ii = V ar(R_i), and the covariance between assets on the off-diagonal Σ_ij =Cov(R_i, R_j).

From the knowledge of µand Σ we can derive the expected return and variance of the portfolio. The investor expected portfolio return will be

µ^p =E[R^p] =

P

X

i=1

w_iR_i =w^Tµ and the portfolio variance is

(σ^p)² =V ar(R^p) =w^TΣw

The problem the investor is faced with is how to allocate his wealth in the different assets (w_i), which results in different R^p, µ^p,(σ^p)². Markowtiz [3] shows how to to optimally select the weights w given the assumptions that the investor will:

• Maximize portfolio return, given a risk level

• Minimize portfolio risk, given a return level

Here the expected portfolio return µ_p is used to represent the portfolio return and the portfolio variance(σ^p)²is used to represent portfolio risk. From the assumptions

(12)

of the investor, the investor will only be interested in portfolios that Markowtiz calls efficient. An efficient portfolio is a portfolio having simultaneously the smallest possible variance for its given level of expected return rate and the largest possible expected return for its given level of risk[3].

2.3 Feasible region

All the portfolios an investor can construct are defined by the possible combinations of weights w, which we will call the weight space. The weight space of an P-asset portfolio is

W_P =n w:

P

X

i=1

w_i = 1o

By plotting the weight space we can see that for P = 2, W_P is a line defined by (α,1−α)for α∈R

Figure 2.1: weight space P = 2

And more generally for P ≥3, the space W_P is an P −1 dimensional plane in R^P passing through the P standard unit basis vectors[3].

(13)

Figure 2.2: weight space P = 3 The set of all pairs(p

(σ^p)²(w), µ^p(w))for all possible combinations ofwis called the feasible region

F_P =n

(σ^p(w), µ^p(w)) :w∈W_Po

Hereσ^p(w)denotes the standard deviation of the portfolio given weightsw,σ^p(w) = p(σ^p)²(w) =√

w^TΣw.

For P = 2 the weight space maps into F_P the right branch of a hyperbola. And whenP ≥3, F_P≥3 feasible region an unbounded region whose boundary is the right branch of a hyperbola. It can be shown that the points on the boundary of the feasible region correspond to unique portfolio weights w, while it exists infinitely many portfolio weights for any point inside the feasible region[4].

Figure 2.3: Feasible region P = 2,P = 3

(14)

A hyperbola is a curve in the xy plane defined by the equation (x−h)²

a² − (y−k)² b²

The hyperbolas center is at (h, k), the vertices is (h±a, k)and the asymptotes

±^b_a(x−h) +k

Figure 2.4: A hyperbola¹

Since the boundary of the feasible region is a hyperbola, we see that there exists a point that has the smallestσ^p(w)of all the points in F_P. This point will correspond to the portfolio which has the smallest risk of all portfolios. From the geometry of F_p it can be seen that (σ^p(w), µ^P(w)) point is the one furthest to the left, which is the vertex(h+a, k) of the hyperbola that defines the boundary. The portfolio with the smallest risk is called the global minimum variance portfolio (GMVP).

One way the global minimum variance portfolio can be found is to solve the optimization problem

w_gmv =minimize

w∈W_P w^TΣw

subject to w^Te= 1 Heree= (1, ..,1). This problem has the solution

w_gmv = Σ⁻¹e e^TΣ⁻¹e

which means that the smallest variance a portfolio can have is V ar(R^P(w_gmv)) =w^T_gmvΣw_gmv = 1

e^TΣ⁻¹e and that the expected return is

E(R^P(w_gmv)) =w^T_gmvµ= e^TΣ⁻¹µ e^TΣ⁻¹e

1Image taken from http://www.openalgebra.com/2013/05/hyperbolas.html

(15)

The point in F_p which correspond to the vertex of boundary and has the smallest variance is (

q 1

e^TΣ⁻¹e,^e_e^TT^ΣΣ⁻¹⁻¹^µe). For the derivation of the solution I refer the reader to [3]

Figure 2.5: Point with smallest SD, Feasible region P = 2,P = 3

When P ≥ 3 we can see that for a selected target level of µ^∗ there is infinitely many points inF_P that has meanµbut with varying standard deviation. This also means that there are infinitely manyw that achieves the mean returnµ^∗. But from our assumptions of the investor, that he minimizes portfolio risk for a given return level, the preferred portfolio for the investor will be the portfolio with lowestσ^p(w).

Geometrically this is the portfolio corresponding to the intersection of the boundary of F_P and the line that represent all the portfolios with the same mean.

Figure 2.6: Points with same µ,P = 3 We can find this portfolio by solving

(16)

w∗= minimize

w∈W_P w^TΣw

subject to w^Tµ=µ^∗, w^Te= 1 Which is the solution

w^∗ = BΣ⁻¹e−AΣ⁻¹µ^∗+µ^∗(CΣ⁻¹µ^∗−AΣ⁻¹e) D

where

A=µ^TΣ⁻¹e B =µ^TΣ⁻¹µ C =e^TΣ⁻¹e D=BC−A²

For the derivation of the solution see[3]. The minimum variance of a portfolio with meanµ^∗ will be

(σ^p(w^∗))² = (w^∗)^TΣw^∗ = µ^∗C−2µ^∗A+B D

From this we can find the the minimum variance for a portfolio with any level of expected return µ^p, and with some algebraic manipulation[3] we can express the relation between the mean and the variance as

(σ^p)² q1

C

2 −(µ^p− ^A_C)² qD

C²

2 = 1

which is a hyperbola in the (σ^p, µ^p) space, with center at (0,^A_C),vertecies (±^√¹

C,^A_C)) and asymptotes±q

D

Cpσ+^A_C, and we see that the left vertex of the hyperbola is the same as the global minimum variance portfolio point,(^√¹

C,^A_C) = (q

1

e^TΣ⁻¹e,^e_e^TT^ΣΣ⁻¹⁻¹^µe).

(17)

Figure 2.7: Boundary of feasible region,P = 3

From the assumptions of the investor, he will only invest in efficient portfolios, which are portfolios having simultaneously the smallest possible variance for its given level of expected return rate and the largest possible expected return for its given level of risk. We showed how to find portfolios having the smallest possible variance for its given level of expected return, but these portfolios will not necessarily be efficient. To see this we can select the targeted expected return lower then the global minimum variance portfolios expected return,µ^∗ < ^e_e^TT^ΣΣ⁻¹⁻¹^µe.

Figure 2.8: a portfolio on the boundary of the hyperbola is not necessarily efficient Here it is possible to select portfolios with higher expected return and with the same standard deviation by moving vertically up in the plane, which means that the

(18)

portfolio is not efficient. From this we see that if we want an efficient portfolio the targeted expected return has to be µ^∗ ≥ ^e_e^TT^ΣΣ⁻¹⁻¹^µe, which means that every boundary point of F_P which is on the upper half of the global minimum variance point will correspond to an efficient portfolio. The set off all points that as efficient portfolios is called the efficient frontier, which can be defined as

EF_P =n (

rµ^pC−2µ^pA+B

D , µ^p) :µ^p ≥ A C

o

The weight space of the efficient portfolio is

W E_P =n

w^∗ = BΣ⁻¹e−AΣ⁻¹µ^∗+µ^∗(CΣ⁻¹µ^∗ −AΣ⁻¹e)

D :µ^∗ ≥ A

C o

Figure 2.9: The efficient frontier

From the efficient frontier we get the theoretical foundation for the risk-reward tradeoff, the only way an investor can get higher expected return is take on more risk.

(19)

Chapter 3 Return distribution

The main quantities of interest in the mean-variance model is the portfolio return and its first two momentsµ^pand(σ^p)². AsR^p is a weighted average of the individual asset returnsR^P =PP

i=1w_iR_i, the statistical properties ofR^p derives formR_i, where R_i is defined asR_i = ^Sⁱ^(1)−S_S ⁱ⁽⁰⁾

i(0) . This is often called the arithmetic return. Although the arithmetic return is intuitive and easy to understand, there are several drawbacks that makes arithmetic returns difficult to work with.

1. The future price of the asset S_i(1) =S_i(0)(1 +R_i(1)) given by the return R_i can be negative if R_i <−1

2. Return is not symmetricS_i(0)(1 +R_i(1))((1−R_i(1)))6=S_i(0) 3. Non additivity of multiple periodsPT

i=1R_i(t)6=R_i(T),R_i(T) = −1+QT

i=1(R_i(t)+

1). Which means that if we assume that one period returnsR_i(1)are i.i.d with the normal distribution, then the return of longer periods R_i(T)will not have normal distribution.

Another return which is often used instead for financial assets is the geometric return ri = log(S_i(1)

S_i(0))

which gives the future price asS_i(1) = S_i(0) expr_i, and where the the relation between the returns areri = log(1 +Ri).

The geometric return corrects all the drawbacks mentioned above, and for this reason it is standard in financial analysis to use geometric returns. It is often assumed that geometric returns have constant expectation and variance, that they are serially independent, and that they are normally distributed[5]. For example in the famous Black & Scholes model[6], which is used to price options, assumes that the price of a stock follows a geometric Brownian motion Si(1) = Si(0)e^µⁱ^+σⁱ^B(1), which results in normal distributed returns. HereB(t)is a Brownian motion.

Even though geometric returns are easier to work with, and is the standard in financial analysis, they have one major drawback when it comes to portfolio theory. This is that geometric portfolio return cannot be written as weighted sum of the geometric asset returns, r^p 6= PP

i=1wiri[5]. But when the the asset returns are relatively small, which is the case when the period length is small, the geometric

(20)

and arithmetic return will be similar[5] log(1 +R_i)∼ R_i. From this similarity the portfolio return can approximately be written as a wighted sum r^p ∼PP

i=1w_ir_i[7].

Figure 3.1: Arithmetic returns plotted against geometric returns, whenkR_i| ≤0.10 the difference is small

The normality of the returns is often assumed in financial modeling, but it is is widely known these returns are known to have a distribution that is more peaked and has fatter tails than the normal distribution[5]. Despite being recognized as overly simplistic, the assumption of normality has appeal due to that it leads to closed form solutions for many financial problems, it is easy to implement since need only make two assumptions for each asset mean and variance and one assumption for the covariance of each pair. And it somewhat models the return realistically while being very simple.

Assumption: We will in this thesis assume that assume that the geometric returns follows a multivariate Gaussian distribution,r ∼N(µ,Σ).

3.1 Data

The assets I will consider in this thesis are stocks from the Norwegian stock exchange traded on Oslo Børs, at the time of which this thesis is written there are about 190 stocks traded on the exchange. The historical data collected on these stocks spans from late 2010 to this day(may 2017) where the observations are collected on a daily basis, the stock prices collected are the end of day closing prices¹.

1All prices was extracted fromhttp://www.netfonds.no/, see Appendix for script

(21)

Figure 3.2: Plot of all stock prices and 5 most traded stocks on Oslo Børs in the log-scale

If we look at the sample quantiles of the arithmetic returns 98% of the observed returns lies between−0.09,0.11, which means that the difference of the arithmetic and geometric returns is small and that we can approximate the arithmetic returns with geometric.

Table 3.1: Sample quantiles of arithmetic returns

0 % 1 % 2 % 3 % 4 % 5 % 6 % 7 % 8 % 9 %

-0,8787879 -0,09001943 -0,06839772 -0,05797101 -0,05063291 -0,04545455 -0,04142156 -0,03816794 -0,03517588 -0,03278689

10 % 11 % 12 % 13 % 14 % 15 % 16 % 17 % 18 % 19 %

-0,03070175 -0,02879581 -0,02702703 -0,02545455 -0,024 -0,02272727 -0,02142857 -0,02024291 -0,01917808 -0,01818182

20 % 21 % 22 % 23 % 24 % 25 % 26 % 27 % 28 % 29 %

-0,01724138 -0,0163254 -0,01538462 -0,01454545 -0,01376147 -0,01298701 -0,01229942 -0,01156069 -0,01086957 -0,01018638

30 % 31 % 32 % 33 % 34 % 35 % 36 % 37 % 38 % 39 %

-0,00958206 -0,008955224 -0,008379888 -0,0078125 -0,007220217 -0,006622517 -0,006042296 -0,005454545 -0,004926108 -0,004484305

40 % 41 % 42 % 43 % 44 % 45 % 46 % 47 % 48 % 49 %

-0,004065041 -0,003613948 -0,003115265 -0,002570694 -0,001917582 -0,000636902 0 0 0 0

50 % 51 % 52 % 53 % 54 % 55 % 56 % 57 % 58 % 59 %

0 0 0 0 0 0 0 0,000928186 0,002118644 0,002717391

60 % 61 % 62 % 63 % 64 % 65 % 66 % 67 % 68 % 69 %

0,003271538 0,003763441 0,004222658 0,004651163 0,005128205 0,005759585 0,006369427 0,007003379 0,007653061 0,008234218

70 % 71 % 72 % 73 % 74 % 75 % 76 % 77 % 78 % 79 %

0,008849558 0,00952381 0,01016518 0,01086957 0,01162791 0,01242236 0,01315789 0,01396648 0,01480836 0,01572327

80 % 81 % 82 % 83 % 84 % 85 % 86 % 87 % 88 % 89 %

0,01671309 0,01769912 0,01873935 0,01980198 0,02097902 0,02230483 0,0237069 0,02527076 0,02702703 0,02887139

90 % 91 % 92 % 93 % 94 % 95 % 96 % 97 % 98 % 99 %

0,03097345 0,03333333 0,03609023 0,03936903 0,04322767 0,04794521 0,05436573 0,06309148 0,07734807 0,1065574 100 %

4

We also see that the distribution of the geometric returns agrees with existing findings, that the geometric return distribution for the assets are more peaked and has fatter tails than the normal distribution.

(22)

Figure 3.3: Histogram and qq-plot of the most and least traded stock on Oslo Børs, we see that the returns have heavier tails

(23)

Chapter 4 Covariance

4.1 The covariance matrix

The covariance matrix contains information about the pairwise dependence among the components of a random vectorX, where X = (X₁, ..., X_P)^T is a P-dimensional random vector with the mean vectorµ=E(X) = (µ₁, ..., µ_P)^T.

The covariance matrix Σis defined as a p×p matrix

COV(X) = Σ =E((X−µ)(X−µ)^T))

Where COV(X_i, X_j) = Σ_ij =E((X_i−µ_i)(X_j−µ_j))). The covariance matrix, Σ is positive-semidefinite and symmetric matrix. In mean-variance framework it is also assumed that Σ is positive-definite, since we require invertibility. For some basic properties of the covariance matrix I refer to [8]

4.2 Estimation error

Markowitz’s framework assumes known µ and Σ, but in practice µ and Σ are unknown and has to estimated from historical data or expectations about the future.

This means that precision of these estimates is uncertain and estimation error can be made.

A survey done by DeMiguel et al. (2007)[9] found that none of the various mean- variance portfolios with different mean and covariance estimates in the survey was able to significantly or consistently outperform the simple equally weighted 1/N portfolio. The authors explained the poor performance of the portfolios relative to the equally weighted portfolio was due to estimation error. It is shown that mean- variance portfolios are highly sensitive to estimates of means and covariance and small changes to these inputs result in extremely different portfolios[10], these problems have caused mean-variance optimization to not bee used directly in practice.

The sensitivity has been credited to the optimization frameworks ability to magnify the effects of estimation error and for this reason, Michaud [11] has referred to mean variance portfolio optimization as error maximization. It is also known that it is

(24)

more difficult to estimate means than covariances of asset returns, and that errors in estimates of the means have a bigger impact on portfolio weights than of the covariance matrix[12] and that the covariance matrix can be estimated much more precisely than the expected returns[13]. When the sample mean is used the estimation error may be so large not much is lost in ignoring the mean altogether[14]. For this reason, in order to decrease the estimation errors, we can avoid the estimation of expected returns and instead assume that all stocks have equal expected returns.

This means that portfolios differ only with respect to risk, which means that the only efficient portfolio will be the global minimum variance portfolio. The global minimum variance portfolio is appealing since the portfolio weights are only deter- mined by the covariance matrix, or more accurately of its inverse, and not the means.

Assumption: We will in this thesis assume that expected returns are all equal and µ= 0,r ∼N(0,Σ).

The covariance matrix is intuitively useful, to better understand the statistical re- lations between the data. However, computationally its inverse, often referred to as the precision matrix, Σ⁻¹ is more relevant in the mean-variance optimization framework. As stated the precision matrix is actually the only parameter computed to find the optimized weights for the global minimum variance portfolio, as we re- member the optimal solution of the global minimum variance problem is given by, w_gmv = _eT^ΣΣ⁻¹⁻¹^ee. Since the real inverse is unknown it has to be estimated from data, and so Σ⁻¹ and must be replaced with an estimate Σˆ⁻¹, which gives the estimated weightswˆ_gmv = ^Σ^ˆ⁻¹^e

e^TΣˆ⁻¹e.

The traditional estimator of wˆ_gmv is constructed by using the sample covariance matrix Σˆ⁻¹ = S⁻¹. The traditional estimator is not a bad choice if the number of assets P is significantly smaller than the number of observations N. But when number of assets P is in the same order as to the sample size N, _N^P ∼ 1 or larger

P

N > 1 the estimate tends to be far from the population covariance matrix, and is therefore not an appropriate estimator ofΣ[15]. Several alternative estimators have been proposed in the literature to overcome the deficiencies of the sample matrix, some authors have proposed imposing some structure on the covariance matrix with factor models[16]. Others have suggested shrinking the sample matrix toward a target matrix[17] and some have used random matrix theory to correct the eigenvalues of the sample matrix[15], see [18][19] for overview of different methods. Surprisingly most of the research focuses on estimatingΣand not Σ⁻¹, I only found two articles [19][20] addressing the inverse matrix when estimating GMVP. So in this thesis I will look at estimation methods forΣ⁻¹ and how it will effect the global minimum variance portfolio.

The main focus of this thesis will be to find a way to estimate global minimum variance portfolio which gives better performance than the equally weighted portfolio and the portfolio constructed from the sample covariance matrix. The performance measure that will be used is out-of-sample variance of the portfolio, which is the portfolio variance in the next period.

(25)

4.2.1 Sample covariance matrix

The classical estimator of the covariance matrixΣ is the sample covariance matrix, S. Given N observations of a P dimensional random vector X = (X₁, ....X_P)^T, We define the sample covariance as

S = 1 N −1

N

X

t=1

(X_t−X)(X¯ _t−X)¯ ^T where X¯ = ¹_nPn

t=1Xt. S has a number of advantages: it is simple to construct, unbiased, and intuitively appealing[18]. But there are also several disadvantages.

The obvious problem is that when the number of observations N is less than the number of variables P, then S is not invertible. But even when N is close to P,

P

N ∼ 1 , the sample covariance S has a significant amount of sampling error, and its inverse is a poor estimator for Σ⁻¹[18]. Under the normality assumption, the expected value of the inverse is E(S⁻¹) = _n−p−2ⁿ Σ⁻¹. While S is unbiased for Σ, S⁻¹ is highly biased forΣ⁻¹, ifP is close toN[18]. We see that ifP =N/2−2, we haveE(S⁻¹) = 2Σ⁻¹.

4.3 Precision matrix

The precision matrix, Θ = Σ⁻¹, has a specific mathematical interpretation. The elements inΘ contains information about the partial correlation between variables.

That is, the covariance between pairsi andj, conditioned on all other variables[21].

One can think of this as a measure of how correlated two vectors are given the influence of a set of other variables has been considered. An important result concerns if the vectors X_i and X_j are normally distributed, in which case a partial correlation of 0 implies conditional independence[21].

The precision matrix is closely related to the Gaussian graphical models, which is a framework for representing the structure of the conditional dependencies between Gaussian random variables in a complex system. Graphical models provide an eas- ily understood way of describing and visualizing the relationships between multiple random variables. A Gaussian graphical model is an undirected graph G = (V, E) where the vertices(V) represent the Gaussian variables and the edges(E) represent the dependencies between the variables. Estimating a Gaussian graphical model is equivalent to estimating a precision matrix[8]

4.4 Estimation

Given observations X₁, X₂...X_N, X_i ∼ N(µ,Σ) which are i.i.d and comes from P- dimensional multivariate normal distribution. We would like to estimateΘ, one way to do this is to through maximum likelihood estimation.

The log-likelihood of the data can be written as¹

1When estimating with log-likelihood methodsSwill be denoted asS= _N¹ PN

t=1(X_t−X¯)(X_t− X)¯ ^T

(26)

l(Θ) = log(|Θ|)−trace(ΘS) (1)

Here we have ignored constants, and partially maximized with respect to µ, see Appendix for more detail. |Θ| denotes the determinant ofΘ.

The Θ that maximizes l(Θ) is S⁻¹. And from theory, this maximum likelihood estimate converges to the true precision matrix as the sample size N tends to infin- ity. But a problem arises when the number of variables P may be comparable to or larger than the sample size N, in which case the MLE may be unstable or does not exist

4.4.1 Regression

The entries of the precision matrix also has a regression-based interpretation[8]. To see this relationship between linear regression and the inverse covariance matrix, it is useful to partition the random variables X in two groups X = (X_i, X_−i) where X−i = (X₁..., Xi−1, X_i+1, ...X_P) and express Σas a block matrix.

Σ =

Σ_ii Σ_i−i Σ−ii Σ−i−i

Some useful identities about the entries inΘ are obtained by considering the linear least-squares regression of X_i based on X−i,X_i =β_iX−i+_i.

β_i = Σi−iΣ⁻¹_−i−i Θ_ii= (Σ_ii−β_iΣ_−ii)⁻¹

Θi−i =−Θ_iiβ_i

From the identities we can express the inverse covariance matrix as

Θ =

(Σ_ii−β_iΣ_−ii)⁻¹ −(Σ_ii−β_iΣ_−ii)⁻¹β_i

−β_i^T(Σ_ii−β_iΣ−ii)⁻¹ Σ⁻¹_−i−i+β_i^T(Σ_ii−β_iΣ−ii)⁻¹β_i

=

Θ_ii −Θ_iiβ_i

−β_i^TΘ_ii Σ⁻¹_−i−i+β_i^TΣ⁻¹_ii β_i

For more detailed derivation the reader is referred to the Appendix.

By changing i, the partition of X changes. Which means that we express each row or column of Θ by regression coefficient β_i and the diagonal entry Θ_ii. From this we see that the sparsity ofΘis decided from the regression coefficients. Which means that one cane develop methods to impose sparsity in the precision matrix from techniques devolved for linear regression.

4.4.2 Sparsity

A frequent task is statistics is to estimate the covariance matrix of a set of normally distributed random variables from given samples, but in the high-dimensional regime this is often a challenge and therefore it is important to have robust al- ternatives to the standard sample covariance matrix estimator. The principle of

(27)

precision matrices estimation. The idea of an sparse precision matrix in covariance estimation first appeared in the work of Dempsters in 1972[22], and numerous methods for sparse precision matrix estimation have been developed since[23].

The reason to impose sparsity in precision matrix can come from either from a priori information about the variables or from estimation issues. If Θ_ij = 0 this means that there is a conditional independence between variablej and i. i.e covariance between i and j is 0 when factoring in all other variables. For some types of variables the notion of conditional independence is natural, variables belonging to a given class is likely to be related together while variables from different classes are more likely to be independent. When the number of variables P is comparable to or larger than the sample sizeN, the sample estimate may be unstable or does not exist. To overcome the problems that occur in high dimensional setting, regularization is often used. Regularization means to impose a penalty on the estimated parameters to control the magnitude of the parameters. When the L₁ penalty is used, it will also promote sparsity.

If we know the structure of the graph, i.e which of the edges that are missing, which entries ofΘare zero. Then we could maximize (1) under the constraints that parameters in Θ are zero. We will see that we can use the regression identities we introduced earlier to solve forΘ and its inverse W = Θ⁻¹, one row and column at a time.

To constrain the log-likelihood (1), we add Lagrange constants for all missing edges.

log(|Θ|)−trace(ΘS)− X

(j,k)/∈E

γ_jkΘ_jk (2) The gradient equation for maximizing (2) can be written as[21]

W −S−Γ = 0

Γ is a matrix of Lagrange parameters with nonzero values for all pairs with edges absent. From the gradient equation we getW−ii−S−ii−Γ−ii= 0. We also have the relationshipW−ii=W−i−iβ_i^T such that

W−i−iβ_i^T −S−ii−Γ−ii = 0 (3)

Now we can solve (3) by simple subset regression. Suppose there are p−q nonzero elements inΓ−iithese correspond to the absent edges in node i. Thesep−qelements carry no information, therefore we can reduceβ_i toβ_i^∗ by removing itsp−qelements corresponding to nonzero elements in Γ−ii. Yielding the reduced q ×q system of equations.

W_−i−i^∗ β_i^T^∗−S_−ii^∗ = 0

Which we can solve forβ and gives βˆ_i^T^∗ =W_−i−i^−1∗S_−ii^∗ , βˆ_i^∗ is then padded back with p−q zeros to give β. This leads to the simple iterative procedure, where we canˆ update W for each i,W−ii=W−i−iβˆ_i^T, it can be summarized as:

(28)

1. Initialize W =S

2. Until convergence², Repeat fori= 1,2...P

(a) Partition W into part 1, (−i): all but the ith row and column, and part 2, i: the ith row and column

(b) Set βˆ_i^∗ = W_−i−i^−1∗S_−ii^∗ . Where p−q elements corresponding to the zero edges are removed. fill βˆ_i^∗ with p−q zeroes to get β_i.

(c) UpdateW−ii=W−i−iβˆ_i^T

3. For each i in the final cycle, set Θ−ii= ^β^ˆⁱ

Wii−βˆiW−ii

For more details about this procedure see Appendix and [21]

4.4.3 Graphical lasso

In most cases we do not know which edges in the graph is zero. And so inference about the structure of the graph has to be done. Meinshausen and Buhlmann[25]

took a simple approach to the problem, rather than trying to fully estimateΘ, they only identified the nonzero entries of Θ. To do this, they fitted a lasso regression using each variable as the response and the others as predictors. Θ_ij is estimated to be zero if the regression coefficient for variablej in the lasso regression of variables

−ion variablei,lasso(X−i, Xi), is zero and/or if the coefficient for variable iin the regressionlasso(X−j, X_j)is zero. Banerjee et al.[26] and Yuan and Lin[27] provided further improvements of the initial work of Meinshausen and Buhlmann[25]. In both works, the problem of estimatingΘis seen as a penalized maximum likelihood.

Instead of consideringP different regression problems, the two articles focus on the likelihood of the Gaussian variables and penalizes the entries of the concentration matrix with aL1 penalty.

log(|Θ|)−trace(SΘ)−λ kΘk₁ (4)

Here λ is a penalty parameter which controls the amount of L1 penalty imposed, andkΘk1 is the sum of the absolute values of the elements. Friedman et al.[24] used this previous work as a launching point to propose a new algorithm that findsΘthat maximizes (4), which they called the graphical lasso. The graphical lasso (glasso) algorithm is very similar to algorithm in the previous section, but replaces step 2b with a lasso problem. The reason for this is the gradient equation for maximizing (4) can be written asW −S−λSign(Θ)[21], which is equivalent with the gradient of a lasso problem, see the Appendix for the derivation. The glasso algorithm is:

1. Initialize W =S+λI

2. Until convergence, Repeat for i= 1,2...p

(a) Partition W into part 1(−i): all but the ith row and column, and part 2(i): the ith row and column

(b) solve the lasso problem, min_β{¹₂kW

1 2

−i−iβ_i−W⁻

1 2

−i−iS−iik²₂−λkβ_ik₁}

(29)

(c) UpdateW−ii=W−i−iβˆ_i

3. For eachi in the final cycle, set Θ−ii = ^β^ˆⁱ

Wii−βˆiW−ii

The solution glasso gives is positive definiteΘfor allλ >0 even if S is singular[28].

If W−i−i = S−i−i in 2 b for every cycle, then the solutions of β would be equal the lasso estimates for the ith variable on the others, such as in Meinshausen and Buhlmann approach [24]. In general W−i−i 6=S−i−i and therefore aP separate lasso regression as in the Meinshausen and Buhlmann[25] approach does not yield the maximum likelihood estimator.

(30)

Chapter 5 Sparsity structure

When we impose sparsity on precision matrix it is either that we think that there exists some conditional independence in the data or to handle estimation problems.

In our case it will be a mixture of both, since the data we are considering are stock returns which can often be high dimensional, regularization can bu used to fix problems such as invertibility ofΣ. But it is also natural to think there exists groups of stocks that are more similar than others, and that it exists conditional independence between pairs unsimilar stocks. One common way to think. is that stocks group together is by industry affiliation, and most market agents assume that companies within industries are more tightly correlated than a random sample of companies[29].

One way to describe the dependence structure of stocks is to characterize it with a community network structure. A graph is said to have community structure if there are groups of nodes such that there is a higher density of edges within groups than between them.

Figure 5.1: Example of a community structure graph

The principle in this kind of structure is that pairs of nodes are more likely to be connected if they are both members of the same group(community), and less likely to be connected if they do not in the same group. This implies that the graph will be dense for nodes in the same group while being sparse in between nodes in different groups. Many real world networks have community structures such as social networks and biological networks, and it is intuitive to think stocks also exhibit this kind of structure. Such that stocks for companies operating in the same industry will be more likely to be connected then companies that are not in the same industry. Some research of the community structure in financial markets has been

(31)

a community structure and the communities correspond the industrial classification of the companies to some extent[30][31]

In this thesis we will assume that the data originates from a community structure, and in all the simulations we will simulate from a community structure. I will in this chapter present how data for the simulations was generated

5.0.1 Graph Generation

In the simulations in this thesis I will use a simple method for generating data that has an underlying community structure. We will first generate the graph which represent the structure of the data and then generate a Θfrom this structure, that can be used to simulate data. I present here the method used for generating the graph structure.

Given the inputs:

• P, number of nodes in the graph

• K, number of groups

• c, which will be vector of length P that represents each node group classification.

• Γis aK×K symmetric matrix whereΓ_ii is the probability of an edge between two nodes in the same group, and Γ_ij is the probability of an edge between two nodes in from group iand j.

we can generate an adjacency matrix. An adjacency matrix A is a P ×P matrix which can be used to represent the structure of a graph. it is defined as

A_ij=

(1 if i, j connected 0 else

For undirected graphsA is symmetric.

Given a Γ we can randomly generate A, where A_ij is 1 with probability Γ_kl where node i is in group k and node j is in group l. In this thesis we will operate with the same inter and intra-group probabilities for all groups, this means Γii = p for all i ∈ 1, .., K and Γ_ij = q for all i 6= j, and it will also be assumed p > q. This random graph structure is also known as Stochastic Block Model, with associative communities[32].

(32)

Figure 5.2: Generated community graphs with different inter and intra-group probabilities

In all the simulations we will use p = 1 and q = 0.1 for inter and intra-group probabilities, and the number of groups will beK = log(P).

Figure 5.3: Generated community graphs when p = 1,q = 0.1, P = 50 and K = log(P)

(33)

5.0.2 Simulate from structure

Given an adjacency matrixA we want to simulate aN×P data observation matrix O= (O₁, ..., O_N)^T, where each observationO_i ∼N(0,Σ)andΣ⁻¹ = Θhas the same zero entries as A. To my knowledge there is no standard methods to generate a precision matrices with a specified sparsity structure, therefore I propose my own method.

The challenge is to get a positive definite matrix which has the same zero entries as A, since the adjacency matrix not necessarily positive definite we cannot just use Θ◦A as a new sparse precision matrix ΘA, where Θ is a arbitrary dense precision matrix and ◦ denote the Hadamard product.

For two matrices A,B of the same dimension N ×P the Hadamard product A◦B is a matrix defined as the element wise product ofA,B (A◦B)_ij =A_ijB_ij Notice that ifAwas positive definite thenΘAwould also be positive definite by Schur product theorem[8]. So if we could find a matrixLthat had the same structure asA and is positive definite we would get our desired matrix ,Θ_L. Finding such a Lthat has the exactly same is not hard to find, one way to achieve this is to add to the diagonal ofAsuch that it becomes positive definite. We see that we add a constant cto the diagonal ofA, L=A+cI then we shift the eigenvalues ofA upwards with c

L=A+cI =QΛQ^T +cQQ^T =Q(Λ +cI)Q^T

WhereQis a orthogonal matrix of the eigenvectors of Aand Λis a diagonal matrix of the eigenvalues of A, A+cI will have the same eigenvectors as A but the eigenvectors will beλi+cfor i= 1, .., P whereλi is the eigenvalues ofA. Since we only have changed the diagonal of A the off-diagonal will still be the same, this means that ifAis not positive definite we can find a positive definite matrixLthat has the same off-diagonal elements as A by adding a constant c to the diagonal of A such that the eigenvalues ofL=A+cI will be non negative. If we choosec=|λ_min |+ where >0, we will achieve positive definiteness.

With the method described above to find a matrix that has the same sparsity pat- tern as an adjacency matrixA and a procedure to generate dense precision matrix, We can define a procedure for randomly generating data from underlying sparse structure

Input: A

1. define L, L=A+I(|λ_min |+) 2. Generate Σand calculate Σ⁻¹ = Θ 3. Define Θ_L = Θ◦L

4. CalculateΣL,ΣL= Θ⁻¹_L

(34)

5. Generate O,O ∼N(0,Σ_L)

For the generation of the dense covariance matrix Σ the following method will be used:

Generate P ×K matrix W from N(0,1) where K ≤ P and 1×P vector d from N(0,1). DefineV =W W^T +D whereD is diagonal matrix withd as the diagonal.

Then normalize to a correlation matrix C = N V N where N is a diagonal matrix with diagonal elements ^√¹_V

ii fori= 1, ..P. And then rescaleCto a covariance matrix with your desired variance range,Σ =RCR whereR is a diagonal matrix sampled from the range(0,√

σ) whereσ is the desired maximum variance.

One thing worth noticing is that the choice of K controls the distribution range of the off diagonal elements of C and Σ, the lower the K the wider the range will be. Which means that this generation method is suitable when you want some high correlations in the data. In all the simulations I will useK = 0.1

Figure 5.4: Distribution of the correlations for differentK, in this plot P = 100 In many cases, such as in this thesis, you want to control the maximum variance ofΣ_L. Since the interest is in stock returns, it is preferable that the simulated data reflects the nature of the returns, hence I would like to set a likely range for where the simulated returns fall in, from the return distribution chapter we saw that most of the returns fall in the range (−0.1,0.1). Controlling the range the returns will fall in is achieved by setting a maximum value for the variance in the covariance matrix, I will use some results from George P.H. Styan[33] to find out what to set the maximum value for the variance in the covariance matrixΣ_L.

When A and B are positive definite we have that

(35)

fori= 1, ..., P, whereλ_min and λ_max is the smallest and largest eigenvalue andB_min andB_max is the smallest and largest value of the diagonal of B. We also have when A is symmetric λ_min(A)≤A_ii≤λ_max(A) for i= 1, ..., P. From this we get

1

λ_min(A)B_min ≥ 1 λ_i(A◦B) λ_max(A⁻¹)

B_min ≥λ_i((A◦B)⁻¹)

max((A◦B)⁻¹_ii )≤λ_max((A◦B)⁻¹)≤ λ_max(A⁻¹) B_min

If we control the maximum eigenvalue of denseΣ in step 2 we can control the maximum variance in Σ_L since max((Σ_L)_ii) =max((Θ◦L)⁻¹_ii )≤ ^λ^max_L ^(Σ)

min . If we set our desired maximum value of(ΣL)iiasη, we could achieve this by setting the maximum eigenvalue ofΣ as L_minη.

When we assume that our observations comes from a multivariate Gaussian distribution, each of univariate variable has an univariate Gaussian distribution. The value range each of the univariate variable i is controlled by the variance (Σ_L)_ii, for an univariate Gaussian variable the 68–95–99.7[34] rule states that values will lie within 2σ range with a probability of approximately 95% and3σ range with a probability of approximately 99.7%, where σ is the standard deviation. For the stock returns we want most of the returns to fall in range (−0.1,0.1), which we can achieve be setting the max varianceη in the range((^0.1₂ )²,(^0.1₃ )²). In the simulations I will use η= (^0.1_2.5)²

Inverse covariance matrix estimation for the global minimum variance portfolio