A comparison of the local Gaussian correlation and the local dependence function

(1)

A comparison of the local Gaussian

correlation and the local dependence function

Christopher James Brokstad June 2022

Master’s thesis in Actuarial Science, Department of Mathematics,

University of Bergen

(2)

.

(3)

Abstract

Correlation is a method to measure the relation between two or more variables. In this thesis, a method of measuring correlation and a method of measuring dependence are used. These two methods, are the local Gaussian correlation and the local dependence function. The goal was to build a bridge between these two measurements. The hypothesis is that for a bivariate normal density, both methods will locally approximate the densities correlation coefficient. The local dependence function is a measure of dependence and it has to be transformed to give the function of the correlation. However, there was not a clear connection between the two methods. Instead of the local dependence function, the precision matrix was utilised. The precision matrix provides an opportunity to find the correlation coefficient locally for the bivariate normal density. Thus, a bridge between the local Gaussian correlation and the correlation estimate from the precision matrix can be built.

To obtain the correlation estimate from the precision matrix, it has to be transformed to its inverse, the covariance matrix. However, while a connection was made, some details remain unclear. This is observed, with the local Gaussian correlation being defined for the range of the density. While the estimated correlation for the precision matrix has areas that are undefined for certain densities. The function of the estimated correlation from the precision matrix is discussed, to explain why some areas are undefined and why the estimate takes the form it does for different densities. The two methods also produced differing correlation estimates for given areas. In this thesis, the same set of test case densities are used, as the ones in the introductory paper for the local dependence function (Jones, 1996), and its preliminary paper (Doksum et al.,1994). As the local Gaussian correlation is an empirical method of measuring correlation, it needs data. For this thesis there is no observed data, therefore instead simulated data are used in the analyses. The correlation estimate from the precision matrix, on the other hand requires the densities to be known. To further explore the precision matrix’s correlation estimate, two novel methods are explored; the Box-Cox transformation and the Gaussian kernel estimates, but they require further work. In conclusion, while a bridge is constructed between the local Gaussian correlation and the precision matrix’s correlation estimate, more work is needed to establish a clearer connection.

i

(4)

.

(5)

Acknowledgments

I would like to thank my supervisor Professor Hans J. Skaug for being patient, interesting discussion and help guiding me throughout this Master’s thesis.

I would like to thank senior consultant Kristine Lysnes for answering my questions related and unrelated to the master’s project. I would also like to thank my friends and family for their support during the writing of this thesis.

ii

(6)

.

(7)

List of Figures

1 Simulated datapoints from the function Y = sin(3x) + . . . . 4

2 The Pear density, simulated observations, precision matrix’s correlation estimate and the local Gaussian correlation map. . 16

3 The twisted Pear density, simulated observations, precision matrix’s correlation estimate and the local Gaussian correlation map. . . 18

4 The Cauchy density, simulated observations, precision matrix’s correlation estimate and the local Gaussian correlation map. . . 20

5 The transformed normal density, simulated observations, precision matrix’s correlation estimate and the local Gaussian correlation map. . . 22

6 Variance estimates for the Pear . . . 24

7 Variance estimates for the twisted Pear . . . 24

8 Variance estimates for the Cauchy density . . . 25

9 Variance estimates for the transformed normal density . . . . 25

10 Outlier variance estimates for the Pear density . . . 28

11 The Pear densities σˆ negative values . . . 29

12 The correlation estimates from the local dependence function for the Cauchy distribution . . . 32

13 Correlation estimates using the precision matrix for the Box- Cox transformed Cauchy density for2 λ values . . . 35

14 Box-Cox transformed Cauchy density for 5 different λ values. . 37

15 Correlation estimates using the precision matrix for the Box- Cox transformed Cauchy density for5 different λ values . . . . 39

16 Four examples of kernel functions . . . 43

17 Bivariate Gaussian kernel estimates for the Pear, twisted Pear, Cauchy and transformed normal density. . . 46

18 Bivariate Gaussian kernel estimates of the transformed normal density for3 different h values . . . 48

19 The correlation estimates from the precision matrix for the bivariate Gaussian kernel estimates of the densities. . . 53

20 Histograms of ρ(x, y)ˆ estimates for 5 different h levels for bivariate Gaussian kernel estimates of the Cauchy density . . . . 54

21 Contours ofρ(x, y)ˆ for the bivariate Gaussian kernel estimates of Cauchy density at 5 different h values. . . 56

iv

(10)

List of Figures

22 Undefined areas for ρ(x, y)ˆ for the bivariate Gaussian kernel estimates of the Cauchy density for 2different h values. . . 57 23 Local Gaussian correlation maps of a bivariate normal density

for 4different sample sizes. . . 61

v

(11)

.

(12)

Table of Abbreviations

AISB - Asymptotic Integrated Variance AIV - Asymptotic Integrated Square Bias

AMISE - Asymptotic Mean Integrated Square Error ARMA - Auto-regressive Moving Average

i.i.d - Independent and Identically Distributed IQR - InterQuartile Range

MISE - Mean Integrated Square Error MSE - Mean Square Error

vi

(13)

.

(14)

1 Introduction

Correlation is the measure of the relation between two or more variable and is a commonly studied subject as it can be important to understand how variables in a sample relate to each other. The local Gaussian correlation, is a measure of correlation, and measures the correlation locally for areas of the data. It does this by approximating areas to Gaussian densities and the correlation estimates comes from these approximations. My introduction to local Gaussian correlation was through my bachelor thesis (Brokstad,2020).

This bachelor thesis provided a general overview of the local Gaussian correlation. In addition, to using the local Gaussian correlation to investigate the relation between COVID-19 data and its impact on the index stock for Oslo Børs. The local Gaussian correlation’s primary use has been for inves- tigating the stock market. An example of this is the paper by Støve et al.

(2014), where they use the local Gaussian correlation to investigate different financial crises. Another example of this isNguyen et al.(2020) that looks at correlation before and after economic crises. Some work has also been con- ducted in other areas of statistics. Such as Jordanger and Tjøstheim (2021) paper where they use the local Gaussian correlation to inspect upper and lower tails of spectral densities. Similarly, Berentsen et al. (2014) used the local Gaussian correlation to examine copula models and their characteristics.

The other method used is the local dependence function. The local dependence function was introduced in Jones (1996). The local dependence function is a measure of dependence. To be able to find an estimate of the correlation using it, it has to be transformed. The areas of use for the dependence function are less specific than the area of use for the local Gaussian correlation. There are papers such as Gupta et al.(2010) that proved, when the local dependence function is always equal or more than 0 the density is totally positive of the second order. Jones and Koch (2003) introduces dependence maps, with the goal of making the local dependence function easier to interpret.

The goal of this thesis is try and build a bridge between the local Gaus- sian correlation and the local dependence function. This bridge can be found if they can both locally approximate the same correlation as the correlation coefficient for a bivariate normal density. As there was not, a clear connection between the local Gaussian correlation and the local dependence function.

1

(15)

1 Introduction

Instead, the precision matrix was utilised over the local dependence function.

The precision matrix is the inverse of the covariance matrix. In order to get the correlation coefficient, the precision matrix has to be transformed to the covariance matrix. From the covariance matrix, it is possible to get the correlation coefficient. As the estimated correlation obtained from the precision matrix can locally approximate to the correlation coefficient for the bivariate normal density, the bridge between it and the local Gaussian correlation can be built. However, some details remain unclear. One of the problems is that the local Gaussian correlation is defined over the entire area of the density, while the estimated correlation from the precision matrix has areas that are undefined. In addition, there are certain areas for the different densities that the local Gaussian correlation and the estimated correlation from the precision matrix give differing estimates. The reason for why they may differ and why there are undefined areas is discussed further in this thesis. For the implementation of the local Gaussian correlation the package lg (Otneim, 2019) for the programming language R is used. For finding the precision matrix the packageTMB, is used (Kristensen et al.,2016). There is also the implementation of the Box-Cox transformation and the bivariate Gaussian kernel estimate in relation to the precision matrix. However they are only briefly used so they are not as extensively studied as the regular precision matrix. Finally a discussion about the overall results and what problems remain with the precision matrix’s estimated correlation.

2

(16)

2 Correlation

Correlation is the interaction between two variables and how they affect each other. Generally the most used form of correlation is Pearson’sρthatPearson (1896) introduced, and its formal definition is

ρ= E(XY)−E(X)E(Y)

σ_Xσ_Y , (1)

where E(XY) is the expected value for X times Y. E(X) is the expected value for X and E(Y)is the expected value for Y. σ_X and σ_Y are the standard deviations for theX andY variables, respectively. Theρvalue is within the range of [−1,1]. Positiveρvalues indicate a positive correlation between the two variables. Positive correlation is the positive association between the variables, so as X increases in value, Y increases in value as well. For the negative correlation, as one of the variables increases in value the other variable will decrease in value. The strength of this association is described by how close or equal ρ is to 1 or−1. A ρ value of 0.9 indicates a stronger correlation then a ρ value of 0.2. For ρ = 0, it indicates that there is no linear correlation between the variables.

Although ρ has been given for its population form, it can also be approximated empirically by replacing the expected value and variance with their empirical forms. The empirical forms are the sample mean and sample variance. Then for a given dataset with the observations (X₁, Y₁), ...,(X_n, Y_n), the mean for the observations X and Y are given by X¯ and Y¯, the sample standard deviations are given by s_X and s_Y. The empirical version of Pearson’s ρ is

r =

Pn

i=1(X_i−X)(Y¯ _i−Y¯) pPn

i=1(X_i −X)¯ ²pPn

i=1(Y_i−Y¯)². (2)

For both the population variant and the empirical variant, a singular value of the correlation ρ is found for all of the data.

3

(17)

2 Correlation

Figure 1: 1 000 simulated datapoints for the function Y = sin(3x) +, where is independent and identically distributed noise.

While Pearson’s ρ gives us a good overview of the relation between the x and y variable, it does have its weak points. For certain datasets we are more interested in specified areas correlation than the whole data’s correlation coefficient There are at least two big disadvantages for Pearsonsρ. One of the two problems is that ρ is sensitive to outliers. For example for a hypothetical dataset of (1,3),(4,6),(5,2),(13,9), the correlation is r = 0.8. If we removed (13,9)from the dataset, then instead r = 0.038. Obviously the proposed hypothetical dataset is small so the ρ value is more susceptible to changes for this dataset, than for a dataset of sizen = 1000for example. But even for that larger hypothetical dataset, theρ value would still be sensitive to outliers. The reason is that both the formula for the empirical (2) and the population variant (1) include measurements of the mean. Outlier values will have bigger impacts on the mean value then data within the normal range.

4

(18)

2 Correlation

The second problem is the fact that Pearson’s ρ is set up to measure linear correlation between the two variables. Thus for nonlinear correlation, it will not describe the association between the two variables well. An example discussed in Tjøstheim et al. (2022) of this issue that ρ is for Y = X² +. The reason Pearson breaks down for this example is that it is not able to capture the nonlinear association between the two variables. As Y and X will be negatively correlated for negative values of X. Whereas for positive values of X there will be a positive correlation between the two variables.

Similarly, another example of nonlinearity that ρ will under perform on is Y = sin(3x)+as seen in figure 1 whereis identical independent distributed (i.i.d) noise.

There are methods to reduce susceptibility for outliers for ρ. An easy solution is to exclude or remove outliers from the dataset. There are other methods such as transforming or altering the formula for how ρis calculated, to make it less susceptible to this problem. Building on the previous example, a quick and dirty solution would be to only use data points within a given percentile. There are also more complex alters such as Spearman’s or Kendall’s methods which are also outlined in Tjøstheim et al.(2022). Spear- man’s rank correlation looks at the observations by ranking. While Kendall’s correlation coefficient τ looks at the difference between concordant pairs and discordant pairs. Concordant pairs are pairs where for i < j, X_i > X_j and Yi > Yj or Xi < Xj and Yi < Yj. A discordant pair would entail either Xi

or Y_i having a smaller value than either X_j or Y_j. However both of these variations will still have problems with describing the relation of nonlinear data.

So far, the paper has mainly presented potential weaknesses for ρ. Yet it still is the most commonly used method of measuring correlation. As previously stated ρ gives an overview of the whole dataset’s relation unlike the local correlations which is discussed later. In addition, ρ is very simple to compute as its components are simple to find, such as the standard deviation and the mean. There are also other benefits of ρ such as being able to describe the relation of X_t and X_t+h for linear time series models such as Autoregressive-moving average (ARMA) models (Tjøstheim et al.,2022).

Another group of models that ρ is important for, is the multidimensional normal density and similar densities from the same family. As one of the parameters used to describe the density is actually Pearson’s ρ which appears 5

(19)

2 Correlation

in the covariance matrix. ρis also useful in linear regression models. For the form of Y = α+βX +, where is zero-mean i.i.d noise. Then β can be given as (Tjøstheim et al., 2022):

β =ρσ_Y

σ_X. (3)

For the linear regression model if σ_X,σ_Y and ρare unknown, we can instead use their empirical counterparts {s_X, s_Y, r} to get a similar result.

6

(20)

3 Local Gaussian Correlation

The local Gaussian correlation is method of measuring correlation by locally approximating a Gaussian density to a dataset. For a given datapoint z = (x, y). The approximated density will have the running variables (v₁, v₂).

It will also have µ₁(z) and µ₂(z) as the local mean vectors and σ₁²(z) and σ₂²(z)will be the local variance functions. ρ(z)is the covariance value for the density. Then for the datapoint (x, y), the approximated density is

ψ(v,µ₁(z), µ₁(z), σ₁²(z), σ²₂(z), ρ(z))

= 1

2πσ₁(z)σ₂(z)p

1−ρ²(z)

×exp

− 1 2

1 1−ρ²(z)

(v₁−µ₁(z))² σ²₁(z)

−2ρ(z)(v₁−µ₁(z))(v₂−µ₂(z))

σ₁(z)σ₂(z) +(v₂−µ₂(z))² σ₂²(z)

.

(4)

From the approximation, the most important parameter is ρ(z). It is from the parameter ρ(z), the local Gaussian correlation gets its name from. The dataset itself does not have to have a Gaussian density, as the local Gaussian correlation will approximate a local area of the dataset to a Gaussian density.

If the dataset is actually Gaussian distributed. Then for a given density f, the local Gaussian density will approximate tof’s density for any value off.

Although for the Gaussian data, as the local Gaussian correlation approxi- mates the datapoints in a given area there can be multiple possible densities that it could approximate to (Tjøstheim et al.,2013). Thus to find the best fit for the approximation, a penalty function is used. The penalty function utilised for the local Gaussian correlation is a locally weighted Kullback- Leibler distance between f and ψ. As the local Gaussian correlation is built on the work ofHjort and Jones (1996). Many of the functions and equations for the local Gaussian correlation’s penalty function, are derived from that paper. In that paper (Hjort and Jones, 1996), they present how to find a local dependence measurement using local likelihood. However unlike the local Gaussian correlation they do not specify what family the approximating density fˆwill have. While for the local Gaussian correlation thefˆtakes the form of ψ.

7

(21)

The penalty function is given as q=

Z

K_h(v−z)[ψ(v, θ(z))−logψ{v, θ(z)}f(v)]dv. (5) In the penalty function K_h is a product kernel containing

K_h(v−z) = (h₁h₂)⁻¹K(h⁻¹₁ (v₁−x))K(h⁻¹₂ (v₂−y)). (6) For K_h, the bandwidth ish= (h₁, h₂) (Tjøstheim et al., 2013). Also

θ(z) = (µ₁(z), µ₂(z), σ²₁(z), σ₂²(z), ρ(z)). (7) Then to minimize θ(z) for the penalty function

Z

K_h(v−z) ∂

∂θ_jlog(ψ(v, θ(z)))[f(v)−ψ(v, θ(z))]dv= 0 j = 1, ...,5.

(8)

For the density f, where (X₁, ..., X_n) and (Y₁, ..., Y_n) are i.i.d observations and Z_i = (X_i, Y_i). To get an estimate forθ(z)for a fixedz, we maximize the local log likelihood (Tjøstheim et al., 2013).

L(Z₁, ..., Z_n, θ_b(z)) =n⁻¹X

i

K_h(Z_i−z) logψ(Z_i, θ_h(z))

− Z

Kh(v−z)ψ(v, θh(z))dv,

(9)

K_h is a kernel function as described previously for the penalty function (6).

From here, one can obtain the derivative ∂L/∂θj (Tjøstheim et al., 2013),

∂L

∂θ_j =n⁻¹X

i

Kh(Zi−z) ∂

∂θ_j log{ψ(Zi, θh(z))}

− Z

Kh(v−z) ∂

∂θ_j log{ψ(v, θh(z))}ψ(v, θh(z))dv

→ Z

Kh(v−z) ∂

∂θ_j log{ψ(v, θh(z))}[f(v)−ψ(v, θh(z))]dv,

(10)

∂L/∂θj is found by using the law of large numbers, and the assumption that E[K_h(Z_i−z) logψ(z_i, θ_b(z))]<∞. (11) 8

(22)

Then if ∂L/∂θ_j = 0, we can find the maximum likelihood estimatesθb_b (Tjøs- theim et al., 2013). From minimizing θ(z) for the penalty function we can get

Z

K_h(v−z)α(v)dv =α(z) + 1 2

2

X

i=1 2

X

j=1

σ_K²

i,j

∂²α(z_i, z_j)

∂z_i∂z_j b_ib_j +o(b^Tb). (12) In the equation, σ_K²_i,j =R

cⁱ₁c^j₂K(c₁, c₂)dc₁dc₂. Wherec=A⁻¹_i (z−a_i). A_i = Σ^1/2_i , where Σ is the covariance matrix and ai =µi, for further explanation see Tjøstheim et al.(2013). The important part is

α(v) = ∂

∂θ_jlog{ψ(v, θ_h(z))}[f(v)−ψ(v, θ_h(z))], (13) for θ^T = [θ₁, ..., θ₅], which means that as b → 0 then at the same time f(z)−ψ(z, θ_h(z))→0. For f₁(z) =f₂(z)within a neighbourhood then for a certain bandwidthb₀ it is possible to getθ(f₁, z) = θ(f₂, z)(Tjøstheim et al., 2013). For creating the subsequent figures, the Rpackage lg(Otneim,2019) is used. The reason is it finds the local Gaussian correlation estimates for datasets. In addition as there is no real data, the data is simulated using the Markov chain Monte Carlo method.

9

(23)

4 Local Dependence Function

Another method for measuring local dependence is described inJones(1996) paper "The local dependence function". The local dependence function Jones describes takes the form of

γ(x, y) = ∂²

∂x∂y log(f(x, y)). (14)

As the local dependence function is the double derivative of the log density, it is not restricted to being values within the range of −1 and 1. This is unlike Pearson’s ρ or the local Gaussian correlation. In Jones’s paper, they also present multiple properties for the function γ(x, y) (Jones, 1996). One of those properties is that γ(x, y) is finite everywhere. A different property is that if X and Y are independent then and only then is γ = 0. Another important property is that, when we are looking at γ(x, y) for a bivariate normal density. Then the γ function is constant(Jones, 1996) and should have the form of

γ(x, y) = ρ

(1−ρ²). (15)

Another property is that for a stronger correlation between x and y, the γ(x, y) function will begin to increase exponentially towards ±∞. This can be seen in equation (15). As ρfor the bivariate normal density goes towards

±1 then the denominator of the equation (15) will go towards0.

10

(24)

5 Experiments

As the local dependence function is defined as a function of ρ for bivariate normal densities. Then by solving the equation, ρ can instead be described as a function of γ.

γ(x, y) = ρ (1−ρ²) γ(x, y)(1−ρ²) = ρ

ρ²γ(x, y) +ρ+γ(x, y) = 0.

(16)

From the inverse transformation we get the two possible values for ρ, ρ₁ = −1 +p

1 + 4γ(x, y)² 2γ(x, y) ρ₂ = −1−p

1 + 4γ(x, y)² 2γ(x, y) .

(17)

From experimentation using real ρvalues within the range [−1,1]and using the γ(x, y) function. ρ₁ returns the real ρ values. This is also discussed in the introductory paper for the local dependence function (Jones, 1996). ρ₂ on the other hand will return values that are outside the range ofρ’s defined area of [−1,1]. However while Pearsons ρ can be equal to 0, ρ₁ 6= 0. This can be seen by using equation (17), and (15). If ρ = 0 then the γ function from equation (15) is also equal to 0. The problem occurs in equation (17) asρ₁ hasγ in the denominator and we cannot divide by0. Although ρ₁ 6= 0, it does approach the value 0as the γ function goes towards 0from both the positive and negative sides. This is shown by using L’Hôpital’s rule

f(γ) = −1 +p

1 + 4γ² g(γ) = 2γ

f(0) = 0 g(0) = 0

f⁰(γ) = 4γ

p1 + 4γ² g⁰(γ) = 2.

Then the resulting equation is

γ→0lim 1 2

4γ

p1 + 4γ² = 0. (18)

11

(25)

5 Experiments

AlthoughJones (1996) states that "it is constant if f is the bivariate normal density, and then takes the valueρ/(1−ρ²)whereρis the Pearson correlation coefficient;". The statement can be proven to be false. By deriving the mixed partial derivative of the logarithm of a bivariate normal density f, the γ function takes the form of

γ(x, y) = ∂²

∂x∂ylog(f(x, y)) = ρ

1−ρ² × 1

σ_xσ_y. (19) The reason that the γ function takes the form of the above equation (19) instead of the previously given form in the equation (15), is because the bivariate normal density is defined as

f(x, y) = 1 2πσ_xσ_yp

1−ρ²

×exp

− 1 2

1 1−ρ²

(x−µ_x)² σ²_x

−2ρ(x−µ_x)(y−µ_y)

σ_xσ_y + (y−µ_y)² σ_y²

.

(20)

From the definition (20), the only part that contains both axand ayvariable is the part of

−2ρ(x−µ_x)(y−µ_y)

σ_xσ_y . (21)

Thus when we try to find the local dependence function for the density, the denominator with σ_x and σ_y is not derived away. Just to further exemplify this, if equation (19) and equation (17) are used to check that ρ₁ =ρ. Then γ for a bivariate normal density with σ_x = σ_y = 1 will take the form given in equation (15) and all possible ρ₁ values will be equal to ρ. While if one or both of the σ are different than 1 then ρ₁ 6=ρ. This conclusion is further supported by Jones (1998), that it is for a standard bivariate normal density that γ takes the form given in equation (15). Using the given definition of the local dependence function for the Pear which is a transformed normal density, the γ function is

γ(x, y) = ρ

1−ρ² × 3x²

σ_xσ_y. (22)

The Pear density was included in Doksum et al. (1994) and can be seen in figure 2a. For the transformed normal density seen in figure 5a, the γ takes 12

(26)

5 Experiments

the form of

γ(x, y) = ρ

1−ρ² × y σxσy

. (23)

As shown there are some problems with trying to build a connection between the local dependence function and the local Gaussian correlation. Therefor my supervisor Professor Skaug proposed the use of the precision matrix instead. The precision matrix for the bivariate normal density can be defined as







1 (1−ρ²)σ_x²

−ρ (1−ρ²)σ_xσ_y

1 (1−ρ²)σ_y²







=







− ∂²

∂x²logf(x, y) − ∂²

∂x∂ylogf(x, y)

− ∂²

∂x∂ylogf(x, y) − ∂²

∂y²logf(x, y)





 .

(24) The precision matrix’s inverse is the covariance matrix, which for a bivariate normal density is given as

"

σ_x² ρσ_xσ_y ρσ_xσ_y σ²_y

#

. (25)

As the covariance matrix is defined to have a ρ value in it, it can be solved to extract the ρ value. This is done by taking either (1,2) or (2,1) as they contain ρσ_xσ_y from matrix (25). Then by dividing (2,1)by the square roots of (1,1) and (2,2) for the equation (25) as they contain σ_x² and σ²_y. Then the only variable left is ρ. This approach is generalized for other densities so that an estimate of ρ can be obtained. By going through the steps outlined above, the estimate is referred to as

ˆ

ρ(x, y). (26)

This estimate will always be ρ(x, y) =ˆ ρ for the bivariate normal density.

Thus a connection between it and the local Gaussian correlation can be found as they both locally approximate the correlation coefficient for the bivariate normal density. Again likewise with the local Gaussian correlation,

ˆ

ρ(x, y) was found in R, this time by using the TMB package (Kristensen et al.,2016) which allows one to compileC++files inR. The reason why the TMB package is useful, is because it can return the double derivative values 13

(27)

5 Experiments

for functions. It can also return the derivatives value if that is needed. The one thing to note is that the package does not give the explicit form of the derivatives or double derivatives. So to find out how the double derivatives actually look has to be done by hand.

14

(28)

5 Experiments

(a) Density.

(b)10 000 Simulated observations from the density.

15

(29)

5 Experiments

(c) ρ(x, y)ˆ estimates from the precision matrix.

(d) Local Gaussian correlation map for the simulated data.

Figure 2: The density of N(x, y,10,1.55,10²,0.775²,0.75) where the density is a transformed normal density from (U, V) wherex=U^1/3, y=V and simulated observations of it. a) contour, b) simulated observations from the density, c) estimated correlation from the precision matrix and d) local Gaussian correlation map.

16

(30)

5 Experiments

(a) Density.

(b)10 000 Simulated observations for the density.

17

(31)

5 Experiments

Figure 3: The density of twisted Pear f(x, y) =f(x)f(y|x), where f(x) =N(1.2,(1/3)²), f(y|x) = N(µ(x), σ²(x)),

µ(x) = (x/10) exp(5−(x/2)), σ²(x) = [(1 + 0.5x)/3]² and simulated observations from it. a) contour, b) simulated observations, c) precision matrix and d) local Gaussian correlation.

18

(32)

5 Experiments

(a) Density.

(b)100 000Simulated observations from the density.

19

(33)

5 Experiments

Figure 4: The Cauchy density and simulated observations from it. a) contour, b) simulated observations from the density, c) estimated

correlation from the precision matrix and d) local Gaussian correlation map.

20

(34)

5 Experiments

(a) Density.

(b)100 000simulated observations for the density.

21

(35)

5 Experiments

(c) ρ(x, y)ˆ values from the precision matrix.

Figure 5: The density of transformed bivariate normal density N(x, y,4,2,5²,2²,−0.27) wherex=U + 1, Y =√

V −2and simulated observations of it. a) contour, b) simulated observations from the density, c) estimated correlation from the precision matrix and d) local Gaussian correlation map.

22

(36)

6 Variance

The different densities for ρ(x, y)ˆ from equation (26) are displayed in figures 2c − 5c. With the exception of the ρ(x, y)ˆ for the transformed bivariate normal distribution, the other figures 2c − 4c have areas that are white.

These white areas occur becauseρ(x, y)ˆ is imaginary. The reason the estimate can be imaginary, is because it gets variance estimates for σ_x² and σ²_y and takes the square root of them. Thus where these estimates are negative, the resulting ρ(x, y)ˆ is imaginary. So while, estimating the variance is not the primary focus of the thesis, there is some values to analysing them. These estimates for the different densities are shown in figures 6, 7, 8 and 9. These estimates are taken from sequentially increasingxandyvalues instead of the sample. So as the sequences are dependent on the areas they are used over, the estimated variances will likewise also be dependent on those areas. Thus these estimates have to at least be taken with a bit of skepticism. In terms of the different estimates, only the estimates for the transformed normal density which corresponds to figure 9 have all of its values within the range of the histogram. This is demonstrated in figure 5c, as it is the only one that has no white areas. The other figures 6, 7 and 8 have excluded 5% or less of the variance estimates. The only exception is figure 6a, which is missing 68 309 observations out of 1 000 000. In terms of the negative values that are estimated there is some information to glean. For example the subfigures 8a and 8b have a reasonable amount of estimations that are negative, which at a surface level seems to correspond to why there is so much white area in figure 4c. Another point to note is the fact that for the different figures, most of the estimates fit within the range of [−1,1]. While most of the values not included are just outside of this range, there are also occurrences of extreme estimated values in at least the thousands if not more. This will be further expanded upon in the next section.

23

(37)

6 Variance

(a)σˆ²_x (b)σˆ_y²

Figure 6: Variance estimates for the Pear density. a) the variance estimate of X and b)variance estimate of Y.

Figure 7: Variance estimates for the twisted Pear density. a) the variance estimate of X and b) variance estimate of Y.

24

(38)

6 Variance

(a)σˆ²_x (b)σ2ˆ y

Figure 8: Variance estimates for the Cauchy density. a) the variance estimate of X and b) variance estimate of Y.

Figure 9: Variance estimates for the transformed bivariate normal density.

a) the variance estimate of X and b) variance estimate of Y.

25

(39)

7.1 Pear 7 Results and Discussion

7 Results and Discussion

One of the main results from figures 2, 3, 4 and 5 is that both the local Gaussian correlation and the ρ(x, y)ˆ estimate from equation (26) indicate if the correlation is positive or negative for the given areas. With the figures 3c and 3d seeming to be the most similar. The other main result is thatρ(x, y)ˆ is within the range of [−1,1]. Compared to the local Gaussian correlation the estimateρ(x, y)ˆ is a function, so there is a smoother transition from one area to another. While for the local Gaussian correlation, these areas are less interconnected, so there is the possibility of an area of negative correlation occurring in a wider area that is positively correlated. The actual function for ρ(x, y)ˆ can be complex because of it using the double derivatives. As some of the chosen densities do not reduce when derived, particularly the twisted Pear. With the local Gaussian correlation, it is harder to grasp why certain areas return the correlation estimates, they do. As the package lg does the estimations for you. Another thing to note is that the transformed normal density and the Cauchy density have more simulated observations than the twisted Pear and the Pear density. This was done because as one can see in figures 4b and 5b, the simulations extend outside of the range of the distribution. As the correlation estimates look at the same range as the densities, some extra observations were simulated to make sure that approximately the same amount of observations are in the restricted ranges.

7.1 Pear

The first Pear transformed normal density is fromDoksum et al.(1994). The Pear density is a transformed normal density shown in figure 2c, where x, y are transformed from (U, V). For the transformation x =U^1/3, y = V and the density is given as

3x² 2πσ_xσ_yp

1−ρ² exp

− 1 2(1−ρ²)

(x³−µ_x)² σ²_x

−2ρ(x³−µ_x)(y−µ_y)

σ_xσ_y +(y−µ_y)² σ_y²)

.

(27)

For this density,µ_x =σ_x = 10, µ_y = 1.55, σ_y = 0.775andρ= 0.75. Generally the correlation estimates in figure 2d seem to have approximated well to the 26

(40)

realρ. While the estimates in figure 2c are less accurate at estimating the real value of ρ. For ρ(x, y), there seems to be no correlation in area between theˆ two tops of the densities seen in figure 2a. In addition, while not observable there is a straight line at x= 0 that is undefined. The reason it is undefined is because the double derivative ofxincludes2/xas part of the equation. On the other hand, there is a clear undefined area approximately aroundxfor the values of(1,3)andyfor the values of(2,4)in figure 3c. Along the edge of this undefined region is where the strongest correlation estimates occur as well as where the strongest variance estimates occur. These variance estimates are displayed in figure 10. These variance estimates being the positive and negative thousands. The reason for these negative approximations forσˆ_x² and ˆ

σ_y² are because they are given by σˆ²_x =

1

0.775²(1−0.75²)

/k

σˆ_y² =





1 2(1−0.75²)

3x⁴−12x

10 − 9x(y−1.55) 7.75

! + 2

x²



/k,

where k is given as

k =





1 2(1−0.75²)

3x⁴−12x

10 −9x(y−1.55) 7.75

! + 2

x²





1

0.775²(1−0.75²)

− 4.5x² 15.5(1−0.75²)

!2

.

From the denominator of σˆ_x² we can easily set it up the inequality for ≤ 0.

Solving the inequality for σˆ²_x, the end result is

39.75x⁶−232.258x³y−120x³+ 350≤0, (28) Which explains the undefined area in figure 2c.

27

(41)

(a)σˆ²_x

(b)σˆ²_y

Figure 10: Outlier variance estimates for the Pear density. Red line are negative values, while the black line are positive values. a) estimates for X and b) estimates for Y.

28

(42)

Figure 11: Negative values corresponding to the inequality for equation (28).

29

(43)

7.2 Twisted Pear 7 Results and Discussion

7.2 Twisted Pear

The twisted Pear density is from the introductory paper for the local dependence function (Jones, 1996), and originally used in Doksum et al. (1994).

The twisted Pears density is given as:

f(x) = 1 σ_x√

2π ×exp

− 1 2

x−µ_x σ_x

2

f(y|x) = 1

√2πσ(x)exp

− 1 2

y−µ(x) σ(x)

2

f(x, y) = f(y|x)×f(x).

(29)

Where the functions µ(x)and σ(x) are µ(x) = x

10exp(5−x 2) σ(x) =

1 + 0.5x 3

,

(30)

and σ_x = 1/3 and µ_x = 1.2. In addition, the correlation coefficient ρ is not given. In terms of the estimates, this is the closest that ρ(x, y)ˆ and the local Gaussian correlation get for the given densities. As the correlation trend seems to be a very strong positive correlation along the left tail for figures 3c and 3d. Then towards the rightmost end of the figures, the correlation decreases in value and in the case of ρ(x, y)ˆ it gets into the negatives. So there seems to be a nonlinear dependence between the variables x and y.

30

(44)

7.3 Cauchy 7 Results and Discussion

7.3 Cauchy

The Cauchy density used is the bivariate form and takes the form of f(x, y) = 1

π(1 +x²+y²)^3/2. (31) This is the same density as the one in the introductory paper for the local dependence function (Jones,1996). The most prominent feature of the Cauchy densities are that they do not have defined variance. So both σ_x² and σ_y² are undefined, this means that the spread of observations can be the entire range of(−∞,∞). For the given densities, the Cauchy density is the only one that

∂²/∂x² and∂²/∂y² mirror each other. This can be seen by taking the double derivatives

∂²

∂x² logf(x, y) = 3(x²−y²−1) (x²+y²+ 1)²

∂²

∂y² logf(x, y) = 3(y²−x²−1) (x²+y²+ 1)²

∂²

∂xylogf(x, y) = 6xy (x²+y²+ 1)².

(32)

The only difference being whether the numerator contains 3(x²−y²−1)or 3(y² −x² −1). This results in figures 8a and 8b mirroring each other. We can show this by finding the covariance matrix using the derivatives. The covariance matrix then takes the form of

x²+y²+ 1 3(x²+y²−1) ×

"

y²−x²−1 −2xy

−2xy x²−y²−1

#

. (33)

As we can see (1,1) and (2,2) also mirror each other for the equation (33).

With the difference being whether there is a minus in front of x² or in front ofy². We can take this further and see why the defined areas for ρ(x, y)ˆ is circular for the Cauchy density. ρ(x, y)ˆ is then given as

ˆ

ρ(x, y) =

−2xy(x²+y²+ 1) 3(x² +y²−1) (

s

(y²−x²−1)(x²+y²+ 1) 3(x²+y² −1) )×(

s

(x²−y² −1)(x²+y²+ 1) 3(x²+y²−1) )

.

(34) 31

(45)

7.3 Cauchy 7 Results and Discussion

Figure 12: The correlation estimates from the local dependence function for the Cauchy distribution.

From equation (34), the restrictions end up being x²+y² 6= 1,

x⁴ ≤(y²−1)², x⁴+ 1≥2x²+y⁴.

(35)

Thus resulting in the circular form. The other thing to note about ρ(x, y)ˆ is that when both variables have the same sign the defined areas in figure 4c are positive. While when the signs differ, there is negative estimated correlation. This seems to line up with figure 4d, although it does have some areas that break this trend. Interestingly, figure 4d resembles the local dependence function more as shown in figure 12 as the estimated correlation for the areas are more similar. It should be noted that figure 12 has the ρ estimates that are calculated from theγfunction, using the method described in equation (17) from the experiments chapter.

32

(46)

7.4 Transformed Normal Density 7 Results and Discussion

7.4 Transformed Normal Density

For the final density, it seems that both the local Gaussian correlation and the ρ(x, y)ˆ indicate that there is a negative correlation on the positive side of the y-axis and positive correlation along the negative side of y. This is more clearly shown in figure 5c than in figure 5d, as there is a clearer transition for the estimate ρ(x, y). The local Gaussian correlation, on the other hand hasˆ a few areas that contradict the overall trend. In addition, only the top and bottom areas showing weak correlations. The middle area has correlation estimates close to 0. The reason for some areas contradicting the overall trend could be because of the areas being under sampled or, alternatively because of the weak negative correlation not being captured. Though the overall correlation trend seems to be, the further away from the y-axis the stronger correlated the variables are. So in a way pushing, the observations towards the axis. This may seem a bit counter intuitive when looking at the density displayed in figure 5a, but the highest probability areas have maximum probability of 0.03.

33

(47)

8 Box-Cox Transformation

The ρ(x, y)ˆ from equation 26 ends up with a small defined area. A possible way to mitigate this is by transforming the density, using the Box-Cox transformation is utilized. The Box-Cox transformation is a power transformation that makes the data look more normally distributed. The Box-Cox transformation that G. Box and D. Cox introduced in their1964 paper (Box and Cox,1964). The transformation is defined as

f(x, y)^(λ)=







f(x, y)^λ−1

λ (λ6= 0), log(f(x, y)) (λ= 0),

(36)

and for the two parameter transformation (Box and Cox, 1964)

f(x, y)^(λ)=







(f(x, y)−λ₂)^λ¹ −1

λ₁ (λ₁ 6= 0), log(f(x, y) +λ₂) (λ₁ = 0).

(37)

For both of these transformations, there are restrictions. For equation (36) the restrictions is that f(x, y) > 0, otherwise when λ = 0 then f(x, y)^(λ) is imaginary. For equation (37), the restriction is similar, which is that f(x, y) >−λ₂. Of these two transformations the uni parametric method is used. Thus using the Box-Cox transformation the precision matrix takes the form of







− ∂²

∂x²f(x, y)^(λ) − ∂²

∂x∂yf(x, y)^(λ)

− ∂²

∂x∂yf(x, y)^(λ) − ∂²

∂y²f(x, y)^(λ)







. (38)

In addition, as λ → 0 then f(x, y)^(λ) will go towards log(f(x, y)). This is shown in figures 15a as it almost identical to figure 4c. Another thing to note is that ρ(x, y)ˆ estimates for the Box-Cox transformed density are no longer bound to the range of [−1,1]. For the density, as λ increase then the probability decreases and becomes more homogeneous as seen in figure 14.

This is similar for the estimate ρ(x, y), as there is a decrease in estimatedˆ value as λ increases as figure 13 shows. The exception to this is λ = 0.001.

As one can see in figure 15, the strongest correlation is estimated outside of the circle, in the diagonal areas. it seems like as λ decreases, the diagonal 34

(48)

(a)λ= 0.1 (b)λ= 10

Figure 13: ρ(x, y)ˆ values for the Box-Cox transformed Cauchy density for 2 λ values. a) λ= 0.1 and b) λ= 10.

areas thin down and increase in value. Whileλincreases, the circle decreases and the diagonals increase. Additionally although hard to see, the edges of the diagonal areas are where the strongest estimated correlation occur. This is further exemplified in figure 15b.

35

(49)

(a)λ= 0.001

(b)λ= 0.1

(c) λ= 1

36

(50)

(d)λ= 5

(e) λ= 10

Figure 14: Box-Cox transformed Cauchy density for 5 different λ values. a) λ = 0.001, b) λ = 0.1, c) λ= 1, d) λ= 5 and e) λ= 10.

37

(51)

(a)λ= 0.001

(b)λ= 0.1

(c) λ= 1

38

(52)

(d)λ= 5

(e) λ= 10

Figure 15: Correlation estimates using the precision matrix for the Box-Cox transformed Cauchy density for 5different λ values. a)λ= 0.001, b)

λ = 0.1, c) λ= 1, d) λ= 5 and e) λ= 10.

39

(53)

9 Kernel Smoother

Kernels are a statistical application with multiple uses. The application this paper will focus on is kernel smoothers. As the name implies kernel smoothers are used to transform or smooth the data within specified areas using kernels. In terms of how the data is transformed, there are a multitude of different kernel functions. The chosen examples are; the uniform kernel, triangle kernel, Gaussian kernel and Epanechnikov kernel. These examples are seen in figure 16. The different functions in the figure are; the uniform kernel function is (Ivanka, 2012)

K(x) = 1

2I[−1,1](x), (39)

where I is an indicator function taking on the form of I[−1,1](x) =

(1 if x∈[−1,1],

0 otherwise. (40)

The triangle kernel function has the form of

K(x) = (1−|x|)I[−1,1](x). (41) The Gaussian kernel function is given as

K(x) = 1

√2π exp(−1

2x²). (42)

and finally the Epanechnikov kernel takes the form of K(x) = 3

4(1−x²)I[−1,1](x). (43)

As previously mentioned these examples are just a few of many more kernel functions. They all serve the purpose of transforming data within given areas.

For the kernel smoothers, the bandwidthhis included. his a parameter that allows one to control how smooth the transformed data is. So for example the Gaussian kernel smoother with the addition of the parameter h is

K(x−X_i

h ) = exp

−(x−X_i)² 2h²

, (44)

40

(54)

9 Kernel Smoother

where X_i is our ith observation for the data. The variables x and h are parameters that we can set to smooth out our observations. Choosing the optimal kernel smoother is done by using the Mean Integrated Square Er- ror (MISE) and the Asymptotic Mean Integrated Square Error (AMISE).

AMISE and MISE are extensions of the Mean Square Error (MSE), and are accuracy measurements. Which instead of taking summation over the area, they instead integrate over the area. MISE is given as an integration over the area of the data. So MSE is given as (Ivanka,2012)

M SE[ ˆf(x, h)] =

n

X

i=1

( ˆf(x, h)−f(x))², (45)

f is the given density for the observed data. fˆ is the sum of the kernel estimates for the data. The kernel density estimate was introduced by Parzen and Rosenblatt in their 1956 paper (Murray,1956) and takes the form of

fˆ(x, h) = 1 nh

n

X

i=1

K

x−Xi

h

, (46)

in the equation X_i is the ith observed data out of the dataset (X₁, ..., X_n).

For the above equation, K(x−X_i

h )are the different kernels for the different given areas. Finally MISE is given as:

M ISE[ ˆf(x, h)] = Z

M SE[ ˆf(x, h)]dx, (47) and AMISE is given as:

AM ISE = 1 nh

Z

K(x)²dx+ 1 4

Z

x²K(x)dx×h⁴ Z

(f⁰⁰(x))²dx. (48) In AMISE, the first part is the Asymptotic Integrated Variance (AIV) and the second part is the Asymptotic Integrated Square Bias (AISB) so then AM ISE =AIV +AISB (Ivanka, 2012) and MISE likewise can be defined as

M ISE( ˆf(x, h)) =AM ISE( ˆf(x, h)) +o{ 1

nh+h⁴}. (49)

41

(55)

9 Kernel Smoother

In terms of minimizing the MISE and AMISE score, one tries to find the optimal bandwidth. The optimal bandwidth can be found by solving the derivative of AMISE. This differential equation is set up as

∂

∂hAM ISE( ˆf(x, h)) =− 1 nh²

Z

K(x)²dx +

Z

x²K(x)dx×h³ Z

(f⁰⁰(x))²dx= 0.

(50)

Furthermore the optimal bandwidth h takes the form of

h_{AM ISE} =

R K(x)²dx 1/5

R x²K(x)dx 2/5

R f⁰⁰(x)²dx

1/5n^−1/5. (51)

Luckily there are simpler ways to approximate the optimal h value. Specif- ically for the Gaussian kernel. For the Gaussian kernel, there are rule of thumb estimates such as Scott’s rule of thumb (Scott, 2015) and Silverman’s rule of thumb (Silverman, 1986). Silverman’s rule of thumb and Scott’s rule of thumb are similarly defined with the only big difference being the constant they use. Silverman’s rule of thumb bandwidth is

h = 0.9 min{ˆσ, IQR/1.34}n^−1/5, (52) and Scott’s rule of thumb is

h= ˆσn^−1/(d+4). (53)

For both of these estimates, σˆ is the empirical standard deviation, n is the length of the dataset, IQR is the interquartile range and dis the amount of dimensions for the dataset. These two rules of thumb can also be up scaled to also work for multivariate kernels. For the implantation of them in R, both functions are built in. So Silverman is bw.nrd0and Scott is bw.nrd.

So far the kernel functions have been implemented for univariate data. How- ever, they can also be expanded to being multivariate. In this paper the only multivariate kernel function used, is the bivariate Gaussian kernel. The

42

(56)

9 Kernel Smoother

(a) Uniform kernel function. (b) Triangle kernel function.

(c) Gaussian kernel function. (d) Epanechnikov kernel function.

Figure 16: Four examples of kernel functions.a) Uniform kernel function, b) Triangle kernel function, c) Gaussian kernel function, d) Epanechnikov kernel function.

bivariate Gaussian kernel is given as K

x h,y

h

= 1

√

2πhexp −1

2 x²

h

× 1

√

2πhexp −1

2 y²

h

, f(x, y, h) =ˆ 1

n

X

i=1

K

x−X_i

h ,y−Y_i h

,

(54)

where x and y are chosen values, and X_i and Y_i are the observations from the dataset with i={1, ..., n}.

43

(57)

10.1 Contours 10 Kernel Estimates

10 Kernel Estimates

Similar to what was done in the Box-Cox transformation. One can instead of using f(x, y) for the precision matrix, use the bivariate Gaussian kernel estimated density f(x, y, h). Then the precision matrix takes the form ofˆ







− ∂²

∂x²

fˆ(x, y, h) − ∂²

∂x∂y

f(x, y, h)ˆ

− ∂²

∂x∂y

f(x, y, h)ˆ − ∂²

∂y²

fˆ(x, y, h)







. (55)

Just as before the same steps are taken. Then the resulting correlation estimate, is given as

ˆ

ρ_K(x, y). (56)

The practical reason for doing this, is because for real datasets the densities that produce them may be unknown. Thus resulting in the need to estimate the densities.

10.1 Contours

To give an overview over the different fits for the estimates, the subfigures 17a and 17d from figure 17 have approximated the form of the real densities closely. Specifically subfigure 17a is almost identical to subfigure 2a. The probabilities of the Gaussian kernel estimate of the Cauchy density are too low compared to the real Cauchy’s probabilities. This also occurs for the fit of the twisted Pear, as the probability in the center of the density is too low. Just to exemplify how the change in bandwidth changes the bivariate Gaussian kernel estimated densityfˆ, the transformed normal density is used.

The figure 18 shows that as the bandwidth decreases towards0then the bands become less smooth. While for higher bandwidths the smoothness increases.

In addition, the probabilities increase for lower bandwidths and decrease for higher bandwidths.

44

(58)

(a)h= 0.119

(b)h= 0.08

45

(59)

(c) h= 0.093

(d)h= 0.28

Figure 17: Bivariate Gaussian kernel estimates using the rule of thumb bandwidths. a) Pear density for h= 0.119, b) twisted Pear for h= 0.08, c) Cauchy for h= 0.093 and d) transformed normal density forh= 0.28.

46

(60)

(a)h= 0.28

(b)h= 0.5

47

(61)

(c) h= 1

Figure 18: Bivariate Gaussian kernel estimate of the transformed normal density for 3different h values. a)h= 0.28, b) h= 0.5and c) h= 1.

48

(62)

10.2 Double Derivatives 10 Kernel Estimates

10.2 Double Derivatives

While the actual fˆapproximations we get might look similar. The estimated correlation maps tell a different story. Even from a cursory glance the subfigures of 19 are vastly different from the ρ(x, y)ˆ estimates from figures 2c, 3c, 4c and 5c. Of these, only figure 19a has a resemblances to the actual ρ(x, y)ˆ from figure 2c. Similarly to the Box-Cox transformation, the ρˆ_K(x, y) values from equation (56) are not limited to the range of [−1,1]. But just like with the contours as the bandwidth increases, the spectrum of which ρˆis defined decreases back within the [−1,1] as seen in figure 20. Based on figure 20, h values of 1 and above, result in ρˆK(x, y) being defined within the [−1,1]

range. Generally from figure 19, the stronger correlation estimates seem to occur within localised areas. For figure 19a this occurs in the small fang like area at(1,4). For figures 19c and 19d there are specks of high correlation.

To observe the impact of the smoothing parameter h, the bivariate Gaus- sian kernel estimated Cauchy density was chosen. The results of this are shown in figures 20 and 21. As h, increases the range of ρˆ_K(x, y) decreases.

As seen in figure 21e the ρˆ_K(x, y) are so minuscule that the estimate are essentially equal to 0. On the other side forh = 0.001 in figure 21a there are only specks of correlation in between a larger area of undefined values. As h increases the amount of undefined areas decrease, as the difference between figures 21a and 21d is palpable. Although both figures 21c and 21d seem to show similar trends to figure 4c. As whenxandyare positive, the estimated correlation is also positive. While when x and y have differing signs, the estimated correlation is negative. What figure 21 has shown is that ρˆK(x, y) has to consciously implemented for the bivariate Gaussian kernel estimates.

As low of h values return a picture that can be too chaotic to read. While high values ofhbecome too homogeneous to discern. In addition, to the fact that sub optimal h values for the fit, may give better correlation estimates.

To understand why there are undefined areas for the estimated ρˆ_K(x, y) is harder than the function for ρ(x, y)ˆ from equation (26), as the function for

ˆ

ρ_K(x, y)is more obtuse . The log double derivatives of the bivariate Gaussian

49

A comparison of the local Gaussian correlation and the local dependence function