Graph Gaussian Process Classifier with Anchor Graph and Label Propagation

(1)

Norwegian University of Science and Technology Department of Mathematical Sciences

TMA4900 Industrial Mathematics, Master’s Thesis

Graph Gaussian Process Classifier with Anchor Graph and Label Propagation

Author:

Oscar Ovanger, 478067

(2)

(3)

Abstract

Gaussian processes is an important method of machine learning as it lets us put a prior on the shape of a function and it inherits nice properties from the normal distribution. It has been used both as a regression and classification model, and performs well compared to other methods across a range of applications. The classifier is usually presented as a supervised model, meaning it requires labeled data to do inference. In this work we extend the existing Gaussian process classifier to handle labeled and unlabeled data simultaneously. We present a model called transductive graph Gaussian process classifier and provide background material for each step of the model. We compare this extended model to the standard model and others on datasets from sklearn, the MNIST-dataset and a dataset from the Aaknes mountain site. We find that the proposed model give an accuracy of 0.997 and 0.595 on the sklearn-datasets respectively, outperforming the standard model and other compared models. On the MNIST-dataset, the classifier achieved an accuracy of 0.67 with optimal model paramters with large variance. On the Aaknes-dataset the model was outperformed by the random forest classifier. We find that the method is sensitive to choices of model parameters and requires regularly distributed data to perform well. Under well-specified model parameters and regular distributed data, the proposed model performs well with few labeled data.

(4)

(5)

Abstract

Gaussiske prosesser er en viktig metode for maskinlæring da den lar oss sette en prioritet p˚a for- men til en funksjon, og den arver fine egenskaper fra normalfordelingen. Den har blitt brukt b˚ade som en regresjons- og klassifiseringsmodell, og fungerer godt sammenlignet med andre metoder i en rekke applikasjoner. Klassifisereren presenteres vanligvis som en overv˚aket modell, noe som betyr at det krever merkede data for ˚a gjøre slutning. I dette arbeidet utvider vi den eksisterende Gaussiske prosessklassifisereren til ˚a h˚andtere merkede og umerkede data samtidig. Vi presenterer en modell kalt transduktiv graf Gaussisk prosessklassifikator og gir bakgrunnsmateriale for hvert trinn i modellen. Vi sammenligner denne utvidede modellen med standardmodellen og andre p˚a datasett fra sklearn, MNIST-datasettet og et datasett fra Aaknes-fjellet. Vi finner at den foresl˚atte modellen gir en nøyaktighet p˚a henholdsvis 0,997 og 0,595 p˚a sklearn-datasettene, bedre enn standardmodellen og andre sammenlignede modeller. P˚a MNIST-datasettet oppn˚adde klassifisereren en nøyaktighet p˚a 0,67 med optimale modellparametere med stor avvik. P˚a Aaknes-datasettet ble modellen overg˚att av den tilfeldige skogklassifisereren. Vi finner ut at metoden er sensitiv for valg av modellparametere og krever regelmessig distribuert data for ˚a prestere godt. Under veldefinerte modellparametere og regelmessig distribuert data, fungerer den foresl˚atte modellen godt med f˚a merkede data.

(6)

(7)

Introduction

In 1990 there were about 2.6 million users of the internet, most of them from the United States. In 2016, that number had grown to 3.408 billion users worldwide. The information sharing and data availability increased immensly following this revolution. Social media websites such as Facebook and Youtube have a business-model that requires large data acquisition from its users in order to target commercials from ads. Search engines such as Google benefit from learning their user base in order to provide useful information and filter out inefficacious information. Machine learning have become a popular method to sort through the sea of data and learn useful patterns. In fact, companies like Google, Amazon, Facebook, Netflix etc. now have specialized sections that work exclusively on developing machine learning models and methods, competing for talents with Academia. DeepMind Technologies is at the forefront of machine learning and AI and is currently owned by Google.

Classification is one of the categories of machine learning. Classification means categorizing data into specified classes. This could mean classifying e-mail as important or spam, classifying hand- written characters as english letters, classifying images containing human faces etc. The machine learning here involves a predictive model, meaning that it can recognize patterns and predict categories based on previous examples. Therefore, the model has to first be provided with examples where the class is known, in order to predict class labels of unknown examples. If we for example have a database of patient symptoms and diseases we can build a machine learning method that predicts the disease of a future patient. Usually, the more data we have available, the better pre- dictions the machine learning approach will give, as it has been able to better recognize the pattern that cause the outcome. For example, the spam-mail folder on a computer improves as more data is provided. One can experience recieving spam-mail from the same sender after marking the mail as spam a few times. Once the machine learner recognizes familiar senders and writing patterns of spam mail, these mails stop occuring in the inbox folder.

There are many popular classification methods in the machine learning literature, utilizing different strategies to search for patterns. We will here illustrate a few key elements of this kind of classification using a simple, yet popular and effective method called decision trees. This method simply tries to find decision boundaries that separate data of different categories. If we stick to the patient symptoms-disease example we could for example discover that patients with above average blood sugar levels and high BMI have an increased occurence of type 2 diabetes, meaning that the decision tree model could give the following structure:

(11)

BMI

> 30

Blood sugar

> 220 mg/dL

No diabetes

’No’

Diabetes

’Yes’

’No’

Blood sugar

> 180 mg/dL

No diabetes

’No’

Diabetes

’Yes’

In some cases, a simple decision tree like this suffies to give accurate classifications. Further, extensions of decision trees like these, such as random forest classifiers have proved to do well in comparison to other sofisticated classifiers (see S´aez et al. (2016)). However, we are not only interested in classifying with high accuracy, but also to understand the underlying distribution of the data. For this, decision trees are not appropriate, as they do not model the data, but rather describe the boundaries between categories. Classifiers that model the data distribution are called generative classifiers. Instead of finding the boundary between diabetes and non-diabetes patients, a generative classifier would model how configurations of different parameters (e.g. BMI and blood sugar) would affect the disease outcome, such as in Figure 1.1:

Figure 1.1: Contour plot showing probability of diabetes occurence with BMI and blood sugar as coordinates.

Here we get contour lines with probabilities of certain patient outcomes. The reason why these models are called generative is that we can generate new examples from the data distribution we have modelled. If we use the multivariate normal distribution to model the data, we get a Gaussian process (GP) classifier (see Rasmussen et al. (2004)). Multivariate normal distributions have nice properties that make them favourable in many applications, including scalability, closed form updating, interpretable covariance structure etc. For this reason, Gaussian process classification (GPC) has become an important method for classification problems. However, for GPs to work well, it is crucial that a large labeled dataset is provided.

As more and more data have become available in the age of the internet, most of this data is in raw form and not processed for machine learning tasks. For this reason, developments in machine learning have focused more on extracting information from unlabeled/raw data. Semi-supervised learning (SSL) is one such approach, giving the means to learn targets from a mixture of unlabeled and labeled data (see van Engelen and Hoos (2020)). Combining the useful properties of GPs while being able to learn from unlabeled data is desirable as it models the dataset as a statistical

(12)

In this work we provide a novel semi-supervised extension of GPC. This method consists of 3 main steps: 1. Build a graph that links the entire dataset. 2. Spread information between nodes in the graph. 3. Build a GPC that models the data distribution and predicts classes of new data. In Figure 1.2 we present the pipeline for the method.

(a) A semi-supervised dataset.

Black datapoints are unlabeled.

(b) A graph connecting datapoints with edges.

(c) Information spread in the graph.

Figure 1.2: An illustration of the pipeline of the extended GPC method.

Here we can observe how including edges between datapoints in a graph allows us to effectively spread information. Once we have done this, we can use GPC to fit a distribution to the data.

In other words, by drawing connections between datapoints, we can fill in the missing labels of our dataset such that the GPC have a bigger dataset to draw from when modelling the data distribution. There are many ways of building graphs and spreading information over graphs and the structure of this work will follow the successive steps of this model.

We refer to the machine learning method we present in this work as a transductive graph gaussian process classifier(TGGPC). We will compare the accuracy of this approach to other similar models on some simple datasets from sklearn (see Pedregosa et al. (2011)), the famous MNIST-dataset and a dataset from the Aaknes mountain site. This thesis is structured as follows:

In Chapter 2 we introduce machine learning. Here we outline the supervised learning (SL), unsupervised learning (UL) and semi-supervised learning (SSL) methods and provide illustrative examples to motivate concepts. In Chapter 3 we outline the first step of our model. Here, we build an unsupervised graph called an anchor graph, which will define relations between points in a dataset. We start Chapter 3 by introducing graph theory, familiarizing the reader with the concepts of graphs. Next, we will introduce the specific graph construction called anchor graph that we will use in our model. In Chapter 4 we introduce the second step of our model, which is spreading information in the graph we established in Chapter 3. Specifically, we want to spread information about the targets in our dataset. This is done by a method called label propagation(LP), a transductive method. In order to present LP, we start Chapter 4 with making the reader familiar with transductive methods. Next, we introduce 3 different variations which will give separate information from the graph. Lastly, we tie together the different methods, showing advantages and disadvantages of each of them. In Chapter 5 we introduce the final step of our model, which is GPC. This step allows us to build a model which can produce and predict targets based on the information provided in Chapter 4. We start the section with introducing GP. Then, we establish GPC for the binary- and multi-class-case. In Chapter 6 we introduce TGGPC, tying together each of the 3 steps from Chapter 3-5. We will give a binary and a multi-class variant of the model and provide algorithms for each. In Chapter 7 we test the TGGPC-model for binary targets in the datasets: ’moons’ and ’cirlces’ from sklearn and multi-class targets in the MNIST- and the Aaknes-dataset. We provide accuracy tables with comparissons to other models. In Chapter 8 we discuss when and why the TGGPC-model does well or poorly on certain datasets, and provide potential extensions to the model in future work. In Chapter 9 we conclude this work.

(13)

Chapter 2

Machine Learning

Machine Learning (ML) is the task of learning structures in data. It is an algorithmic procedure which takes data as input and infer some specific pattern as output. In ML a dataset is a set of data pointsx_i, i= 1, ..., n containing some features/covariates. For example, a data point could be:

xi=

Height W eight Age Sex Eyecolor N ationality

(173cm 65kg 29y male blue N orway ) (2.1)

I.e. an instance of the dataset, where the features are [Height, W eight, Age, Sex, Eyecolor, N ationality].

Here, the dataset could have been:

X=

Height W eight Age Sex Eyecolor N ationality













165cm 60kg 21y f emale blue Sweden

181cm 75kg 43y male green N etherlands

151cm 51kg 61y f emale brown J apan

173cm 65kg 29y male blue N orway

154cm 45kg 13y male brown U SA

(2.2)

Meaning that the data point in equation (2.1) refers to the 4th instance ofX. We categorize ML into 3 types; UL, SL and SSL. The distinguishing factor of UL and SL is the inclusion of a targety.

The target is an additional part of the data set. SL is supervised because we specify what pattern we want to learn, namely the transformation from X toy, while UL is unsupervised because there is no specific feature we are interested in. Instead, UL searches for any distinguishing pattern in the data. For example, if the target of the dataset X is ’shoe size’, we would have an additional column of data:

y=

Shoesize











 37 45 34 41 40

, (2.3)

from which we want to learn the pattern. The pattern could be Height ↑ Shoesize, F emale ↓ Shoesize,M ale↑Shoesizeetc.

The set up for the different types of ML is:

• UL:X only

(14)

• SSL:X,yandXU

In ML we distinguish between training data and test data. Training data is the part of the data where we train the ML-model to recognize the pattern. Test data is the part of the data where we introduce new data from which we test the validity of the model we have inferred. In this work we will denote the test data byX∗ andy∗ for the test dataset and test target respectively.

2.1 Unsupervised Learning

UL or cluster learning are algorithms which are provided with a dataset X and outputs some clusters{cj}^k_j=1. Often, the number of clusters kmust be specified as an input to the algorithm.

The clusters contain groups of datapoints that are ”similar”. The specific algorithm decides what makes some data points similar and is often based on distance metrics in the feature space. E.g.

a UL-algorithm could group the dataset in equation (2.2) into ”small” and ”large” people, since

”small” people tend to share low height, low weight, low age, and female sex, while ”large” people tend to share large height, large weight, middle/high age and male sex. We could perhaps also discover that some nationalities have the majority of examples in one of the clusters. UL is often used when one wants to organize or learn structures in a dataset, without having a specific pattern in mind, or lacking training data with labels, making it difficult to specify cluster properties.

In this work we will use a UL method called k-means clustering. The k-means clustering algorithm groupsndata points of observations inkclusters. It is an iterative method containing two steps:

• Update cluster adherence

• Update cluster center

These steps are repeated until the cluster centers no longer update. Figure 2.1 shows the k-means algorithm.

Figure 2.1: Illustration of the k-means procedure. We start with some initial cluster centers displayed as red and blue crosses. In each step of the sequential procedure we update the cluster adherence for the datapoints and cluster centers, until convergence (see Hudson (2018)).

In modern ML, k-means clustering is used for a range of applications. In Kansal et al. (2018) they use clustering methods to segment customerbase of retail shops, Wang et al. (2009) uses

(15)

clustering methods to segment the financial market and Vaishali (2014) uses clustering methods to do anomaly detection on credit card frauds. In this work we will use k-means clustering to find a specified number of cluster centers that will be used in the anchor graph method. These cluster centers will become anchor points, which will reduce the computational complexity of deriving relations in the graph. More on this in Chapter 3. In Section 3.2 we provide an algorithm for the k-means clustering method written in pseudocode.

2.2 Supervised Learning

SL are algorithms that are provided with a datasetX and a targety. For instance we are provided with a set of person features and the respective shoe sizes (as earlier), a set of house features and the respective listing prices, or a set of animal pictures and the respective animal labels. The task is to recognize patterns inX that infer the target y, i.e. we want to find the functionf :x→y, such that for any new observation of a data pointxwe can infer the targety. The targetycan be continuous(as in the listing price example) or discrete/categorical(as in the animal label example).

In this work, we limit scope to categorical targets.

With categorical targetsy, instead of finding the function f : x →y we want to find the probability distribution associated with each label/category of the target. We assume that the label distribution ofyis conditioned on the dataX. Therefore, the task becomes findingp(y|X). There are many ways of approaching this probability distribution. One approach is using Bayes rule:

p(y|X) = p(X,y)

p(X) . (2.4)

Many approaches of ML is based on maximum likelihood estimation (MLE) which involves maximizing the likelihood functionp(y|X). Since the likelihood function is proportional to the numer- ator in Equation (2.4), the MLE-approach often involves maximizing the joint probabilityp(X,y).

The separating factor of different classifiers boils down to how the joint probability distribution is modelled. Examples of classifiers that model the joint distribution is: naive bayes (NB) classifier, linear discriminant analysis (LDA) and GP. These classifiers are often calledgenerative classifiers.

Other classifiers directly model the conditional distribution p(y|X), for example decision tress, random forests etc. These are calleddiscriminative classifiers. Figure 2.2 illustrates the difference between generative and discriminative classifiers.

(a) Discriminative classifier. (b) Generative classifier.

Figure 2.2: Difference between generative and discriminative classifiers. Red and blue datapoints signifies different class labels.

Notice that discriminative classifiers models the boundary between red and blue datapoints, while generative classifiers model the distribution of red and blue datapoints.

We can construct a very simple generative binary classifier (yi ∈ {0,1} for all i = 1, .., n) by for example setting a priorip(yi = 0) =p(yi = 1) = 0.5 for all i= 1, ..., nand lettingp(xi|yi= 0) = N(µ , σ ) andp(x|y = 1) = N(µ , σ ) for alli= 1, .., n. This gives a probability distribution for

(16)

Hence, we can predict the target distribution for a test examplex_∗: p(y∗= 1|x∗) = p(y∗= 1)p(x∗|y∗= 1)

[p(y_∗= 0)p(x_∗|y_∗= 0) +p(y_∗= 1)p(x_∗|y_∗= 1)].

GPC is a generative model that models the jointp(y, X) by introducing a latent functionf, which we will expand on in Chapter 5.

SL have been applied in a vast number of fields, e.g. bioinformatics, handwriting recognition, spam detection, speaking recognition etc. In Jyothi and Bhargavi (2009), NB was used to classify aggricultural land soils. In this work we will expand the SL-model GPC to learn from labeled and unlabeled data.

2.3 Semi-Supervised Learning

SSL are algorithms which are provided with a part of the data that is labeled and another part that is unlabeled:

• Part 1: LabeledX andy

• Part 2: UnlabeledXU without any target.

As with SL-methods, we want to learn the label distribution for the target. However, we want to includeX_U in the training. Thus, we are looking forp(y|X, XU). The underlying assumption is that X and X_U are samples from the same distribution, i.e. X, X_U ⊆ D, where D is some distribution that generates the data.

The Expectation Maximization (EM)-algorithm is an example of an SSL-method. The general approach is to maximize the complete-data log-likelihood for some model parameters, which is the joint distribution of the labeled and unlabeled data p(y, X, XU,yU). Since yU is unobserved/unlabeled we instead maximize the expected complete-data log-likelihood. This is done in iterative steps, called the expectation (E)-step and maximization (M)-step, where model parameters are updated:

• E-step: Compute class adherence probabilities for all unlabeled examplesyU based on current model parameters.

• M-step: Maximize the expected complete-data log-likelihood/update model parameters.

After a number of iterations, the algorithm will converge (Wu (1983)), and we find the parameters which maximizesp(y, X, X_U,y_U). In Figure 2.3 is an illustration of the EM-procedure. Here, the parameters of the data-generating distributions, for each class, is refined by using the predicted class probabilities of unlabeled data.

(17)

(a) Dataset before start of EM-algorithm.

Red and blue nodes represent labeled/observed data, transparent nodes represent unlabeled/unobserved data.

(b) Dataset at E-step of the EM- algorithm. Red and blue nodes represent labeled/observed data, transparent nodes represent unlabeled/unobserved data. Crosses represent the cluster means of the respective classes.

(c) Dataset at M-step of the EM- algorithm. Red and blue nodes represent labeled/observed data, lightly colored nodes represent the updated probability of class adherence.

(d) Dataset at second E-step of the EM- algorithm. Red and blue nodes represent labeled/observed data, lightly colored nodes represent the updated probability of class adherence. Crosses represent the updated cluster means of the respective classes.

Figure 2.3: Illustration of the EM-procedure.

SSL-methods are often useful when we have access to a large dataset with few classified examples.

When we introduce the additionalXU, the decision boundaries between categories can sometimes be improved. In the work by He and Jiang (2020), the EM-algorithm was performed on a SSL real world flood mapping dataset. In the work by Blum and Mitchell (1998), they used SSL as a means to classify webpages, e.g. ”educational”, ”forum”, ”streaming” etc. The authors show that by addingX_U, they could improve accuracy of the classifier. In this work we will use SSL to improve upon the existing GPC.

(18)

Chapter 3

Anchor graph

Anchor graphs were introduced by Liu et al. (2010) as an efficient way of producing sparse graphs over large datasets. The goal of this work is to extend GPC to an SSL-application. In order to do so, we build an anchor graph to transduce/infer labels of data points which share labels. For this reason, we start by introducing anchor graphs before we move on to label propagation in Chapter 4 and GPs in Chapter 5. Before introducing the anchor graph method we make the reader familiar with graphs by introducing some concepts in graph theory in Section 3.1. Then, we move on to introduce the specific construction method of anchor graphs in Section 3.2.

3.1 Graph Theory

A graph is a mathematical object consisting of vertices and edges. The edges define the similarity/relation between the vertices in the graph. Graphs can have directed or undirected edges.

In this work, we will only focus on undirected graphs. Figure 3.1 illustrates a simple undirected graph, with 6 vertices: v={a, b, c, d, e, f}and 6 edgese={(a, b),(a, c),(a, d),(a, f),(b, c),(b, e)}.

a

b c

d

e f

Figure 3.1: A simple undirected graph of 6 vertices. The edge structure indicate conditional dependencies

A way to capture the graph information effectively is to introduce an adjacency matrix which contains 1 if there is an edge between 2 vertices and 0 otherwise. For the graph in Figure 3.1, the

(19)

adjacency matrix becomes:

A=

a b c d e f













0 1 1 1 0 1 a

1 0 1 0 1 0 b

1 1 0 0 0 0 c

1 0 0 0 0 0 d

0 1 0 0 0 0 e

1 0 0 0 0 0 f

(3.1)

Some properties of the adjacency matrix are:

• zero on the diagonal

• symmetric

• non-negative

Functions on graphs are vectorsf ∈Rⁿ that assign values to vertices: f : V →R, e.g. f(v) = [f(a) = 0, f(b) =√

2, f(c) = 1, f(d) = √

5, f(e) =√

5, f(f) = 1]^T. These functions can be vector norms in some coordinate system, labels etc. The functions on the vertices a−f are based on euclidean distances with a as the origin of the coordinate system. Matrix products on graph functions works as operators. For example we can apply the identity matrixIf =f or matrix of ones1f = [Pn

i=1f(i), ...,Pn

i=1f(i)] to graph functions. The adjacency matrix is also an operator, which has the following property:

g=Af g(i) =X

i∼j

f(j) f^TAf =X

e_ij

f(i)f(j),

(3.2)

whereeij is the edge between verticesvi and vj andi∼j denotes thatvi andvj are neighbors.

The quadratic form applied to the adjacency matrix is a scalar and sums up all the products on functions applied to the vertices that are connected. With the above graph function on our example in Figure 3.1 we can apply the adjacency matrix A:

Af =







0 1 1 1 0 1

1 0 1 0 1 0

1 1 0 0 0 0

1 0 0 0 0 0

0 1 0 0 0 0

1 0 0 0 0 0













√0 2

√1

√5 5 1







=





 5.65 3.24√ 2

√0 2 0







f^TAf = 0 √

2 1 √

5 √

5 1







0 1 1 1 0 1

1 0 1 0 1 0

1 1 0 0 0 0

1 0 0 0 0 0

0 1 0 0 0 0

1 0 0 0 0 0













√0 2

√1

√5 5 1







= 9.16.

(3.3)

Weighted adjacency matrices are also valid graph representations, wherePn

j=1Aij = 1 andAij ≥ 0∀i, j= 1, ..., n. With vertices in Figure 3.1 as points in 2-dimensional euclidean space with vertex aas origin, we would have the following representation of the vertices:

v=





0 0

1 −1

1 0

−1 −2





. (3.4)

(20)

We can now apply an L2-norm to every edge in the adjacency matrix and row-normalize to get the weighted adjacency matrix:

W =





 0

√2 (√

5+√ 2+2)

1 (√

5+√ 2+2)

√5 (√

5+√

2+2) 0 ¹

(√ 5+√

√ 2+2) 2 (√

(2)+2) 0 ¹

(√

2+2) 0 ¹

(√

(2)+2) 0

1 2

1

2 0 0 0 0

1 0 0 0 0 0

0 1 0 0 0 0

1 0 0 0 0 0







(3.5)

Another important representation of the graph is thegraph Laplacian, which is defined asL=D− A, whereDis the degree matrix. The degree matrix is a diagonal matrix defined asDii=P

jAij. The graph Laplacian matrix of Figure 3.1 is:

L=D−A=







4 0 0 0 0 0

0 3 0 0 0 0

0 0 2 0 0 0

0 0 0 1 0 0

0 0 0 0 1 0

0 0 0 0 0 1







−







0 1 1 1 0 1

1 0 1 0 1 0

1 1 0 0 0 0

1 0 0 0 0 0

0 1 0 0 0 0

1 0 0 0 0 0







=







4 −1 −1 −1 0 −1

−1 3 −1 0 −1 0

−1 −1 2 0 0 0

−1 0 0 1 0 0

0 −1 0 0 1 0

−1 0 0 0 0 1





 (3.6) . The properties of the graph Laplacian are:

• symmetric

• semi-positive definite(SPD)

• singular

• row and column sum are zero

. The SPD-property can be derived by looking at the graph Laplacian operator:

Lf =





 Pn

j=1A_1j −A₁₂ · · · −A_1n

−A12 Pn

j=1A_2j · · · −A2n

... . .. · · · ...

−A_1n −A_2n · · · Pn j=1A_nj











 f(1) f(2) ... f(n)







=





 Pn

j=1A_1jf(1)−Pn

j=1A_1jf(j) Pn

j=1A_2jf(2)−Pn

j=1A_2jf(j) ...

Pn

j=1A_njf(n)−Pn

j=1A_2jf(j)







=





 Pn

j=1A_1j(f(1)−f(j)) Pn

j=1A_2j(f(2)−f(j)) ...

Pn

j=1A_nj(f(n)−f(j))





 (3.7)

(21)

f^TLf = f(1) f(2) · · · f(n)





 Pn

j=1A1j(f(1)−f(j)) Pn

j=1A2j(f(2)−f(j)) ...

Pn

j=1Anj(f(n)−f(j))







=

n

X

i=1 n

X

j=1

Aijf(i)(f(i)−f(j))

=f(1)





n

X

j=1

A_1jf(1)−A₁₂f(2)−...−A_1nf(n)



+f(2)



−A₂₁f(1) +

n

X

j=1

A_2jf(2)−...−A_2nf(n)



+

...+f(n)



−An1−...+

n

X

j=1

Anjf(n)





=A12 f(1)²−2f(1)f(2) +f(2)²

+A13 f(1)²−2f(1)f(3)−f(3)² +A₂₃ f(2)²−2f(2)f(3) +f(3)²

+...+A_n−1,n f(n−1)²−2f(n−1)f(n) +f(n)²

=1 2

n

X

i=1 n

X

j=1

A_ij(f(i)−f(j))²≥0.

(3.8) In this derivation we have used Aii = 0∀i = 1, ..., n and Aij = Aji ∀i, j = 1, ..., n. Graph Laplacians are often useful in machine learning tasks where we need to optimize a quadratic objective function, such as: ¹₂x^TAx−b^Tx+c, because SPD matrices ensures convexity, such that we can find global optimas.

3.2 Anchor graph construction

With some concepts of graphs in mind, we introduce anchor graph construction, which is an effective way of producing sparse graphs over a dataset X. This graph will later be used to transduce targets in the label propagation method described in Chapter 4.

Anchor graph construction can be divided into several steps:

1. Buildmanchor points.

2. Findsnearest anchor points of each data pointx_i. 3. Find regression weightsz_i of each data point.

4. Build regression matrixZ.

5. Build weighted adjacency matrixW. We will next go through each step.

1. The first step of building the anchor matrix consists of building m anchor points. These can be m randomly sampled points from the input space R^d, where X ∈ R^(n×d), or built in more sophisticated ways. In this work we will use k-means centroids as anchor points. K-means is an unsupervised clustering method which outputs k cluster centers. The k-means method is described in Algorithm 1. The algorithm iteratively updates cluster adherence for each data point xi and cluster centroidsµ_i until the centroid’s location converges.

2. The second step of the anchor graph construction is to find thesnearest anchor points for each x_i. This can easily be done by taking some normk · k between each data point and cluster center as in Algorithm 2:

(22)

Algorithm 1K-means algorithm

1: procedurekmeans(k,{x₁, ...,x_n})

2: {µ₁, ...,µ_k}=RandomSample({x₁, ...,x_n}, k)

3: while not convergeddo

4: w={w1, ..., wk} . {w1, w2, ..., wk}are sets containing data points within each cluster.

5: fori= 1 :ndo

6: j= argmin_skxi−µsk

7: wj=wj∪ {xi}

8: fori= 1 :kdo

9: µi=_|w¹

i|

P

xs∈wixs 10: return{µ1, ...,µk}

Algorithm 2s nearest anchor points

1: proceduresneighbors({x₁, ...,x_n},{µ₁, ...,µ_k})

2: fori= 1 :ndo

3: forj= 1 :kdo

4: dist[j] =kxi−µjk

5: anchorneighbor[i] = argpartition(dist,s) .argpartition returns the argument of the s-smallest values of input-array in order.

6: returnanchorneighbor

₆₂

Figure 3.2: Anchor graph regression weights

(23)

4. The fourth step involves building a regression matrixZ:

Zi,=z^T_i ∀i∈{1,...,n}. (3.10)

This matrix has dimensionality (n×m). In Algorithm 3 we describe building the regression matrix:

Algorithm 3Regression matrix

1: procedureRegressionWeights(X, U, s)

2: i∈{1,...,n}=sneighbors(X, U) .Store s nearest anchor points

3: fori= 1 :ndo

4: zi= argmin_zkxi−Uzk²/2

5: Zi,=z^T_i

5. In the last step, we build the weighted adjacency matrix. Here, we describe the relation between data points inX. The proposed adjacency matrix by Liu et al. (2010) is:

W =ZΛ⁻¹Z^T, (3.11)

where W ∈ R^n×n and Λ ∈ R^m×m and diagonal matrix Λkk = Pn

i=1Zik. Here, W becomes a weighted adjacency matrix with non-negative entries: Wij =Pm

k=1

Z_ikZ_jk Pn

h=1Z_hk. Since many of the entriesZij are zero, the resulting matrix W is somewhat sparse depending on the choice ofmand s. One important distinction between a regular adjacency matrix andW is that the diagonal has non-zero entries.

Figure 3.3 illustrates the anchor graph construction. We start with a 2-dimensional (2D) dataset containing 3 overlapping structures; 2 triangles and a circle. The structures are discretized with a total of 168 points and contain some gaussian noise. The dataset is plotted in Figure 3.3a. In Figure 3.3b the cluster centers of the k-means algorithm with m = 50 is plotted, and in Figure 3.3c the resulting weighted adjacency matrixW with s= 2 is visualized. Here all edges between the datapoints are plotted as gray lines.

(a) Plot of 2D noisy dataset containing 2 triangles and 1 circle.

(b) Plot of the 2D noisy dataset anchor points.

(c) Plot of anchorgraph over 2D noisy dataset.

Figure 3.3: Anchor graph process visualized on a 2D noisy dataset.

In Figure 3.4 the k-NN graph with k = 9 is visualized. As we see, the number of edges is approximately the same as for the anchor graph withm= 50, s= 2 and the graph structure looks quite similar.

(24)

Figure 3.4: k-NN graph visualized on a 2D noisy dataset.

In the edge case wherem=n, the cluster centers of k-means are just the data points themselves, meaningU =X and thus U becomes thes-NN ofxi. Then zi are the weights such that the norm betweenxi and its s-NN is minimal, thusZ becomes a weighted s-NN graph. Since Z is not symmetric, the productZΛ⁻¹Z^T is required to make a normal and symmetric representation of the weighteds-NN graph. For this reason, we get similar results with anchor graphs and k-NN graphs.

The reason we prefer anchor graph in this work is the computational complexity, which lets the algorithm scale. The time complexity of k-NN graph construction is O(knlogn)(see Eppstein et al. (1997)), since we need to compute the distance between every data point. This becomes infeasible if we have a set of 1e⁶ images for example. On the other hand, graph construction for the anchor graph only needs to measure distances between data- and anchor points. The total time complexity of anchor graph construction isO(m²n), consisting of k-means anchor construction:

O(mn), designing the regression matrixZ: O(smn), and designing weighted adjacency matrixW: O(nm²). If nownms, the time complexity for large datasets can be considerably reduced.

(25)

Chapter 4

Label propagation

Label propagation is a popular method to infer labels over graphs, see e.g. Zhou et al. (2004),Zhu and Ghahramani (2003) and Kahle et al. (2019). In Chapter 3 we introduced anchor graphs. In this Chapter we will introduce label propagation that will be used for transducing labels in the graph. This will ultimately go into a GPC, which we will talk about in the following Chapter. We start this Chapter by introducing the concept of transduction in Section 4.1. Then, we introduce three different variants of label propagation in Section 4.2, 4.3 and 4.4 respectively. Finally, we tie together the variants in Section 4.5.

4.1 Transduction

Label propagation is a transductive method to infer labels. Before we introduce the method, we distinguish between transduction and induction.

Traditional SL-tasks have two phases; training and test phase (Hastie et al. (2001)). In the training phase, the model is fitted to the training data{X,y}. Once the hyperparameters of the model has been fitted, the model is applied to som test data {X∗} where the labelsy∗ are inferred by the modelf(X∗|X,y,θ). This sort of inference is calledinduction. Consequently, the key idea of induction is to reason from observed specific examples, to build a general model and subsequently apply it to some test examples.

Transduction, on the other hand, does not include the intermediate step of building a general model. Instead, transductive methods reason from observed specific examples to test examples.

This is a simpler form of inference, of which the pros are efficiency and reduction of unecessary generalization and the cons are the specificity and lack of reusability of the model. Figure 4.1 displays the difference between transduction and induction. In Figure 4.1a the training phase of the transductive model is displayed. Here, data points with observed and unobserved targets are included. In Figure 4.1b we see the inference phase of the transductive model, where unobserved targets have been infered based on similarity to the observed targets. In Figure 4.1c the training phase of the inductive model is displayed, where only observed targets are included in order to make a general model. In Figure 4.1d we see the inference phase of the inductive model, where a general model have been fitted in order to infer labels of unobserved examples.

(26)

a b c

d

e f

g

h i

j k

l

(a) Training phase, transduction.

a b c

d

e f

g

h i

j k

l

(b) Inference phase, transduction.

a b

g

h

(c) Training phase, induction.

c

d

e f

i

j k

l

(d) Inference phase, induction.

Figure 4.1: Display of the difference between Inductive and Transductive learning. Red and blue nodes refers to labeled datapoints. Green datapoints refers to unlabeled datapoints.

In this work we want a transductive method to efficiently infer ”soft” labels that can later be used to build a general model.

4.2 Objective function view

Label propagation is a transductive graph method. Meaning the labels are inferred from information contained in the graph representation of the data. All transductive graph methods can be generalized to optimizing an objective functionQwith 2 components (see van Engelen and Hoos (2020)). The first component penalizes transducted labels that differ from the labeled data and the second component penalizes connected nodes with different labels. A general form of the objective function is as follows:

Q(F,Fˆ) =λ

n

X

i=1

l1(Fi,Fˆi) +

n

X

i=1 n

X

j=1

Wijl2( ˆFi,Fˆj). (4.1)

Here,F = [F1, ..., Fn] and ˆF = [ ˆF1, ...,Fˆn] are label distributions associated with the data points whereF,Fˆ ∈R^(n×c) withc being the number of label categories. F are observed labels, meaning Fi[j] = 1 ifyi=jandFi[k] = 0∀k6=j. ˆFare inferred label distributions where 0≤Fi[j]≤1∀_j∈{1,c}

and Pc

j=1Fi[j] = 1∀_i∈{1,n}. W is the adjacency matrix, l1 and l2 are loss functions and λ is a weight parameter for the first component of the objective function. A simple choice of loss function is the L2-norm, which yields the objective function:

Q( ˆF) =λ

n

X

i=1

kFi−Fˆik²₂+

n

X

i=1 n

X

j=1

WijkFˆi−Fˆjk²₂. (4.2)

From equation (4.2) we notice that the quadratic form of the graph Laplacian resembles the second component of equation (3.7), whereW is a weighted adjacency matrix, thus the expression in matrix notation becomes:

Q( ˆF) =λk(F−Fˆ)^Tk²_2,1+kFˆ^TLFˆk²_2,1. (4.3) where the graph Laplacian isL=D−W as defined earlier in this work andk · k2,1is theL2,1norm

(27)

defined as: kAk_2,1= Pn

j=1

Pn

i=1ka_i,jk²^1/2

. This objective function has analytical solution:

Fˆ= (1−α)(I+αL)⁻¹F, (4.4)

where α = _1+λ¹ . Equation (4.4) is the label propagation solution. Because the computational complexity of inverting a n×n-matrix is O(n³), the analytical solution might be infeasible for large datasets. For this reason, many papers prefer an iterative algorithm:

Fˆt+1=αLFˆt+ (1−α)F, (4.5)

which converges to equation (4.4).

There exists alternatives to this algorithm proposed by Zhou et al. (2004), which solves a variation on equation (4.1). The analytical solution to one of these variations is:

Fˆ= (1−α)(I+αL)˜ ⁻¹F, (4.6)

where ˜L=D^−1/2W D^−1/2 is the normalized graph laplacian. This algorithm can also be solved iteratively and is often calledlabel spreading. Another proposed alternative by Zhou et al. (2004) is:

Fˆ = (1−α)(D−αW)⁻¹F. (4.7)

4.3 Random walk view

A random walk is a path of successive random steps in some sample space, for instance the space of integersZ. A simple example of a random walk could be moving with equal probability +1 and−1 on the integer line. This is a one-dimensional example, since we can only move in a one-dimensional spaceZ. We could also have multi-dimensional random walks, e.g. on a two-dimensional gridZ². Some 1D and 2D random walks are depicted in Figure 4.2. In these displays, we denote the current state of the process at a time stept as a random variabley_tfor a 1D random walk or x_t, y_t for a 2D random walk. Here,x_t andy_tcan be any random variable over time, not to be confused with labels or data points in previous sections of this work. We can denote the probability of moving as transition probabilities, in this case: p(y_t =y_t−1+ 1) = p(y_t = y_t−1−1) = 0.5 ∀_t∈_N for the 1D-example andp(yt=y_t−1+ 1) =p(xt=x_t−1+ 1) =p(yt=y_t−1−1) =p(xt=x_t−1−1) = 0.5

∀_t∈N for the 2D-example.

(a) 3 realizations of a one-dimensional random walks, with equal probability of moving +1 and−1, starting aty0= 0.

(b) 3 realizations of a two-dimensional random walks, with equal probability of moving +1 and−1 in each direction, starting at (x0, y0) = (0,0).

Figure 4.2: 1D and 2D random walks.

A Markov chain (MC) is a random walk where the probability of moving to each state is dependent

(28)

can define a discrete-time MC (DMC) by an initial state and a transition matrix. The initial state defines the probability of begining the process in each state. For example, if we have a state space {1, ..., c}with a 1D DMC, the initial state isπ0= [p(y0= 1), p(y0= 2), ..., p(y0=c−1), p(y0=c)]^T, andPc

k=1p(y0=k) = 1. The transition matrix defines the probability of moving to any state in the state space given any current position in the state space, e.g.:

T=







... ... · · · ... ...

p(yt+1=c−1|y_t= 1) p(yt+1=c−1|y_t= 2) · · · p(yt+1=c−1|y_t=c−1) p(yt+1=c−1|y_t=c)

p(yt+1=c|yt= 1) p(yt+1=c|yt= 2) · · · p(yt+1=c|yt=c−1) p(yt+1=c|yt=c)







, (4.8)

where the column sums are 1.

Given an initial state vector and transition matrix we can find the probability of any state at any time step:

p(y_t+1 =i) =

c

X

j=1

p(y_t+1=i|y_t=j)p(y_t=j) =

c

X

j=1

T_ijp(y_t=j). (4.9) This formula is recursive, sincep(y_t = j) is defined similarly. We can therefore summarize the probability vectorπ_t = [p(y_t = 1), p(y_t= 2), ..., p(y_t=c−1), p(y_t=c)]^T for all states at timet as:

π_t=Tπ_t−1=T(Tπ_t−2) =...=T^tπ₀. (4.10) We can find the limiting behaviour of the DMC by here lettingt→ ∞. This is called the stationary distribution of the DMC:

π= lim

t→∞T^tπ₀. (4.11)

We can think of the stationary distribution of the DMC as the eigenvector of the transition matrix corresponding to the eigenvalue λ = 1, since applying the transition matrix to the stationary probability vector should give us back the stationary probability vector:

Tπ=π

(T−I)π= 0. (4.12)

In this application we are interested in finding the limiting distribution of a random walk over a graph. This means that we no longer have a single time-dependent random variable y. Instead we have multiple random variables y₁, ..., y_n which represents the nodes of the graph. Each of these random variables have an initial probability distribution and thus we can define a initial probability matrix for the graph:

Π₀=







p(y1(t= 0) = 1) p(y1(t= 0) = 2) · · · p(y1(t= 0) =c−1) p(y1(t= 0) =c) p(y2(t= 0) = 1) p(y2(t= 0) = 2) · · · p(y2(t= 0) =c−1) p(y2(t= 0) =c)

... ... · · · ... ...

p(yn−1(t= 0) = 1) p(yn−1(t= 0) = 2) · · · p(yn−1(t= 0) =c−1) p(yn−1(t= 0) =c) p(yn(t= 0) = 1) p(yn(t= 0) = 2) · · · p(yn(t= 0) =c−1) p(yn(t= 0) =c)







, (4.13)

which is anR^n×c-matrix.

Graph random walk is the link between the anchor graph and the SSL-extension of GPC we will present later in this work. It allows us to propagate labels in the graph structure we presented in Section 3.2, which will later go into the GPC we present in Chapter 5. A graph is represented by an adjacency matrix(see Section 3.1)A which tells us which nodes are connected. Where a DMC assumes that the next step of our random walk is only dependent on the current state, a graph random walk assumes that the next state of each node is only dependent on the current state of its neighbors. Thus, the state of any node at timetbecomes:

p(yk(t) =i) = X

j∈Ne(yk)

p(yk(t)|yj(t−1))p(yj(t−1) =i), (4.14) where Ne(yk) denotes the neighbors of nodeyk, or more specifically the non-zero columns ofAi. The transition probabilitiesp(yk(t)|yj(t−1)) are defined for all neighbors in the graph, and they

(29)

are independent of the specific state of the node. We can represent the transition probabilities as an R^n×n-matrix, where only the neighbor nodes have non-zero entries. This means that the same entries that are non-zero of the adjacency matrix will be non-zero of the transition matrix.

However, in the adjacency matrix the column sums are greater than one, meaning the rows are not valid probability distributions, like we required in Equation (4.8). Instead, we can think of the transition matrix as a weighted adjacency matrix (see Section 3.1), where the columns sum up to 1. One way of weighting the adjacency matrix is by the transformation:

T =D⁻¹A, (4.15)

whereD is the degree-matrix with entriesD_ii =P

jA_ij, is equivalent to l1 row normalization of the adjacency matrix.

Given a transition matrix and an initial probability matrix for the graph, one can find the stationary distribution in a similar fashion to a DMC (Berger (1993)):

Π = lim

t→∞T^tΠ0. (4.16)

In this work, we want to do label transduction. This means that we are looking for the stationary distribution of the targets [y₁, ..., y_n], where a proportion of the targets are test examples. Assuming we have some adjacency matrixA over the entire data set X, we can do graph transduction by following equation (4.16). Here, the initial probability matrix will be the initial label distribution.

Since some of the labels are observed training labels, these naturally have probability 1 on the observed class. If we organize the labels in the following fashion [y1, ..., yl, yl+1, ..., yn], where the l-first targets are training labels and then−(l+ 1)-last targets are the test labels, then we have the following:

Π0(i, j) =







(1 ifyi=j

0 otherwise i∈ {1, ..., l}

p(yi =j) i∈[l+ 1, n]

. (4.17)

wherep(yi=j) can be set arbitrarly as initial guesses, as long asPc

j=1p(yi =j) = 1∀_i∈[l+1,n]. In the article by Zhu and Ghahramani (2003) on random walks in graphs, the authors show that the initial label distribution for the test data is inconsequential and thusp(yi =j) can be anything.

The transition matrix T can be found similarly as in equation (4.15), or any other way which results in a weighted adjacency matrix. The construction of the weighted adjacency matrix will decide the stationary label distribution. For label transduction, we want the observed targets to be absorbing states. This means that the probability of leaving these states are zero. In Figure 4.3 we have depicted a graph with absorbing states.

a

c b

d e

f

Figure 4.3: Graph with 2 absorbing states

Following equation (4.15) and the example in Figure 4.3 we get the following transition matrix:

T =





1 0 0 0 0 0

0 1 0 0 0 0

0 0 0 1 0 0

1/3 0 1/3 0 1/3 0





. (4.18)

(30)

We observe that the matrix element from nodeetob is 1/3, while the matrix element frombtoe is 0. This indicates thatbis an absorbing state. We can observe the same for nodea.

This matrix can be divided into block matrices:

T =

T_ll T_lu Tul

Tuu

=







1 0

0 1

0 0 0 0







0 0

1/3 0 0 1/3

0 0













0 1 0 0

1/3 0 1/3 0

0 1/3 0 1/3

0 0 1 0













. (4.19)

Here, T_ll, T_lu, T_ul, T_uu denotes transition probabilities between groups of labeled and unlabeled nodes. T_ll is the transition probabilites between labeled examples and since these are absorbing states, both have zero probability of transitioning, thus this becomes an identity matrix. T_lu is the transition probabilities from labeled to unlabeled examples which is a zero matrix, since we cannot transition to unlabeled nodes from labeled nodes. T_ul is the transition probabilities from unlabeled to labeled nodes, as we see, nodes d and e can transition to labeled nodes since these are neighbors of nodes a and b. Finally,Tuu are transition probabilities between unlabeled nodes and here we can transition wherever there are connected unlabeled nodes in the graph.

With this block structure we get the following expression for the label distribution of the graph at time statet:

Πt= Π^l_t

Π^u_t

=

T_ll T_lu T_ul T_uu

Π^l_t−1 Π^u_t−1

=

I 0 T_ul T_uu

Π^l_t−1 Π^u_t−1

=

I 0 T_ul T_uu

t Π^l₀ Π^u₀

, (4.20)

where Π^l_t,Π^u_t is respectively the labeled and unlabeled probability distribution at time statet. We want the stationary distribution, which means we have to find lim_t→∞T^t. This becomes (Zhu and Ghahramani (2003)):

t→∞lim T^t= lim

t→∞

I^t 0 (Pt

i=0T_uuⁱ )Tul T_uu^t

. (4.21)

Here we can view the sumPt

i=0T_uuⁱ ) as a geometric series. A geomtric series of a matrix Acon- verges to (I+A)⁻¹if|A|<1. SinceTis a normalized matrix,|Tuu|<1. Therefore lim_t→∞T_uu^t =0.

Thus, as we lett→ ∞the stationary label distribution becomes:

Π = Π^l

Π^u

=

I 0

(I−Tuu)⁻¹Tul Tuu

Π^l₀ Π^u₀

=

Π^l₀ (I−Tuu)⁻¹TulΠ^l₀

. (4.22)

Notice in this expression that the stationary distribution is independent of the initial distribution of the unlabeled nodes Π^u₀. In Zhu and Ghahramani (2003), it is therefore set to a0-vector(although this is technically not a distribution).

4.4 Message passing view

The objective function view of label propagation gives an easy formulation of the algorithm (van Engelen and Hoos (2020)). However, it lacks motivation and statistical rigor. In this subsection we view the algorithm from a more statistically sound perspective through a message passing algorithm called junction tree algorithm.

The junction tree algorithm is a message passing algorithm that generalizes belief propagation to all graphical models. It consists of forming a junction tree and applying belief propagation to the resulting structure. We begin by briefly introducing junction trees and subsequently presenting belief propagation (see Kahle et al. (2019)).

Junction trees are undirected graphical models (UGMs) that are converted to tree structured graphs. This means that each pair of nodes are connected by exactly one path in the UGM.

Junction trees can be formed from directed acyclic graphs (DAGs) by converting them to UGMs

Graph Gaussian Process Classifier with Anchor Graph and Label Propagation

Norwegian University of Science and Technology Department of Mathematical Sciences

TMA4900 Industrial Mathematics, Master’s Thesis