Finding new links in ACT data provided by Mnemonic using Graph Neural Networks

(1)

Finding new links in ACT data provided by Mnemonic using Graph

Neural Networks

Martin K. J. Heggem

Thesis submitted for the degree of

Master in Programming and system architecture 60 credits

Department of Informatics

(2)

Faculty of mathematics and natural sciences

UNIVERSITY OF OSLO

Autumn 2021

(3)

(4)

Finding new links in ACT data provided by Mnemonic using

Graph Neural Networks

Martin K. J. Heggem

(5)

Finding new links in ACT data provided by Mnemonic using Graph Neural Networks

http://www.duo.uio.no/

Printed: Reprosentralen, University of Oslo

(6)

Abstract

The ACT project is a collaboration lead by Mnemonic with the goal of developing a platform which can predict cyber attacks before they are carried out. The platform describes a graph consisting of objects and facts, where the objects are nodes and the facts are edges. The ACT graph consists in large part of facts with the type ’mention’. These facts are the edges between the object type ’report’ and any other object type.

The reports and mentions are generated automatically from available information found on the internet. The overwhelming amount and generic nature of the mentions make them less useful than one might think. To manually sift through all the generated mentions in not prac- tically achievable, so an automatic approach could perhaps make better use of the information these mentions represent.

We will use graph neural networks (GNN), more specifically the SEAL framework, to attempt to propose new unknown facts in the graph, and find some useful relationships between so far unconnec- ted objects. Our question is whether the aforementioned "mention"

facts can contribute to this goal. As the mentions constitute such a large portion of the graph they might represent some important struc-

(7)

tural information that could be used by a neural network trained on the graphs structure, or they could simply be noise without any structural significance to the graph. Including the mentions in the training process could lead to better results than if we were to apply GNN to the graph without them, or it could lead to the network not learning in the way we want. The research question which we consider in this thesis is

"Will the addition of ’mention’ facts lead to better performance of SEAL when it comes to predicting new facts in the ACT graph?" Better performance in this case means whether the neural network model trained on the graph achieves higher precision and recall when predicting the existence of an edge in the graph.

To go about answering this question we will first remove the mentions from the equation entirely to see how well the SEAL model will perform without them. Then we will attempt to reintroduce the mentions in a variety of ways to see which approach, if any, has the best impact on the final performance of SEAL in regards to proposing new graph edges. Because of the way the ACT graph works and the API we have available to us, extracting the graph so as to be able to train a GNN on it is non trivial. We have the ability to use the query language Gremlin to query the graph from a given node, but the search takes a long time and there is a timeout preventing larger queries. The result of this is that a fairly large part of this thesis will first revolve around extracting the data we need from the graph and to structure and process it into what is the required input for the SEAL framework.

Afterwords we will begin to focus on the best way to introduce the men-

(8)

tion facts.

This work as led us to find that since the ACT graph is so large and varied, it is not currently feasible to train a neural network to predict new links with a consistently high performance across the many different parts of the graph, and that the inclusion or exclusion of mention facts has similarly inconsistent effects.

(9)

List of Figures

2.1 Model of a basic neural network. The dark circles represent neurons while the light gray circles represent the input vector. Image from Stephen Marsland - Machine Learning. An Algorithmic Perspective 2nd ed. p. 58 . . . 19 2.2 Model of a deep neural network. . . 22 2.3 A local receptive field. The input neuron is represented

in a grid corresponding to reach pixels location in the image. We can see how the neurons in the hidden layer only receive information of a small region of the image. Image from Nielsen, Michael A. - Neural Networks and Deep Learning. Chapter 6. . . 24 2.4 Illustration of feature maps. The hidden layer consists

of three feature maps which use three different sets of shared weights and biases. Image from Nielsen, Michael A. - Neural Networks and Deep Learning. Chapter 6. . . 25

(12)

2.5 Illustration of pooling layers. We can see that each feature map is summarised into a smaller next layer. Image from Nielsen, Michael A. - Neural Networks and Deep

Learning. Chapter 6. . . 26

3.1 The full ACT data model not including «mention» facts and isolated object types. The object type «report» (not shown) is connected to nearly every other object type, as well as some others also not shown here, by «mention» facts. The whole structure is very large and not feasible to display in a figure on a single page. Image from Bromander et al. Modeling Cyber Threat Intelligence, Page 4. . . 33

4.1 Node labeling trick in action. In the two cases we are looking at the links < v1, v2 >and< v1, v3 >respectively. Image from Muhan Zhang et al. Revisiting Graph Neural Networks for Link Prediction, Page 4. . . 43

4.2 The structure of DGCNN. Image from Muhan Zhang et al. An End-to-End Deep Learning Architecture for Graph Classification, Page 4. . . 48

5.1 ACT data structure during selection process . . . 60

5.2 Test cases . . . 66

6.1 Accuracies for the countries data set. . . 76

6.2 Precisions for the countries data set. . . 76

(13)

6.3 Recalls for the countries data set. . . 76

6.4 F1 scores for the countries data set. . . 76

6.5 Accuracies for the fqdn data set. . . 77

6.6 Precisions for the fqdn data set. . . 77

6.7 Recalls for the fqdn data set. . . 78

6.8 F1 scores for the fqdn data set. . . 78

6.9 Accuracies for the ipv4 data set. . . 79

6.10 Precisions for the ipv4 data set. . . 79

6.11 Recalls for the ipv4 data set. . . 79

6.12 F1 scores for the ipv4 data set. . . 79

6.13 Accuracies for the organizations data set. . . 80

6.14 Precisions for the organizations data set. . . 80

6.15 Recalls for the organizations data set. . . 81

6.16 F1 scores for the organizations data set. . . 81

6.17 Accuracies for the threat actors data set. . . 82

6.18 Precisions for the threat actors data set. . . 82

6.19 Recalls for the threat actors data set. . . 82

6.20 F1 scores for the threat actors data set. . . 82

6.21 Accuracies for the tools data set. . . 83

6.22 Precisions for the tools data set. . . 83

6.23 Recalls for the tools data set. . . 84

6.24 F1 scores for the tools data set. . . 84

(14)

List of Tables

3.1 Response codes listed in API . . . 37 3.2 Encountered response codes . . . 37 4.1 By adding self-loops to an adjacency matrix the diagonal

of the matrix is set to ones. . . 49

(15)

Preface

A lot has gone into the development of this thesis. I have had no previous experience with a text longer than a few pages, so writing something of this magnitude definitely posed a large challenge. It would have been hard enough on its own to stay motivated and interested for the three semesters I have spent working on the thesis, so it is safe to say that Covid-19 definitely did not help the process in any way shape or form. As luck would have it I persevered and managed to pull through in the end.

I originally worked on a different project for a short while at the beginning of my masters degree, however I soon realised that the project was not a good fit for me, and I am glad that I discovered I could write a thesis in collaboration with Mnemonic, and that I was directed to precisely this thesis.

I would like to thank Fabio Massimo Zennaro for the guidance through the project, I would have had no idea where to even begin without it.

Siri Bromander had also been an invaluable help in shaping exactly what the goal of the project should be and in helping us understand the details of ACT.

(16)

Part I

Introduction

(17)

Chapter 1 Overview

Cyber crime is an ever growing threat, and the demand for information on how to combat it is growing along side it. The goal of the ACT project is to provide a united platform to fight cybercrime. To achieve this goal the platform is developed as open source and it should be able to include data from many different sources. The data should be shared across the industry both public and private, and the platform should have the capability to predict when, how, by who, and why an attack happened.[15]

The ACT-graph is the graph in which all this threat intelligence data is represented. The graph contains many nodes and edges of different types, such as ThreatActor, Country, Person, et cetera. The nodes simply indicate the existence of an object, whereas the edges are the actual information containing the relations between the objects. While a lot of the information in the graph can be effectively used to draw conclusions, uncover information and connections between ob-

(18)

jects, a large part of the information in the graph is automatically generated, and thus harder for an analyst to make effective use of.

Motivation Our work is motivated by the ever growing threat of cybercrime, but also by further studying graph neural networks applied to link prediction tasks.

New threats, incidents and vulnerabilities keep happening and to successfully combat the threat the intelligence needs to be accurate and up to date. The ACT graph attempts to achieve precisely this, and our research will hopefully result in another tool to find links between threat objects. Neural networks is an ever growing field and new applications are constantly researched, graph neural networks are a breed of neural networks that are tailor made to work on graphs, and success has been found in using them for classification tasks. A field of study on graph neural networks that is less developed however is using the networks to predict links between nodes in a graph. We hope to utilize this technique to enhance the available data in the ACT graph and prove that link prediction with graph neural networks is feasible in real world scenarios.

The graph neural networks function in some ways similarly to convolutional neural networks, which are used on images. If you took an image and created a node for each pixel and an edge between each neighbouring pixel you would end up with a graph. The challenge is to perform machine learning when the nodes are not constrained to a pixel grid. The structural information in the graph is important to

(19)

train an effective model, and because of this we might be able to make use of nodes and edges in the ACT graph that are not easy to make use of by a human analyst. Another question is whether the mention facts, which are facts generated automatically from available information found on the internet, provide useful structural information and whether their inclusion improves the performance of the network.

We will do tests both without any mention facts in the graph and with the mentions introduced in different ways. As the mentions are not interesting to look at in themselves, but are merely used as sup- porting information for other intelligence, it might be possible to train the neural network on a selection of node types we wish to learn information about and only reserve the mentions in the background, to function only as structural information in the graph.

Structure We will begin part one with a general introduction to machine learning and neural networks, moving further into deep learning and convolutional networks then finally explain the functioning of graph neural networks and contrast them with fully connected neural networks, and convolutional neural networks. To conclude the background chapter there will be a brief introduction to threat intelligence.

An overview of the purpose, structure, data, and API of the ACT graph will then follow, before finally we will conclude the introductory part of the thesis with an overview of existing solutions to link prediction, and specifically the solution we will be making use of, SEAL.

In part two we will go through the execution of the project, the

(20)

choices made during implementation and testing, and challenges faced in the planning and execution of the project, before finally presenting the results achieved by the neural network.

Finally, in part three, we will finish with the conclusion.

(21)

Chapter 2 Background

2.1 Introduction to machine learning

As Marsland[13] puts it, machine learning is modifying or adapting a computers actions to make the computer perform a task more accurately. The task can range from predicting future input based on previous input, to controlling the movement of a fluid, or to approximate real life physics. The accuracy of the computers actions is defined as how much the chosen action resembles the correct one. If a computer is given a sequence of numbers and tasked with predicting the next numbers by learning the pattern they appear in, the accuracy would represent how many of the next numbers the computer successfully predicts, or how close the predictions were to the correct numbers. The goal of machine learning is for the computer to learn the most accurate approximation of the ground truth as possible.

Machine learning has become increasingly accurate in the last dec-

(22)

ade to the point where the created approximations can in some cases be nearly indistinguishable from reality at a much lower computational cost than traditional approaches.

To achieve these results the computer is fed large sets of data and, depending on the approach, uses this data to learn something about how the data is put together.

There are several approaches to training the computer. The main categories are as follows:

• Supervised learning, where the computer is given a large set of data along with the correct actions the computer should take. The computer then uses this to learn what to do when presented with data not contained in this set.

• Unsupervised learning, where the computer is given data and attempts to find patterns and similarities in the data to categorise it.

• Reinforcement learning, where the computer is given feedback on whether the taken action is correct or not, and learns to take more correct actions.

• Evolutionary learning, modeled after evolution. The computer learns through generations where the best solutions are passed along to the next generation.

In some ways neural networks are modeled at a high level after our brains, and the building blocks of the brain, neurons, has inspired the

(23)

development of machine learning. Neural networks is a subclass of machine learning algorithms where the training data is fed to a network constituted by artificial neurons. Each neuron is often very simple, containing a function called an activation function which, depending of the input, will output a value. More about neural nets in the next section.

2.2 Neural networks and deep learning

As mentioned in the last section neural networks are a type of machine learning that is modeled after neurons in a brain. They have proved so effective at learning that when people talk about machine learning nowadays they, most likely mean neural networks. The basic principles of a neural network is an input vector which is fed into a set of individual neurons with weights for each input value, and a bias that determines how much the neuron influences the next layer. The first kind of artificial neuron, the perceptron, was developed in the 50s and 60s by Warren McCulloch and Walter Pitts. [17] The perceptron took multiple binary inputs and produced a single binary output. In mathematical terms, if the sumP

jw_jx_j of the inputsxand weightsw were greater or equal to the threshold, the perceptron would output 1, otherwise it would output 0.[17]

Output =







1, if P

jw_jx_j ≥threshold 0, otherwise

(2.1)

(24)

If you were to string multiple perceptrons together into layers where the output from a perceptron was fed into several of the perceptrons in the next layer you would end up with a rudimentary neural network.

The problem with such a network is that there is no way to properly

Figure 2.1: Model of a basic neural network. The dark circles represent neurons while the light gray circles represent the input vector. Image from Stephen Marsland - Machine Learning. An Algorithmic Perspect- ive 2nd ed. p. 58

achieve a subtle change in output in reaction to a subtle change in input. Since the perceptrons work only with binary values any small change in the input could only cause a perceptron to completely flip causing a large change in the network. Because of this other neuron models are used today which allow for granular changes in the output of each neuron. The neuron model mainly used today is the sigmoid neuron which can output any real number between 0 and 1.[17] This is achieved with thelogistic functiondefined as follows:

f(x) = 1

1 +e^−x (2.2)

(25)

This allows for much better training of the network. Now we can make small changes in weights to nudge the network subtly in the directions we want. Modern neurons contain different types of activation functions, but the most important point is that we have moved away from the step function which effectively makes the neuron a perceptron, to more smoothed out, non linear functions like the logistic function. After adjusting the weights of all the neurons the network has «learned» to produce a correct solution for this specific problem.

Hopefully it will also produce correct solutions for similar problems that it has not been trained on. This is called generalization.

To train a neural network, specifically using supervised learning, we use an algorithm called stochastic gradient descent. We will not go too far into detail here, but an overview of the most essential parts is useful to understand. The whole idea behind training neural networks is to adjust weights and biases in the neurons until the desired result is achieved. How do we know how to adjust these parameters? This is where gradient descent comes in. Gradient descent simply means to move towards the lowest point of a gradient or slope. In machine learning terms this means to move towards the optimal configuration of the weights and biases. To achieve this we need a gradient to move down. The gradient is computed by the backpropagation algorithm.

Backpropagation is aptly named as it moves backwards through the network, starting from the final layer, and computes theerrorfor each layer. The error, very simply put, is a term referring to how far off the weight and biases are from being the «optimal» configuration, or put

(26)

differently, the bottom of the gradient; the desired outcome. The error is calculated by acost function. There are several types of cost function, one of which is thequadraticcost function:

C(w, b)≡ 1 2n

X

x

||y(x)−a||² (2.3)

w represents all the weights in the network, and b represents all the biases. n is the total number of training inputs; the number of cases in your training set. a is the output vector of the network dependant on the inputx, and the weights and biaseswandb.

The backpropagation algorithm moves back through the network computing the error in each layer as it goes, to calculate the error for a specific layer another formula is required:

δ_j^l ≡ ∂C

∂z_j^l (2.4)

The equation gives the error δ of neuron j in layer l. _∂z^∂Cl j

is the par- tial derivative of the cost function with respect to the weighted input z for the specified neuron. The gradient is then calculated and fed to the stochastic gradient descent algorithm which in turn nudges the weights and biases in the correct direction to achieve the desired result.

The general idea of a neural network is simple, but the type of neural nets that has had the largest impact in the last decade is less so. Deep learning is the generalized name for neural networks that have more than one layer of neurons between the input and the output layers. These deep neural networks have proved extremely capable

(27)

of learning and generalizing in many applications. Image upscaling [4], advanced physics simulations [19] and the rapidly growing field of deepfakes [16] are just some examples of what is possible with deep neural networks.

Figure 2.2: Model of a deep neural network.

There are many ways to influence the structure of the network and change how it learns and behaves, the activation functions in the different layers, the amount of neurons in each layer, and the types of layers you string together are some ways. There are also several types of deep neural networks using different combinations of these, and more, features, tailor made for different applications. For example recurrent neural networks where a neurons earlier activation can influence how it behaves in the future. Convolutional networks however, meant for image classification, are the most relevant to our work. There are of

(28)

course also graph neural networks which is especially designed to work with graphs, but we will come back to those in the next section.

Convolutional networks are a type of deep neural networks mainly used in image classification.[17] The basic idea behind convolutional networks is to create a network structure that takes advantage of the spatial structure of the input data. Imagine an image that is 100x100 pixels. In order for this image to be the input of a neural network each pixel would need its own input neuron. In the classical fully connected neural networks we have talked about so far each input neuron would be connected to each neuron in the next layer. See figure 2.2. When the input neurons are fully connected in this manner all the spatial information in the picture is lost. Pixels that are next to each other will seem no different to pixels on completely opposite sides of the image to the neurons in the second layer. Is is not difficult to imagine why this is not optimal when classifying images. The solution to this is to use so called local receptive fields, which is a way of connecting the input nodes to the next layer of neurons which conserved the spatial information. Each neuron in the hidden layer will only be connected to a small localized field of input neurons, which in turn represents a small part of the original image.

These local receptive fields will overlap with the fields connected to the other hidden neurons. Each of the input neurons connected to a single hidden neuron will have different weights attached to them, but each of the hidden neurons will have the same weights for their respective input fields. They will also have the same bias. In this way

(29)

Figure 2.3: A local receptive field. The input neuron is represented in a grid corresponding to reach pixels location in the image. We can see how the neurons in the hidden layer only receive information of a small region of the image. Image from Nielsen, Michael A. - Neural Networks and Deep Learning. Chapter 6.

the hidden neurons learn to spot the same feature in the image, across the whole image. Each set of neurons in the hidden layer which shares weights and biases in this way are called a feature map. To learn more features you have more feature maps. Intuitively it would seem that adding more feature maps would make the network larger and more complicated than a traditional fully connected network, this is not the case. Another benefit of convolutional networks is precisely the re- duced number of parameters needed. In a conventional fully connected network, where all the neurons in one layer are connected to all the neurons in the next, each neuron will have weights for each input neuron, and a bias. For a 100x100 pixel image this is 10000 neurons and in turn 10000 weights for each hidden neuron. This really adds up with many neurons. A convolutional network however only needs weights for each local receptive field times the amount of feature maps

(30)

Figure 2.4: Illustration of feature maps. The hidden layer consists of three feature maps which use three different sets of shared weights and biases. Image from Nielsen, Michael A. - Neural Networks and Deep Learning. Chapter 6.

you are using. This is a drastic reduction in parameters which could greatly improve performance.[17] Even if, at a glance, it looks like the convolutional layer would contain more neurons.

Immediately after a convolutional layer there is usually a pooling layer[17] Pooling layers condense the outputs from the previous layer into fewer neurons. A field of neurons from a feature map is summarised in a neuron in the pooling layer. The pooling is similar to the local receptive fields, but there is no overlapping between the different fields.

As an example a field of 2x2 neurons could be condensed into a single neuron in the pooling layer.

There are several ways to achieve this pooling, one way is max pooling which simply takes the maximum activation value from the input field. To conceptualise why this works we can imagine an image of a square. To recognise a square there has to be corners. The feature

«corner» has been learned by the previous convolutional layer. It does

(31)

Figure 2.5: Illustration of pooling layers. We can see that each feature map is summarised into a smaller next layer. Image from Nielsen, Michael A. - Neural Networks and Deep Learning. Chapter 6.

not matter exactly which neurons find the corners, only that there are corners in the image. This information is preserved, as the neurons that are most confident that they have found a corner will have higher activation values than the other neurons in their area. The pooling layer basically abstracts away the neurons that are less confident about having found the feature they were looking for, and is thus left with a summary of the previous layers most conclusive findings.

To sum it up, convolutional neural networks use local receptive fields to preserve and utilise spatial information. After the convolutional layers there is pooling layers which condenses the network into fewer neurons, saving resources. There can be several of these pairs of convolutional and pooling layers, and there can be regular fully connected layers mixed in as well, usually at the end.

(32)

2.3 Graph neural networks (GNN)

Graph neural networks are, like convolutional networks, a type of deep neural network with a specific use case, namely graphs. Graph neural networks were first proposed in [20] in 2009. The working assump- tion of this paper was that nodes in the graph represents objects and the edges represent their relationships, something which is indubitably true for the graph we will study, the ACT graph. To define an object you take the features of the object and the relationship it has to any connected objects. The state X_n of a node is, according to Scarselli et al.

[20], the information contained in the nodes neighbourhood calculated by:

xn=fw(ln, lco[n], xne[n], lne[n]) (2.5) fw is here a local output function expressing a nodes dependence on its neighbourhood. l_n is the label of the node n, l_co[n] is the labels of the nodes edges, x_ne[n] is the states of the neighbouring nodes, and l_ne[n] is their labels. After calculating the node states. The next step is to use them to produce theoutputon:

o_n=g_w(x_n, l_n) (2.6) g_wis the local output function. The states and outputs are then stacked onto global versions containing the states and outputs of all the nodes in the graph. Gradient descent is then used to update the weights and minimise the loss function.[25] In later years several improvements

(33)

to the original proposition has been made, using convolution is one of them.

Graphs, not unlike images, have structural information which we would like to preserve through the training process. Perhaps it is not surprising then that some GNN-approaches are heavily based on ideas from convolutional networks. The concepts of having locally connected pixels be grouped together, and shared weights and biases both translate nicely when dealing with graphs.[25]

Shared weights and biases is fairly straight forward to apply to graphs, as we want the neural network to learn features in the graph in virtually the same way it would do in an image. Local receptive fields however is not as simple. An image is neatly ordered in a grid of pixels where each pixel is connected to its neighbours, a graph however is not a grid, any node in the graph can be connected to any other, so it is not as easy to choose local fields.

Convolutional Graph Neural Networks are roughly divided into two approaches: Spectral and Spatial. The Spatial approach does something akin to the traditional convolutional layers, but the approach to solving the problem of which nodes belong in a receptive field together is done in different ways. The PATCHY-SAN [18] approach extracts a neighbourhood of exactly k nodes around each node. This neighbourhood is then normalized into a linear order of nodes which is used as the local receptive field. LGCN [6] has a similar approach where it performs max pooling and uses thekbest neighbour nodes as the local receptive field.

(34)

Most spectral approaches however learn filters that work on the graph structure, which then can not directly be applied to graphs with a different structure than the one they were learned on.[25] DGCNN [23] propagates node information to the neighbourhood of the nodes so that each node is labled with, or «learns», its structural role in the neighbourhood. The nodes are then sorted based on structural role and passed on to normal convolutional layers. More on DGCNN in section 4.3.

2.4 Cyber threat intelligence

Now on to something completely different. This part will be short and only contain the basics needed to understand the motivation behind the ACT graph and our research as we do not approach the problem from a cyber threat intelligence standpoint, but rather from a machine learning one.

Cyber threat intelligence is gathering, evaluating, and analysing cyber threat information. These steps are performed in a continuous cycle where new information is constantly gathered and analysed in the context of new, and old, information. This process requires a large amount of expertise and resources to keep up with emerging threats and situations. Analysts have to be able to process large amounts of data to discover similarities and differences while also looking out for deception. The goal is to gain an understanding of the intents and cap- abilities of threat actors so as to hopefully produce accurate intelligence

(35)

and be able to prevent, or be more ready for, future attacks.[8]

Any task where there are large amounts of data to be processed and analyzed will benefit from some amount of automation, as long as it can be done in a way that does not diminish the accuracy and relevance of the results. The ACT graph, which will be covered in the next chapter, aims to be a semi-automated cyber threat intelligence solution.[15] But even still analysts have to do a lot of work to extract meaning from the data. We hope that by using graph neural networks in combination with the ACT graph we can take another step in the direction of automating threat intelligence by providing potential links between objects that might not be feasible to uncover by conventional means.

(36)

Chapter 3 The ACT platform

3.1 Background

ACT is a platform for digital threat intelligence. The platform was developed to combat the growing threat of cybercriminals and aims to be able to predict attribution, timing, characteristics, and victims of cybercrime. The project is developed by Mnemonic in cooperation with The University of Oslo, The Norwegian University of Science and Tech- nology (NTNU) at Gjøvik, The Norwegian National Security Authority (NSM), Nordic Financial CERT, and KraftCERT.[15]

The ACT data model was designed to enable automation and analysis of the already available threat intelligence. By collecting all available intelligence from multiple heterogeneous sources and collecting them in one place it is possible to better analyse the data, and to remove the need for repetitive tasks on the part of the analyst. To achieve good automation and analysis results the data quality must be equally

(37)

good. The quality of the content itself can not be enforced, but the format of the data can be. For data from multiple sources to coexist there is a need for a consistent and strict data model. This way one always knows where a certain data type exists within a data set, and the same type of data is always found in that same location. Pictured below in figure 3.1 is the complete ACT data model, with some exclusions covered in section 3.3 further down.

The model consists of objects and facts. The objects are the nodes and the facts are the relationships between them, in graph terms the objects are the vertices and the facts are the edges.

3.2 Objects

The objects are the nodes in the graph and represent entities like user- names, techniques, tools, countries organizations, and many others.

The objects do not have any properties themselves, outside of their object type and name, the information in the graph is mainly represented as facts, the relations between objects. Objects are immutable, and in cases where an object is not explicitly known, but it is known that the object must exist, placeholder objects are used to avoid the need for more fact types than necessary. Say we have three object types: A, B, and C, and two fact types A → B and B → C. If a piece of information is received that indicates A → C we know there must be aB and the information can be represented as A → (placeholderB) → C thus removing the need for superfluous fact types. The placeholder objects

(38)

Figure 3.1: The full ACT data model not including «mention» facts and isolated object types. The object type «report» (not shown) is connected to nearly every other object type, as well as some others also not shown here, by «mention» facts. The whole structure is very large and not feasible to display in a figure on a single page. Image from Bromander et al. Modeling Cyber Threat Intelligence, Page 4.

(39)

can then later be «upgraded» to an actual object through a new fact[3].

3.3 Facts

Facts are where the bulk of the information exists. Facts are the relations between the different objects. A person, for instance, is a «mem- berOf» an organization, or a username is «observedIn» an event et cetera. Facts are immutable and cannot be deleted to ensure non repu- diation and preserve the history if the graph. If a fact turns out to be false, or for some reason needs to be removed it is possible to add a second fact that retracts the first. Because of this limitation it is possible to trace a continuous timeline of facts in the graph. This can be used to trace the development of known intelligence, and to see what was known at a specific point in time. One can imagine potentially using the time to tag the facts as recent or older, and to be able to prioritize which facts are used when training the model. This could perhaps lead to the predicted facts being more relevant, however we did not explore this avenue in this project.

Most of the facts in the graph consists of a certain type calledmen- tions. The mentions are the relations between the «report» object type and most other object types. A report can for instance «mention» a username, or it can «mention» an uri, or a person, or an event et cetera. These mentions are gathered using natural language processing (NLP) to extract facts from openly available intelligence sources such as articles on the internet. In figure 3.1 we can see just how many

(40)

different object types there are, and thus the amount of different relations the mentions can represent. There are even some isolated object types, like «vulnerability» or «certificate», that are onlyconnected via the mention fact type.

The mentions represent an enormous amount of data, but by themselves they are rather generic and do not reveal a great deal about the connected objects to an analyst. The mention simply lets us know that the object was mentioned in a report, and to manually check every mention to see how significant of a connection it represents is not a feasible task. This is where we hope to create a potential solution.

As previously mentioned, the aim of this project is to use graph neural networks (GNN) to create new facts between objects, hopefully aided by the existing mentions. The hope is that mentions might just contain some useful structural data that is not immediately apparent to the eyes of a human analyst, but that a graph neural network might just be able to take advantage of. As the mentions are not very useful by themselves, our aim is for the result of our work to be less generic facts that are actually useful for threat prediction and analysis.

3.4 API

The graph engine of SEAL is implemented with Apache TinkerPop, and the graph can be queried using the graph query language Grem- lin. Apache TinkerPop is an open source graph computing framework, and Gremlin is Apache TinkerPop’s traversal language. Gremlin is a

(41)

functional, data-flow language, and its purpose is to allow for complex traversals of a graph. Each traversal query is comprised of several atomic steps that can be nested. Together the steps allow for any traversal of the graph imaginable.

The API allows for sending requests to the graph, which return .json files containing the returned objects, facts or both. The API is mostly tailored around extracting data about a certain object, fact or a neighbourhood in the graph. There are also ways to add objects and facts of course, which is not relevant to our work. There are also a set of requests made for traversing the graph from a starting point. What does not exist however is a way to easily extract the whole, or large sections of the graph.

All in all it is clear that the API is designed to provide quick and informative results to an analysts queries. If you want to know which facts are connected to which objects, or which other objects are in an objects immediate neighbourhood, you will be able to find out quickly and easily. This is great for ACTs intended purpose which is to remove some of the overhead from security threat analysis and make the analysts lives easier. It is not however very conducive to performing large scale analysis of tens of thousands of objects and facts. This issue will be explored further in chapter 5.

(42)

3.5 Data description

The ACT API returns a .json file containing all the objects and or facts relevant to the gremlin query. The .json files have a nested structure, and each file contains six top level fields.

1 {

2 "responseCode": 200,

3 "limit": 0,

4 "count": 425,

5 "messages": null,

6 "data": [],

7 "size": 425

8 }

TheresponseCodetells us weather the search was successful. The API lists the following possible codes:

401 User could not be authenticated.

403 User is not allowed to perform this operation.

408 Execution of this operation timed out.

412 Any parameter has an invalid format.

Table 3.1: Response codes listed in API

In addition to this, while testing the following codes were also encountered:

404 Object does not exist.

200 Query returned normally.

500 Unspecified error occurred.

Table 3.2: Encountered response codes

Most queries will either return with code 200 or 408, but for some reason, after working on the project for several months, the 403 error

(43)

started showing up for queries where it had not before. Luckily there is no shortage of objects in the ACT graph, so we simply queried starting from a different object in the case of this happening.

The most important field here is thedata field. This field contains all of the graph information in a nested structure. Below is the listing of a single object and a fact that connects to it.

1 {

2 "id": "eeec60b9-1551-49ad-9673-233b28582734",

3 "type": {

4 "id": "be9cdd49-d484-41d0-86d9-5d8d5a891058",

5 "name": "threatActor"

6 },

7 "value": "APT1",

8 "statistics": null

9 },

10 {

11 "id": "4f8de344-d066-4ba8-acd6-25f1a81b3313",

12 "type": {

13 "id": "d6aa510c-4064-4745-96f0-89eb8f304872",

14 "name": "attributedTo"

15 },

16 "value": "",

17 "inReferenceTo": null,

18 "organization": {

19 "id": "00000000-0000-0000-0000-000000000001",

20 "name": "N/A"

21 },

22 "addedBy": { "id": "00000000-0000-0000-0000-000000000001",

23 "name": "N/A"

24 },

25 "origin": {

26 "id": "0274d4fc-f4ea-4aa3-bbb2-c92c08236200",

27 "name": "mitre-attack"

28 },

29 "trust": 0.8,

30 "confidence": 1.0,

31 "accessMode": "Public",

32 "timestamp": "2021-02-04T16:51:53.032Z",

33 "lastSeenTimestamp": "2021-03-21T17:06:42.666Z",

34 "sourceObject": {

35 "id": "b01b5429-db0e-4a58-9c5c-37b0fa2d4dbd",

36 "type": {

37 "id": "c66e5b27-2d68-497d-ab7b-378b3d827b5d",

38 "name": "incident"

39 },

40 "value": "[placeholder[

(44)

ec365290b0b9ce07140a4ef2add12132edfe70e86ad69b9948d77a59266d2e2d]]"

41 },

42 "destinationObject":{

43 "id": "eeec60b9-1551-49ad-9673-233b28582734",

44 "type": {

45 "id": "be9cdd49-d484-41d0-86d9-5d8d5a891058",

46 "name": "threatActor"

47 },

48 "value": "APT1"

49 },

50 "bidirectionalBinding": false,

51 "flags": [],

52 "certainty": 0.8

53 }

As can be seen above objects are first listed, then the facts that con- nect to it. In addition to this all the facts have a source object and a destination object whose entries are identical to the top level object entries. This means that every top level object entry also has up to several object entries nested within the fact entries, leading to a lot of superfluous information. One of the first steps in processing the data was to filter out some of this redundancy and be left with an object list and a fact list where the fact list only contains the necessary type and value of the source and destination objects. In addition to this there is a lot of secondary information in the .json file that was disregarded when processing, most of which is not that interesting from a graph structure perspective, but it could perhaps be used in labeling in future research.

The .json response also contains a lot of null values, sometimes in the destinationObject field in the facts. This caused issues when at- tempting to create an adjacency matrix from the data, so the null values were also filtered out.

(45)

Chapter 4 State of the art solutions

4.1 Link prediction

Link prediction in graphs is not a new idea, and it is a task which has had many applications for a long time at this point. Suggesting friends in social networks[1], recommendations in online stores[11], and looking for interactions between proteins[2] are fields where link prediction can be very useful. The well established methods for link prediction are mainly heuristic methods based on similarities between nodes. These approaches can work very well if the network they are being applied to happen to conform to the assumptions that the different heuristics make. However a heuristic looking for common neighbours will not find much success when used on a network where common neighbours are not important. It would be ideal if we could learn what heuristic works best for the network in question. This is where GNNs come in.

Link prediction using GNN however is still a field in its infancy and

(46)

it is not as widely studied and understood as classification of nodes and graphs. [24] Still, using GNNs has proven to be superior in predicting links across various types of graphs, if a little slower, as opposed to the much more established node similarity-based methods. [9] Following will be a brief summary of the different solutions to link prediction using neural networks.

VGAEVariational Graph Autoencoder[10] first uses a GCN (Graph Convolutional Network) on the whole graph to compute representations for each node. The representations of the two nodes you are focus- ing on are then used in order to predict the presence of a link between them. A problem with VGAE is that it does not have the ability to distinguish between isomorphic nodes. If two nodesaandbhave identical structural roles in the graph, say for instance two nodes on the opposite sides of a symmetrical graph, VGAE will compute the same representations for the two nodes. Now imagine that we are looking for a link between the node pairs < a, c > and < b, c >. In this instance the two links will be predicted with the same likelihood, as cwill have the same representation in both cases, and a and b are indistinguishable to the VGAE. In reality c could be the neighbour of a while being on the complete opposite side of the graph from b. In not as many words this simply means that VGAE computes the node representations in- dependently, without taking into consideration the nodes relations to each other.[24]

WLNMWeisfeiler-Lehman Neural Machine[22] first starts with local subgraph extraction around a target link. The k nearest neighbours

(47)

are extracted, first starting with the neighbours 1 hop away, then pro- gressively increasing the amount of hops untilk nodes are extracted.k is here a user defined variable. If after increasing the number of hops the number of nodes is greater than k, surplus nodes are discarded.

Conversely if there are not enough nodes to achieve a neighbourhood of sizek, dummy nodes will be created to fill the remaining slots, thus achieving a unified size of all the subgraphs. After the enclosing subgraph is extracted PALETTE-WL vertex ordering is used to create an adjacency matrix. The algorithm assigns «colors» to the nodes based on their structural roles in the subgraph. To start with all the nodes are assigned colors based on their distance from the target link in the subgraph. The colors for each vertex x are then refined using a hash function.

h(x) = c(x) + 1

dP

z⁰∈VK log(P(c(z⁰)))e × X

z∈Γ(x)

log(P(c(z))) (4.1)

Where c(x) is the color of x as an integer, V_K is the set of vertices in the subgraph,P is the set of all primes andP(n)is then^thprime number, and Γ(x) is the 1 hop neighbours of x. After refining the colors the nodes in the subgraph are sorted ascending order and represented as an upper triangular adjacency matrix. The matrix is then input to a fully connectedneural network.[22] The network is trained using positive and negative links sampled from the matrix and is otherwise similar to what has been covered in section 1.2.

SEAL learning from Subgraphs, Embeddings and Attributes[21] is

(48)

as of this writing the best performing GNN link prediction solution on most types of graphs.[24][9] A deep-dive into SEAL will take place in a following section, but some of SEAL’s key differences and advantages over the other approaches are as follows. SEAL uses alabeling trickto solve the problem with isomorphic nodes that is present in VGAE, in this way SEAL can distinguish between links that contain isomorphic nodes, but are not isomorphic links themselves, and any actual isomorphic links. This is done by adding labels to the two nodes we are

Figure 4.1: Node labeling trick in action. In the two cases we are looking at the links < v₁, v₂ > and < v₁, v₃ > respectively. Image from Muhan Zhang et al. Revisiting Graph Neural Networks for Link Pre- diction, Page 4.

looking for a link in between.[24] If we go back to the previous example of a symmetric graph where the node pairs < a, c > and < b, c > contain the isomorphic nodes aand b, and aand care neighbours, we can see that in one of the cases the nodes a andc will be labeled. a is now the neighbour of the labeled node c, while in the other case with b and c being labeled, b is not the neighbour of a labeled node. a and b are no longer isomorphic and < a, c > and < b, c > will no longer yield the same results. SEAL also uses a graph neural network instead of

(49)

the fully connected network that is employed by WLNM. This is an advantage for two main reasons. Regular neural networks only allow fixed input sizes while graph neural networks are much more flexible, and graph neural networks are simply better at learning structural information in graphs. Because of these advantages SEAL can extract subgraphs of varying sizes around its target links, and do not have to bother with trimming or adding nodes to fill a number k. SEAL will also generally better learn features of the graph.

4.2 node2vec

node2vec is the approach used as the default embeddings in SEAL.

node2vec sets out to solve a similar problem to the one GNN based link prediction algorithms attempt to solve, namely that the previously existing approaches were inflexible and not able to as effectively deal with the diversity of different networks and connectivity patterns.

node2vec takes a more flexible approach to the nodes’ neighbourhoods, using random walks to explore them, and thus can model all kinds of equivalences in networks.[5]

There are two extremes when it comes to traversing a neighbourhood in a graph. Breadth-First Sampling(BFS) is an approach where you first look at all the n-hop neighbours of a node before moving to nodes n + 1-hops away. Depth-First Sampling (DFS) is the opposite where you only look at one node for hop-nbefore moving on to hop-n+1, then hop-n+ 2et cetera further and further away from the center node.

(50)

The two extremes search through a graph in very different ways, and will therefore be best suited to expressing different kinds of similarities between nodes. Homophilyis the notion that nodes with high intercon- nectivity, and that belong to similar clusters or «communities» of nodes, should be embedded similarly. Structural equivalence however is the notion that nodes with similar structural nodes should be embedded close together instead, no matter how far apart they are in the actual graph. These two approaches work well on different networks, but they are not exclusive, and will often both be useful for expressing similarities in different nodes, even within the same network.[5] Intuitively it might seem that BFS would be best suited to express Homophily, and DFS structural equivalence, however in fact the opposite is true.

BFS lead to embeddings that correspond to structural equivalence [5], a reason for this might be that BFS leads to an understanding of the immediate neighbourhoods of all the nodes, and thus more accurately classifies their structural roles. Vice versa DFS will better express Ho- mophily, likely because the search branches out more from the center node and gets a broader view of the different communities of nodes.

node2vec aims to strike a balance between these approaches and achieve the best of both worlds by using a weighted random walk which can explore with both DFS and BFS. The nodesc_i in the random walk are generated with the following

P(ci =x|ci−1 =v) =





 π_vx

Z , if(v, x)∈E 0, otherwise

(4.2)

(51)

Where E is the set of all edges in the graph, π_vx is the unnormalized transition probabilitybetween the nodesvandx, or the chance of walk- ing fromv tox. Z is the normalizing constant.[5]

Two parameterspandqare used to decide whether the walk should explore more BFS-like or DFS-like. The unnormalized transition probability π_vx is set to α_pq(t, x)×w_vx wherew_vx is the static edge weights in the graph. This value is set to 1 in the case of an unweighted graph, like SEAL.α_pq(t, x)interacts with the parameterspandqas follows

α_pq(t, x) =









 1

p, ifd_tx= 0 1, ifdtx= 1

1

q, ifdtx= 2

(4.3)

t is theprevious node, the one the walk just came from before reaching nodev. d_txis the distance between nodestandx. In essence this means that the distance between the previous and the next node in the walk decide which of the two parameters are taken into account. Setting p to a high value makes the walk less likely to revisit a node for the next two steps, conversely setting q to a high value will make the random walk more likely to stay close to node t. Setting q low will make the walk more likely to stray further fromt. We can see how setting these parameters can make the random walk more BSF-like or DFS-like by making the random walk stay in the local neighbourhood or explore further away.

After sampling the subgraphs node2vec learns feature representa-

(52)

tions which can later be clustered to learn overall features in a network. node2vec treats learning these feature representations of the nodes in the graph as a maximum likelihood optimization problem[5]

The feature learning itself is modified from the Skip-gram[14] architecture which was originally designed for natural language processing.

The Skip-gram architecture functions by scanning the words of a doc- ument and attempts to embed every word so as to maximise the likelihood of the existing neighbouring words. If there is a high likelihood of a given neighbourhood then that neighbourhood can essentially be predicted with a corresponding confidence. To apply this approach to graphs you need to redefine what a neighbourhood is, in a similar fash- ion to when modifying convolutional networks for graphs, the way to do this is the previously explained random walk. The objective function for maximizing the log probability of observing a neighbourhood N_s(u) of the nodeu, dependant on the feature representationf is

maxf

X

u∈V

logP r(Ns(u)|f(u)) (4.4)

which is simplified to max

f

X

u∈V

h−logZ_u+ X

ni∈N_s(u)

f(n_i)·f(u)i

(4.5)

whereV is the set of all vertices, and

Zu =X

v∈V

exp(f(u)·f(v)) (4.6)

(53)

The equation 4.5 is then optimised using gradient descent, briefly covered in section 2.2, on the features.[5] In the end we are left with features that maximise the likelihood of the existing neighbourhoods, meaning that we can predict what kind of neighbourhoods each node belongs to.

4.3 DGCNN

DGCNN is the default neural network used by SEAL, it consists of three main stages:

• Graph convolution layers

• SortPooling layer

• Traditional convolutional and dense layers

Figure 4.2: The structure of DGCNN. Image from Muhan Zhang et al.

An End-to-End Deep Learning Architecture for Graph Classification, Page 4.

The graph convolution layers is fed an adjacency matrixA and a node information matrixX and defines a consistent ordering. TheSortPool- ing layer takes this consistent ordering and sorts the vertices. The tra-

(54)







0 1 0 0 1 0 0 1 0 1 1 1 0 0 0 0 0 0 0 1 1 1 0 1 0







→







1 1 0 0 1 0 1 1 0 1 1 1 1 0 0 0 0 0 1 1 1 1 0 1 1







Table 4.1: By adding self-loops to an adjacency matrix the diagonal of the matrix is set to ones.

ditional layers can then perform normal convolutional machine learning on this ordered tensor.[23]

Graph convolution layers are as previously covered in section 2.3 a way to translate between the unordered organic world of graphs into the structured and consistent world of traditional convolutional neural networks. DGCNN does this with graph convolution layers with the form:

Z =f( ˜D⁻¹AXW˜ ) (4.7) A is the adjacency matrix and A˜ = A+I whereI are self-loops added to the matrix. D˜ is thediagonal degree matrixofA˜whereD˜_ii=P

jA˜_ij. X is the node information matrix, where X ∈ IR^n×c. W is a matrix W ∈ IR^c×c⁰ of trainable parameters for the graph convolution. f is a nonlinear activation function. This produces an output activation matrix Z ∈ IR^n×c⁰. Essentially this formula first performs a feature trans- formation on the node information matrix X, mapping the cchannels toc⁰ channels in the next layer by multiplying it with filter weightsW, which are shared for all vertices. A˜is then multiplied with the result, thus propagating node information to the neighbouring nodes, and it-

(55)

self. Intuitively this works because when multiplying two matricesx, y we calculate the dot product of the rows in matrix x with the columns of matrix y. In our case this means that we multiply a single nodes’

node information with the node information of all its neighbours. The rows of the resulting matrix i are then normalized by multiplication withD˜_ii⁻¹. Lastly a nonlinear activation function is applied.[23]

By using more than one graph convolutional layer DGCNN can extract multi-scale features, meaning that features are learned in different «resolutions» and combined in the end to create a more robust model. When using multiple layers we get:

Z^t+1 =f( ˜D⁻¹AZ˜ ^tW^t) (4.8) WhereZ⁰ =X like in the previous equation.Z^t∈IR^n×c^t however, is the output of the t^th layer. W^t ∈ IR^c^t^×c^t+1 maps thec_tchannels of the input Z^t to ct+1 channels in the output. ct is the number of output channels of layert.[23]

The outputs of all the layersZ^1:h := [Z¹, ..., Z^h], wherehis the number of layers, are then concatenated together by a final layer. This final layer creates a concatenated output given by:

Z^1:h ∈IR^n×^P^h¹^c^t (4.9) After this process we have a situation where each row in Z^1:h en- codes the multi-scale structural and node information of a vertex. These feature descriptors can be thought of similarly to the colors in WLNM,

(56)

covered in section 4.1, in fact we will continue to use the term colors when talking about the outputs of the convolutional layers. See figure 4.2 for an illustration. Each column in Z^1:h contains a feature channel similarly to the feature maps found in conventional convolutional layers.

The SortPooling layer is the next step in the process. This layer sorts the nodes according to their structural roles in the graph, before passing the sorted nodes on to the traditional convolutional and fully connected layers. luckily we have already arranged the nodes in a consistent order, the «colors». The SortPooling layer first sortsZ^1:hby rows according to the colors present inZ^h, as the colors found in the last layers output are the most recent. When sorting, thelastfeature channel is considered the most significant, the second to last channel is only used if there is a tie in the sorting, et cetera. It is similar to sorting numbers, where the last feature channel can be considered the most significant digit.

After sorting the output size must be unified in order to be able to pass it on to the traditional convolutional layers. A user specified integerk is chosen as the output size. The unification is done by either deleting the lastn−k rows or addingk−nrows set to zero. The output then becomes:

Z^sp ∈IR^k×^P^h¹^c^t (4.10) The convolutional layers are one dimensional, meaning that we have to reshape the currently two dimensional tensorz^spinto ak(Ph

1c_t)×

(57)

1vector.

The first 1-D convolutional layer has a filter size and step ofPh 1c_t. The filter size is the size of the local receptive field, whereas the step is the amount the local receptive field glides across the data for each step.

When setting the filter size and the step to the same value we get no overlap between the local receptive fields. In practice this means that each field is applied to the feature descriptor of a certain vertex.

After this first layer there are several max pooling layers and more 1-D convolutional layers. Finally there is a fully connected layer and a softmax layer. A softmax layer simply outputs decimal probabilities that add up to one. Stochastic gradient descent is used to train the network together with backpropagation. An interesting feature of the SortPooling layer is that it has the ability to pass loss gradients back to the previous layers, thus making it possible to train the graph convolution layers as well.

The goal of DGCNN is to predict structure features for unseen graphs.

These heuristics is what SEAL is after. When we can get heuristics for unseen graphs we can also use them to predict the likelihood of an unseen link existing between nodes.[21]

4.4 SEAL

SEAL, or learning from Subgraphs, Embeddings and Attributes, is a framework implementing link prediction using GNN. Not unlike WLNM, SEAL first extracts subgraphs around the links which are being tar-

(58)

geted, then a node information matrix is constructed, and in the end a GNN is used to learn some heuristics, specifically graph structure features.

To predict links between nodes we need some heuristic which tells us how high the likelihood of a link existing is. Traditional approaches to link prediction uses preset heuristics to make predictions. The glar- ing issue with this approach is that graphs are different. A heuristic which produces ground breaking results in one graph might produce unusable results in another. The idea then, behind using GNNs to predict links, is as mentioned in section 4.1 to have a neural network learn what kind of heuristic fits the current graph best. How much information do we need to learn the heuristics however? There are several types of heuristics, first order heuristics, like common neighbours, work well in some graphs, and you only need information about a nodes immediate 1-hop neighbours to utilize it. Second order heuristics like Adamic-Adar, which counts the neighbours of the target nodes common neighbours, need information about the 2-hop neighbours. Then there are high order heuristics like Katz and rooted PageRank which require information about the whole graph. The higher order heuristics often show better performance [12], so if possible we want our GNN to learn them. At first glance however it seems like you can only learn first and second order heuristics using local subgraph extraction, but as it turns out you can approximate even high order heuristics with only a few

Finding new links in ACT data provided by Mnemonic using Graph Neural Networks

Finding new links in ACT data provided by Mnemonic using Graph

Neural Networks

Martin K. J. Heggem

Thesis submitted for the degree of

Master in Programming and system architecture 60 credits

Department of Informatics

Faculty of mathematics and natural sciences

UNIVERSITY OF OSLO

Finding new links in ACT data provided by Mnemonic using

Graph Neural Networks

Martin K. J. Heggem

Abstract

Contents

I Introduction 11

II The project 57

III Conclusion 85

List of Figures

List of Tables

Preface

Part I

Introduction

Chapter 1 Overview

Chapter 2 Background

2.1 Introduction to machine learning

2.2 Neural networks and deep learning

2.3 Graph neural networks (GNN)

2.4 Cyber threat intelligence

Chapter 3

The ACT platform

3.1 Background

3.2 Objects

3.3 Facts

3.4 API

3.5 Data description

Chapter 4

State of the art solutions

4.1 Link prediction

4.2 node2vec

4.3 DGCNN

4.4 SEAL