• No results found

Passive Fingerprinting of Known Operating Systems using Deep Learning Techniques

N/A
N/A
Protected

Academic year: 2022

Share "Passive Fingerprinting of Known Operating Systems using Deep Learning Techniques"

Copied!
102
0
0

Laster.... (Se fulltekst nå)

Fulltekst

(1)

Passive Fingerprinting of Known Operating Systems using Deep

Learning Techniques

Martin Veshovda Løland

Thesis submitted for the degree of

Master in Informatics: Network and System Administration 30 credits

Institutt for informatikk

(2)
(3)

Passive Fingerprinting of Known Operating Systems using Deep

Learning Techniques

Martin Veshovda Løland

(4)
(5)

Abstract

Traditional fingerprinting is the act of collecting data extracted from a remote host’s TCP/IP header in the network layer. The information gathered can then be used to derive the remote Operating System (OS), by comparing it to a set of signatures. A signature consists of several values which are often unique for varying operating systems. Traditionally the OS would be set to the first match found in the database. As more and more operating systems are created, signatures become less accurate and machine learning techniques have been introduced to deal with the flaws of classical fingerprinting. In this work, we focus on implementing and tuning a deep learning algorithm to perform passive fingerprinting on a dataset of over 500 variations operating systems. Our main contribution lies in investigating the effectiveness of a never before tested feature in fingerprinting, namely the TCP congestion control algorithm.

(6)

Acknowledgement

I would like to thank my supervisor Desta Haileselassie Hagos for frequently meeting me and leading me in the right direction. I want to thank Anis Yazidi for critic reviews, and Paal Engelstad for great ideas and showing interest in my work. Lastly I want to thank my fiancee for calming me down in stressful moments, encouraging me, and always believing in me.

(7)

Contents

List of Figures viii

List of Tables ix

List of Acronyms x

1 Introduction 1

1.1 Motivation . . . 1

1.2 Problem Statement . . . 2

1.2.1 Problem basis . . . 2

1.2.2 The Problem . . . 2

1.2.3 Research Question . . . 3

1.3 Thesis Structure . . . 4

2 Background 5 2.1 Active and Passive OS Fingerprinting . . . 5

2.1.1 Active fingerprinting . . . 5

2.1.2 Passive fingerprinting . . . 6

2.1.3 Areas of application . . . 6

2.2 Protocols . . . 7

(8)

2.2.2 ICMP . . . 9

2.2.3 HTTP . . . 9

2.3 Fingerprinting tools . . . 10

2.3.1 Nmap . . . 10

2.3.2 Xprobe2 . . . 10

2.3.3 SinFP . . . 11

2.3.4 p0f . . . 11

2.3.5 NetworkMiner . . . 12

2.3.6 Ettercap . . . 12

2.4 Machine learning . . . 13

2.4.1 K-Nearest Neighbors . . . 13

2.4.2 Decision Tree . . . 14

2.4.3 Random Forest . . . 16

2.4.4 Support Vector Machines . . . 17

2.4.5 Deep Learning . . . 18

2.5 TCP Congestion Control . . . 21

(9)

2.7 Other . . . 23

2.7.1 iPerf3 . . . 23

2.7.2 TCPdump . . . 23

3 Related Work 24 3.1 Machine Learning in Passive Fingerprinting . . . 24

3.2 Feature selection . . . 25

3.3 Multi-packet fingerprinting . . . 26

4 Approach 28 4.1 Introduction . . . 28

4.2 Data Collection . . . 29

4.2.1 Data Generation . . . 29

4.2.2 Available Dataset . . . 32

4.2.3 Final Datasource . . . 33

4.3 Implementation . . . 35

4.3.1 Training and Tuning the Deep Learning model . . . 35

4.3.2 Tuning the other classifiers . . . 41

5 Results 42 5.1 Comparing the algorithms . . . 42

(10)

5.3 Summary . . . 48

6 Discussion and Future Work 49 6.1 Introduction . . . 49

6.2 The Dataset . . . 49

6.3 The Features . . . 50

6.4 The Classifier . . . 51

6.5 TCP Congestion Control Mechanism . . . 51

6.6 Future Work . . . 52

7 Conclusion 53 Appendices 54 A Scripts 54 A.1 Data Preparation . . . 54

A.1.1 Full label reduction . . . 54

A.1.2 Splitting Data - Train/Test . . . 60

A.1.3 Distribute labels . . . 62

(11)

A.2.4 Plot Confusion Matrix . . . 75

A.2.5 Support Vector Machine . . . 77

A.2.6 K-Nearest Neighbors . . . 80

A.2.7 Random Forest . . . 83

(12)

List of Figures

1 The layers of the OSI-model. . . 8

2 Shows the fields and their positions in the TCP Header. . . 8

3 Example of an HTTP User-Agent string. . . 9

4 Example of Classification using the KNN-algorithm. . . 14

5 Simplified example of Classification using a Decision Tree. . . 15

6 Simplified example of Classification using a Support Vector Machine. . . 17

7 The convolution and pooling of a CNN. . . 18

8 A Long Short Term Memory cell. . . 19

9 Configuration of the interfaces file in ubuntu. . . 29

10 Setup of Virtual machines for data generation. . . 30

11 Accuracies of all configurations throughout the epochs. . . 38

12 Accuracies of top ten configurations throughout the epochs. . . 38

13 Time and accuracies of the top ten configurations. . . 38

14 Confusion matrix for the DL classifier. . . 44

15 Confusion matrix for the SVM classifier. . . 44

(13)

List of Tables

1 Example of multiple OS with the same fingerprint . . . 2

2 Advantages and disadvantages of the KNN algorithm. . . 13

3 Advantages and disadvantages of the Decision Tree algorithm. . . 15

4 Advantages and disadvantages of the Random Forest algorithm. . . 16

5 Advantages and disadvantages of the Support Vector Machine algorithm. . . 17

6 Commands used by different operating systems to find the TCP congestion control algorithm currently in use. . . 31

7 Random examples from the fingerprint database. . . 32

8 Statistics of the operating systems with the highest market share. The percentages of the versions are the distribution within the OS-family. . . 34

9 Accuracies of the algorithms for both datasets, with the TCP congestion control as a feature . . . 42

10 Accuracies of the algorithms with and without using the TCP Congestion Control mechanism as a feature. . . 45

11 Classification report for the DL classifier with and without the TCP Congestion Control feature. . . 46

12 Classification report for the SVM classifier with and without the TCP Congestion Control feature. . . 46

13 Classification report for the KNN classifier with and without the TCP Congestion Control feature. . . 47

(14)

List of Acronyms

CNN Convolutional Neural Network DL Deep Learning

HTTP HyperText Transfer Protocol ICMP Internet Control Message Protocol IDS Intrusion Detection System

IP Internet Protocol KNN K-Nearest Neighbor LSTM Long Short Term Memory ML Machine Learning

MLP Multi Layered Perceptron OS Operating System

OSes Operating Systems RF Random Forest

RNN Recurrent Neural Network SMTP Simple Mail Transfer Protocol

(15)

Chapter 1

1 Introduction

1.1 Motivation

All networks are faced with numerous threats such as attacks from external sources, compromised internal nodes, and unauthorized end points [1]. The knowledge of a network’s hosts’ Operating Systems (OSes) is an important piece of information used for multiple activities associated with computer security. Malicious users may tailor exploits to attack the network and increase their chance of success by avoiding an Intrusion Detection System (IDS). On the other hand, an IDS can use fingerprints to determine the OS of the hosts it’s protecting to develop more precise detections.

This can lead to less non-relative positive alerts by discarding attacks meant for OSes not within the network [2]. System administrators might benefit from OS detection by finding changes in the network, such as replacement of a machine, or by determining unauthorized hosts as some OSes contain severe vulnerabilities and will count as a security breaches [3].

Fingerprinting is the process of deducting a remote host’s OS by looking at specific values in the host’s transmitted network packets. Fingerprinting is divided into two categories called active and passive fingerprinting. Active fingerprinting consists of sending defined probes to the remote host and interpret the response. Active fingerprinting is not always beneficial, especially for attackers who wants to be as anonymous as possible, or for an IDS who are not permitted to send packets on the network they are protecting. In these cases, passive fingerprinting is a better solution as it can identify OSes by solely listening to the network traffic. Numerous tools have been developed for this task, but most of them have not maintained their fingerprint database which leads to many of the widely used OSes today, such as Windows 10, not to be detected.

(16)

1.2 Problem Statement

1.2.1 Problem basis

Traditional fingerprinting tools require predefined databases consisting of signatures to infer an OS. These signatures are often created manually through tedious work. This has led to numerous databases becoming severely outdated (e.g., Ettercap, was last updated 26th of October 2012), making the tools flawed as they have no information on OSes developed after their last update.

A lot of research has been put into the field of Machine Learning (ML) as an application for passive fingerprinting. With the introduction of ML to passive fingerprinting the discussion of which features that are the most beneficial for distinguishing OSes have been widely debated.

1.2.2 The Problem

With the continuous development of OSes combined with the rarely updated fingerprinting databases, the traditional fingerprinting tools have become highly inaccurate. When new versions of OS are released, they are often built upon their predecessors leading their Transmission Control Proto- col (TCP)/Internet Protocol (IP) header values to become similar to their ancestors’. In other words, different OS versions, sometimes even different OS families tend to have equal fingerprints.

When packets with these ambiguous values are under investigation of traditional fingerprinting tools, they often become discarded or matched as the first hit in the database, occasionally leading to erroneous classification.

OS Packet size Window size TTL

(17)

ML techniques have shown great results in the field of passive fingerprinting. Within the field of fingerprinting one of the most discussed subjects is feature engineering. A lot of research has shown that many variations and combinations give acceptable results. Choosing the correct features can have impact on both the time and performance measures of the algorithm.

1.2.3 Research Question

In [4], [5] an approach for passively detecting the TCP congestion control algorithm was suggested.

With this method in mind, we want to investigate a Deep Learning (DL) approach utilizing the TCP congestion control algorithm as one of the features for passive fingerprinting and compare its performance to other ML techniques such as the Support Vector Machine (SVM), K-Nearest Neighbor (KNN) and Random Forest (RF).

(18)

1.3 Thesis Structure

The report is divided into the following chapters:

• Introduction: The motivation behind the thesis, the problem statement, and the structure.

• Background: A brief introduction to the tools, methods and implementations used in this thesis work.

• Related Work: A summary of other related research works.

• Approach: A structured overview of implementation strategies and code that contribute to solving the problem statement.

• Result: A discussion about expectations of the work and an assessment of the achievements.

• Discussion: A brief review of the implementation, approach, challenges and future work.

• Conclusion: A summary of the work that has been done and contributions made to this thesis work.

(19)

Chapter 2

2 Background

This chapter contains a basic introduction to fingerprinting, as well as tools and methods used to perform OS detection. In addition, a short summary of the work related to passively inferring the TCP congestion control algorithm will be presented, together with technologies used to implement a DL model.

2.1 Active and Passive OS Fingerprinting

The objective of fingerprinting is to infer the OS of unknown remote hosts. There are mainly two types of fingerprinting methods: active and passive.

2.1.1 Active fingerprinting

Active fingerprinting is the process of sending TCP or Internet Control Message Protocol (ICMP) probes to a remote host, followed by examining the returned replies. Different OSes react with various responses to packets explicitly crafted to force unique return values, by editing certain bits in the packet header [6]. It is often sent multiple crafted packets with varied active flags in the header. The responses are later compared to known OS signatures. Nmap [7] is one of the most used and well known active fingerprinting tools.

(20)

2.1.2 Passive fingerprinting

Unlike active fingerprinting, passive fingerprinting doesn’t take part in the communication and detects OSes by sniffing and analyzing a network traffic stream.

2.1.3 Areas of application

Active fingerprinting tools have the ability to request information that is highly useful for deter- mining the OS of a remote host, often leading to a high accuracy. The process might disrupt a network’s traffic when issuing requests, and it is possible to block the probes with firewall rules or intrusion detection systems (IDS), which will greatly reduce its performance.

Passive fingerprinting tools require access to an active traffic stream and can suffer from the fact that there is a chance that a system might not send any packets that are beneficial for OS detection. Despite the disadvantages above, the benefit of passive fingerprinting is that they can’t be blocked by firewalls or IDS.

(21)

2.2 Protocols

A protocol is a set of rules used for communication. When fingerprinting there are multiple different protocols available from whom it is possible to infer the remote host’s OS.

2.2.1 TCP/IP

The TCP [8] protocol is an extension of the IP protocol and works on the transport layer of the OSI-model (see Figure 1). TCP/IP fingerprinting is the most established form of fingerprinting and is made possible by the fact that different OSes have their own unique way of setting the bits in the TCP header. It is a widely discussed subject which of the bits in the TCP header that are the most distinguishing factors, [1] points toward the Maximum Segment Size (MSS) and Window Scale (WSO) options while [9] use more options such as MSS, WSO, Selective Acknowledgements option (SOK), NOP option, and the Timestamp option (TS). Though not directly in the TCP header, the IP Time to Live (TTL) field seems to be the most agreed upon feature for distinguishing an OS.

(22)

Figure 1: The layers of the OSI-model.

Figure 2: Shows the fields and their positions in the TCP Header.

(23)

2.2.2 ICMP

The ICMP protocol is widely used in active fingerprinting. Tools such as Nmap [7] send numerous malformed ICMP packets and use the remote system’s responses to establish the OS. Because of the fact that this protocol is not widely used for passive fingerprinting, further details will not be discussed.

2.2.3 HTTP

The HyperText Transfer Protocol (HTTP) runs on the application layer of the OSI-model. The protocol contains a field called the User-Agent, which consists of a string often specifying informa- tion about the system. The User-Agent string is used by web servers to adapt their content to the connecting endpoint, improving the experience of the website. The User-Agent string is created by an application (often the web browser) and is therefore not directly linked to the OS. A challenge with HTTP fingerprinting comes from the fact that the protocol works on the application layer of the OSI-model meaning that it can be easily edited, omitted, or often hidden inside an encryption as most traffic today is encrypted.

Figure 3: Example of an HTTP User-Agent string.

(24)

2.3 Fingerprinting tools

2.3.1 Nmap

Nmap [7] is a powerful tool used for network reconnaissance. It consists of numerous features and gained support for TCP/IP fingerprinting in October 1998. To find a remote host’s OS Nmap sends up to 16 TCP, User Datagram Protocol (UDP) and ICMP probes, which are all specially crafted to give a response containing necessary information to infer a host’s OS. By analyzing the TCP/IP values of the returned packets nmap generates a fingerprint. If there is a match between the generated fingerprint and nmap’s database, the detection was successful.

2.3.2 Xprobe2

Xprobe [10] is written and maintained by Fyodor Yarochkin and Ofir Arkin. Xprobe sends small UDP datagrams to known closed UDP ports, which results in an ICMP error message. If an "ICMP Port Unreachable" is received, the returned packet is analyzed. Xprobe will send out a maximum of four packets, then use a decision tree to decide the remote OS based on the examined replies.

Due to the low number of tests required Xprobe is very fast, and harder to detect. In addition, Xprobe don’t use malformed packets which may have a higher chance of passing through a firewall.

(25)

2.3.3 SinFP

SinFP [11] was created by Patrice "GomoR" Auffret and the newest version, SinFP3, released in 2012. It’s written in Perl and its goal is to tackle the weaknesses of Nmap fingerprinting. SinFP uses three TCP packets all targeting the same open TCP port. SinFP uses an algorithm to find the best match of a fingerprint, causing the fingerprint database to only have a single fingerprint for each OS. SinFP can be used for both active and passive fingerprinting and keeps a different database for each.

2.3.4 p0f

Michal Zalewski created the first version of p0f in 2000, and completely revamped the tool in 2012 for its third and final version [12]. p0f is a well-established fingerprinting tool which solely work with passive traffic. The tool can be executed in the foreground or ran as a daemon. p0f use features collected from the TCP/IP stack such as the ordering of the TCP options, the relationship between maximum segment size and window size, the progression of TCP timestamps, and some other implementation quirks such as non-zero values in "must be zero" fields.

The use of the HTTP and Simple Mail Transfer Protocol (SMTP) protocols are also supported in the latest version. Rather than using descriptive fields such as the User-Agent, p0f analyzes the packets based on the ordering or syntax of the HTTP header or SMTP commands.

(26)

2.3.5 NetworkMiner

NetworkMiner [13] is an open source Network Forensic Analysis Tool (NFAT) written in C#.

NetworkMiner can sniff packetstreams of a network or parse PCAP files offline. NetworkMiner is host centric and which means it groups information by hosts rather than a list of packets.

When fingerprinting NetworkMiner uses the signature databases of the tools p0f and Ettercap to match any fingerprint generated by a TCP SYN packet and the corresponding SYNACK packet.

In addition, NetworkMiner can perform OS detection based on DHCP packets and the Satori fingerprint database.

2.3.6 Ettercap

Ettercap [14] is an open-source tool used for fingerprinting and man-in-the-middle attacks. It is one of the tools that is still being maintained. The OS detection module relies on packets requesting a connection, i.e. SYN and SYNACK packets. From these packets a series of features are analyzed to infer the remote OS: Window Size, Maximum Segment Size, Time To Live, Window Scale Option, Selective ACK option, NOP option, Don’t Fragment field, and the Timestamp option.

Today the Ettercap database contains over 1700 unique fingerprints. Since Ettercap uses two types of packets, its database contains multiple fingerprints for some OSes since they have different implementations of the TCP header for the SYN and SYNACK packets. SYNACK packets are often influenced by SYN packets and can therefore be less reliable. To deal with this issue Ettercap marks fingerprints generated from SYNACK packets as temporary and wait for a SYN packet to confirm the assumption.

(27)

2.4 Machine learning

When listening to traffic there are a lot of packets whose generated fingerprint don’t match any fin- gerprint in a given database. When this occurs, the OS is often placed in a group called "unknown"

or "other". ML methods have shown promising results where the rule-based OS detection methods fail [15] [9]. By making predictions based on previously generated fingerprints, ML algorithms can make well educated guesses of which OS an unknown fingerprint belongs to. Below five different ML algorithms are briefly explained.

2.4.1 K-Nearest Neighbors

KNN [16] is a method for classifying objects based on the K closest training examples in the feature space. It is one of the simplest classification techniques when there are little or no prior knowledge about the data distribution. Figure 4 shows an example classifying the unknown star as a green square, based on its 5 nearest neighbors.

Advantages Disadvantages

Resilient when faced with noisy data It is necessary to determine the value of the parameter K It’s effective with large training datasets Faced with series of problems with high dimensional data It doesn’t require a training phase Not clear which distance metric will produce the best result Learns complex models easily High computation cost

Table 2: Advantages and disadvantages of the KNN algorithm.

(28)

Figure 4: Example of Classification using the KNN-algorithm.

2.4.2 Decision Tree

A decision tree [17] is a form of a supervised ML algorithm. A decision tree can be seen as a flowchart, where each internal node describes an attribute test, each edge specifies the outcome of the test and the leaf nodes holds a class label. When creating a decision tree there are four areas which needs to be considered: Which features to use, conditions for splitting (defining the tests), knowing when to stop, and pruning. Pruning is a process used to reduce overfitting.

(29)

Advantages Disadvantages Decision trees are simple structures which

are easy to interpret and visualize

Inexperienced users tend to create over complex trees causing overfitting

It can analyze numbers and categorical data Small variations in data can result in completely different path, causing the tree to become unstable The data preparation requires low human effort Inexperienced users can create biased trees

from unbalanced datasets Its performance is not affected by non-linear

relationships between features

Table 3: Advantages and disadvantages of the Decision Tree algorithm.

Figure 5: Simplified example of Classification using a Decision Tree.

(30)

2.4.3 Random Forest

The random forest [18] algorithm uses a number of decision trees. More trees can lead to more robust predictions and higher accuracy. To classify new objects based on a given set of features each tree gives a prediction. The prediction given by most trees is then chosen to be the final prediction.

Advantages Disadvantages

Random forest handles missing values and maintains the accuracy for missing data

It can be hard to understand and control what the model does

With a high number of trees, the model will not overfit

It can overfit if the data is extremely noisy

Handles large datasets with many dimensions

Table 4: Advantages and disadvantages of the Random Forest algorithm.

(31)

2.4.4 Support Vector Machines

SVM [19] are suited for extreme cases. SVM takes data points which are closest to the opposing classes and creates a vector with minimal margin to the points in either of the classes. This will lead to vectors (support vectors) on each side of a hyperplane which will be the best line to differentiate the classes. When classes are non-linearly separable the data need to be transformed before the SVM algorithm is applied.

Advantages Disadvantages

Works well in high dimensional spaces SVMs perform poorly when the number of features are greater than the number of samples

Is highly customizable SVMs does not provide probability estimates Table 5: Advantages and disadvantages of the Support Vector Machine algorithm.

Figure 6: Simplified example of Classification using a Support Vector Machine.

(32)

2.4.5 Deep Learning

DL [20] is a neural network with one or more hidden layers. Within DL, architectures such as Con- volutional Neural Network (CNN) [21], Recurrent Neural Network (RNN) [21], and Multi Layered Perceptron (MLP) [22] are often used for different purposes. They all have in common the layered structure starting with the input layer, followed by the hidden layers, ending with the output layer.

For our implementation we ended up using the MLP architecture. Below the different architectures are briefly explained.

Convolutional Neural Networks: CNN are currently the state of the art within image recog- nition. There are two key properties of CNN, namely convolution and pooling. Convolution is the process of creating feature maps of the original data. Pooling is the act of selecting a region, then find a value in that region, often the maximum value, and then setting that to the whole regions value. These processes happen in the hidden layers, which are then connected to a fully connected layer followed by the output.

Figure 7: The convolution and pooling of a CNN.

(33)

Recurrent Neural Networks: RNN are often used for analyzing and predicting sequential data.

In RNN a cell takes input data, along with the previous output of the same cell, to create new out- put. This means that previous data have impact on the result produced by the current data. To deal with problems such as "how to weight the different parts of the input data?", "how to handle the relationship of the new data to the recurring?", "what should be done in the later cells?", Long Short Term Memory (LSTM) cells are introduced. An example of a LSTM cell can be seen in 8 and it works by having various functions determine: what to forget from previous data, what to add from the new data, what to output to new cells, and what to pass on.

Figure 8: A Long Short Term Memory cell.

(34)

Multilayered Perceptron: A MLP is a form of an artificial neural network. MLP contains at least an input layer, one or more hidden layers, and an output layer. Each layer consists of nodes.

For the input layer it is common to have at least one node per feature in the feature vector. The other nodes are called neurons and utilizes a non-linear activation function. In MLP each node is connected to each other (fully connected network) with a weight. These weights are then adjusted based on the error in the output compared to the expected result, this is called backpropagation.

In most cases the last layer contains one node per class in classification problems, to give the final output, mapping the input features and the weights of the edges to the prediction.

(35)

2.5 TCP Congestion Control

TCP’s congestion control [23] is a feature which secures that the sender doesn’t send more packets than the network communication channel can pass on. The congestion avoidance starts when either a packet loss occurs or if a there is a delay in the response time. TCP use the concept of additive increase, multiplicative decrease, as well as something called “slow start”. Slow start increases the TCP congestion window by one per acknowledgements received, until a loss, delay, the receiver’s advertised window or the slow start threshold (sstresh) is reached. When this occurs the congestion avoidance algorithm is activated. There are multiple well established algorithms, but at the time of writing CUBIC is the most used default algorithm.

2.5.1 Passive detection of TCP Congestion Control Mechanisms

In [4] Hagos et. al present a robust and scalable machine-learning based model to infer the TCP congestion window and the underlying TCP variant from data passively collected at an intermediate node in the network. The prediction model showed good results in multiple scenario settings such as emulated, realistic and combined. By using the knowledge gained through the emulated setting, they show that their model performs well in a realistic setting. The accuracies of predicting the TCP variant in these settings were 93.51%, 95%, and 91.66%. The idea is further developed in [5]

where Hagos et. al implement an LSTM-based RNN model to passively predict TCP connection characteristics. In this paper they show that their LSTM architecture performs even better than their previous ML prediction model with a 97.22% accuracy in the emulated setting, 96.66% in the realistic setting and 94.44% in the combined scenario.

(36)

2.6 Python

Python is a high-level, object-oriented programming language. The scripts used throughout this thesis work have all been written in python. Its syntax is simple with a focus on readability which reduces the cost of code maintenance. Python can be built and executed on all major OSes and is free to distribute which means that is accessible for everyone. With its large ecosystem of frameworks, libraries, modules, packages and tools, it inspires code reusage and modularity. Python is extremely versatile and can be applied within numerous fields, such as: web development, ML, data analysis, visualization, and scripting just to name a few.

2.6.1 Keras and Tensorflow

Keras is a high-level framework for DL, which simplifies the process of creating a DL model. It works as an interface that wraps multiple frameworks, such as Theano, Tensorflow and CNTK.

Keras allows implementation of DL to be performed in less lines of code with an added layer of abstraction. Keras with Tensorflow have been used to create, train and tune our DL model.

2.6.2 SciKit Learn

SciKit Learn is a python package containing numerous predefined ML classes as well as tools to process data. SciKit Learn makes it easy to preprocess data (e.g. scaling the data, normalization, transformation) before feeding it into the algorithm for training. SciKit Learn also contains tools to analyze the performance of the ML model, such as confusion matrices or performance reports.

These tools help give an insight to what that is happening during the training. SciKit Learn

(37)

2.7 Other

For this section other tools and technologies that have been used will be listed and briefly explained.

2.7.1 iPerf3

iPerf3 [24] is an open-source network bandwidth measurement tool with support for tuning param- eters related to timing, buffers and protocols. It is lightweight and easy to implement as both a client and a server for network throughput testing. iPerf3 was used for generating network traffic for our self created dataset.

2.7.2 TCPdump

TCPdump [25] is an open-source command-line tool which comes with most Linux systems, but is also available for other OSes. TCPdump requires the libpcap [26] which is a system-independent library. TCPdump is used to capture and analyze network packets. TCPdump was used to capture the network packets passively at a router node during our dataset generation.

(38)

Chapter 3

3 Related Work

In this section work related to passive fingerprinting will be discussed.

3.1 Machine Learning in Passive Fingerprinting

There has been a lot of research in the field of passive fingerprinting by applying ML to find patterns and predict OSes by well-educated guesses.

In 2003 Lippmann et. Al investigated the use of ML to detect remote OSes by utilizing Ettercap’s signature database [9]. A test database was created for the study by collecting SYN packets on a mail and SSH server over a two-day period. The baseline experiment consisted of using the KNN algorithm to predict all the 75 classes found in the Ettercap database. The experiment yielded poor results, rejecting as much as 84% of the packets, while 44% of the accepted packets were erroneously classified. The performance was believed to be caused by two main issues, substitution errors due to multiple OSes with the same fingerprint, and a high rejection rate caused by numerous unique fingerprints derived from the same OS. After combining the OSes most often confused with each other, eliminating all the classes where no exemplars were identified correctly and removing outliers, the error percentage was reduced to 9.8% with no rejected packets. Other ML algorithms were tested as well such as Decision Tree, MLP, and SVM, with similar results to the KNN.

(39)

In 2004 Robert Beverly developed a naive Bayesian classifier to perform maximum likelihood inference to detect a remote host’s OS [15]. The work was based on OS signatures from the tool p0f.

P0f’s exact match policy weren’t able to identify as much as 5% of the hosts in the test data. In the solution proposed by Beverly, a probabilistic likelihood for each OS was calculated per packet, and the maximum likelihood chosen as the prediction. By using this method, a best guess will always be available even if there is an exact match or not.

3.2 Feature selection

To optimize ML it is important to find the most distinguishing features. Aksoy, Louis and Gunes used a genetic algorithm (GA) to create a subset of features for four different ML algorithms, namely Decision Tree, Random Forest, One R and Zero R [27]. By reducing the number of features the classification process was sped up, while the noise was reduced, leading to higher accuracies.

Their data was created using six classes within 3 classification families. Linux: Xubuntu 14.04 and Raspberry OS, Windows: Windows 7 and Windows 8, and Mac OS X: El Capitan and Lion.

Their classification process was tiered first detecting the OS genre, then the specific OS within the genre. The classification was performed on the protocols IP, TCP and UDP, where TCP had the best results and UDP the worst. After testing their setup, the GA showed improvements to approximately half of the tests only taking accuracy into consideration.

(40)

3.3 Multi-packet fingerprinting

Some research has shown great performance when fingerprinting sessions or flows rather than single packets. A flow can contain multiple sessions and is often based on features called flow keys. The basic flow keys are IP Source address, IP Destination address, Source Port, Destination Port, and the protocol used in the communication.

Anderson used an argmax classifier to identify OSes with single sessions as well as multi-session traces in 2017 [1]. By collecting data over time, defensive mechanisms such as editing your output are less successive as a user usually don’t have the ability to change all the unique network protocols being sent from their end point. The classifier was tested with TCP/IP, TLS and HTTP where each protocol had their own subset of features. For the single session model only the HTTP protocol showed promising results, but for the multi-session model most of the protocols had good results except for TCP/IP. The best result came from combining all methods with the multi-session model with an accuracy of 97.5%. The model also produced high accuracy when faced with obfuscated traffic, as long as only one of the protocols where obfuscated.

By combining several methods of OS fingerprinting Lastovicka et. al achieved high performance using IPFIX flow protocol [28]. Together with traditional OS fingerprinting the HTTP User-Agent string and specific domains were utilized. The specific domains method uses the fact that most OSes have their own domains responsible for e.g. updates. When a system connects to any of these domains it is a high probability that the system is using some form of the domain’s owner’s OS.

For the TCP/IP method only three features were selected: The initial SYN packet size, Window size, and Time to Live. These values were chosen as they are dependent on the OS kernel and can therefore be directly linked with OS detection in mind. When the methods were combined and applied to a network trace, the OS determined by most of the methods was set to be the predicted

(41)

To determine the efficiency of the OS identification performance measures such as accuracy, precision, recall and f-score were calculated for each method. Overall the HTTP User-Agent proved to be the best method, with a precision over 98%. This shows that it works well in any environment if the required field is present and not encrypted. The accuracies of the other methods are all over 80%, but the recall and f-score measurements show that the rate of false positives and false negatives is quite high for all of them.

(42)

Chapter 4

4 Approach

This section will explain the work done and the experiments that have been conducted. Most of the scripts referenced in the text will be presented in the appendix. Due to the fact that some of the scripts have very minor changes, only one version of each similar script will be appended.

Throughout this chapter some phrases will be used interchangeably such as “Labels” and “Classes”

in the context of our ML and DL approaches. Other interchangeable phrases are the “TCP variant”

with “TCP congestion control (mechanism, algorithm, method)”.

4.1 Introduction

To implement a DL model, train, test and compare its result to other classifiers there are several necessary steps. Firstly, it is required to have some data to work with. To evaluate the effect of using the TCP congestion control algorithm, it is important that the data contains this information.

Another important factor for the data is that it contains a ground truth, i.e. the packet’s originating operating system. The ground truth is mandatory to be able to establish whether the model have predicted the correct outcome or not. When the data collection is completed the DL, model needs to be tuned. Changing various hyperparameters can have huge impact on performance and is an important step in the evaluation. To be able to evaluate the TCP congestion control feature a second model will be trained and tuned to compare the performance of the two independent algorithms. When the DL model is satisfactorily tuned the other classifiers need to be implemented

(43)

4.2 Data Collection

Two options for data gathering was explored: generating the data or using already existing data sets.

4.2.1 Data Generation

In the first option a group of virtual machines (VM) were set up as seen in fig 10. The OSes implemented as VMs can be seen in table 8, with an exception of the iOS systems as they are not freely available as VMs. All VMs were created and ran on a personal notebook leading to only one active VM at a time. An Ubuntu 18.04 (Bionic Beaver) node worked as a router by implementing two network adapters, one for internal traffic and one communicating with the internet. The /etc/network/interfaces file was edited as seen below in fig 9, while the line: net.ipv4.ip forward=1was uncommented from the/etc/sysctl.conffile to enable the traffic to be forwarded from the node.

Figure 9: Configuration of the interfaces file in ubuntu.

(44)

Figure 10: Setup of Virtual machines for data generation.

To transmit data the tool iPerf3 [24] was used. iPerf3 was ran as a client on the local VMs and a Google Cloud f1-micro VM was set up to function as an iPerf3 server. TCPdump [25] was used to capture the packet flow from the internal VM to the external node in the Google Cloud, over a period of 2 minutes per OS, storing the captured packets in pcap files. Lastly the default TCP congestion control mechanism was found by using the commands seen in table 6.

(45)

OS-Family Command

Linux, Android, Mac sysctl net.ipv4.tcp_congestion_control

Windows netsh interface tcp show global

Solaris ipadm show-prop -p cong_enabled,cong_default tcp Unix systems sysctl -a |grep net.inet.tcp.cc.algorithm

Table 6: Commands used by different operating systems to find the TCP congestion control algo- rithm currently in use.

When printing the currently used TCP variant a known bug in the Windows 7 and 8 systems occurred stating that the variant is “None”. The none value means that the system is using the built-in TCP variant which is “New Reno” for Windows 7 and “CTCP” for Windows 8.

(46)

4.2.2 Available Dataset

The second option that has been tested was the use of already collected data. Through related work the data set from [28] was freely available on github1. The data set contains data collected over a period of one week in May 2017, with TCP/IP signatures of 51 unique OSes with a total of 529 variations when considering major and minor versions. The dataset consists of three features, namely SYN packet size, window size and time to live which are all dependent on the OS kernel, the OS-family, the major version, the minor version and a statistic called confidence. The confidence is the market-share of a specific fingerprinting belonging to a given OS. Below, see Figure 7, are 10 consecutive example values from the dataset.

ID Packet size Window size TTL OS Major version Minor version Confidence

88 60 12330 64 Android 5 0 100%

89 60 12400 64 Android 4 3 73.35%

89 60 12400 64 Linux N/A N/A 22.54%

89 60 12400 64 iOS 7 1 1.60%

89 60 12400 64 Android N/A N/A 1.33%

89 60 12400 64 Android 3 1 1.12%

89 60 12400 64 Windows 5 1 0.06%

90 60 12510 64 Android 6 0 100.0%

91 60 12600 64 Linux N/A N/A 99.07%

91 60 12600 64 Android 7 0 0.93%

92 60 13000 64 Android 4 3 100.0%

Table 7: Random examples from the fingerprint database.

(47)

4.2.3 Final Datasource

After considering the benefits and disadvantages of the predefined and self-generated data sets the choice was made to use the predefined set. The conclusion was drawn with the fact that it was less time consuming, contained more OSes, and it was based on realistic data. The self-generated data consisted on traffic solely from one service per OS which leads to only one fingerprint per OS which isn’t the case for most realistic settings. To deal with missing TCP variant values in the dataset the default values for the operating system in table 8 were appended, while the values for the less known OSes were set to “Unknown”.

The TCP variant values were appended with the script seen in appendix A.1.1. In additional the three columns ”OS”, ”Major version” and ”Minor version” was merged into one label column and each fingerprint entry was multiplied by its confidence values times ten to create a natural bias as the distribution of all OSes is not equal across all variations.

In addition, a second database file was created where all OSes were put into larger buckets spanning multiple OS versions. The label reduction lead to seven classes for the reduced set, namely: Windows, Android, iOS, Mac OS, Linux, Unix, and Other. With more classes to identify it becomes harder to distinguish all the labels. The main goal has been to identify as many classes as possible, with the label reduction as a second option if the performance with the higher number of classes were not to be satisfactory.

(48)

OS-Family Version OS-Family Version

Android (36.5%) Mac OS X (6.37%)

8.0 Oreo (20%) OS X Mojave, 10.14 (39.46%)

8.1 Oreo (17.58%) OS X High Sierra, 10.13 (25.79%)

6.0 Marshmallow (16.96%) OS X Sierra, 10.12 (12.61%)

7.0 Nougat (11.96%) OS X El Capitan, 10.11 (10.5%)

5.1 Lollipop (11.05%) OS X Yosemite, 10.10 (6.74%)

7.1 Nougat (8.57%) OS X Mavericks, 10.9 (2.15%)

Windows (35.99%) Unknown (4.78%)

Windows 10 (54.78%) Linux (0.79%)

Windows 7 (33.89%) Ubuntu 16.04

Windows 8.1 (6.55%) Ubuntu 18.04

Windows 8.0 (2.17%) Ubuntu 18.10

Windows XP (1.97%) Fedora 29

Windows Vista (0.56%) Debian 9

iOS (13.99%) CentOS 7.6

iOS 12.1 (71.48%) openSUSE 42.3

iOS 11.4 (7.32%) Unix-like

iOS 12.0 (4.54%) Solaris 11.4

iOS 10.3 (3.86%) FreeBSD 11.2

iOS 9.3 (2.8%)

(49)

4.3 Implementation

4.3.1 Training and Tuning the Deep Learning model

In the tuning process, finding the settings and hyperparameters that gives the highest accuracy is the main goal. Other measures such as precision, recall and loss are important, but will not be focused on in the first tuning steps. The DL algorithm has been tuned by training multiple models with varying hyperparameters.

Preprocessing the data - Before the training could be done the input data was preprocessed and prepared for the algorithm. This meant that features were scaled to a common range for entire feature space, making their magnitudes even, while the labels was be turned into numerical values that the algorithm could interpret.

First the data was randomized by using a seed. The seed allows the randomization to be reproduced by inputting the same seed or random state. When training the model, it was important to randomize the data to make sure the model didn’t become biased by overpopulation of a specific value only contained in a small part of the dataset causing the model to overfit for that value.

df = df.sample(frac=1, random_state=7).reset_index(drop=True)

The features were scaled with multiple scalers using the SciKitLearn preprocessing module. The scalers tested was the StandardScaler, the MinMaxScaler, and the RobustScaler. The standard scaler causes each feature to have a mean of zero and a variance of one, the min max scaler places the features in a range between 0 to 1, and the robost scaler centers and scales the features based on percentiles.

(50)

After scaling the feature values the labels were converted to numerical categorical values in the form of a One-Hot-Encoding matrix. This was done by using SciKit Learn’s LabelEncoder together with Keras’ data utilities. By using a One-Hot-Encoding matrix the numerical values will be presented as vectors containing one 1 and multiple zeros rather than decimal, binary or hexadecimal numbers removing any unwanted ranks. By using vectors, the system will not directly compare them removing cases where higher or lower numbers would be considered better or worse causing unwanted bias.

First Tuning of Hyperparameters - Due to the high number of different hyperparameters the tuning was performed in two steps. For the first tuning step the main goal was to find which activation function, optimizer and loss functions which combined gave the best results on the prepared data. The hyperparameters that was tested can be seen in the list below:

Activator functions:

1. Relu 2. Sigmoid 3. Tanh 4. Softmax 5. Elu 6. Selu Optimizers:

1. SGD 2. Adam

(51)

In addition to the hyperparameters tested above, different numbers of hidden layers and nodes in each layer was varied as well with the following values: 0, 1 and 2 hidden layers, and 32, 64 and 128 nodes in each layer. When combining these choices, a total of 324 models were trained.

These tests were performed on the reduced label data file. In this phase the batch size was kept at a low 25, while the epochs were set to 50. An additional 324 models were trained on the same dataset with the TCP variant feature removed. The results for both with and without the TCP variant were the same regarding which hyperparameters performed the best, with a little drop in the accuracy for the dataset containing the TCP variant.

Two Keras callbacks were utilized in the first tuning process. These callbacks were the Tensor- Board and ModelCheckpoint modules. TensorBoard allowed for easy visualization of the perfor- mance in each epochstep of the training, while the ModelCheckpoint was used to save the model at the epochstep where the accuracy of the model was at its best.

In Figure 11 the performance of all the trained 324 models are shown. The X-axis shows the epoch steps while the y-axis shows the accuracy percentage in decimal form for both Figure 11 and Figure 12. Figure 12 shows the performance of the top ten configurations, while the last figure, Figure 13 shows the configurations. From the first tuning it was concluded that the activators:

Relu, Selu, Elu and Softmax and the optimizer SGD does not perform well with our data and was therefore dropped for the second tuning.

From the first test it was concluded that 128 nodes and 1 hidden layer performs the best, together with the activator functions "tanh" and "sigmoid", the loss functions "mse" and "cate- gorical_crossentropy", and lastly the optimizers "adam" and "nadam".

(52)

Figure 11: Accuracies of all configurations throughout the epochs.

Figure 12: Accuracies of top ten configurations throughout the epochs.

Figure 13: Time and accuracies of the top ten configurations.

(53)

Second Tuning of Hyperparameters - For the second tuning step of the hyperparameters the goal was to establish whether more nodes and layers would enhance the performance with the best performing activators, optimizers and loss functions found during the first tuning. The number of training steps was increased to 150 and the batch size increased to 500. During the second tuning 120 models were trained per data set, with and without label reduction. The second tuning phase showed that models with more than 2 hidden layers often lead to overfitting, making the accuracy drop extremely low. The number of nodes per hidden layer did not increase the accuracy beyond 128 nodes, but the time of the training increased from a two-minute average for 128 nodes, to a four-minute average for 256 nodes to a 14-minute average for 512 nodes with one hidden layer.

The best performing configuration came to be: 128 nodes with 1 hidden layer, using the activator function of Tanh, the optimizer of Nadam and the loss function Categorical Crossentropy.

For the dataset with no label reductions, an accuracy of 84% was found. Since the dataset with no label reductions contained over 500 classes the results seemed extremely good. After investigating the dataset, it became clear that the data was extremely skewered, with some classes containing over half of all the entries in the database. To fix the issue two options was explored: keeping a bias, but with reduced differences, and removing all bias creating a dataset with equally distributed fingerprints for each class.

Generalizing the Data - The data was altered to reduce the maximum amount of copies per class entry (see appendix A.1.1, line 220 – 230) and then removing fingerprints that occur less times than a third of a percent of the whole dataset (see appendix A.1.3, line 21 and 24). The alteration left 33 classes for the non-reduced label set, while the already reduced set kept its seven classes.

(54)

For the first option of keeping a reduced bias in the dataset, the fewer occurring classes were copied until they reached a presence of at least half of the most occurring class. In the second option, a fully general dataset was created by distributing the classes based on a predetermined number of entries per class: 2000, 5000 and 10 000 entries. After the three levels were tested by performing training on each of the datasets, the dataset containing 10 000 entries of each class gave the best results. When pinning the two options against each other the difference in performance was so low, less than a percentage for each classifier, that the most general option was chosen, with equally distributed classes.

Final Data Preparation - To be able to train and test several ML algorithms on the exact same data as our DL algorithm, the data was split into training and testing sets outside each script. This meant that the datafiles only had to be loaded into each script for their specific task, rather than loading a single datafile and then split it into training and testing data. This was done mainly to remove human error. An example would be the order of executions. If one of the scripts randomizes the file before splitting, while another randomizes the data after splitting the data sets would not have been an exact match leading to possible error sources. The splitting script can be seen in appendix A.1.2.

When all the necessary data files were properly configured, eight final models were trained for the following set ups:

• Full label reduction with equally distributed classes with the TCP variant

• Full label reduction with equally distributed classes without the TCP variant

• Full label reduction with biased classes with the TCP variant

• Full label reduction with biased classes without the TCP variant

(55)

4.3.2 Tuning the other classifiers

All the other classifiers were trained and tested on the same datasets as prepared for the DL approach except for the SVM where the training process is highly time consuming, leading to downsizing of the datasets.

Support Vector Machine - The SVM was tuned through a process called grid search which finds the parameters that gives the best result. This was done by specifying a set of values for the parameters “C”, “Gamma” and “Degrees” where the Linear kernel uses only the “C” parameter, the RBF kernel uses the “C” and “Gamma” parameter, and the Poly kernel which uses all three parameters. When the best parameters was found, the model with these parameters was validated on the test set. The full script can be seen in appendix A.2.5.

K-Nearest Neighbors - The KNN was tuned by testing various numbers of K from 5 to 100.

For each model the accuracy was stored and the model with the highest accuracy was chosen to be used for comparison to the other machine and DL models. The full script can be seen in appendix A.2.6.

Random Forest - For the RF algorithm no training or tuning was necessary. To make the experiment reproducible a seed was used when the SciKit Learn RandomForestClassifier object was called. The full script can be seen in appendix A.2.7.

(56)

Chapter 5

5 Results

This section contains reports on the performance of the DL algorithm with and without the TCP variant as a feature, as well as comparisons to the SVM, KNN and RF classifiers.

5.1 Comparing the algorithms

This section contains detailed reports about the performance of the different algorithms. The results presented in this section are all performed on data sets using the TCP congestion control algorithm as a feature.

All tests were performed on multiple datasets one containing 33 classes and another dataset where the number of labels had been reduced to the 6 major OS-families, Linux, Unix, Android, iOS, MacOS, Windows, and a seventh class called “Other” for OSes not suited for any of the other groups. Below, Table 10 contains the accuracies of all our trained classifiers on the two datasets.

33 Classes 7 Classes

DL 84.26% 91.91%

SVM 82.39% 91.71%

KNN 87.00% 92.69%

RF 87.06% 92.73%

(57)

When inspecting the results further for the data set with 33 classes, all the four classifiers perform almost identical when it comes to precision and recall, with an overall good performance in the 75-95 percentage for both values except for within the different versions of android 4, 5 and 6. When looking at the confusion matrix it is clear that these classes are often misinterpreted as each other, while the rest of the classes are well distinguishable.

When looking into the results for the dataset with grouped labels it is clear that the problem with misclassifying the android versions is gone. From this data all classes have a high precision and recall except for the iOS class with a precision of approximately 75% for all the classifiers.

All classifiers also have a drop of 10% for the recall value of the “Other” class, meaning there are some false positives. The confusion matricies in Figures 14,15,16 and 17 show that there is a slight confusion between the “Other” class and the “iOS” class.

(58)

Figure 14: Confusion matrix for the DL classifier. Figure 15: Confusion matrix for the SVM classifier.

(59)

5.2 Evaluating the TCP congestion control algorithm

In this section the performance while using the TCP congestion control algorithm as a feature will be compared to the results without the feature. For our two datasets containing 33 classes and 7 classes the impact on the accuracy is quite noticeable. When inspecting our seven-class data, the performances of all the classifiers are still quite similar. There are two major differences: a clear drop in precision for the Mac OS class as well as a low recall value for the iOS class. A classification report for each classifier can be seen in the Tables 11, 12, 13, and 14 below.

With TCP Congestion Control feature Without TCP Congestion Control feature

DL 91.91% 82.16%

SVM 91.71% 81.96%

KNN 92.69% 83.95%

RF 92.73% 84.07%

Table 10: Accuracies of the algorithms with and without using the TCP Congestion Control mech- anism as a feature.

(60)

DL with feature DL without feature OS Precision Recall F1-score Precision Recall F1-score

Android 0.96 0.97 0.97 0.75 0.92 0.83

Linux 0.89 0.92 0.91 0.90 0.82 0.86

Mac OS 0.96 0.92 0.94 0.62 0.81 0.71

Other 0.93 0.81 0.87 1.00 0.74 0.85

Unix 1.00 1.00 1.00 0.94 0.99 0.97

Windows 0.96 0.92 0.94 0.97 0.91 0.94

iOS 0.76 0.89 0.82 0.67 0.57 0.61

Average 0.92 0.92 0.92 0.84 0.82 0.82

Table 11: Classification report for the DL classifier with and without the TCP Congestion Control feature.

SVM with feature SVM without feature OS Precision Recall F1-score Precision Recall F1-score

Android 0.96 0.99 0.97 0.74 0.88 0.80

Linux 0.86 0.95 0.90 0.85 0.85 0.85

Mac OS 0.98 0.89 0.93 0.65 0.77 0.71

Other 0.93 0.81 0.87 0.91 0.81 0.86

Unix 1.00 1.00 1.00 0.91 0.99 0.95

Windows 0.99 0.89 0.94 0.97 0.88 0.92

iOS 0.75 0.89 0.81 0.73 0.55 0.63

Average 0.92 0.92 0.92 0.83 0.82 0.82

Table 12: Classification report for the SVM classifier with and without the TCP Congestion Control feature.

(61)

KNN with feature KNN without feature OS Precision Recall F1-score Precision Recall F1-score

Android 0.99 0.98 0.99 0.87 0.91 0.89

Linux 0.93 0.94 0.93 0.91 0.90 0.91

Mac OS 0.97 0.92 0.94 0.58 0.88 0.70

Other 0.90 0.83 0.86 0.92 0.81 0.86

Unix 1.00 1.00 1.00 0.94 0.99 0.97

Windows 0.99 0.91 0.95 0.98 0.91 0.94

iOS 0.76 0.91 0.83 0.79 0.47 0.59

Average 0.93 0.93 0.93 0.86 0.84 0.84

Table 13: Classification report for the KNN classifier with and without the TCP Congestion Control feature.

RF with feature RF without feature OS Precision Recall F1-score Precision Recall F1-score

Android 0.99 0.98 0.99 0.87 0.91 0.89

Linux 0.92 0.95 0.93 0.91 0.90 0.91

Mac OS 0.97 0.92 0.94 0.61 0.83 0.70

Other 0.93 0.81 0.87 0.92 0.81 0.86

Unix 1.00 1.00 1.00 0.94 0.99 0.97

Windows 0.97 0.92 0.94 0.98 0.91 0.94

iOS 0.75 0.91 0.82 0.72 0.53 0.61

Average 0.93 0.93 0.93 0.85 0.84 0.84

Table 14: Classification report for the RF classifier with and without the TCP Congestion Control feature.

(62)

5.3 Summary

Our results show that our DL model performs just as well as other well-established ML techniques.

When taking a closer look at the precisions and recalls of all the methods, it is clear that all of the methods make almost identical predictions. When it comes to the TCP congestion control mechanism we’ve shown that all the classifiers we’ve tested have a noticeable increase in accuracy, precision and recall with the feature applied to the dataset.

(63)

Chapter 6

6 Discussion and Future Work

6.1 Introduction

Throughout the thesis work a series of choices have been made. Some of the more important selections will be discussed in this chapter.

6.2 The Dataset

One of the main concerns with passive OS fingerprinting is finding or creating a dataset with the necessary information to be able to use the set for training and testing. Most data sets freely available don’t have any information about the OS of the client. There can be multiple reasons for this, but the main issue lies within protecting personal information. There is a lot of work put in to make data completely anonymous. This caused us to go a lot back and forth on creating our own data set. We felt that real world data would give a better indication of how the solution would work in general and chose therefore to use the data collected for the paper [28]. The data set contained a ground truth and was fairly new which is quite important when working with OSes which gets updated quite frequently. The data generated by our own setup always created the same fingerprint for the given OS which isn’t always the case in the real world.

(64)

There are some flaws regards using the data set presented in the paper which needs to be addressed. Firstly the data might not be representative of a general demographic, as the data was collected from a university subnet. Secondly there is no measure of how often each fingerprint have occurred, only a measure of how many percentage of a given fingerprint belongs to a specific OSes.

To get the best OS prediction for a given network some bias can be beneficial. It is a fact that the distribution of OSes are not equal, which means that a new model should be trained with bias based on the network one would want to perform the OS detection on. That being said, a general model would perform better on a multitude of different networks.

6.3 The Features

There has been a lot of research done in the field of feature engineering for passive fingerprinting.

In this work it has been focused on four features, namely: SYN packet size, Window size, Time to live and the TCP congestion control algorithm. These features are all dependent on the OS kernel which makes them good candidates for OS detection. Most tools on the other hand often use more features creating more detailed fingerprints. The benefit of having detailed fingerprints lies within the fact that they’re more likely to contain some feature that separates any one OS from one another. There are also disadvantages of having more specific fingerprints. It often leads to longer computation times of the different ML algorithms trying to predict the OS. In [27] they use a genetic algorithm to create specific subsets of features for each ML algorithm they test, which in most of their cases lead to improved accuracies as well as shorter computation times. They found that many of the features often used lead to noise and redundancy, meaning that smaller feature spaces can provide better performance. With this in mind the choice to go for the predefined dataset was strengthened, with its minimalistic approach to features.

(65)

6.4 The Classifier

We chose to use Keras to implement the DL model since it is at the time of this writing the state of the art DL development framework as well as very easy to use. It is also widely used and well renowned. To compare the DL classifier we wanted to test its performance to other ML classifiers, where we chose three widely known ML algorithms: SVM, KNN, and RF.

6.5 TCP Congestion Control Mechanism

The basis of our investigation of the TCP congestion control mechanism lies within the research done in the field of passively finding this congestion control algorithm. As far as we know we are the first to use the TCP congestion control mechanism as a feature for passive OS detection.

For this project we chose to use the default TCP congestion avoidance method since we used a predefined data set. We believe that most users don’t change their congestion control mechanism even though it is possible. There were some issues finding the default TCP congestion control mechanism of the Windows OS, versions 7 and 8. There is an error when the command to lookup the TCP congestion control method is executed, which leads to the value being "none". Further research online claims that Windows 7 and older by default use "New Reno" while Windows 8 use Microsoft’s own "CTCP" algorithm.

(66)

6.6 Future Work

For future work there are several experiments we would find interesting to perform. First of all we would find it interesting to train our DL model on a completely realistic dataset, rather than transforming a realistic fingerprint database into a dataset. This would give a true performance measure on a naturally bias dataset.

In addition to this we would want to set up a capture mechanism which can collect the TCP/IP header info, the TCP congestion control mechanism, and the OS in a safe and non-intrusive way.

By collecting the actual TCP congestion control mechanism rather than using the default, we would expect to see different results.

(67)

Chapter 7

7 Conclusion

In this thesis four different algorithms, DL, SVM, KNN and RF, have had their performances compared on a passive fingerprinting classification problem. In addition a never before used feature have been evaluated, the TCP congestion control algorithm. The problem consisted of a minimal feature space for the task, with only three to four features: SYN packet size, Window size, Time to live and the TCP congestion control algorithm. The tests were performed with two datasets, one containing 33 classes and a second dataset with reduced labels, containing seven classes. During our tests all classifiers performed well on the data with seven classes, while the performances for the dataset containing all the classes did not end up with the same high scores. When it comes the the evaluation of the TCP congestion control as a feature, all classifiers had their performances increased by 5-10% when the feature was included.

We believe our DL approach has acceptable results and perform at an equal level to some of the most well-established ML algorithms. We also feel that the TCP variant could become a useful asset when passively classifying OSes.

(68)

Appendices

A Scripts

A.1 Data Preparation

A.1.1 Full label reduction

1 ORIGINAL_DB_PATH = "C:/Users/Martin/Desktop/Data_preparation/FP_DB.csv"

2

3 #An overview of which TCP variants can be seen in the appendix

4 TCP_variant = {"Unknown" : "0",

5 "CUBIC" : "1",

6 "CTCP" : "2",

7 "New Reno" : "3",

8 "Reno" : "4",

9 "Tahoe" : "5",

10 "Vegas" : "6",

11 "Hybla" : "7",

12 "BIC" : "8",

13 "Agile-SD" : "9",

14 "Westwood+" : "10",

15 "BBR" : "11",

16 "C2TCP" : "12",

17 "Elastic-TCP" : "13",

18 "Proportional Rate Reduction" : "14"}

19

20 def prepare_data(filepath):

21 """

22 Function preparing data for the Deep Learning algorithm.

23 Arguments:

24 - filepath : Path to the file containing the data.

25 Returns:

- Writes the prepared data to a new file.

Referanser

RELATERTE DOKUMENTER

Problem Queue Congestion Flow Prediction Forecasting Estimation Techniques Deep learning Neural networks Machine learning Big data.. Second step is to select the

In this paper, we propose a new machine learning approach for target detection in radar, based solely on measured radar data.. By solely using measured radar data, we remove

We explore the performance of modern Deep Learning-based registration techniques, in particular Deep Global Registration (DGR) and Learning Multi- view Registration (LMVR), on

The primary goal of this project is to demonstrate source authentication using machine learning based fingerprinting of signal characteristics on the CAN physical

In this work, we evaluate several state-of-the-art machine learning techniques and deep learning methods in the medical image processing domain and research solu- tions about

Men samtidig så ser jeg jo at hvis vi ser på dette med lesing i alle fag, og regning, altså det...noe av dette er kanskje dybdelæring litt av tanken, at man for eksempel skal

The development of a deep learning method for detecting hate speech based on anomaly detection techniques, using semi-supervised learning, pre-trained word embeddings and

In view of this, we review the current research status of deep learning for mobile crowdsourcing from the perspectives of techniques, methods, and challenges.. Finally, we list a