• No results found

Quantifying Workload Delays for Consolidated Cloud Environments

N/A
N/A
Protected

Academic year: 2022

Share "Quantifying Workload Delays for Consolidated Cloud Environments"

Copied!
97
0
0

Laster.... (Se fulltekst nå)

Fulltekst

(1)

Quantifying Workload Delays for Consolidated Cloud

Environments

Emil Richardsen Nedregård

Thesis submitted for the degree of

Master in Network and System Administration 30 credits

Department of Informatics

Faculty of mathematics and natural sciences

UNIVERSITY OF OSLO

(2)
(3)

Quantifying Workload Delays for Consolidated Cloud

Environments

Emil Richardsen Nedregård

(4)

© 2019 Emil Richardsen Nedregård

Quantifying Workload Delays for Consolidated Cloud Environments http://www.duo.uio.no/

Printed: Reprosentralen, University of Oslo

(5)

Abstract

Virtual Machine (VM) consolidation in the cloud has received huge research interest over the recent years. Usually, the consolidation strategies are variants of the bin packing problem, which aims to minimize the numbers of deployed physical machines (PM). When there are too few resources to handle the workloads, adelayoccurs. Approaches for the bin packing problem results in more utilized clusters of PM where the expected load and delay is estimated to add up arithmetically. This thesis shows that when consolidating VMs based on the bin packing principle, the loads and delays do not always add up arithmetically, leading to larger delays than expected.

(6)
(7)

Contents

1 Introduction 1

1.1 Problem Statement . . . 2

2 Background 3 2.1 Cloud Computing . . . 3

2.1.1 Cloud Service Model . . . 3

2.2 Virtualization . . . 4

2.2.1 Types of Virtualization in Linux . . . 5

2.3 Service Level Agreement . . . 6

2.4 Consolidation . . . 6

2.4.1 Steal Time . . . 6

2.5 Bin Packing Problem . . . 7

2.6 The Concept of Workload Delay . . . 8

2.7 CPU . . . 8

2.7.1 Cache . . . 8

2.7.2 Multi-Core . . . 10

2.7.3 Multi-Socket . . . 10

2.7.4 Hyper-Threading and Simultaneous MultiThreading 10 2.7.5 Context Switch . . . 10

2.7.6 CPU Scheduling . . . 11

2.8 Kernel Timer . . . 12

2.8.1 Proc . . . 12

2.8.2 Tick . . . 12

2.8.3 /proc/stat . . . 12

2.9 NUMA . . . 13

2.10 Related work . . . 14

(8)

2.10.1 Consolidation Strategies in the Cloud . . . 14

2.10.2 Applications of Queuing Theory in Cloud Computing 16 2.10.3 Steal time . . . 16

2.10.4 Delay . . . 17

3 Approach 19 3.1 Environment . . . 19

3.1.1 DDOSLab2 . . . 19

3.1.2 Research2 . . . 20

3.1.3 Public Cloud Vendors . . . 20

3.2 Visualization . . . 21

3.2.1 Flame Graph . . . 21

3.2.2 Rstudio - ggplot2 . . . 22

3.3 T-test . . . 23

3.4 External Tools . . . 23

3.4.1 Lstopo . . . 23

3.4.2 Stress . . . 23

3.4.3 Numactl . . . 23

3.4.4 Isolcpu . . . 24

3.4.5 Taskset . . . 24

3.4.6 Perf . . . 24

3.4.7 Bash Shell . . . 26

3.4.8 C and C++ . . . 26

3.4.9 Python3 . . . 26

3.5 KVM . . . 26

3.5.1 Kernel Samepage Merging . . . 27

3.6 CPU Heavy Benchmarks . . . 27

3.6.1 AssemblyPercent . . . 27

3.7 Memory and Cache Benchmarks . . . 28

3.7.1 CPP-loop . . . 28

3.8 Experimental Approach . . . 29

4 Results 31 4.1 Hyperthreading and Simultaneous Multithreading . . . 31

(9)

4.1.1 Quantify Execution Time for Simultaneous Multi-

threading . . . 31

4.2 Steal Time . . . 34

4.2.1 Steal time in KVM . . . 34

4.2.2 Steal Time in the Public Cloud . . . 36

4.3 Delay toward CPU intensive tasks . . . 37

4.3.1 Execution time when consolidating two VMs . . . 37

4.3.2 Execution time for a process when consolidating all CPUs . . . 38

4.3.3 Execution time in the Public Cloud . . . 39

4.4 Delay toward Memory and Cache intensive tasks . . . 40

4.4.1 Execution time when consolidating two VMs to the same CPU . . . 41

4.4.2 Execution time when consolidating all CPUs . . . 41

4.4.3 Consolidating different numbers of CPUs . . . 42

4.4.4 Public Cloud . . . 44

5 Discussion 47 5.1 Hyperthreading and Simultaneous Multithreading . . . 47

5.2 Steal Time . . . 48

5.3 Delay toward the CPU . . . 49

5.4 Delay toward Memory and Cache . . . 49

5.5 Public Cloud . . . 51

5.6 Problem Statement . . . 52

5.7 Future Work . . . 53

6 Conclusion 55 Appendices 61 A Experiments 63 A.1 Bare Metal . . . 63

A.1.1 Hyperthreading and Simultaneous Multithreading Performance . . . 63

A.1.2 test_all_cpus.sh . . . 63

A.1.3 loop_time.sh . . . 63

A.1.4 hyperthreading_formatter.py . . . 63

(10)

A.1.5 hyper_graph.R . . . 64

A.1.6 all_boxplot.R . . . 64

A.1.7 two_boxplot.R . . . 65

A.2 Steal Time . . . 65

A.2.1 proc_compare.py . . . 65

A.2.2 stealcpu.sh . . . 66

A.3 Delay toward the CPU . . . 66

A.3.1 Shared CPU . . . 66

A.4 All CPUs Consolidated . . . 68

A.4.1 CPU Performance in the Public Cloud . . . 69

A.4.2 Memory performance with same CPU . . . 71

A.4.3 devide2.py . . . 71

A.4.4 Memory performance all CPUs . . . 72

A.4.5 Memory performance in the Public Cloud . . . 73

B Code 77 B.1 AssemblyPercent . . . 77

B.1.1 C . . . 77

B.1.2 Assembly . . . 77

B.2 Memory Performance in C++ . . . 78

B.2.1 C++ . . . 78

(11)

List of Figures

2.1 IaaS, SaaS and PaaS Comparison. Adopted from [10] . . . . 4 2.2 With virtualization, the three OSs running on the PMs to the

left are migrated and can continue to run on the same PM. . 5 2.3 Server Consolidation . . . 7 2.4 Cache Architecture . . . 9 2.5 NUMA architecture . . . 14 2.6 How workload requests get delayed. Adopted from [6] . . . 18 3.1 DDOSLab2 cache sizes, information about one of the servers

two NUMA nodes. . . 20 3.2 Research2 cache size, information about one of the servers

eight NUMA nodes. . . 21 3.3 Flame Graph of a script printing the Fibonacci numbers . . . 22 3.4 Example of a detailed output measured byperf stat, data are

taken from running the STEAM benchmark with different block sizes. . . 25 3.5 Flame Graph of theAssemblyPercentscript. . . 28 3.6 Flame Graph of theCPP-loopscript. . . 29 4.1 Execution of processes on the Research2 server where

Simultaneous Multithreading (SMT) is enabled. . . 32 4.2 Performance in time when running Simultaneous Multi-

threading on a CPU heavy task. Data adopted form figure 4.1. . . 33 4.3 Comparison of utilizing the max numbers of physical CPUs

and logical CPUs. . . 33 4.4 Sample of 100 Experiments, with mean and confidence

interval of 95%. . . 38 4.5 Performance of a task in a consolidated environment where

all CPUs are utilized, one task in a VM and on bare metal. . 39

(12)

4.6 Execution time for theAssemblyPercentscript. . . 40 4.7 Performance of the memory intencive scripts on Bare Metal,

a VM and two VMs consolidated to the same CPUs. . . 42 4.8 Execution time in seconds for the different block sizes when

consolidating VMs to all available CPUs. . . 43 4.9 Performance of the benchmark when utilizing different

numbers of CPUs. This figure illustrated how the execution time arises over time when utilizing different numbers of CPUs on the DDOSLab2 host. The execution time are measured from a VM pinned to the hosts CPU 0, on NUMA node 0. . . 44 4.10 Performance over time for the different vendors running the

scripts for memory performance . . . 45

(13)

List of Tables

2.1 Content of the /proc/stat file witch contains the number of ticks spent in different modes. . . 13 3.1 Server spesifications. . . 19 3.2 Instance type used by different cloud vendors. . . 21 4.1 Ticks used by DDOSLab2 executing the AssemblyPercent

script with all the required resources. . . 35 4.2 Ticks used by DDOSLab2 when executing the AssemblyPer-

centscript on two VMs sharing CPU. . . 35 4.3 Ticks used byResearch2executing theAssemblyPercentscript

with all the required resources. . . 35 4.4 Ticks used byResearch2when executing theAssemblyPercent

script on two VMs sharing CPU. . . 36 4.5 How the kernel is given resources from the hypervisor both

inDDOSLab2and the public cloud. . . 37 4.6 Standard deviation when executingAssemblyPercenton dif-

ferent vendors over a long period. . . 40 4.7 Standard Deviaton for all the different vendors and block sizes. 46 5.1 Comparing the mean of VM and consolidated VMs in

percent for the different block sizes. . . 50 5.2 Comparing perf data from running the benchmark on a VM

with no other traffic on the host, and perf data from a VM in a cluster with eight VMs where all guests are pinned to all CPUs and executing the benchmark. . . 52

(14)
(15)

Special Terms and Acronyms

CPUAlternately referred to as a processor, central processor, or micropro- cessor, and the CPU is the central processing unit of the computer.

CoreA part of a CPU that receives instructions and performs calculations, or actions, based on those instructions.

Core and CPUBoth terms refer to a core throughout this thesis.

HTHyperthreading

SMTSimultaneous Multithreading VMVirtual Machine

PMPhysical Machine OSOperating System

SLAService Level Agreement IaaSInfrastructure as a Service PaaSPlatform as a Service SaaSSoftware as a Service

(16)
(17)

Acknowledgments

I would like to thank my supervisors Hårek Haugerud and Vangelis Tasoulas for their excellent guidance and support during this process, which has been challenging yet very interesting.

(18)
(19)

Chapter 1

Introduction

Cloud computing is increasing at a tremendous pace, and more and more companies are centralizing their resources to ensure uninterrupted power, better security, and availability. With virtual machines (VMs) increased over the years enables vendors to offer a variety of scalable virtualized services for a low cost by a pay-as-you-go approach.

As the internet took off in the mid-1990 data centers became common, and today the energy consumption challenges posted by the data centers are considerable. In 2017 about 3% of the global electricity production was consumed to operate data centers worldwide, nearly 40% more than the entire United Kingdom. This consumption is expected to double every four years [1]. The energy consumption by cooling data center equipment can be over 40% of total energy consumption [2]. For cloud vendors this is turning in to a problem due to energy cost is leading to a significant expense. Keeping the energy low is, therefore, a significant drive and an ideal goal is to align the numbers of physical machines (PMs) with the current demand.

Since the launch of virtualization vendors were able to utilize their physical servers much more with several VMs on a single PM compared to only having one operating system (OS) on a single physical machine (PM). Virtualization technologies with VM live migration enable VMs to be moved around in a cluster with almost no downtime. Now, being able to have multiple virtual machines on a single PM made it possible to move several under-utilized machines to the same physical host. This approach is calledServer Consolidation. Vendors are still facing the problem with under- and over-utilized servers.

A great deal of research has gone into making the optimal solution for consolidate VMs on PMs, see for example [3, 4]. The majority of the work considers the problem as a resource allocation problem and typically a bin packing problem [5]. The need for consolidation is in most cases different.

Some vendors might pack their VMs with the aim of high utilization, and in some periods the demand for resources may go above the capacity, and if that is the case the tasks is going to getdelayed.

(20)

In a paper published by researchers at Oslo Metropolitan University (OsloMet), they looked at the concept of workload delay as a quality- of-service metric for consolidated cloud environments[6]. The paper concluded with the concept of delay needed to be accounted for when consolidating.

However, research has yet to quantify workload delay for consolidated cloud environments. Most of the study focuses on minimizing the PMs to meet the service level agreement (SLA) but not how VMs sharing resources might compete fiercely with each other to acquire the shared resources, such as CPU, cache and memory bus. The performance degradation on over-utilized servers has not been a common research topic over the years — this thesis focus on how a delay more significant then what to expect may occur in a real consolidated cloud environment. When VMs in consolidated environments share resources, it is commonly expected the execution time for the VMs sharing resources would add up in a linear matter. This thesis is going to take a look at to what extent this is true and quantifies workload delays for consolidated cloud environments.

1.1 Problem Statement

At some point, two processes on a bare metal server need to share resources. It is easy to expect the delay to be 50% in execution for the processes whenever sharing, maybe in some cases, the execution time exceeds the expectations. The sharing of resources on a bare metal server leads to the following first problem for this thesis:

When does sharing of resources on bare metal servers lead to an increase in expected delay?

This thesis also aims to quantify workload delay in a consolidated environment. To get the best possible understanding of a workload delay, the underlying hardware and components in the kernel need to be researched. As VMs usually share the same resources at some point when consolidated, it is essential to quantify the witch metric might to what extent leads to a workload delay, which leads to the second problem statement:

When do VMs sharing resources cause an increase in an expected delay?

An IaaS cloud and a consolidated environment consist of VMs utilizing the same hardware on a PM. How sharing of the same host leads to a delay needs to be investigated. A public IaaS cloud maker more money when more VMs are utilizing the same hardware; however, the impact of many machines utilizing the same PM is yet to be researched. This leads to the third problem statement:

How do we quantify the workload delay for VMs running on the same physical hardware both in a local environment and the public cloud?

(21)

Chapter 2

Background

This chapter will go through different technologies and terminologies that are desired to know for understanding the problem domain. It is essential to have a better understanding of the problem domain and the current situation to see where a delay might occur. As PMs are getting more and more complex, several designs have been implemented for the sake of speed and scale. Some might also cause a significant delay for the user if not utilized as intended. This chapter will address most of these essential metrics.

2.1 Cloud Computing

Cloud computing is a centralized shared pool of configurable computer system that can be rapidly provisioned with minimal effort. Cloud computing has over the years with vendors like Amazon AWS, been able to phase out the need for data centers for small- to medium-sized companies with their pay-as-you-go approach. The paper [7] takes a look at cloud computing services in scientific computing where High-performance Computing (HPC), High-Throughput Computing (HTC) and Many-Task Computing (MTC) are introduced. A key point in this article is by using Cloud Computing via vendors like Amazon AWSs ec2, and researchers can get resources when needed and only pay for what they have used.

2.1.1 Cloud Service Model

There are three differentCloud Service Modelsthat are mostly used all over IT, and are essential to understanding as most big cloud vendors provide these services. The models are Infrastructure as a Service (IaaS), Platform as a Service (PaaS) and Software as a Service (SaaS):

IaaS - In IaaS a provider are selling their underlying infrastructure where the customer can run their applications. The customers do not control the underlying infrastructure but usually are given access to the

(22)

Figure 2.1:IaaS, SaaS and PaaS Comparison. Adopted from [10]

machines through a web browser, where they can start a new VM with resources as they want as a pay-as-you-go manner. They usually choose the OS and have to do all the configurations themselves, for example, networking and server cluster, etc. From a customers point of view, the IaaS service model is known to be resilient, always available, fast hosting of an application and can get as many resources as needed [8].

PaaS - In PaaS, the goal is also to deploy the software onto the provider’s infrastructure, but generally by using programming languages supported by the vendor. In this approach, the customers do not define the network, OS, servers or storage but only control the application.

SaaS - In this approach, the vendors are hosting the application for the user, and it is ordinarily accessible from a web browser or specific software. The provider has the benefit of steady, on-going revenue from their customers. In exchange, the customer gets the benefit of continuously maintained software [9].

See figure 2.1 for a comparison of the tree Cloud Service Models, where an illustration of the managing areas are presented.

2.2 Virtualization

Virtualization is the act of creating a virtual version of an operating system (OS). See figure 2.2 for a basic virtualization example, where three OSs are

(23)

Figure 2.2:With virtualization, the three OSs running on the PMs to the left are migrated and can continue to run on the same PM.

virtualized to running independently from three to one physical machine.

By the help of a hypervisor [11], several operating systems can run on the same PM. When using this technology system administrators can utilize more resources on one PM. It is also possible to take a snapshot of a VMs current state for the sake of a later restore. This snapshot might also be used as a backup for example before performing a risky operation. VM live migration are also introduced, where VMs can switch PM with almost no downtime [12].

2.2.1 Types of Virtualization in Linux Full Virtualization

The hardware completely supports virtualization, and there is no need to modify any specific software or hardware for the VMs to work properly.

KVM and VMWare ESXi are examples of software that totally depends on the underlying hardware and only works wherevirtualizationis supported.

Para-Virtualization

In contrast toFull Virtualization, each VM are displayed with an abstraction of the hardware that is similar but not identical to the underlying physical hardware [13]. It requires modifications to the guest OS that are running on the VMs resulting it being aware that they are executing on a VM allowing for near-native performance.

(24)

2.3 Service Level Agreement

A service level agreement (SLA) is a contract between a service provider and an end user that defines the level of service expected from the provider.

SLAs purpose is to define what the customer is going to receive; however, the SLA does not define how the specific service is delivered or provided and should contain the following metrics:

Description- A description of the service.

Reliability - The percentage of when the service is going to be available.

Responsiveness- The punctuality of service in response to requests and service dates.

Problems- List of who to contact and procedure of escalation when a problem occurs.

Monitoring- Who is going to monitor performance

Not meeting service obligations- Routines if the SLA is not met for example should the customer be able to terminate the relationship.

Escape Clauses- If for example an ISPs equipment gets destroyed in some natural disaster, fire, etc.

The level of the services should be specific and measurable in each area to ensure a quality of service (QoS).

2.4 Consolidation

As mentioned in the introduction Server Consolidation is the process of packing VMs on PM which allowed vendors to utilize their PMs more. By minimizing the numbers of PMs, there may be a reduction in cost, carbon footprint, and power consumption. See figure 2.3 for the impact consolidation might have in cluster of PMs. In the figure, the VMs executing on server S2 and S3 are after consolidating running independently on the new hosts S0 and S1. Now the same number of VMs is execution on less hardware. Also with live migration, VMs may be moved around in a cluster based on utilization. Vendors are migrating server all around in the cluster, and there are many examples of research to optimize the migration of VMs by monitoring the demand for customers to find the best suitable destination [14].

2.4.1 Steal Time

As the number of virtual CPUs on the guest can exceed the number of physical CPUs of the host, the hypervisor cannot in some cases schedule

(25)

Figure 2.3:Server Consolidation

all virtual CPUs on the available physical CPUs and must temporarily suspend some of them. When a virtual CPU is ready to execute but is suspended is calledSteal Time. It enables the VM to compute the execution time of an application accurately. This metric is available in scenarios in which the host and the guest OS are tightly coupled. Vendors like AWS does not show this metric for OS running on most instances in the cloud [15]; this applies to most cloud vendors. Where the Steal Time is high, the user of the computer is going to notice a degradation in performance.

By being able to measure the steal time, the delay for different workloads on VMs might quantify the resulting delay due to other VMs utilizing the same CPU.

2.5 Bin Packing Problem

The Infrastructure as a Service (IaaS) model can be abstracted as a bin- packing problem where the VMs are equal to items and PMs are similar to bins. The items must be packed into an infinite number of bins where the intent is to use as few bins as possible. Reducing the number of bins is essential for green computing. The bin-packing problem is known to be an NP-hard problem, so there is no known algorithm for solving it. There is much research on the topic, for example, [16] where a consolidation strategy called Ant Colony System based VM Consolidation (ACS-VMC) to reduce the numbers of PMs in a cluster for green computing. Different workloads were tested, and the final result showed the solution reduced energy consumption, SLA violations and the number of migrations.

However, no research are considered the ultimate solution to the problem.

As many consolidating algorithms are based on this approach, more are mentioned in the related work section, research for making sure the bins require all the resources have not been a hot research topic over the recent time.

(26)

2.6 The Concept of Workload Delay

A workload getsdelayedwhen a workload takes longer time than expected to complete. A process or VM sharing resources gets delayed when not being able to utilize the desired resources. When consolidating, it is easy to expect the need for resources adds up arithmetically. A process might get in some cases require more resources when consolidated, resulting in a delay more significant then what to expect. When researching the concept of workload delay, the goal is to quantify what resources might lead to an increase in expected delay when consolidated. By taking workload delay into account, consolidation algorithms might also improve.

2.7 CPU

It is essential to know about the metrics in the CPU, as there might be the single cause of a delay. If a delay occurs and conclusions are desired all metrics used in the execution needs to be researched. Due to the increasing demand for computationally intensive applications result in higher demand for computing resources. Researchers are continually developing new ways to increase the performance of the computer. When looking at the delay, information about the underlying infrastructure is therfore desired.

2.7.1 Cache

The cache is a software or hardware component witch job is to store data so future requests for data can execute faster. As the CPU in the early days of computing were getting a lot faster than the memory resulting in the CPU not being fully utilized as it was waiting for data from memory, and for this reason the cache was introduced. The goal of the cache is to ensure that the CPU has the next bit of data it will need already loaded into the cache by the time it goes looking for it, this is also known to be a cache hit. A cache miss means that the CPU has to find the data elsewhere. Usually, a system consists of an L1, L2, and L3 cache, where L1 is the fastest and L3 slowest, however, the speed of utilizing the cache is better compared to fetch data from memory. If there is a cache miss in L1 the CPU checks in L2 and so on. Someone might argue the size of the cache, but the L1, L2 and L3 cache must not be too big as it has an impact on the speed and is also expensive.

See [17] for a cache comparison between Intel Sandy Bridge and AMD Bulldozer, where conclusions are made that AMD bulldozer technology are much more complicated compared to Intel Sandy Bridge resulting in a vast amount of latencies. For example, the study shows the accumulated L3 cache bandwidth of a bulldozer server with eight cores is close to the L3 bandwidth of a single Sandy Bridge CPU. The paper goes further through the technology in detail and points out decision made on the CPU that

(27)

Figure 2.4:Cache Architecture

results in lower or better bandwidth. See figure 2.4 for an underlying cache architecture, where the infrastructure of a core is presented. The figure also shows the different metrics the CPU has to go through before fetching date. Firstly the CPU looks in the L1 cache for instructions if a miss occurs it scans L2, L3 and so on. Eventually, if the RAM is full, the information is stored on the disk which is even slower compared to RAM.

Cache is important when quantifying memory. As VMs in a consol- idated environment might share the CPU, the L1 and L2 cache are usu- ally shared by the two VMs. Commonly, the L3 cache is shared with some CPUs by default. When running a heavy memory benchmark, a delay more significant than expected might arise when VMs shares some parts of the cache.

(28)

2.7.2 Multi-Core

A computing component with two or more independent processing units is called a multi-core prosessor. For example, a dual-core setup consists of two CPUs is located in the same socket. Since they are on the same socket, the connection between them is much faster compared having them on a separate socket (multi-socket). Even though a user can run multiple processes on a single CPU, by having multiple CPUs on a single system task can be given CPU time faster.

2.7.3 Multi-Socket

The server normally wants a high number of active CPUs. Multi-socket makes it possible to add several CPUs to the same system. The sockets are interconnected so all the CPUs can communicate with each other. Usually, sockets have locally connected RAM; this design is called NUMA. For performance related problems when having multiple sockets with NUMA, see section 2.9. Also since cores on the same socket share L3 cache, the communication is much faster compared to inter-socket communication [18].

2.7.4 Hyper-Threading and Simultaneous MultiThreading Hyper-Threading (HT) and Simultaneous MultiThreading (SMT) by Intel and AMD are based on the same policies. The technology is used to improve parallelization, which means doing multiple tasks at once. A processor with this enabled consist of two logical processors per physical core; each of which has its architectural state. Each logical processor can individually be halted, interrupted, or directed to execute a specific thread independently from the other logical processor sharing the same physical core. Since each physical core is interpreted by the kernel to be two processors are called siblings. Sibling CPUs usually have a small individual L1 cache and share a small part of the L1 cache, while at the same time share the L2 cache and usually L3 cache with CPUs running on the same socket.

As this technology share resources when heavily utilized, it might be a root cause of delay. This topic is researched more in the result section.

2.7.5 Context Switch

Context Switch is the process of storing the state of a processor thread so it can be restored later and executed from the same point later. The process allows several processes to be executed on the same CPU and is an essential feature of a multitasking OS. AContext Switch also occur as a result of an interrupt. It has a cost, and much of the design of an OS is to optimize the use, as it requires some small amount of time for storing the state of the current process. Context Switchis essential because it is the

(29)

critical component for parallelization. When two processes are fighting for the same resources, the CPU context switches between processes so fast it seems the processes are executing at the same time, but are at the same time delayed.

2.7.6 CPU Scheduling

CPU scheduling is a process which allows one task to utilize the CPU while some other task is on hold. The goal of CPU scheduling is to make the system fair, fast and efficient. When the CPU becomes idle, the OS must select one of the processes to be executed in theready queue. When two VMs are asking for the same resources, it’s the scheduler’s job to decide what job to execute next. Usually, in this case, the scheduler wouldcontext switch between the processes so over time the CPU utilization for the two VMs is 50%. How the underlying scheduler behave in a consolidated environment is desired to know when researching delay.

Dispacher

The dispatcher is the module that gives control of the CPU to a process selected by the short-term scheduler which involves switching context, switching to user mode and jumping to the proper location in the user program to restart the program from where it was left. The process of stopping and starting a process is known to be aDispatch Latency.

Types of CPU scheduling

CPU scheduling decisions take place under the following circumstances:

1. A process is switched from a running to a waiting state 2. A process is switched from the running to ready state.

3. A process switches from waiting to the ready state.

4. Termination of a process.

WithNon-Preemptive Schedulingthe CPU has been allocated to a process, and the process keeps the CPU until it releases it by either terminating or switching to the waiting stage. In Preemptive Scheduling, the tasks are typically assigned with priorities as it is necessary to run a specific job that has a higher priority before another task although it is running. The running task is therefore interrupted and resumed later when the process finishes with a higher priority.

(30)

2.8 Kernel Timer

As the environment used in this thesis is Linux, it is desired to know the kernel timer work. When taking a look at the performance of a VM, it would be essential to see where the VM or hypervisor believe it spends its time compared to where it really spends its time. The kernel timer might give some answers when a task gets delayed, because processes need a particular time on the CPU to finish executing, and in a consolidated environment where available resources may vary, a CPU could be busy for the process at some point.

2.8.1 Proc

The proc filesystem is a pseudo-filesystem which provides an interface to kernel data structures. It mounted at /procin the Ubuntu and most of the standard Linux Distros. Most of the files in proc are read-only, but some files are writable for allowing kernel variables to be changed.

2.8.2 Tick

Atickis an arbitrary unit for measuring internal system time. How many milliseconds a tick represents varies and is up to the kernel. For example, Microsoft Windows has 10,000 ticks in a millisecond. Ubuntu Linux has typically 100 ticks in a second, so one tick equals to 10 milliseconds. The measure of CPU time is in clock ticks per CPU. So a server running Linux with 100 ticks in a second with 64 cores, has in total 6400 clock ticks per second. /proc/statin Linux is a file where information about each CPU has spent their time and are measured inticks. Steal Timeare reported in ticks and are essential to understand to see for how long the machine gets the requested CPU time.

2.8.3 /proc/stat

Information about the kernel activity are stored in the/proc/statfile for each CPU, and the numbers in this file are reported since the system first booted.

For an example of the content see table 2.1. The file contains information about how many ticks each CPU spent performing different types of work.

Each of the columns has its meaning.

1. user- How much time are used executing processes in user mode.

2. nice- How much time spent executing tasks with a nice value (low scheduling priority).

3. system- Time the CPU spent in kernel mode.

4. idle- Numbers of ticks the CPU has spent idle.

(31)

5. iowait- Waited for I/O to complete.

6. irq- Time serving interrupts.

7. softiqr- Time servicing softirqs.

8. steal - Number of ticks spent in other operating systems under the control of the linux kernel.

9. guest - Time spent runninga virtual CPU for a guest OS under the control of the linux kernel.

10. guest_nice - Time spent running a niced guest, a virtual CPU for a guest OS under control of the linux kernel.

The metricctxtare also reported in this file. Linectxtgives the total number of context switches across all CPUs in the system. By analyzing the data presented in this file, information about a CPU executing in a consolidated environment can be displayed.

cpu 12742 667 6033 13527227 9645 0 74 0 0 0

cpu0 2792 3 1518 3382869 2240 0 7 0 0 0

cpu1 3260 104 1413 3380868 2577 0 25 0 0 0 cpu2 3386 404 1619 3381061 2636 0 35 0 0 0 cpu3 3303 154 1481 3382427 2191 0 5 0 0 0 Table 2.1: Content of the /proc/statfile witch contains the number of ticks spent in different modes.

2.9 NUMA

Non-uniform memory access (NUMA) is a memory design used in multiprocessing. As the CPU is considerably faster than the local memory, it’s essential to access the memory as fast as possible. With NUMA, a CPU can access a local part of the memory much faster in comparison to the non-local part and is a much-used design on servers running multi-socket.

Sometimes, when the demand for memory is bigger than the local memory address space, utilization of external memory will then occur. A delay is going to happen when requesting data from the non-local NUMA node.

NUMA systems are known to be more suitable for scaling according to [19], where the paper explains the simplicity of making a resource heavy server when combining several cost friendly sockets compared to only have one that is huge and expensive. See figure 2.5 for the underlying architecture where it shows a VM with two virtual CPUs (vCPUs) pinned each to a CPU on a socket located on a separate NUMA node. The node consists of a socket with a locally connected size of memory. By default, each of the CPUs uses memory from the local NUMA node, but node 0 might also use memory eternally from node 1. The figure also shows the

(32)

Figure 2.5:NUMA architecture

underlying infrastructure of a multi NUMA node system. In a traditional system, you will find memory- and I/O controllers, but are ignored in figure 2.5 for the sake of simplicity.

NUMA is vital because each socket has a locally connected NUMA node, and using memory from an external NUMA node results in a delay.

In some cases a process executes faster when using external memory; the scheduler decides this. The sizes of locally connected memory are essential to take into reckoning because even though a system have much memory, a process might get delayed when a CPU utilizes significant parts of it.

2.10 Related work

There are several approaches to consolidate VMs on PMs in the cloud as bin-packing is known to be an NP-hard problem. Also, the concept of waiting time and delay is a central part of the standard queuing theory.

Therefore different consolidation strategies and representative examples of queuing theory are mentioned in this section. As this thesis aims to research the steal time of a VM in a consolidated environment, there is also a section related to this. Related work relevant for delay is also a topic.

2.10.1 Consolidation Strategies in the Cloud

In [20], a burstiness-aware server consolidation via queuing theory ap- proach is introduced. As many PMs are highly consolidated the VMs in

(33)

some cases, need to be migrated to other idle PMs. Migration is a costly operation and can potentially cause performance degradation for the end user. The paper purpose a novel VM consolidation mechanism with re- source reservation which takes burstiness into consideration as well as en- ergy consumption. The paper [21] are presenting a live migration algo- rithm which analyses the resource need and then adapt to the VMs previ- ous need for resources. Then the researchers compare their results to other algorithms and show their solution is up to 10% more efficient than the comparing algorithms.

Furthermore, the paper [22] takes this consolidation one step further as it is consolidating algorithm consolidates VM based on which another guest with frequent communication. Research has shown that 70% of total data center communication traffic comes from interactive communications between VMs. Inappropriate consolidation may leave VMs continually talk back and forth with each other through a data center network, which may significantly degrade performance. The solution can effectively reduce energy consumption and communication bottleneck while meeting SLA constraint.

As more and more companies are moving their infrastructure to the cloud, the paper presents [23] a solution that would find the most cost- effective solution where the workload is known. Algorithms are developed to see the best cost taking some variables into account. In [24], a cost- benefit solution of using cloud computing to extend the capacity of clusters is presented. With cloud computing and IaaS, companies with local infrastructure can extend their cluster with vendors resources without moving all the infrastructure to the cloud. As some user might want to increase their resources for a short amount of time, by using this approach, there is no need to expand the local infrastructure, which is a costly operation. The research showed an excellent ratio of slowdown improvement to the money spent on using cloud resources.

As there are many consolidation strategies in the cloud, each strategy has its purpose, and this goes to show in [25], where the solution aims to consolidate to save energy cost in a cloud computing environment. Three different approaches to achieve these energy savings were discussed in the article: workload prediction, VM placement, and workload consolidation, and resource overcommitment. Current challenges with these consolida- tion strategies were also introduced and discussed.

As there is a lot of consolidation strategies in the cloud, most of the papers presented does not take the concept of delay. In many cases, the consolidation algorithm does not take into account that at some point, the VMs needs to share resources, and that time they share might lead to an increase in expected delay. When a server has 64 logical CPUs available does this not generally means that the performance of a single CPU is the same when all of them are utilizes. The delay of this example is due to hyperthreadingis enabled in the PM. Papers about all the available technologies are not widely researched yet, compared to consolidation

(34)

strategies.

2.10.2 Applications of Queuing Theory in Cloud Computing As services need resources based on time of the day due to user activity, it is crucial to take a look of research related to this. To increase the revenue for the vendors, this means minimizing the cost for the services (electricity, infrastructure providers) and maximizing the service charge to customers while at the same time guaranteeing the quality of service request.

Some articles are researching how to predict the required cloud capacity in the presence of time-varying customer demand for services. In [26], different approaches are used to predict the arrival time of client requested for VMs and purpose and compare several models for managing the required number of VMs for the demand being satisfied.

Researchers in [27] introduces a double-quality-guaranteed renting scheme for service providers as customers with different demand want the waiting time for each service request to be within a low range to satisfy quality-of-service requirements. The main features of this scheme are to combine short-term with long-term renting. The solution can reduce resource waste and adapt to the dynamical demand of computing capacity. The articles take many optimization factors which include market demands, SLA, the rental cost of servers, etc.

When scaling the system, problems arise when different services over time scales. Vilaplana et al. [28] introduce a model that scales and can be easily applied to most public and commercial areas. The article concludes to delivering good QoS in terms of response time; they have to determine where the system has a bottleneck and then improve the corresponding parameters.

2.10.3 Steal time

Researchers from the Polytechnic University of Catalonia have published a Platform-Agnostic Steal-Time measurement in a guest operating system is introduced [15]. As most of the big cloud vendors guest OSs do not output the steal time for the VMs, the solution presented in this paper is a novel and platform agnostic approach to calculate steal time within the virtualized environment and without the cooperation of the host OS. As a result, the algorithm presented is effectively computed in a guest Linux OS running on a hardware-assisted VM.

Schad [29] takes a look at the performance degradation in some prominent vendors IaaS cloud. He compared the performance of the same instances types and experienced a substantial performance variance between "equal" instances. He pointed out the vendors not informing the clients about the placement of the VMs in a large cluster, so the user has no idea how many VMs are running on the same hardware, and little control

(35)

for the end user if running intensive tasks.

2.10.4 Delay

Researchers from Oslo Metropolitan University (OsloMet) has looked into the concept of workload delay in a consolidated environment [6].

The paper goes into how a workload gets delayed, and they base their conclusions on data from a real consolidated environment. Their result shows that a simple bin packing based on delay metrics can deliver more predictable performance when compared to a simple bin packing based on average utilization. See figure 2.6 adopted from the paper, on how a workload request get delayed in a cluster where VMs have different demand over time. When the demand is above the max utilization, a delay starts to arise. For example, in some part of the day, the demand for resources is much higher and may, therefore, lead to a bigger delay. The environment needs, therefore, more time to handle all the request of all the different VMs in the consolidated environment. As previously mentioned in this paper, this paper also stated the issues might occur when not taking the concept of delay into account when consolidating VMs on PMs.

In [30], the authors conclude with the overall performance of consoli- dated VMs being unpredictable. VMs that share the same CPU and main memory is according to the researchers working surprisingly well, and the delay when stress testing these metrics are not that significant. The main issue concluded in this paper is I/O, which is tested and shown to be un- predictable in a consolidated environment. However, in this paper, the experiment is executed in the public cloud and not a local environment.

The researchers based their conclusions on data fetched from mostly Ama- zon AWS instances where they have executed different benchmarks. The conclusions are based on the performance of a VM running on a specific hy- pervisor then and there. The survey does not mention this result may vary over time as the hypervisors hosting the VMs and underlying hardware are getting better with time.

(36)

Figure 2.6:How workload requests get delayed. Adopted from [6]

(37)

Chapter 3

Approach

This section will contain the approach together with different tool and technology that will be used consistently through this thesis, and also technical information about the environment used for testing. Different servers are used for experimenting and together with different tools for making sure the experiment is executed isolated as possible for making sure the experiments are as valid as possible.

3.1 Environment

For local testing the two serversDDOSlab2andResearch2has been in use.

Their hardware specifications are listed in table 3.1.

Server DDOSLab2- Dell PowerEdge R610 Research2- Dell PowerEdge R815 Prosessor Intel® Xeon® 5600 AMD Opteron(tm) Processor 6366 HE

Sockets 2 6

Cores 8 32

L1 Cache 32KB L1i, 32KB L1d 16KB L1d, 64KB L1i

L2 Cache 256KB 2048KB

L3 Cache 8192KB 6144KB

Chipset Intel® 5520 AMD

DIMMs 12 DDR3 16 DDR3

RAM 12 * 2GiB DIMM DDR3 1066 MHz 16 * 16GiB DIMM DDR3 1600 MT/s

Drive Bays 6 x 2¨ 6 x 2

Hard Drive Types SAS SAS

Hyperthreading / Simultaneous Multithreading No Hyperthreafing

Table 3.1:Server spesifications.

3.1.1 DDOSLab2

The DDOSlab2 server is running Ubuntu Server 18.04 Server with KVM installed. All the VMs to be hosted on the server are also going use Ubuntu 18.04 Server. See figure 3.1 for information about cache sizes and NUMA node for the server, the picture only shows one of the two NUMA nodes for the sake of simplicity as the other NUMA node looks precisely the same.

The cores does not share cache as hyper-threading is turned off. CPU 1, 2

(38)

Figure 3.1: DDOSLab2 cache sizes, information about one of the servers two NUMA nodes.

and 3 are isolated so no user process can execute on the three CPUs, except when running a program with taskset, more about the different tool used for the environment in section 3.4.

3.1.2 Research2

Research2 is running AMD Bulldozer microprocessor microarchitecture, see figure 3.2 for the lstopo output. Eight NUMA nodes exist on the host;

however, the lstopo output only shows one as the other seven is identical.

The server is running Ubuntu 18.04 Server with KVM installed, and VMs created would be the same as the host. In contrast to the DDOSlab2 server, the servers share L2 and some part of the L1 cache with another CPU due to simultaneous multithreading (SMT).

3.1.3 Public Cloud Vendors

For experiments in the cloud, some vendors are chosen. VMs chosen for experimenting have all one vCPU; however the size of memory and disk may vary. The vendors used and the specifications are listed in table 3.2.

Even though the specifications are the same, some might be very heavy consolidated, meaning a lot of VMs are running together on the same PM.

For all vendors, it is not possible to place the exact location of the VM other

(39)

Figure 3.2:Research2 cache size, information about one of the servers eight NUMA nodes.

Vendor Type vCPUs Memory (GiB)

Amazon AWS t2.micro 1 1

Azure Standard B1ls 1 0.5

Google n1-standard-1 1 3.75

DigitalOcean Starter 1 1

Table 3.2:Instance type used by different cloud vendors.

than geographical location as they have different availability zones.

3.2 Visualization

There are different tools used to make graphs and other visualizations. It’s important to clarify their primary function, and how to correctly interpret the image.

3.2.1 Flame Graph

A current problem in the IT industry is how software is consuming resources, notably CPUs. What is consuming how much, and how does this differ from the last update? This is illustrated by using software profilers like perf [31]. However, the results often are not visualized, so even with presented with all the data, it is hard to get some concrete out of it. The Flame Graph[32] is a visualization tool for profilers likeperf, to make much faster comprehension, reduce the time for root cause analysis.

Flame Graphsare in a .svg format so the user can click on the different bars to see how much CPU the various tasks are consuming. The user is also able to mouse-over for information and search in the graph. See figure 3.3 for an example of a flame graph for a script printing the Fibonacci numbers. TheFlame Graphhas the following characteristics:

(40)

Figure 3.3:Flame Graph of a script printing the Fibonacci numbers

• The stack trace is represented as a column of boxes, with each box is representing a function.

• They-axisshows the stack depth, ordered from the root at the bottom, to the leaf at the top. The top box shows the function that was on-CPU when the stack trace was collected, and everything beneath that is its ancestry. The function beneath a function is its parent.

• Thex-axisdoes not show the passage of time as many would expect, but instead indicates utilization, so the whole bare is equal to 100%

CPU utilization.

• The width of each box shows the frequency at which that function was present in the stack.

• The color of the box has no significant meaning, but the randomness helps the eye differentiate boxes.

• The item at the top is the one running on the CPU, the ones at the bottom are the functions calling another one.

The Flame Graph is used in this report to illustrate where the CPU spent their time when running different experiments, and also used to troubleshoot delay to make sure the program utilizing the expected resources.

3.2.2 Rstudio - ggplot2

Almost all the diagrams presented in this paper are developed by using theggplot2 package in Rstudio. ggplot2 is a trendy graphics language for creating elegant and sophisticated plots, and the tools enable the user to be univariate and multivariate numerical and categorical data in a straightforward manner.

(41)

3.3 T-test

As we want to quantify workload delay over time, different experiments for various environments are going to be executed. t-test will be used for calculating to what extent two sample means are the same. T-test is a statistical hypothesis test. The two-sample location test, which is going to be used in this thesis, tests the null hypothesis to see if the two populations are equal or not.

3.4 External Tools

Several tools are used throught in this thesis for making sure the experiments are executed as isolated as possible, to get the most valid result. For example, some tools make a process utilize a specific CPU, monitoring that specific use might be of interest to measure performance.

3.4.1 Lstopo

Are used for making a.pngImage of the underlying system. Lstopo gives information about the different cache sizes and also what CPUs that are apart of the same NUMA node. With Lstopo, it is easy to determine if technologies like hyper-threading or SMT are enabled. See figure 3.1 and 3.2 for an example, where only the image of one NUMA node is included for the sake of simplicity. A complete lstopo also gives information about the network interfaces and the disks connected.

3.4.2 Stress

Stress is a simple workload generator for Linux systems. It imposes a configurable amount of CPU stress to the system. It is used to very quickly to generate work on a computer. It is written in C and is, in essence, a loop that forks worker processes and waits for them to either complete or exit with an error. Stress will be used for mainly creating heavy loads for later measure how much resources the task is given in a consolidated environment.

3.4.3 Numactl

Numactl is a tool for controlling NUMA policy for processes or shared memory. The user can also be provided of the NUMA infrastructure of the system, and are also as tasksetable to decide CPU affinities, but also decide which NUMA node a given process is going to execute on.

Together withstressthey allow the user to execute a program on chose core and use the memory of an external NUMA node. With the tool,

(42)

you can be sure the code is using memory where wanted. Sometimes the scheduler might use external NUMA node when local is full, but some the scheduler use external memory for the sake of speed.

3.4.4 Isolcpu

In Linux, it is possible to isolate CPUs by removing them from the kernel load balancer. The isolation is easily configured in the kernel, and by isolating the CPUs, no user processes can be executed on the chosen processor. The only processes defined from taskset or if a VM has been allowed to run on that specific core. When configured, no unwanted tasks gets to execute on the chosen cores. Three CPUs on theDDOSLab2 server are isolated to make sure no user process gets CPU time on the isolated CPUs.

3.4.5 Taskset

Tasksetis used to retrieve or set a process’s CPU affinity of a running process given a PID or to launch a new command and running the process on a given CPU. The scheduler will honor the given CPU, and the process would not run on another CPU. When isolating CPUs, it might be desired to making sure a process gets to execute on the isolated CPU.

3.4.6 Perf

Perf is a performance analyzing tool, and is capable of statistical profiling of the entire system both in user- and kernel mode. The tool supports hard- ware performance counters, tracepoints, software performance counters, and dynamical probes. Perf is known to be the most commonly used per- formance counter profiling tool in Linux. With perf, anyone can easily see information about, for example, cache, and cache misses. Perf enables the user to measure the performance of a single process, CPU, or the entire sys- tem. See figure 3.4 for an example output ofperf stat, to illustrate a very de- tailed output of perf. Data about the cache with misses, context-switches, and page-fault, are some essential data that can be used for performance monitoring and validating.

Perf is vital because it’s output is showing what metrics used when executing a program. When compared with other outputs, patterns reveal, and a different process behavior from different environments might result in a conclusion. For example, if a process gets delayed due to the cache when consolidated,perf can be used to verify this by comparing the cache misses and hits. Only the most crucial part of the perf data is going to be used in this thesis, like icache and dcache load and misses. If mentioned later in the thesis, discussions will follow in the discussion section.

(43)

20396.785527 task-clock (msec) # 1.000 CPUs utilized 33 context-switches # 0.002 K/sec

5 cpu-migrations # 0.000 K/sec

94,188 page-faults # 0.005 M/sec

48,601,408,006 cycles # 2.383 GHz (22.11%)

22,800,010,807 stalled-cycles-frontend # 46.91% frontend cycles idle (22.18%) 4,518,568,163 stalled-cycles-backend # 9.30% backend cycles idle (22.25%) 64,362,004,504 instructions # 1.32 insn per cycle

# 0.35 stalled cycles per insn (27.77%)

4,698,265,484 branches # 230.343 M/sec (27.92%)

2,333,302 branch-misses # 0.05% of all branches (27.82%)

29,871,509,864 L1-dcache-loads # 1464.520 M/sec (27.90%)

1,439,534,641 L1-dcache-load-misses # 4.82% of all L1-dcache hits (27.80%)

58,310,817 LLC-loads # 2.859 M/sec (22.28%)

33,721,990 LLC-load-misses # 3.87% of all LL-cache hits (5.53%)

1,683,602,248 L1-icache-loads # 82.543 M/sec (11.14%)

9,748,719 L1-icache-load-misses (16.65%)

43,720,850,145 dTLB-loads # 2143.517 M/sec (22.25%)

17,855,144 dTLB-load-misses # 0.04% of all dTLB cache hits (22.22%)

64,711,904,777 iTLB-loads # 3172.652 M/sec (22.18%)

684,502 iTLB-load-misses # 0.00% of all iTLB cache hits (22.14%)

877,244,933 L1-dcache-prefetches # 43.009 M/sec (22.09%)

920,731,939 L1-dcache-prefetch-misses # 45.141 M/sec (22.06%) 20.398666746 seconds time elapsed

Figure 3.4: Example of a detailed output measured by perf stat, data are taken from running the STEAM benchmark with different block sizes.

(44)

3.4.7 Bash Shell

Bash is a Unix shell and command language used in Linux, and most of the test in this thesis is written in bash. By usingbash, allows us to execute code much more efficiently, connect to a remote VM inside a script together with otherUnixcommands. By using the command language, users could conduct many tests on relatively few lines of code, and store the result to a text file for later statistical use.

3.4.8 C and C++

The C programming language provides the user with low-level access to the memory, map efficiently to machine instruction and to require minimal runtime support. To make sure the script is utilizing the needed metrics scripts in C has been developed for testing purposes. By using C when testing, for example, memory, the users may choose how much memory is used by the developed program and also modify the block sizes. It is essential when looking into the concept of delay to make sure the tests behave as wanted; for this reason, C is preferred before high-level programming languages like, for example, Python.

Benchmarks created in this thesis are created using this programming language; the benchmarks itself are presented later in this section. By using C and C++ for making the benchmarks certainties can be made on being sure what is going on inside the CPU or memory.

3.4.9 Python3

When programming inbashthe program dump the data to text files, Python is used for modifying the resulting bash data to a CSV file with the desired format for later use in R. As test created in Bash and executed on different hosts, there might in some cases be many files containing performance data that have to be compiled together.Python3are effectively used to solve the compiling of text files containing statistical data. Scripts for comparing lines are created, and generally, all data in this thesis goes through some modification usingPythonfor the desired result.

3.5 KVM

Kernel-based virtual machine or KVM [33] as most commonly referred to, is a virtualization tool for Linux on solution running x86 hardware with hardware virtualization extensions and enables the kernel to run as a type 1 hypervisor. With KVM it enables the user to run multiple virtual machines with unmodified versions of Linux or Windows images. Each of the virtual machines has private virtualized hardware; a network card, disk, graphics adapter, etc.

(45)

Together with virsh, witch work as a command line interface tool for administrating the VMs running on the host enables the user to set CPU affinities, so for experimenting purposes together with NUMA affinities. It works similar to taskset, and numactl (section 3.4.3) used when experimenting on bare metal.

For this thesis, KVM is used for administration of VMs on both hosts.

All the VMs and host mentioned in this project are running the same OS, Ubuntu 18.04 for testing and experimenting for the sake of making sure both the physical and virtual environment are the same.

3.5.1 Kernel Samepage Merging

Kernel Samepage Merging (KSM) is a memory saving re-duplicate feature in KVM [34]. The principles of the technology are to re-use the same pages, so in practice, different VMs can use the same memory. KSM might result in less memory used for big programs with similar data and features. The way this work is by detecting duplicate block then merging them into a single page and map them into both original locations. If the data is modified the kernel detect this and later separate the blocks. KSM where introduces due to the high memory utilization when running multiple machines on the same PM. With KSM users where able to run much more copies of the same OS on a relatively small amount of ram on a host.

As benchmarks will be executed in consolidated environments, it is essential to know about KSM because executing several equal benchmarks for memory on the same host might lead to the benchmark utilizing the same blocks of memory and the result might make the result not being realistic.

3.6 CPU Heavy Benchmarks

Usually, benchmarks for testing the performance are desired. In this paper, which is going for the most part monitor delay in seconds, scripts for that purpose have been developed. The scripts for testing CPU performance both on bare metal and in a consolidated environment, have been created using C and assembly. This for making sure the script only utilizes the CPU and not any other components on the machine.

3.6.1 AssemblyPercent

AssemblyPercent is the script used for memory benchmark. Compared to other different benchmarking tools, which measure the performance in MB/s, AssemblyPercent measure the performance in seconds. The script is developed in C and assembly and can be displayed in appendix B.1. C and assembly were chosen to make sure to understand what is going on in

(46)

Figure 3.5:Flame Graph of theAssemblyPercentscript.

the CPU to make sure of the integrity of the test especially when at the end where conclusions about the CPU is going to be made. AssemblyPercent uses about 42 seconds on the DDOSLab2 server, the performance varies based on the hardware. Figure 3.5 shows the flame graph of the script. The graph shows the script utilizes 100% CPU time in the function loop which is running the assembly code from the appendix.

3.7 Memory and Cache Benchmarks

Memory and cache benchmarks have been developed to measure the performance in execution time. The presented benchmarks are writing different block sizes to the memory to measure the performance of the memory together with the cache. The size of the blocks is based on the cache size of the host where the benchmarks are to be executed on.

3.7.1 CPP-loop

The CPP-loop script in appendix B.2.1 utilizes the memory NTIMESwith a block size ofSTREAM_ARRAY_SIZE. Two scripts have been developed with the only change being in line 29, where a copy of another number is added for making sure the script does not use the same memory or cache when running together due to KSM mentioned in section 3.5.1. With these two scripts as a template, several executable programs have been made with different modifications of the block site and times of execution (STREAM_ARRAY_SIZEandNTIMES), this for taking a look at the impact of the cache. The flame graph of the script is presented in figure 3.6, and shows the script utilizes 100% of the CPU when executing, even though its a memory heavy script. The memory performance is not presented in a flame graph; however, from the graph, the script behaves as expected.

(47)

Figure 3.6:Flame Graph of theCPP-loopscript.

3.8 Experimental Approach

Before experimenting certainties has to be made that the environment are as isolated as possible. There is a lot of user process always executed on a server so isolated CPUs, taskset with other tools need to be used for making sure the result gets as accurate as possible. Most of the external tool presented is used for making sure a process and experiments are isolated and execute the benchmarks where wanted.

The benchmarks presented in earlier sections will be executed on the different hosts in various environments by the help of bash scripts where the results are saved in a text file. As all the hosts used for experimenting are using Ubuntu Linux, one test can execute on all the different host for the different environments. The scripts will dump the results to a text file;

python will format the gathered text files and format them to a desired CSV format for statistics purposes.

The different environment will be bare metal for comparing purposes, a single VM running on a server with all the available resources and different variances of a consolidated environment where for example two VMs shares the same resources or a lot of VMs are executing on separate CPUs but utilizes the whole host.

For experiments in the public cloud, no access to the underlying hypervisor is given, and the conclusion to be made will be based on standard deviation (SD). SD because some vendors might have better hardware than other and our local environment, resulting in the best result for an experiment is not meaning some vendors are better than others, SD gives information about consistency. The same approach for testing on the local environment is used for the public cloud. The experiments will look at performance over time; therefore, long-running bash scripts are going to be developed.

(48)
(49)

Chapter 4

Results

This chapter is going in depth on the subjects addressed in the previous sections, and show how all the different metrics and technologies might affect the performance of a machine in a consolidated environment. Several experiments for testing on different environments are presented in this section.

4.1 Hyperthreading and Simultaneous Multithread- ing

With this technology enabled on most moders servers, the performance is the main topic of this section. Research2 are used for experimenting with it’s AMD CPUs, and Simultaneous Multithreading (SMT).

4.1.1 Quantify Execution Time for Simultaneous Multithreading The goal of this experiment is to measure the performance of Research2 when heavily utilized. The server has 64 logical CPUs with Simultaneous Multithreading (SMT), meaning the OS believes one physical CPU to be two available logical CPUs. The CPU heavy benchmark is executed on the environment, and the execution time is measured. The process is not assigned a CPU but is chosen by the scheduler.

Figure 4.1 are a visualization for the performance of the server. The y- axis shows the execution time in seconds, and the x-axis gives the numbers of processes running onXnumbers of CPUs. The red dots display the mean of the experiments executing on the given number of CPUs. The blue line shows a regression line for the performance. Where X equals 32, the mean execution time for 32 experiments equals to around 40 seconds, compared to where X equals 40, the mean execution time for 40 experiments equals 55 seconds. Figure 4.2 shows a visualization based on the same number as the previous graph. A boxplot for each number of CPUs utilized and experiments is presented by a boxplot to illustrate the distribution better.

(50)

Figure 4.1: Execution of processes on the Research2 server where Simulta- neous Multithreading (SMT) is enabled.

The blue line does also represents the regression line. The left side of the red dotted vertical line in both figures presents the process executing on available physical CPUs, and the right side shows the performance of the server when it starts to execute two processes on the same physical CPU or commonly referred to as core sibling.

The schedule won’t by default start executing on sibling cores unless it doesn’t have any available non-sibling cores left to use. The OS treats Research2s32 physical cores as 64 logical CPUs, and is confirmed as there is almost no delay when running the same or less amount of tasks on logical CPUs as there are physical CPUs; this was also monitored when conducting the experiments.

See figure 4.3 for comparison in the execution time of the process when running on the maximum physical and logical CPUs. The bottom boxplot represents the execution time when running the maximum number of physical CPUs. The top boxplot illustrates utilizing all the 64 available logical CPUs. The overall results are discussed more thoroughly in the discussion chapter.

The future experiments are using an environment where HT and SMT are not present for the sake of making sure the presented delay is not due to bad performance in, for example, hyperthreading.

(51)

Figure 4.2: Performance in time when running Simultaneous Multithread- ing on a CPU heavy task. Data adopted form figure 4.1.

Figure 4.3:Comparison of utilizing the max numbers of physical CPUs and logical CPUs.

Referanser

RELATERTE DOKUMENTER

4 The CPA, signed 9 January, 2005, and witnessed by envoys from 13 countries and international organizations, is a compendium of six agreements consolidated in one

This is the consolidated review of the report developed by the WKSTATUS group in response to a special request from OSPAR to provide the scientific knowledge basis for preparing

In the previous phase the cable stayed bridge was planned with one single pylon located on Svarvhelleholmen, a side span of 310 meters on the shore side and a main span of 510

If the consolidated layer thickness is defined as the highest gap or void of the entire drilled profile, the minimum consolidated thickness is 0.35 m and therefore, the consolidated

Specimens from the block sample that were subjected to simulated tube sample disturbance showed similar stress-strain behavior to that from conventional anisotropically

In the following validation cases, a con- stant value of J = 0.5 is adopted, which is recommended by (Matlock (1970)) for normally consolidated soft clay. The value of ϵ 50

The NORMAN Association has considerable—and continuously developing—experience of establishing: (i) a consolidated network of closely cooperating laboratories active in research

Identification of Barriers and Facilitators Using the Consolidated Framework for Implementation Research The current practice with nutritional care, perceived barriers and