Utilization of instrumentation data to improve distributed multimedia processing

(1)

UNIVERSITY OF OSLO Department of Informatics

Utilization of instrumentation data to improve distributed

multimedia processing

Master’s Thesis

Ståle B. Kristoffersen

(2)

(3)

List of Figures

2.1 Different kind of graphs . . . 10

2.1.1 Directed graph . . . 10

2.1.2 Undirected graph . . . 10

2.2 Example of a graph that is partitioned . . . 11

2.3 Speedup for various parts . . . 13

2.4 Code blocks . . . 14

2.5 CPU Hot-spots, illustrations is from [1] . . . 17

2.5.1 With different workloads . . . 17

2.5.2 With a single core application . . . 17

3.1 Overview of nodes in the P2G system. . . 31

3.2 Dependency graphs . . . 36

3.2.1 Intermediate implicit static dependency graph . . . 36

3.2.2 Final implicit static dependency graph . . . 36

3.3 Directed acyclic dynamically created dependency graph . . 40

4.1 Overview of the communication between nodes . . . 44

5.1 FSM for CPU mapping . . . 55

6.1 Timing . . . 67

6.1.1 Timing in a tight loop, using gettimeofday . . . 67

6.1.2 Timing in a tight loop, using clock_gettime . . . 67

6.1.3 Timing in a tight loop, using rdtsc . . . 67 6.1.4 Timing in a tight loop, using rdtsc with out serializing 67 v

(8)

6.2 Overview of theK-means clustering algorithm . . . 71

6.3 Benchmark ofK-means clustering algorithm . . . 72

6.4 Overview of the MJPEG encoding process . . . 73

6.5 Workload execution time . . . 75

6.5.1 Motion JPEG . . . 75

6.5.2 K-means . . . 75

(9)

Abstract

Instrumentation is ubiquitous in computer software today though its use in parallel processing frameworks is not widespread. In this thesis, we have developed an instrumentation framework, which we have integrated with the P2G framework. The instrumentation framework feeds a high and low level scheduler with detailed instrumentation data, while inducing a minimal of overhead. Our instrumentation framework also collects a wealth of information about the machine it runs on, including capabilities, enabling P2G to support specialized hardware. To demonstrate the feasibility of our framework, we have run a series of tests that shows promising results, both for the schedulers and developers seeking to locate a performance bottleneck, even though P2G, at the time, was not able to use this data for enhancing the decision making process

vii

(10)

(11)

Acknowledgments

Special thanks to my supervisors, Paul Beskow, Carsten Griwodz and Pål Halvorsen, for valuable help and guidance through the entire writing of this thesis. I would also like to express my gratitude to the rest of the P2G Team for creating an awesome environment to work in.

The work on this master thesis has been done at Simula Research laboratory at IT-Fornebu. Thanks the guys and girls at the lab, for keeping the moral up, and for providing a fun place to work.

I would also like to thank my roommates; Espen Haviken, for buying food for me when I could not afford it, Jonas Karstensen for saving the planet and Roman Florinski for being the household maid.

Finally, I would like to thank my family for the endless support they have given me my entire life.

Oslo, May 16th 2011 Ståle Kristoffersen

ix

(12)

(13)

Chapter 1 Introduction

Instrumentation is ubiquitous in computer software today; the operating system keeps statistics on each process currently executing, the download tab in your browser shows how fast a download goes and some games show you the current frame rate achieved. Such instrumentation data can be used in a number of ways. For example, schedulers use instrumentation data provided by the operating system to weight their choice when selecting the next task to run [2], i.e., how much CPU time a task demands, is used as feedback to the scheduling algorithm. By carefully selecting which tasks to run, the schedulers can attain less overhead or fairerscheduling, all depending on what the scheduler is trying to accomplish. Another example where instrumentation data is used is for billing purposes. Some Internet Service Providers (ISP) bill their customers based on network usage. Even though instrumentation is important, with so many possibilities, a general solution for all instrumentation has yet to surface, and may be impossible. We are therefore required to write custom solutions for most problems.

1

(14)

1.1 Background and motivation

Ever since the inception of programmable computers, both hardware and software have become increasingly complex. This increase in complexity is the result of the never ending quest for better performance. Already in 1965 it was noted that the number of transistors on one integrated circuit had doubled every two year from the invention of the integrated circuit in 1958, and it was predicted that this trend would continue. This prediction is what is now known as Moore’s law [3]. It has been proven to be remarkably accurate, to the degree that is has become a self-fulfilling prophecy, because it is used as a target for product development [4].

To keep up with Moore’s law, engineers have in the last years had to increase the number of CPU cores, because adding logic to one core got diminishing returns in terms of performance. While the increase in CPU cores in one machine does offers an increase in theoretical performance, it does not automatically provides a program written for only one CPU, more speed. So, as the hardware becomes more complex, the software has to follow suit. This leaves us with very complicated hardware and software designs, that contain numerous different parts that interact with each other in various undeterministic ways.

On the hardware front, the latest attempts to increase performance of single machines include the introduction of heterogeneous machines, where a specialized co-processor can perform some tasks at a very high speed. An example of this is seen in the current trend in decoding H.264 [5]

and other video codecs by offloading as much of the processing as possible to the Graphics Processing Unit (GPU) [6–8]. The GPU can perform many of the steps required to decode a video stream at a rate much higher than the general purpose CPU the GPU is connected to. Another popular heterogeneous architecture is the Cell Broadband Engine [9], first used in the Sony PlayStation 3. It has a general-purpose Power Architecture CPU

(15)

1.1. BACKGROUND AND MOTIVATION 3 and several RISC co-processors with 128-bit SIMD support [10].

On the software front, we are beginning to look into using more than a single machine to do the processing by distributing tasks between machines in a cluster. There are many frameworks written to utilizing a computer cluster for parallel processing, like MapReduce [11], Dryad [12]

and Nornir [13]. They all need a scheduler to decide where to run each workload, but the use of instrumentation data for scheduling in a cluster of computers is still a fairly new field of study. In Hadoop [14], an open- source implementation of MapReduce, the task scheduler assumes that all tasks progress linearly. While that holds true on an homogeneous cluster, it can severely impede performance in an heterogeneous cluster.

However, in [15], Zaharia et al design a new scheduling algorithm, Longest Approximate Time to End (LATE) that tries to work around the shortcoming of the default scheduler, by using instrumentation data to estimate time left for each running task, and using that as a feedback into the speculative task scheduler.

A less simple system for distributed computing, Condor [16], uses instrumentation data to decide whether or not a machine is idle. Condor does not otherwise make use of the instrumentation data for scheduling but it has another feature that is interesting; it has the ability for jobs to have certain requirements for the system it should be executed on, like operating system and hardware. This maps nicely to heterogeneous systems.

In an attempt to simplify writing and running multimedia processing on an heterogeneous computer cluster, a new framework, with a new way for expressing parallelism, is under development. It is called P2G, and as with several other frameworks, the P2G runtime consist of a central node, and several worker nodes. The central node partitions the workload, and then delegate parts to the worker nodes. Because of the way the low level scheduler on each machine can combine tasks, and the way different

(16)

implementation for a task, compiled for different architectures, can be scheduled by the central high level scheduler in P2G, we believe that it can greatly benefit from instrumentation data. We discuss P2G more in depth in chapter 3.

While the use of instrumentation data in distributed systems is not wide spread, schedulers have used instrumentation data, often called accounting data, to make decisions since the move from non-multitasking operating systems to preemptive scheduling [17]. Today, a variety of scheduling algorithms are used, but most are based on the same idea of using a multilevel feedback queue [18]. This approach has worked well on a single machine, since all of the accounting is done inside a context switch and keeps the overhead low.

1.2 Problem statement

Instrumentation is used successfully on single machine systems to not only help the scheduler, but also for many other tasks, like aiding a developer track down a performance bottleneck. We therefore want to research and develop an instrumentation framework for P2G, so that we can provide both the developers writing P2G programs and the P2G scheduler with useful and valid data. We investigate what kind of data we can provide and discuss what possible use the data might have. We look at how we should report missed deadlines, and investigate how to detect and report the different capabilities, load and status of a machine.

To help P2G avoid scheduling task onto machines that are overloaded or dying, we also look at ways to detect failing machines before they fail.

Our framework must not introduce much overhead, because that would render it less useful in a high performance setting.

(17)

1.3. RESEARCH METHOD 5

1.3 Research method

We chose to use what in [19] is called evolutionary prototyping, because we are unsure about the exact requirements of the final instrumentation framework. We researched, designed and implemented a working monitoring and instrumentation framework prototype. We then integrated the prototype with the P2G framework and focused on minimizing the overhead of collecting the data and the process of making data available to both the low level and high level schedulers and to the developers.

1.4 Contributions

During the master studies, we have published a demo poster to EuroSys 2011 [20], and submitted a paper to the 2011 International Conference on Parallel Processing (ICPP-2011) [21], which is currently pending review.

We have seen that instrumentation is important for a scheduler to make the correct decisions, and the goal of this thesis has been to design and implement an efficient framework with a low overhead for data gathering.

With this goal in mind, we have explored different ways to obtain timing information (Section 2.4), and created a proof-of-concept prototype (Chapter 5) proving the chosen method is viable. The instrumentation framework we have made is capable of collecting detailed system status, make precise measurements and reporting it back to to the master node, without adding too much overhead. We have also gained valuable insight into what data a scheduler could use.

(18)

1.5 Outline

Since the focus in this thesis is on providing information to a scheduler in a parallel processing framework, i.e., P2G, quite a lot of background material is needed. Both schedulers and parallel processing frameworks are quite complex, so in chapter 2, we give some background on data gathering, time sources, graphs and more. Chapter 3 introduces the P2G framework and explains how it differs from other parallel processing frameworks. It also explains the details of how P2G works.

In chapter 4, we explain the reasoning behind the design choices we have made with respect to our instrumentation daemon, and go into detail on how we set up the timers and why. Chapter 5 contains the specifics of the implemented instrumentation framework and shows how well it fits into the P2G framework.

In chapter 6 we evaluate our implementation. We go trough each data point gathered and discuss if the scheduler could make use of it in chapter 7.

Finally, we give conclusions and directions for future work in chapter 8.

(19)

Chapter 2 Background

In this chapter, we first introduce high performance clusters in section 2.1, and discuss the two methods used to split a workload to many machines.

We then go through the basics of graphs in section 2.2. In section 2.3, we explain how different kinds of instrumentation work. Later in this chapter, we go into detail on how timing information can be acquired on current Linux systems, and finally, we discuss related work.

2.1 High performance clusters and types of par- allelism

The usage of high performance clusters (HPC) is the only solution when a lot of processing power is needed. HPC is in use for many different purposes when applications are computationally bound, such as predicting the weather, and rendering high quality 3D graphics. In a HPC, several machines, each called a node, are connected by a network.

The nodes work together to complete a workload faster than any single node could be capable of.

7

(20)

The first step towards parallel execution, is to decompose the workload in to self contained parts. Such workload partitioning requires a certain degree of parallelism inherent to the problem, or, at least, a possibility of expressing computationally intensive parts of the problem as a paralleliz- able algorithm. There are two main approaches to split a workload so that each node can process the work in parallel.

The first approach, called data decomposition, or domain decomposition, obtain its parallelism from splitting the input data and have each node work on a different part. When execution with all parts of the input data has finished, the workload is done. Exploiting this kind of parallelity is relatively easy; a program must be written that can work on a subset of the data, and then a generic framework can split the input data and distribute the parts. The framework can then start the execution on each node as fast as each node get its own data. Even with a heterogeneous computer cluster it is fairly easy for the framework to balance the work, so that no node is idling, waiting for data. Since the nodes do not share any data or state with each other, the workload completes with the same result, regardless of scheduling order and without the possibility of a deadlock.

The second approach, task decomposition, or functional decomposition, describes an approach where the data processing is split into several different computational steps, which, in turn, could be assigned to different nodes. It is potentially harder to exploit task parallelisation like this, but with a heterogeneous cluster, some tasks may be better suited for certain types of nodes, and the gain by using task parallelisation correctly can be substantial.

When a workload has been decomposed, it can be scheduled onto different nodes to leverage their combined processing power.

(21)

2.2. SCHEDULING 9

2.2 Scheduling

The job of the scheduler in a HPC is to balance workloads across the nodes in order to achieve the highest throughput. Graphs and graph partitioning is used by the scheduler to decide where to schedule different jobs, since workloads can be represented as a graph, with vertices being nodes, and edges being communication demand between nodes. How much CPU time a given task demands, or how much data it have to communicate to other nodes is often not known in advance. We want to provide the scheduler with actual measured data. The scheduler can then use that data to weight the scheduling graph. We have a small introduction into graph theory and graph partitioning in the following sections.

2.2.1 Introduction to graphs

A graph consists of an ordered pairG = (V,E), whereVis a set of vertices and E is a set of edges. The edges are simply pairs of vertices, so every edge is connected to two vertices. We have two kinds of graphs, one where the vertices consist of unordered pairs, called an undirected graph, and one where the pairs are ordered, called a directed graph. The difference is illustrated in figure 2.2.1.

In an undirected graph, there is no direction associated with a vertex. This is illustrated in figure 2.1.2. In an undirected graph, each vertex of an edge must be different, i.e., a vertex can not loop back on the same edge it originates from, so an edge like e₉ in figure 2.1.1 is not possible. In a directed graph, as shown in figure 2.1.1, you can see that the edges have a direction. An undirected graph can be changed into a directed graph by change each edge into two edges in opposing directions, like edgee₄and e₈in figure 2.1.1.

(22)

v₁

v₃

v₂

v₆ v₅

v₄ e₁

e₃

e₄

e₂

e₅

e₉ e₈

e₆ e₇

2.1.1: Directed graph

v₁

v₃

v₂

v₆ v₅

v₄ e₁

e₃

e₄ e₂

e₅

e₆ e₇

2.1.2: Undirected graph

Figure 2.1: Different kind of graphs

A weighted graph is a normal graph, where weight is assigned to each edge, vertex or both. To split the graph into several different partitions is called partitioning the graph, which we discuss further in the following section.

2.2.2 Graph partitioning and the need for instrumentation

Graph partitioning is used to split a graph into two or more partitions.

Various criteria can be used for how the graph should be split. In our case, a high level scheduler might have a graph of a workload, weights assigned to each of the vertices, obtained through instrumentation. If different nodes communicate, the edges connecting the vertices can be weighted

(23)

2.2. SCHEDULING 11

V1 100

V2 100

V3 100

V4 50

V5 50 10

100

10

100 20 100

Figure 2.2: Example of a graph that is partitioned

by using instrumentation data for how much data is transmitted on each edge. From that graph, the scheduler can then partition the graph into K-partitions, whereKis the number of nodes available in the HPC.

Depending on what the scheduler wants to achieve, it can partition the graph with a given criteria, for example, it can try to make each partition contain about the same weight of vertices. In figure 2.2, we can see a graph that has been partitioned into two parts, one red and one blue, both containing an equal weight of the vertics. The partition also has minimized the weight of the edges that leave or enter each partition. The same would a scheduler do to keep task that transfer a lot of data to each other on the same node in order to avoid transferring it over the network.

We do not go into how the different graph partitioning algorithms work, but interested readers may want to read more in [22].

(24)

2.3 Instrumentation

So far we have mentioned instrumentation, but not gone into any detail.

We now discuss different kind of instrumentation and what they can be used for. Instrumentation can be of help for understanding the dynamic behavior of an application, both for the programmer, who wishes to improve his application, and for a scheduler that tries to find an optimal execution plan. Other uses include monitoring performance and adapt the code path taken, based on load, and admission control, that only allows new tasks to be scheduled if there is enough free capacity.

There are several different types of instrumentation; Profilers record a variety of metricses for a program, tracers record which code path are followed in a program, and timers record how long a section of code take to complete. We now explore these in detail.

2.3.1 Profiling

There are several types of program profilers; some profile memory usage, and other metricses, but we focus on those that profile by counting how many times each basic block of code is executed, called a flat profiler. An example of a profiler that can output a flat profile is GNU Gprof [23].

Profilers are a helpful tool for the programmer. It can help to determine where the optimization efforts should be focused. A common adage of software development is that 80% of the time is spent in 20% of the code [24]. Profiling can then identify where optimization is most useful.

It is of little use to optimize code that is executed rarely, but of much value to speed up code that is executed often. figure 2.3, even a minor improvement in code that takes up most of the processing time is better than a big improvement in code that constitutes only a small portion of the processing time. Assume a program consisting of two separate parts,

(25)

2.3. INSTRUMENTATION 13 A and B. When A is optimized so it runs 10 times faster, the whole program still takes more time to complete than if B is optimized so it runs twice as fast.

Two separate code parts

A B

Original process Speed up by 10x

by 2x Speed up

A

B

Figure 2.3: Speedup for various parts

Just in time (JIT) compiling uses profiling to estimate execution times for optimization. If a code block is executed frequently, the gain obtained by compiling that block into native, efficient code can be larger than the time lost by actually compiling it [25].

An accurate and straightforward way of doing profiling, is to insert code at the start of every code block. The code inserted at each place will have a counter assigned to it, which it increases each time it is executed. This can introduce significant overhead, because the inserted code is also executed on each branching of the original code. Alternatives have been developed, which can eliminate the recording at many of the branches, by careful analysis of the original code. If a code path that is profiled has a branch, and each arm of the branch ends up in the same position, only one of the branches needs to be recorded, since the number of times the code has gone down the other branch can be calculated by the total number of times the execution has gone into the code before the branch, minus the times it has taken the branch with the profiling code. This reduces the instrumentation overhead, while still obtaining a full profile for every block of code [25].

Take figure 2.4; It has five code blocks, connected in a simple layout. If this is the entire program, we could get away with three counters, one in

(26)

A B

C

D

E

Figure 2.4: Code blocks

B, C and D, and still know how many times each block of code has been executed.

Another instrumentation approach is interrupt-based: a program’s execution is interrupted at periodic intervals and the program counter is recorded. Over time, the numbers average out and should represent an accurate view of where the program is spending time. We discuss how a machine can get a periodical interrupt in section 2.4.

2.3.2 Tracing

Program tracing counts how many times each block is executed, and in what sequence they are called. GNU Plot, as discussed earlier, also support to output a call graph, showing how each code path was reached, and how many times it was reached that way. This is useful in code path analysis, which is used to audit code for possible errors. For example, the Linux kernel has a built in tracing framework, Linux Trace Toolkit next generation (LTTng) [26]. LTTng is designed to be used to debug problems that show up rarely, so it needs to be present in production code without being enabled. It works by inserting special probes in the code before

(27)

2.3. INSTRUMENTATION 15 compilation. Each probe has a very low overhead in the normal case when it is not enabled. Once a situation arises that needs to be monitored, the probes can be enabled at run time and tracing information can be collected.

2.3.3 Timing

There are two ways to time a program. We can either time the execution of the whole program under one, or we can time specific code parts separately. The first is possible under Linux by invoking the time [27]

command. The output oftimegives us three different numbers. The first number is the time the program took to execute, it would be equal to the time a user would have to wait for it to complete. The second number is the amount of CPU time the program used, it could amount to more than the first number. The last number, is the CPU time spent in the kernel on behalf of the program.

Thetimecommand is only useful if we are interested in the execution time of the entire program. If the program runs indefinitely it is apparent that timedoes not work. We then have to resort to inserting timing code into the program around those pieces of code we are interested in. This would usually be a function or an inner loop to measure just how much time is spent in that particular location. The timing of one or more small, specific pieces of code is called amicrobenchmark, and usually tells us how fast that piece of code runs.

Another way to obtain timing information during the execution of a program, is to callgetrusage[28]. It returns various statistics on either the calling process, all of its children, or the calling thread.

(28)

2.3.4 Early failure warning

Given a computational cluster of a certain size, we are almost guaranteed that there are malfunctioning components within the cluster at any given time. An investigation performed for the Internet Archive [29] on failures of hardware in their cluster showed failure rate as high as 2% for hard disk drives (HDD), and 2.54% [30] for motherboards, CPU, memory, etc.

combined. Other studies have shown failure rates for HDDs as high as 4–8% annually [31, 32].

When a machine fails, its behavior can be undefined, it is therefore of interest to try to take machines out of service before they break down.

There are several approaches to predict failures in different hardware [33].

We here discuss two common strategies, the S.M.A.R.T. data provided by HDDs, and the CPU temperatures provided by temperature sensors inside CPUs.

S.M.A.R.T. data

Self-Monitoring, Analysis, and Reporting Technology (SMART) is a system for monitoring the health of HDDs built into most new HDDs. It monitors several key parameters of the HDD, and tries to predict failures.

Google observed over 100,000 HDDs and analyzed the data returned by those HDDs that did fail and by those that did not fail. They found some correlation between the SMART data returned and failures. A problem with SMART data is that the values returned are not standardized.

Core temperature

New CPUs contain not one, but several temperature sensors. Those additional temperature sensors are added because we want to know how

(29)

2.3. INSTRUMENTATION 17

2.5.1: With different workloads 2.5.2: With a single core application

Figure 2.5: CPU Hot-spots, illustrations is from [1]

warm the hottest spot on the CPU is. For example, if the workload stresses the floating point unit (FPU), that is where the CPU is the warmest. As we can see from figure 2.5, the hot-spot of the die varies a lot depending on the workload. Each temperature sensor is checked and the warmest temperature sensor for each core is selected and that is the temperature that is returned when queried in the legacy way [1].

Many modern CPUs monitor the temperature sensors themselves, and throttle back the clock speed if a certain thermal threshold is exceeded.

Some even go into a complete shutdown if the temperature increases enough.

(30)

2.4 Time sources

Modern computers have implemented several methods of acquiring timing information from the system. All current timer sources have one or more problems, they are either slow or not very accurate, and in some cases even both. Certain time sources, like the time stamp counter on many AMD processors, do not even guarantee that two successive calls return monotonically increasing timestamps [34], the last call might return a time stamp that is earlier than the time stamp returned by the second call.

A programmer must be aware of such idiosyncrasies to choose a reliable and sufficiently accurate timing strategy. We now describe some of the methods, and their strengths and weaknesses.

2.4.1 Hardware timers

Hardware timers require dedicated circuitry to operate. They are usually based on a crystal oscillator, and either contain a register that can be read by the CPU or output an interrupt periodically. Different hardware timers have been implemented with various goals in mind. Hardware timers are used as a backend for all software timers. In this section we introduce the most common hardware timers available on modern computers. All of the following timers are in use today.

Intel 8253

When the IBM PC was introduced in 1981, it contained an Intel 8253 Programmable Interval Timer (PIT). While all modern IBM compatible computers contain an Intel 8253, it is no longer on a separate chip on the motherboard, since it has been integrated into the south bridge chipset.

It has three channels, each implemented as a 16-bit counter that counts

(31)

2.4. TIME SOURCES 19 down to zero. Each channel can be in one of six modes, and depending on the mode configured, it can be used for different things. Channel 0 is connected to IRQ 0, and channel 2 is connected to the PC-speaker. Channel 1 is not always present, and if present, it is not very useful because it was used to refresh the Dynamic Random Access Memory (DRAM) on early machines featuring the chip, and that functionality is not needed any more.

Access to the PIT is through four fixed I/O Ports. While relatively simple to use, it does take 3 microseconds to read it [35]. It is possible to use it as an aperiodic timer, but since it is slow to program a new timeout value, it is only used as a way to generate a periodic clock interrupt on systems without other suitable timers. The PIT is also used for tone generation on the PC-speaker.

Real Time Clock

A real time clock (RTC) was added to the IBM PC in 1984, as a way to keep track of the clock even when the system was not connected to the mains. It has a small battery to keep the clock running when the rest of the system is without power. As with the Intel 8253, the RTC is now integrated into the south bridge chipset. It can also be used to generate periodic timers, to free the PIT for aperiodic tasks. However, this is not what it was designed for and it has proven to be unreliable [36] and slow [37].

The interrupt generated by the RTC is on IRQ 8, which is of lower priority than every IRQ with a lower value. Linux only read the value at boot and after resuming from a low power state [38].

(32)

Time Stamp Counter

Every Intel x86 CPU since the Pentium has had a 64-bit register called TSC. The TSC register is incremented on every tick, and is initialized to 0 when the CPU is reset. When introduced in 1993 it was very well suited for achieving fast and accurate timing due to the fact that the tick-rate was known because it was tied to the CPU clock rate. It is low overhead since all that was needed to read it was to run one instruction, RDTSC (opcode 0F 31). When executed the result would be returned in the two 32-bit registers EDX and EAX [39]. The TSC has no way to generate an interrupt, so it can only be polled for time.

Since the introduction of multi-core CPUs and CPUs with different levels of power saving modes, some obstacles have been introduced that make it less desirable to use it as a time source.

On some multi-core CPUs, not all cores tick at the same rate. This means that over time the values diverge. This can create problems for programs that use the TSC to measure the elapsed time. If the program reads the TSC while running on one of the cores, and is then later scheduled onto another core, where it again reads the TSC, it can appear that the time has gone backwards. This is especially true for AMD CPUs [34], where AMD has released a windows driver that tries to avoid the problem by synchronizing the TSC by periodically adjusting them so they are in sync [40]. This mitigates the problem, but allows for the possibility of having the TSC value go backwards for successive reads.

When some power saving modes are activated the tick rate of the CPU might be affected, skewing the results if it was calibrated when the tick rate was different. So unless great care is taken, this is a potential source of error. On new Intel CPUs the TSC is incremented at a constant speed regardless of the current clock rate.

(33)

2.4. TIME SOURCES 21 The Intel Pentium Pro introduced out-of-order execution, that removes the guarantee that instructions are executed in order. This means the CPU is free to reorder the instructions, which means you no longer know what you are timing. To work around this the CPU must be told to finish all previous instructions before continuing. It is accomplished by executing a serializing instruction before RDTSC. This slows down the execution, but it is still very fast [35].

Another problem encountered with the TSC is that not all x86 clones have implemented the RDTSC instruction, and even if the CPU supports it, the OS can disable it. This comes in addition to the problems described earlier, and makes the TSC unsuited as a general method of timing.

Since the rate at which the TSC increases is unknown, it can not be used as a source to calculate elapsed time, without first calibrating it by using a known timer. To calibrate it, we read the TSC at a known interval, and the tick rate can then be calculated. This only give us an approximation, that can not be more accurate than the clock used as an reference.

To summarize; The TSC, used correctly on the correct hardware, can work as expected. However, its use as a general method of timing is discouraged [41].

Local APIC Timer

The advanced programmable interrupt controller (APIC) was introduced to solve the problem for how to serve interrupts efficient on multiproces- sor systems. Each CPU has its own local APIC (LAPIC). It contains a 32-bit counter and counter input register. The speed of the counter is not known, but is usually the same as the processor’s front side bus (FSB) [42].

Several implementations of the APIC timers are buggy [43] [37]. It also suffers from the same problem the TSC has, in that the speed the counter

(34)

is increasing is not known. To use it, we must follow the same calibration technique as the TSC.

PM Timer

The PM timer is also known as the ACPI timer. According to the ACPI specifications [44], it is required to be present on any ACPI compatible systems. It can either have a 24 bit or a 32 bit counter, and the counter is increased at a frequency three times that of the PIT. Reading from the PM timer is also fast, only 0.7 microseconds [35]. It can be configured to raise an interrupt when the high bit changes. However, the counter value cannot be set, and when reached its maximum value it overflows and start at zero again, so it is not very flexible. The PM timer keeps running even in some of the power saving modes where some of the other timers stop or slow down, and can therefore be more reliable than other timers.

High Precision Event Timer

Intel and Microsoft developed the high precision event timer (HPET), and it was presented in Intel ICH8 chipset. It was introduced because the other available timers had flaws which made them undesirable to use. The HPET counters run at a minimum of 10 MHz, and each chip has several 32 and 64 bit counters. Each counter can have several registers associated with it, and when the value in the counter matches the value stored in one of its registers an interrupt is raised.

Access is almost as fast as the PM Timer, at 0.9 microseconds [35]. One problem with the HPET timer is that the registers for a timer only trigger when the counter is an exact match, and since there might be a large enough delay from reading the current value, to the new register value has been calculated and written back, that the counter might have passed

(35)

2.4. TIME SOURCES 23 that number. This results in the timer not triggering before the counter has rolled over and started again, which could be a long time.

2.4.2 Software timers under Linux

We focus on software timers in Linux because that is the operating system we are using for our implementation. Other operating systems have different but similar interfaces to the timers, so much of the information here may be applied to other operating systems.

When it comes to timers, the Linux kernel has two main tasks it must accomplish. The first is to keep track of the current time and make it available through various APIs discussed later, and the other is to have a framework that allows both the kernel and user space applications to sleep for a given interval [45]. We focus on the first part, since that is what we use in our instrumentation framework later.

Software timers are the way the OS exposes the hardware timers to the applications. Some of the hardware timers can be interfaced directly from software, but even then they usually require special privileges. The TSC is a notable exception and is usually readable from user space software running as an unprivileged user.

Linux used to be based on ticks, where it scheduled a periodic timer to trigger at a given interval. Typical values have been 100, 250, 512, 1000 and 1024 times per second for different kernels. One such interrupt is called a tick. On each new tick the value of the tick-counter is increased, and since the time since the last tick was a fixed value, the kernel could increase the system time value by the same amount. When the timer interrupt occurs the current value of TSC is also saved for later use in calculation of inter- tick time. If the kernel had 100 ticks per second the system time increased at 10ms intervals.

(36)

When ticks are used and gettimeofday is called, the kernel reads the TSC again, calculated the time from the last tick and add that to the time saved as system time at the last tick. This method is error prone, because there are several race conditions and the possibilities of missed ticks made the time drift. These problems are also exacerbated when running under virtualization [42].

On newer Linux kernels, the entire timing subsystem is rewritten. The new system use something they call clocksource abstraction. In the new system, the time is calculated from scratched and returned when it is asked for and not updated during a tick. This means that kernels using the clocksource abstraction can run tick-less, i.e., they do not need to have a periodic interrupt configured.

Tick-less kernels have several advantages. They are immune to problems with lost ticks, since there are no ticks to be lost. This is a huge gain when running under virtualization, removing the need for complex logic in the hypervisor to compensate for lost ticks. Tickless kernels also eliminate a lot of unnecessary waking up, where all that is done is to update a few timers.

time

The time system call returns the number of seconds since the Epoch (00:00:00 UTC, January 1, 1970) [46]. Its resolution is in seconds and it is useless for all but the most coarse instrumentation.

gettimeofday

Gettimeofday returns a data structure containing the number of seconds since the Epoch, just like the time system call, but it also contains the

(37)

2.4. TIME SOURCES 25 number of microseconds in the current second [47]. A problem when using the gettimeofday system call for measuring time is that it is affected by changes to the system clock. If the clock is adjusted, the time returned from a call before the adjustment and a time returned after the adjustment cannot be compared. Even with that problem, gettimeofday is widely used in a lot of software.

clock_gettime

The clock_gettime system call takes a clock-id as a parameter, that way the program can choose between multiple clocks. Recent versions of Linux have a clock-id called CLOCK_MONOTONIC_RAW which gives access to raw hardware-based time that is unaffected by changes to the system clock like NTP adjustments [48]. Because it is guaranteed to increase monotonically, linearly, and is unaffected by adjustments to the system clock, CLOCK_MONOTONIC_RAW is the preferred clock-id for timing uses. Unfortunately it can not be used to tell the current time, or use in conjunction with timeout values used like in select [49].

2.4.3 Distributed time

In a machine cluster it is desirable to have a common clock on all machines.

Since that is not usually possible we are left with trying to synchronize the clocks. This is achieved by using NTP to synchronize clocks over the local network and even over the Internet. It can compensate for variable latency to the time server, and if the time server is on the local network, accuracies down to tens of microseconds can be achieved [50]. To increase accuracy, parts of the NTP clock phase-locked loop is inside the Linux kernel, running in kernel space.

(38)

Another way to distribute the clock is to use a GPS connected to each machine. Embedded in the GPS signal is a very accurate clock stream, which enables the machines to be synchronized with an maximum error of around 10 microseconds [51].

Since the local clocks on all machines drift, and in what direction and by how much is different on every machine, a daemon is usually run to continuously adjust the time to keep it in sync. Even with such a daemon running it is not advisable to rely in the distributed time being synchronized any better than to the same second.

2.5 Parallel processing frameworks

To implement instrumentation in a parallel processing framework is not a new idea. In the Nornir [13] run-time system for parallel programs, Vrba et al. implemented something they called accounting, where they could record various performance related values, like CPU time used by each process, number of context-switches, number of loop iterations while waiting to acquire a spinlock, and more. They found that this added around 0.72 microseconds in overhead, most of it from two system calls for obtaining data on per-process CPU time. However, they did not use this for any scheduling decisions, and it was mainly used to give data to the programmer about where bottlenecks are.

In Dryad [12], Isard et al. have implemented amanager. Themanagercan detect if some parts of the job is finishing slower than comparable parts. It can then spawn a duplicate job to make sure one slow computer does not slow the whole job down. This behavior is similar to MapReduce’sbackup task.

(39)

2.6. SUMMARY 27

2.6 Summary

In this chapter we have introduced HPC and how scheduling work. We have also explained how they can use a weighted graph to schedule more efficiently. We then made the argument that we can use instrumentation data as an input to the weighting of nodes in the graph. We gave a thorough introduction to various implementation of timers, both in hardware and in software. Finally we looked at other parallel processing frameworks that use instrumentation.

In the next chapter we explain the P2G framework, and why it was created. We then explain how it can benefit from instrumentation data.

(40)

(41)

Chapter 3 P2G

In the previous chapter we introduced a lot of background information, we now show how it fits in with P2G, a framework for distributed real-time processing of multimedia data. We start by explaining the motivation to build such a framework. Then we show an example workload implemented in P2G, and explain how it would be executed.

Last, we go into details about the inner workings of P2G.

3.1 Background and motivation

In recent years, it has become evident that the future development in performance of general-purpose processors will come from concurrent processing [52]. For years, the improvements in execution speed of single- threaded applications were chiefly due to ever increasing clock speeds, cache sizes and the efficiency of instruction level optimization. Around 2002, increases in clock speed stopped. You could buy a 3.06 GHz Intel Pentium 4 processor in late 2002 [53], and Intel had planned to release a 4 GHz Pentium 4 but later abandoned that plan due to transistor leakage 29

(42)

and power density [54]. Now, 9 years later, Intel’s fastest CPU in term of clock speed has still not reached 4 GHz, instead Intel and other CPU manufactures have moved to increase the number of cores in a CPU. The trend is such that even mobile phones are equipped with multi core CPUs.

We are now at a place in time when parallel processing has taken over as the main strategy for speed improvements. While multi-core CPUs offer more theoretical speed, a sequential program has to be re-written to use more than one core to exploit the possible concurrency.

Transitioning from a single to multi-threaded application is complicated and often requires domain specific knowledge of the hardware it runs on, to take full advantage of the computational capacity available. To ease this work, several frameworks have been introduced, such as Microsoft’s Dryad [12] and Google’s MapReduce [11].

As discussed in section 2.1, there are two main axis of expressing parallelism, and different frameworks usually only use one of the methods.

For example, MapReduce uses a data parallel model, where each machine runs the same task on different parts of the data. P2G tries to improve the situation by supporting to express both data and task parallelism in a new way that is well suited for multimedia processing.

3.2 Overview

P2G is split into one master node, and an arbitrary number of execution nodes, as shown in figure 3.1. The master node contains the High Level Scheduler (HLS), Instrumentation manager and the communication manager. The HLS dispatches work to the execution nodes and the instrumentation manager gathers instrumentation data. Using the instrumentation data, the HLS optimizes where, and how, work is dispatched.

(43)

3.2. OVERVIEW 31

Figure 3.1: Overview of nodes in the P2G system.

Execution nodes can join the cluster at any time, and the HLS dynamically distributes the workload onto all available execution nodes. Each execution node contains a Low Level Scheduler (LLS), Instrumentation Dae- mon, Storage Manager and a Communication Manager. Instrumentation data that the instrumentation daemon collects is provided to the HLS via the network and to the local LLS. The LLS can, using instrumentation data to guide it, combine multiple tasks into a single execution unit, a batch, in order to reduce overhead. This is explained more in detail in section 3.2.10

A vital concept in P2G is the virtual fields, which are used to store data between each step in the processing. In reality, the virtual field is represented as a memory area on each machine that uses that virtual field, limited to the ages and indexes that the machine work on. Even when a virtual field is represented in memory, it is not guaranteed to be continuous. When more than one machine share the same virtual field, the store manager has to transport data from the machine that writes to the virtual field, to all the machines that read from the same part of the virtual field.

(44)

3.2.1 Kernel language

The kernel language developed for P2G defines how a programmer interacts with P2G, and expresses the parallelity of the program. A P2G Program consists of an arbitrary number of field declarations, and code for an arbitrary number of kernels. The field declaration defines which fields the kernels works on, and the kernel code does the work on the data in the fields.

The kernel interacts with the P2G fields through fetch and store commands. Depending on how the fetch and store commands are written, one kernel might be executed once per age, or for each index for each age.

3.2.2 Kernel code

The kernel code is the P2G code the programmer writes for their program.

Each kernel consists of sequential pieces of code, that work on data stored in the virtual fields. The goal is to have each kernel express as much of an decomposition as possible, so the scheduler can combine them as it sees fit, based on instrumentation data for how long a kernel takes to execute. The code for describing a kernel is currently implemented in a C-like language that expose many of P2G’s central concepts. In the next section we see an example workload written in kernel code.

3.2.3 Code example

Here is an example workload consisting of four kernels. When run the printkernel writes out {10, 11, 12, 13, 14}, {20, 21, 22, 23, 24} for the first age, and then {25, 27, 29, 31, 33}, {50, 54, 58, 62, 66} for the second age.

(45)

3.2. OVERVIEW 33 Since there is no termination condition in this workload, it continues to run and print increasing values indefinitely.

All of the following listings are usually contained in one P2G program.

field declaration

Listing 3.1: Field declaration

int32[] m_data age;

int32[] p_data age;

The field declaration defines two global fields, m_data and p_data.

Depending on how the fields are declared, the fields can be write once constants, or multi aged, multi dimensional arrays. Fields are discussed further in section 3.2.5.

init

Listing 3.2: Kernel code for init

init:

local int32[] values;

%{

for(int i = 0; i < 5; ++i) {

put( values, i+10, i);

}

%}

store m_data(0) = values;

(46)

The init kernel initializes an array with five values from 10 to 14, and stores the values inm_datafor age 0. It has no fetch statements, and it can therefore be scheduled at start, since no fetch statements means it does not have any data dependencies that has to be met in order for it to run.

mul2

Listing 3.3: Kernel code for mul2

mul2:

age a;

index x;

local int32 value;

fetch value = m_data(a)[x];

%{

value *= 2;

%}

store p_data(a)[x] = value;

Themul2kernel fetches a single value fromm_databecause of index- variable x, and multiplies it by 2. It then saves the value in p_data in the same location, and same age. Because of its fetch statement, it has a data dependency that is not met when P2G first starts up, and it must wait forinitto run first and fillm_datafirst, with the position given byxand a.

plus5

Listing 3.4: Kernel code for plus5 plus5:

(47)

3.2. OVERVIEW 35

age a;

index x;

local int32 value;

fetch value = p_data(a)[x];

%{

value += 5;

%}

store m_data(a+1)[x] = value;

Theplus5kernel reads a single value fromp_data, that themul2kernel has put there, and adds 5 to it, it then stores that value back inm_data in the same location, but next age.

print

Listing 3.5: Kernel code for print

print:

age a;

local int32[] p, m;

fetch p = p_data(a);

fetch m = m_data(a);

%{

for(int i = 0;i < extent(p,0); ++i) {

cout << "p: " << get(p, i);

cout << "m: " << get(m, i);

}

%}

The print kernel prints out all p and m values for the current age.

After the initkernel has run, half of its data dependencies are met,

(48)

but that is not enough. It has to wait until the mul2kernel has run in order for all its data dependencies to be met. The reason the print kernel fetches all the values for one age at the same time is that it lacks thexvariable in the fetch statements.

3.2.1: Intermediate implicit static dependency graph

3.2.2: Final implicit static dependency graph

Figure 3.2: Dependency graphs

From the kernel code, P2G can build an implicit static dependency graph, see figure 3.2.1. This is because thestoreandfetchstatements return fields and kernels gives the relationship between kernel definitions, were edges are represented bystoreandfetchstatements on fields. Since the fields are virtual, the HLS can merge edges connecting two kernels through a field, producing the final implicit static dependency graph seen in figure 3.2.2.

3.2.4 Example walk through

When we start the execution of this program in P2G the only kernel that can run is init, since the data dependencies for the rest of the kernels are not met yet for any age. After the init-kernel starts to run, the data dependencies for some instances of mul2 gets met and they can be scheduled to run.

(49)

3.2. OVERVIEW 37 For each instance ofmul2that is run, a dependency for an instance ofplus5 is met, and that instance ofplus5can be scheduled. Since print needs an age to be complete before it can run, all instances of bothmul2and plus5 for a given age must be completed before print for that same age can be run.

Plus5needs the output from mul2, but writes to the input of mul2for the next age. Plus5and mul2therefore form a loop, which is easily identified in figure 3.2.2.

3.2.5 Field

In P2G, kernels fetch data from fields, which they perform operations on.

Each field can be looked at as a global multi dimensional array, where age is one dimension, and an arbitrary number of dimensions is available for use with indexes. Fields are write-once, which means that if you have a kernel that applies a filter to all pixels in an array, it can not just write back the processed data to the same index. In the code example we had two fields, m_data and p_data. After the initkernel wrote to m_data(0), it can not be written again for the same age. Since multimedia algorithms often write back to the same data structure it reads from P2G need to support that. The way P2G lets a kernel do that is by introducing theAgeconcept, which is illustrated in figure 3.3.

3.2.6 Age

Age is introduced as a way to get around the write-once semantic of fields.

To write to the same index in a field you need to increase the age of that field. For example, when the plus5 kernel has added 5 to the value it fetches for the first time it can not write it back tom_datafor the same age,

(50)

because theinitkernel has already written to it. Inplus5, that is solved by writing to the next age which for the first invocation would be 1. Ages makes it possible to create loops between kernels like the one created by mul2andplus5, see figure 3.2.2.

3.2.7 Kernel definition

A kernel definition consists of local variable declarations, fetch and store statements, and the kernel code. The kernel code can embed native code, and that code can be as complex as necessary.

3.2.8 Kernel instance

An instance of a kernel definition is for examplemul2with x =2 anda = 30. Each kernel definition can lead to many kernel instances. The number of instances being executed depends on thefetch, andstorestatements in the kernel code. In our code example themul2kernel had a fetch statement that fetched one item from one age. This makes it possible to have 5 kernel instances of mul2 per age. However, the LLS is free to combine several kernel instances into one batch job, to limit the overhead of processing each fetch statement and the overhead of timing each kernel as seen in figure 3.2.2.

To provide feedback to the LLS about the overhead for each kernel instance we need to insert instrumentation code before and after each execution of a kernel instance.

(51)

3.2. OVERVIEW 39

3.2.9 Fetch and store

A fetch command can appear before the code section of a kernel. In listing 3.1, first the age and index variables are defined, and then a local 32bit variable is declared, called value. Then next, right before the code part, we have the fetch statement. It fetches only a single value from the current age a, and indexx. Since them_data arrayis of length 5, P2G, when executing, spawns up to 5 instances of the mul2 kernel per age. In section 3.2.10, we can see how P2G, using feedback from the instrumentation daemon, can decrease the number of separate executions of the kernels in order to reduce overhead.

A store command is very much like a fetch command. It has the same syntax and slice the same way as a fetch command. The only difference is that, because of the write-once semantic, store commands are always used to store to another global field, the next age, or both.

3.2.10 Dependency graph

As we already saw earlier in figure 3.2.2, P2G has a cyclic dependency graph of the workload. At run time P2G dynamically expands the dependency graph into a directed acyclic graph (DAG), as seen in figure 3.3. The move from a cyclic to acyclic graph is because of P2Gs write once semantic, where each field can have an age, so a cycle becomes a sequence of ages.

To partition the work, the HLS may use graph partitioning to distribute the load fairly onto the resources available, as seen in figure 3.2.2. With our framework we can provide information about how large the load of each kernel is, so that the HLS can use that as input to its algorithms.

With the directed acyclic dependency graph (DC-DAG) the LLS can

(52)

Figure 3.3: Directed acyclic dynamically created dependency graph combine several instances of a kernel to reduce the overhead in the framework. From figure 3.3, we can see an example of how the LLS might schedule data and tasks. In age 2 the LLS has reduced data parallelity by combining the fetch statement inmul2 to cover an entire age. The reason the LLS might do this is if the work done in the kernel is fast in contrast to the fetch statement, it then tries to reduce the overhead by fetching more data at once.

In age 3 the LLS has removed the task parallelity, and combinedmul2and plus5, but kept data parallelity. This is done ifplus5andmul2exchange a lot of data without doing much work, in order to reduce the overhead of network traffic. Finally, in age 4 all parallelity is removed and all data and tasks are combined.

We see that both the LLS and HLS can use data from our instrumentation framework to optimize how to execute the kernels.

3.2.11 Compiler

To transform the P2G kernel code into usable machine code the P2G framework consists of a special P2G compiler. It transforms P2G kernel code into code for different architectures. To leverage all the hard work that has been put into various compilers for various architectures the P2G

(53)

3.3. SUMMARY 41 compiler transforms the kernel code into valid C or C++ code and then execute the highly optimized native compiler for each architecture, i.e., gcc for normal C-code, Nvidia’s compiler for CUDA-code and any other specialized compiler for any exotic hardware.

3.2.12 Runtime

The P2G runtime consists of a daemon that you run on every machine you want to participate in the computation, and a server that controls them, see figure 3.1. The server must have all the compiled code available, so it can transmit the code to the execution nodes and tell them to run it. The runtime is responsible for dynamically load the distributed binaries and execute them safely.

The P2G runtime contains the instrumentation code, and the timing probes are inserted right before and after the execution of one kernel instance, or a batch of kernel instances.

3.3 Summary

In this chapter we have discussed the reasons behind making P2G, and shown how P2G works. We have gone into depth on how the HLS and LLS use a dynamic graph to make scheduling choices. It should be clear that in order for the LLS to be able to know when to combine single kernel instances into a batch, it needs feedback from our instrumentation framework. The HLS is also in need of feedback, so it can weight each edge and vertices in its scheduling graph, with accuracy.

In the next chapter we look at how the instrumentation framework should be design so it can provide the data P2G needs, while still being easy to

(54)

integrate and having a low overhead.

(55)

Chapter 4 Design

In this chapter, we discuss the design of our instrumentation framework for P2G, and the reasons for choosing this design. Our goal is to make a flexible framework that is modular and not coupled with P2G more than needed, while still providing valuable information to both the LLS and HLS in P2G.

We start this chapter by giving an introduction to how we want our framework to fit in with P2G. We then move on to discuss each requirement for our design. We summarize the design in section 4.7.

4.1 Introduction

The intent of this instrumentation framework is to provide instrumentation data, on a distributed platform, to a master that controls the other machines, but also to the local machine.

Our goal is to provide data to the local LLS and to the HLS. Worker nodes do not need to know anything about each other, and communication is 43

(56)

Capabilities Timing Computer status Computer health Alarms

P2G Event library Execution node

LLS

Capabilities Timing Computer status Computer health Alarms

P2G Event library Execution node

P2G Event library

Master node

HLS LLS

Figure 4.1: Overview of the communication between nodes

therefore only between the master node and the worker nodes, as seen in figure 4.1. Since P2G already has a communication module, we use it for all communication.

4.2 Requirements

An absolute requirement for anything that is used in a high performance computing setting is that it should be efficient, i.e., the overhead of using the instrumentation framework should be as low as possible. The main purpose is to have the scheduler make smart decisions to speed up the processing, and if the instrumentation framework consumes all the gain from applying the information it provides, there is not much point in having it. The framework should also be as non-intrusive as possible, to make the integration with P2G easier. It should also be easy to switch on/off instrumentation code, to be able to debug performance issues.

Since the goal is to use the framework in the LLS to combine kernel instances into batches based on execution time, we need to measure the

(57)

4.3. DATA GATHERING 45 execution time for each kernel instance.

We also want to use the data provided from our instrumentation framework in the HLS to combine kernels that exchange a lot of data, on one machine, so we can avoid saturating the network. For this to work, we also have to monitor the network traffic.

Another goal, is to provide data in a human readable format, to help developers locating bottlenecks in their code, so we also need a way to display the measured data.

4.3 Data gathering

We need to gather a lot of different information, and we try to keep each part separated into self contained units so they can easily be exchanged, extended or removed. Our focus is on making the solution as generic as possible, so we can adapt to different needs fast, as we do not know what kind of data our framework needs to provide in the future.

4.3.1 Capabilities

The scheduler is interested in the capabilities of the machines P2G is running on, because some of the kernels might require a CUDA capable GPU, a specific CPU-type or other specific piece of hardware. As the number of machines increases, the job to manually configure each node, becomes less viable. It is therefore vital for the scalability of P2G that the framework can detect capabilities automatically, without any human interaction. As discussed in chapter 2, certain heterogeneous systems can perform tasks at a much higher speed than a general purpose CPU.

To leverage that, the program must be written and compiled for that

Utilization of instrumentation data to improve distributed multimedia processing

UNIVERSITY OF OSLO Department of Informatics