Implementing Less than Best Effort with Deadlines

(1)

Effort with Deadlines

One LBE to rule them all and to the deadline bind them

Lars Erik M. Storbukås

Thesis submitted for the degree of Master in Programming and Networks

60 credits

Department of Informatics

Faculty of mathematics and natural sciences UNIVERSITY OF OSLO

Spring 2018

(2)

University of Oslo, Simula Research Laboratory Printed: Reprosentralen, University of Oslo

(3)

Implementing Less than Best Effort with Deadlines

One LBE to rule them all and to the deadline bind them

Thesis submitted for the degree of Master in Programming and Networks February 15th, 2018

University of Oslo

Department of Informatics

Faculty of Mathematics and Natural Sciences

(4)

(5)

Applications which do not carry strong capacity or latency constraints, such as backups, could use a Less than Best Effort (LBE) transport protocol as an alternative to best-effort TCP. This could minimize their potential to disrupt network flows with a higher significance. Existing Less than Best Effort Congestion Controls, such as LEDBAT, enforce LBE behaviour without any timeliness requirements regarding completion time.

This paper introduces a framework API in the Linux Kernel that provides the ability to impose timeliness requirements and LBE behaviour on an arbitrary CC.

The framework measures the degree of network congestion based on the most commonly used metrics of network congestion (loss, delay, and explicit congestion notification). The framework introduces functionality which provides the ability for network traffic to adjust its relative share of network capacity on a per-flow level. This functionality should be used to control the level of LBEness based on the congestion level in the network, as well as the relative time until the contracted completion time.

The experiments executed show the effectiveness of the DA-LBE mechanisms implemented in the Linux Kernel. The transmission rate can be artificially inflated or deflated on an arbitrary CC in order adjust the relative share of network capacity on a given network-flow.

The implementation of the DA-LBE kernel framework has provided the foundation for the DA-LBE user-space framework, giving the ability to impose LBE behaviour on an arbitrary CC, on a per-socket basis. The Deadline Aware part of the DA-LBE framework must be implemented in user-space and use the statistical structures provided by the the DA-LBE kernel framework, to make decisions about which DA-LBE kernel mechanism should be enabled at a given time.

iii

(6)

(7)

Before you is the dissertation on Implementing Less than Best Effort with Deadlines, the basis of which is the framework proposal composed byDavid A.

Hayes, David Ros, Andreas Petlund, and Iffat Ahmed, titled A Framework for Less than Best Effort Congestion Control with Soft Deadlines. This thesis has been written as a part of the master’s program, Programming and Networks, at the University of Oslo, in collaboration with Simula Research Laboratory.

My implementation of the framework proposal is a contribution to a New Evolutive API and Transport-Layer (NEAT) Architecture for the Internet. NEAT is funded by the European Union’s Horizon 2020 research and innovation programme.

Linux Kernel development has proven to be a challenging and time-consuming undertaking. Conducting a systematic analysis of the Linux Kernel network stack has allowed me to translate the framework proposal into Kernel code. The implementation in the Linux Kernel provides a framework which enables less than best effort service with deadlines for an arbitrary congestion control. I would like to thank my supervisors for their guidance and support during this process.

A special expression of gratitude goes to my wife, Karoline, for all the support she has given me while I worked on this thesis. A lot of that work occurred at times inconvenient to my family. My daughter, Nine-Marie, who is two at the time of writing, has also needed to show patience when her dad has been working on the Linux Kernel and on this thesis.

v

(8)

(9)

ACK Acknowledgement

API Application Programming Interface

BE Best Effort

BRTT Base Round-Trip Time CC Congestion Control CPU Central Processing Unit CWND Congestion Window

CWR Congestion Window Reduced

DA-LBE Deadline Aware Less than Best Effort

ECE ECN-Echo

ECN Explicit Congestion Notification

EWMA Exponentially Weighted Moving Average GSO Generic Segmentation Offload

IP Internet Protocol LBE Less than Best Effort

NEAT New Evolutive API and Transport-Layer NIC Network Interface Controller

vii

(10)

RTO Retransmission Timeout

RTT Round-Trip Time

SKB The Socket Buffer SSTHRESH Slow-Start Threshold

TCP Transmission Control Protocol TSO TCP Segmentation Offload UDP User Datagram Protocol

(11)

1 Introduction 7

1.1 Problem statement . . . 8

1.2 Motivation . . . 9

1.3 Research questions . . . 10

1.4 Organization . . . 11

2 Background 13 2.1 Transmission Control Protocol (TCP) . . . 13

2.1.1 Congestion Control (CC) . . . 16

2.1.1.1 Slow start . . . 16

2.1.1.2 Congestion Avoidance . . . 17

2.1.1.3 Fast Retransmit and Fast Recovery . . . 18

2.1.1.4 Different flavours of TCP Congestion Control . 19 2.1.1.5 TCP CUBIC . . . 19

2.1.1.6 TCP Vegas . . . 19

2.1.2 Internet Protocol (IP) header . . . 20

2.1.3 Transmission Control Protocol (TCP) header . . . 21

2.1.4 Explicit Congestion Notification (ECN) . . . 22

2.1.4.1 ECN in TCP Header . . . 23

2.1.4.2 ECN in IP Header . . . 23

2.1.5 TCP Cumulative Acknowledgements . . . 23

2.1.5.1 TCP Segmentation Offload (TSO) . . . 23

2.1.6 Queueing delay . . . 23

2.2 Less than Best Effort (LBE) . . . 24

2.2.1 LEDBAT as an example of LBE CC . . . 25

2.2.2 FLOWER as an example of LBE CC . . . 25

2.3 Deadline Aware Less than Best Effort (DA-LBE) . . . 25

2.3.1 Becoming Deadline-Aware . . . 26

2.3.2 Imposing LBE behaviour on arbitrary CCs . . . 27

2.3.3 The NEAT system . . . 29

2.3.3.1 Role of DA-LBE in NEAT . . . 30 ix

(12)

2.4 Linux Kernel architecture . . . 30

2.4.1 Compiling a custom kernel . . . 31

2.4.2 Communicating with the kernel . . . 32

2.4.3 The TCP stack in Linux . . . 33

2.4.3.1 TCP Output Engine . . . 34

2.4.3.2 TCP Input Engine . . . 34

2.4.3.3 The Socket Buffer (SKB) . . . 35

3 Implementation in Linux 37 3.1 Applying theory to practice . . . 37

3.2 Linux Kernel architecture . . . 38

3.3 Network congestion statistics . . . 40

3.3.1 Congestion Window (snd_cwnd) . . . 41

3.3.2 Slow- and Fast-Retransmissions . . . 42

3.3.3 Slow-Start Threshold (ssthresh) . . . 42

3.3.4 Average congestion indication time interval . . . 43

3.4 Location of mathematical calculations . . . 43

3.4.1 Fixed-point integer arithmetic . . . 44

3.4.2 Probability . . . 44

3.4.3 Random number generator . . . 45

3.4.4 Probability calculation . . . 45

3.5 Rate control mechanisms . . . 47

3.5.1 Phantom ECN . . . 47

3.5.1.1 Congestion abatement . . . 48

3.5.2 Ignoring ECN signals . . . 50

3.5.3 Ignoring loss . . . 51

3.5.4 Phantom delay . . . 53

3.6 Using the DA-LBE kernel framework from user-space . . . 54

3.6.1 Socket options . . . 56

3.6.1.1 Invoking socket options . . . 57

3.6.2 User-space example . . . 58

4 Experimental setup 61 4.1 Hardware . . . 61

4.1.1 Configuring two network interfaces . . . 62

4.1.2 Enabling ECN support . . . 63

4.1.3 Disabling TSO and GSO . . . 64

4.1.4 Tools . . . 64

4.2 Network emulation . . . 65

4.2.1 Rate control . . . 65

4.2.2 Network delay . . . 65

(13)

4.2.3 Explicit Congestion Notification (ECN) . . . 65

4.2.4 Configuring Congestion Control (CC) algorithm . . . 66

5 Experimental evaluation 67 5.1 Experimental objective . . . 67

5.2 Experimental configuration . . . 67

5.3 Experimental results . . . 69

5.3.1 Comparison sample . . . 69

5.3.2 NetEm ECN . . . 72

5.3.3 Phantom ECN . . . 73

5.3.3.1 Increasing the probability . . . 76

5.3.4 Ignoring ECN . . . 77

5.4 Conclusion . . . 79

6 Conclusion 81 6.1 Evaluation . . . 81

6.2 Future work . . . 82

Appendices 83 Appendix A Linux Kernel code excerpts 83 A.1 struct tcp_sock . . . 83

A.2 struct sock_common . . . 84

A.3 struct sock . . . 84

A.4 load and store for lockless contexts . . . 85

A.5 DA-LBE mode . . . 86

A.6 DA-LBE defines . . . 86

A.7 struct da_lbe_info . . . 86

A.8 tcp_ecn_queue_cwr() . . . 87

A.9 tcp_packets_in_ack() . . . 87

A.10 cwnd_proportion_change() . . . 88

A.11 dalbe_ewma() . . . 88

A.12 da_lbe_mode_enabled() . . . 89

A.13 tcp_ecn_rcv_ecn_echo() . . . 89

A.14 tcp_init_cwnd_reduction() . . . 91

A.15 tcp_clean_rtx_queue() . . . 92

A.16 tcp_snd_una_update() . . . 93

A.17 tcp_enter_recovery() . . . 94

A.18 tcp_fastretrans_alert() . . . 95

A.19 __tcp_retransmit_skb() . . . 96

A.20 do_tcp_setsockopt() . . . 96

(14)

A.21 do_tcp_getsockopt() . . . 98

A.22 da_lbe_get_info() . . . 99

Appendix B Experimental results 101 B.1 Comparison sample . . . 101

B.2 NetEm ECN . . . 108

B.3 Phantom ECN . . . 115

B.4 Ignoring ECN . . . 128

Appendix C Miscellaneous 133 C.1 Compile the Linux Kernel . . . 133

C.1.1 Minimal Requirements . . . 134

C.1.2 Online resources . . . 135

C.2 User-space example . . . 135

C.3 Linux Kernel Map . . . 138

(15)

2.1 Network flow TCP/IP Model Layers . . . 15

2.2 Slow start . . . 17

2.3 IP Header . . . 20

2.4 Type of Service (TOS field) . . . 20

2.5 TCP Header . . . 21

2.6 LBE Example . . . 24

2.7 NEAT Architecture . . . 29

2.8 Linux Kernel Map . . . 31

2.9 Linux TCP Output Layout . . . 34

2.10 Socket Buffer . . . 35

3.1 Bit-shifting . . . 44

3.2 Setting probability of 1% . . . 45

3.3 Setting probability of 57.5% . . . 45

3.4 Linux Kernel communication with socket options . . . 55

4.1 Hardware setup . . . 62

5.1 Experiment configuration - Transfer . . . 68

5.2 Comparison sample - TCP CUBIC . . . 70

5.3 Comparison sample - TCP Vegas . . . 71

5.4 NetEm ECN 2.5% - TCP CUBIC . . . 72

5.5 NetEm ECN 2.5% - TCP Vegas . . . 73

5.6 Phantom ECN 2.5% - TCP CUBIC . . . 74

5.7 Phantom ECN 2.5% - TCP Vegas . . . 75

5.8 Phantom ECN 5% - TCP CUBIC . . . 76

5.9 Phantom ECN 10% - TCP CUBIC . . . 77

5.10 NetEm ECN - TCP CUBIC . . . 78

B.1 Comparison sample - TCP CUBIC - Throughput . . . 102

B.2 Comparison sample - TCP CUBIC - Congestion window . . . 103

B.3 Comparison sample - TCP CUBIC - Round-trip time . . . 104 1

(16)

B.4 Comparison sample - TCP Vegas - Throughput . . . 105

B.5 Comparison sample - TCP Vegas - Congestion window . . . 106

B.6 Comparison sample - TCP Vegas - Round-trip time . . . 107

B.7 NetEm ECN - TCP CUBIC - 2.5% ECN - Throughput . . . 109

B.8 NetEm ECN - TCP CUBIC - 2.5% ECN - Congestion window . . 110

B.9 NetEm ECN - TCP CUBIC - 2.5% ECN - Round-trip time . . . . 111

B.10 NetEm ECN - TCP Vegas - 2.5% ECN - Throughput . . . 112

B.11 NetEm ECN - TCP Vegas - 2.5% ECN - Congestion window . . . 113

B.12 NetEm ECN - TCP Vegas - 2.5% ECN - Round-trip time . . . 114

B.13 Phantom ECN - TCP CUBIC - 2.5% ECN - Throughput . . . 116

B.14 Phantom ECN - TCP CUBIC - 2.5% ECN - Congestion window . 117 B.15 Phantom ECN - TCP CUBIC - 2.5% ECN - Round-trip time . . . 118

B.16 Phantom ECN - TCP Vegas - 2.5% ECN - Throughput . . . 119

B.17 Phantom ECN - TCP Vegas - 2.5% ECN - Congestion window . . 120

B.18 Phantom ECN - TCP Vegas - 2.5% ECN - Round-trip time . . . . 121

B.19 Phantom ECN - TCP CUBIC - 5% ECN - Throughput . . . 122

B.20 Phantom ECN - TCP CUBIC - 5% ECN - Congestion window . . 123

B.21 Phantom ECN - TCP CUBIC - 5% ECN - Round-trip time . . . . 124

B.22 Phantom ECN - TCP CUBIC - 10% ECN - Throughput . . . 125

B.23 Phantom ECN - TCP CUBIC - 10% ECN - Congestion window . 126 B.24 Phantom ECN - TCP CUBIC - 10% ECN - Round-trip time . . . . 127

B.25 Ignoring ECN - TCP CUBIC - 2.5% NetEm ECN and 10% DA- LBE ignore ECN - Throughput . . . 129

B.26 Ignoring ECN - TCP CUBIC - 2.5% NetEm ECN and 10% DA- LBE ignore ECN - Congestion window . . . 130

B.27 Ignoring ECN - TCP CUBIC - 2.5% NetEm ECN and 10% DA- LBE ignore ECN - Round-trip time . . . 131

(17)

2.1 OSI Model and TCP/IP Model . . . 14

2.2 TCP/IP Model examples . . . 14

2.3 TCP/IP Model layers explained . . . 15

2.4 TCP congestion implementation flavours . . . 19

2.5 TCP Header specification . . . 22

2.6 Different ways to communicate with kernel-space . . . 32

3.1 Files affected by the DA-LBE Linux Kernel framework . . . 38

3.2 DA-LBE mechanisms implemented innet/ipv4/tcp.c . . . 39

3.3 DA-LBE mechanisms implemented innet/ipv4/tcp_input.c . . . 39

3.4 DA-LBE mechanisms implemented innet/ipv4/tcp_output.c. . . 39

3.5 da_lbe_info struct overview . . . 41

3.6 Socket options usingsetsockopt . . . 56

4.1 Machine overview . . . 62

4.2 Overview possible values fortcp_ecn . . . 64

4.3 Tools overview . . . 64

5.1 Shared configuration across all experiments . . . 68

5.2 Ignoring ECN configuration . . . 78

C.1 Minimal requirements to compile the Kernel . . . 134

3

(18)

(19)

3.1 da_lbe_info struct implementation . . . 40

3.2 __tcp_snd_una_update() located innet/ipv4/tcp_output.c . . . . 42

3.3 cwnd_proportion_change() located innet/ipv4/tcp_input.c . . . . 42

3.4 Bit-shifting in C . . . 44

3.5 Generating a random unsigned number . . . 45

3.6 Check if probability event should occur . . . 45

3.7 Modified probability check - cumulative acknowledgement . . . . 46

3.8 DA-LBE mode . . . 47

3.9 DA-LBE-INFO in tcp.h . . . 48

3.10 tcp_ecn_rcv_ecn_echo() located innet/ipv4/tcp_input.c . . . 49

3.11 tcp_fastretrans_alert() located innet/ipv4/tcp_input.c . . . 49

3.12 tcp_ecn_rcv_ecn_echo() located innet/ipv4/tcp_input.c . . . 51

3.13 Code excerpt A.17 outline the tcp_enter_recovery() located in net/ipv4/tcp_input.c . . . 52

3.14 tcp_clean_rtx_queue() located innet/ipv4/tcp_input.c . . . 53

3.15 getsockopt() and setsockopt() function declaration . . . 57

3.16 Using getsockopt() to fetch DA-LBE statistics . . . 57

3.17 Using setsockopt() to configure DA-LBE mechanisms . . . 57

3.18 User-space example . . . 58

4.1 Installing bridge-utils . . . 63

4.2 Create bridge and assign interfaces . . . 63

4.3 Configuration of network interfaces . . . 63

A.1 struct tcp_sock located ininclude/linux/tcp.h . . . 83

A.2 struct sock_common located ininclude/net/sock.h . . . 84

A.3 struct sock located ininclude/net/sock.h. . . 84

A.4 sk_da_lbe_mode_load() and sk_da_lbe_mode_store() located in include/net/sock.h . . . 85

5

(20)

A.5 DA-LBE mode enum located ininclude/uapi/linux/tcp.h. . . 86

A.6 DA-LBE socket options located ininclude/uapi/linux/tcp.h . . . 86

A.7 struct da_lbe_info located ininclude/uapi/linux/tcp.h . . . 86

A.8 tcp_ecn_queue_cwr() located innet/ipv4/tcp_input.c. . . 87

A.9 tcp_packets_in_ack() located innet/ipv4/tcp_input.c . . . 88

A.10 cwnd_proportion_change() located innet/ipv4/tcp_input.c. . . . 88

A.11 dalbe_ewma() located innet/ipv4/tcp_input.c. . . 88

A.12 da_lbe_mode_enabled() located innet/ipv4/tcp_input.c . . . 89

A.13 tcp_ecn_rcv_ecn_echo() located innet/ipv4/tcp_input.c . . . 89

A.14 tcp_init_cwnd_reduction() located innet/ipv4/tcp_input.c . . . . 91

A.15 tcp_clean_rtx_queue() located innet/ipv4/tcp_input.c . . . 92

A.16 tcp_snd_una_update() located innet/ipv4/tcp_input.c . . . 93

A.17 Code excerpt A.17 outline the tcp_enter_recovery() located in net/ipv4/tcp_input.c . . . 94

A.18 tcp_fastretrans_alert() located innet/ipv4/tcp_input.c . . . 95

A.19 __tcp_snd_una_update() located innet/ipv4/tcp_output.c . . . . 96

A.20 do_tcp_setsockopt() located innet/ipv4/tcp.c . . . 96

A.21 do_tcp_getsockopt() located innet/ipv4/tcp.c . . . 98

A.22 da_lbe_get_info() located innet/ipv4/tcp.c . . . 99

C.1 Example TCP client . . . 135

(21)

Introduction

The Internet is much like a road-system which spans across the entire world. The road-system has highways which can carry a lot of traffic as well as smaller roads which can carry less traffic. Roads have interconnections, traffic queues, and border patrol, and similarly the Internet has routers, network queues, and firewalls.

Some roads have a high-speed limit, some have a lower speed limit, while the Internet has high bandwidth data links which can carry a lot of data as well as lower bandwidth data links which can carry less data.

While driving, most people are eager to get to their destination as fast as possible, driving as quickly as possible within the speed limit. Other people have a plethora of time, just appreciating the journey, not thinking too much about when they reach their destination - provided they do so within a reasonable amount of time. While comparing the Internet with a road-system, some traffic is less important than others (i.e., freight transport has a lower priority than human transport). Obviously, similarities can be drawn between network traffic and road traffic.

In keeping with the road analogy, this thesis will look into more detail on how traffic of differing urgency can be able to coexist while applying some frame for the yielding traffic to complete within a given soft deadline.

The foundation for this thesis builds on the framework proposal composed by David A. Hayes, David Ros, Andreas Petlund, and Iffat Ahmed. The framework proposal includes a solid foundation for the implemented framework.

The effectiveness of the approach in the framework proposal is validated by numerical and emulation experiments. The purpose of this thesis is to analyze the framework proposal and assess mechanisms which can provide Less than Best Effort (LBE) behaviour on an arbitrary Congestion Control (CC) in Linux.

7

(22)

1.1 Problem statement

Networks use congestion avoidance and congestion control as techniques to limit network congestion and avoid collapse. CCs is constantly probing for bandwidth in order to know when to increase or decrease transmission rate. The individual CCs define how it should react to network congestion, and what network congestion indications it should react to. An LBE service reacts to network congestion earlier than standard TCP would, which results in a lower impact on non-LBE network traffic when bandwidth is shared by multiple competing entities.

Commonly, network traffic is carried out using a Best Effort (BE) service in order to maximize their transmission rate. Network traffic carried out by an LBE service is yielding towards network traffic using a BE service. LBE behaviour is commonly called a scavenger service, based on the fact that it is utilizing unused network bandwidth. Providing LBE service entails only using bandwidth which is unused by other consumers on the network, particularly non-LBE traffic. In order for LBE to use the unused bandwidth, it should yield to non-LBE traffic, adapting to network changes and traffic load on the network. When an LBE service observes a decrease in network congestion it will increase its transmission rate.

Similarly, an LBE service will decrease its transmission rate when it observes an increase in congestion indications, in order to allow for traffic of differing urgency to maximize their transmission rate, resulting in a more efficient utilization of network bandwidth.

LBE does not consider any notion of timeliness requirements when yielding for traffic of differing urgency. LBE services could be starved by BE flows, causing the LBE services to yield indefinitely. This can be remedied by adding the notion of time to LBE, coined Deadline Aware Less than Best Effort Services (DA-LBE).

Adding notion of time to an LBE service will allow it to adjust how early it reacts to network congestion. A DA-LBE service will be able to conform to standard TCP BE services if it needs to in order to complete within the soft deadline.

The purpose of this thesis is to implement Deadline Aware Less than Best Effort support in the Linux Kernel. This thesis will reasearch how DA-LBE behaviour can be imposed on an arbitrary end-to-end CC. End-to-end CC has algorithms running on the end-hosts, as opposed to the alternative of having some sort of priority scheduling in routers. This thesis will research the possibility to adjust the behaviour of a CC by adjusting the transmission rate without having to modify the source code of CC or dropping packets to trigger congestion avoidance. The DA-LBE behaviour should provide timeliness constraints on LBE mechanisms in order to complete within a soft deadline (i.e., I want it to be finished by tomorrow

(23)

at 3PM). Ideally, the LBE mechanisms should be able to be controlled on a per- socket level, as opposed to a system-wide configuration.

The Deadline Aware, Less than Best Effort (DA-LBE) framework, implemented in Linux should provide the following functionality (excerpt from [DA-LBE]):

• Keep disruption of concurrent BE interactive services to a minimum.

• Add a timeliness constraint to the transport, i.e., the transfer should be finished by a soft deadline to fit in with other network activities and to ensure the timely correctness of replicated data.

To achieve the functionality described above, it should be able to dynamically adjust its aggressiveness in competing with BE traffic from that of a scavenger- type service up to that of a BE-type service. The DA-LBE framework would be required to provide information which shows the current state of the network in regards to network congestion and bandwidth utilization. In order to support modification of the transmission rate of an arbitrary CC, mechanisms which will change the perception of network congestion indications is required. The DA- LBE framework must provide mechanism which are applicable to commonly used network congestion indicators such as loss, queueing delay and explicit congestion notifications.

The DA-LBE framework will require Linux Kernel support in order for the existing CCs to be adapted to this use. This thesis will investigate how the problem statement described above can be implemented and organized in Linux.

1.2 Motivation

The DA-LBE framework is is a contribution to a New Evolutive API and Transport-Layer (NEAT) Architecture for the Internet. NEAT is funded by the European Union’s Horizon 2020 research and innovation programme.

Applications which do not carry strong capacity or latency constraints, such as backups, could use a Less than Best Effort (LBE) transport protocol as an alternative to best-effort TCP. This could minimize their potential to disrupt network flows with a higher significance. Existing LBE CCs, such as LEDBAT, enforce LBE behaviour without any timeliness requirements regarding completion time.

The key motivational points of this thesis is to:

(24)

• Impose LBE behaviour with a notion of time in order to support soft deadlines by dynamically adjusting the aggressiveness when competing with BE network traffic. Soft deadlines can prevent network starvation [2] and latecomer unfairness [8] as is observed in an LBE implementation such as LEDABT.

• Support a wide range of CCs as opposed to attempt to develop a one-size- fits-all CC. This will allow application developers to enable deadline aware LBE behaviour on new or existing applications with minimal configuration requirements.

The motivation for implementing a Deadline Aware, Less than Best Effort (DA- LBE) framework is to be able to LBE haviour on an arbitrary CC with the notion of time. DA-LBE could be used to balance non-intrusiveness and respect of timeliness constraints. A key use for an LBE transport service is data backup applications, by limiting the impact of a backup service towards network flows of higher significance (e.g. streaming services). Backup applications does generally have lower throughput and delay requirements compared to traffic such as video conferencing, streaming services, etc.

Applications which do not have any specific throughput or delay requirements will have the ability to take advantage of periods when the network has low bandwidth utilization. The DA-LBE framework could be used by such an application in order to incrase network utilization. No deadline-aware LBE CC methods have ever been proposed in literature [NEAT]. An LBE which includes a notion of time will provide the ability to request an arbitrary network transmission to perform Lower than Best Effort (LBE) (see section 2.2). DA-LBE solves the LBE starvation found in LEDBAT [2] by implementing deadline awareness.

1.3 Research questions

This thesis will discuss the following research questions.

• Is it possible to apply mechanisms in the Linux Kernel which can reduce the transmission rate on an arbitrary CC without causing loss?

• Can traffic of differing urgency coexist while applying some frame for the yielding traffic to complete within a given soft deadline?

• Can the former items be achieve on a per-socket basis in Linux?

(25)

1.4 Organization

This thesis is organised by first introducing the Background chapter providing information useful in order to understand the DA-LBE framework and how it is implemented in the Linux Kernel. Next, theImplementation in Linux chapter presents the implementation of the DA-LBE kernel framework in Linux and explores how it has been structured and organized. Followed by theExperimental setup chapter, which explores the technical configuration for the experimental environment, where experiements and measurements have been performed and the purpose of the different experiments that have been carried out in this thesis.

The experimental results is presented in the Experimental evaluation chapter, which evaluates the outcome of the experiments that was carried out. The thesis is concluded in theConclusion chapter, where the effectiveness of the DA-LBE framework implementation and future work is discussed.

(26)

(27)

Background

This chapter describes the background information necessary to understand the DA-LBE framework and how it is implemented in the Kernel. It gives a foundation for understanding the networking mechanisms which the DA-LBE framework relies on in order to operate. It is assumed that readers are familiar with basic networking concepts, but for completeness the concepts which is necessary to understand how the DA-LBE framework is designed is presented in this chapter.

2.1 Transmission Control Protocol (TCP)

Transmission Control Protocol (TCP) [RFC0793], alongside Internet Protocol (IP) [RFC0791] are the most commonly used protocols in the Internet. TCP/IP [RFC1180] is a set of protocols which specify communications between computers on the Internet. TCP/IP determines how packets are to be transported to their destination. IP is responsible for the logistics of packets (where to go, and how to get there). TCP is responsible for providing a reliable transmission. TCP retransmits packets which do not arrive at their destination, and performs error- checking on the packets. A well-known counterpart to TCP is User Datagram Protocol (UDP) [RFC0768].

TCP/IP is divided into layers. The OSI Reference Model is a generic, protocol- independent standard, which TCP/IP is based on. The OSI model has 7 layers, whereas the TCP/IP model has 4 layers. We’re going to explore the implementation of TCP/IP in Linux and how the different layers of the TCP/IP model are implemented.

13

(28)

OSI Model TCP/IP Model 7. Application Layer

Application Layer 6. Presentation Layer

5. Session Layer

4. Transport Layer Transport Layer 3. Network Layer Network Layer 2. Data Link Layer

Link Layer 1. Physical Layer

Table 2.1:Comparing the OSI Model to the TCP/IP Model

TCP/IP Model Layer Examples

Application Layer FTP, SMTP, DNS, RIP, SNMP, NFS, NIS+, telnet Transport Layer TCP, UDP

Network Layer IP, ARP, IGMP, ICMP

Link Layer PPP, Ethernet, IEEE 802.2, Frame Relay, ATM Table 2.2: TCP/IP Protocol Examples in the different layers of the TCP/IP Protocol Architecture Model

TCP/IP provides reliability, in-order delivery, error checking, flow control, and CC to the application layer. The TCP/IP Protocol Architecture Model is displayed in table 2.1, and a list of protocol examples is provided in table 2.2. The three lower layers of the TCP/IP model provide a service to the layers above. A short description of each layer’s role is provided below.

(29)

Application Layer: Provides the ability for applications to access Internet- related services of the other layers.

Transport Layer: Provides host-to-host data transportation for the Applic- ation layer, with session and datagram communication services.

Network Layer: Provides the ability for sending data over the Internet with addressing, packaging, and routing functions Link Layer: Responsible for placing and receiving TCP/IP packets

on the network medium.

Table 2.3:TCP/IP Protocol Architecture Model Layers explained

A network device need not implement all four layers to operate, but it is a requirement that you implement all layers below the layer you wish to use. Each layer encapsulates data from the layer above, together with some supplemental data specific to the content, called a header (overhead). The header is used for control signals at the specific layer of the TCP/IP model. We will give some examples of headers below.

Network flow between TCP/IP layers, computers, and routers is displayed in figure 2.1 TCP is connection oriented; this means that in order for two hosts to communicate, a connection between the two must be established. TCP does this by a three-way handshake. The three-way handshake is used to configure network metrics such as Maximum Segment Size (MSS) between the hosts. Each host announces its MSS limit, where the lowest of the two values is used.

Figure 2.1: Network flow TCP/IP Model Layers

(30)

TCP keeps track of delivered packets by using acknowledgements (ACK). Each packet has an (incrementally increasing) sequence number indicating the part of the byte stream it belongs to. The sequence number corresponds to the number of bytes that have been sent. ACK’s are used by the receiver to state that the packet has been received. When the sender receives an ACK with a certain sequence number, it is interpreted as the preceding bytes up until the sequence number has been received. This mechanism is called cumulative acknowledgment (see section 2.1.5). Another TCP mechanism is retransmitting unacknowledged packets triggered by a retransmission timeout (RTO). An RTO is simply a timer which keeps track of the last time an acknowledgement was received. When the RTO- timer exceeds a certain value, a retransmission is triggered. The retransmission re-sends all bytes preceding the last acknowledged sequence number.

TCP provides error-checking by adding an error checksum to the header, which can be evaluated by the receiver in order to detect corrupted packets. Corrupted packets could, for instance, be the result of noise or interference on the link layer.

TCP avoids network congestion by reducing the throughput of the sender. TCP uses a mechanism called flow control, which is used to prevent the sender from sending more than the receiver can handle. To limit the amount of data sent by TCP, the receiver advertises a Receive Window (rwnd). The rwnd is the available space in the receive buffer. Another important mechanism used by TCP is CC, which will be discuss next.

2.1.1 Congestion Control (CC)

Congestion Control (CC) is a mechanism in TCP to adjust the transmission rate in order to avoid congestive disruption as a result of oversubscription.

Commonly, a CC algorithm avoids this by reducing the throughput (e.g. rate of packets). CC is triggered by network congestion indications (i.e., dropped packets). Some commonly used CC algorithms are: Cubic [TCP Cubic], New Reno [RFC2582], and Vegas [TCP Vegas]. How the different CC algorithms react to different network congestion indications varies. CC is applied independently on each connection. Generally, the CC algorithm is configured as a system-wide configuration.

2.1.1.1 Slow start

Slow Start [RFC2001] is an algorithm which evaluates the rate at which acknowledgements are received in order to decide the flow rate. Slow Start

(31)

introduces a new window, the congestion window. The congestion window indicates how many unacknowledged packets can exist before having to wait for an acknowledgement to send more. Upon a new connection, the congestion window is set to one segment (the MSS agreed upon in the three-way handshake).

Each acknowledgement received increases the segment size by one. The limit of unacknowledged packets is the lowest value of the congestion window and the advertised window. The congestion window provides flow control on the sender side, whereas the advertised window provides flow control on the receiving side.

Initiator Receiver cwnd = 1

cwnd = 2 cwnd = 4

cwnd = 8

Figure 2.2: Slow start

2.1.1.2 Congestion Avoidance

Congestion Avoidance [RFC2001] is a mechanism for dealing with lost packets.

There are two indications of packet loss: receiving duplicate ACKs and a retransmission timeout occurring. When TCP receives such a congestion indication, it must slow down its transmission rate. Slow Start is the result of reducing the transmission rate, and the two are commonly implemented together.

Slow Start and Congestion Avoidance use two variables in order to control transmission rate: congestion window (cwnd) and slow start threshold (ssthresh).

Together, Slow Start and Congestion Avoidance operates as follows (list used from [RFC2001]):

1. Initialization for a given connection sets cwnd to one segment and ssthresh to 65535 bytes.

2. The TCP output routine never sends more than the minimum of cwnd and the receiver’s advertised window.

(32)

3. When congestion occurs (indicated by a time out or the reception of duplicate ACKs), one-half of the current window size (the minimum of cwnd and the receiver’s advertised window, but at least two segments) is saved in ssthresh. Additionally, if the congestion is indicated by a timeout, cwnd is set to one segment (i.e., slow start).

4. When new data is acknowledged by the other end, cwnd is increased, but the way it increases depends on whether TCP is performing slow start or congestion avoidance.

2.1.1.3 Fast Retransmit and Fast Recovery

Fast Retransmits and Fast Recovery [RFC2001] are mechanisms for dealing with lost segments without waiting for a retransmission timer to expire. This event is triggered upon the reception of three or more duplicate ACKs. As a result, Congestion Avoidance mode is entered as opposed to Slow start mode. The reason for not going into slow start mode is that receiving duplicate ACKs indicates that the receiver has received a segment with a higher sequence number. This means that data is still flowing between the two ends, and TCP should not abruptly go into slow start mode.

The fast retransmit and fast recovery algorithms are usually implemented together as follows (list used from [RFC2001]):

1. When the third duplicate ACK in a row is received, set ssthresh to one- half the current congestion window, cwnd, but no less than two segments.

Retransmit the missing segment. Set cwnd to ssthresh plus 3 times the segment size. This inflates the congestion window by the number of segments that have left the network and which the other end has cached (3).

2. Each time another duplicate ACK arrives, increment cwnd by the segment size. This inflates the congestion window for the additional segment that has left the network. Transmit a packet if allowed by the new value of cwnd.

3. When the next ACK arrives that acknowledges new data, set cwnd to ssthresh (the value set in step 1). This ACK should be the acknowledgment of the retransmission from step 1, one round-trip time after the retransmission. Additionally, this ACK should acknowledge all the intermediate segments sent between the lost packet and the receipt of the ACK.

(33)

2.1.1.4 Different flavours of TCP Congestion Control

An overview of some of the most common flavours of TCP CCs is shown in table 2.4.

Name

Cubic [TCP Cubic]

Vegas [TCP Vegas]

LEDBAT [RFC6817]

FLOWER [FLOWER]

Tahoe [TCP Tahoe]

New Reno [RFC2582]

Westwood [TCP Westwood]

CDG [TCP CDG]

Table 2.4:TCP congestion implementation flavours

2.1.1.5 TCP CUBIC

TCP CUBIC [TCP Cubic] is a loss-based CC algorithm which is optimized for high bandwidth networks with high latency. The congestion window calculated using a cubic function of the time since the last congestion event. The growth of the congestion window is cubic during slow start, when the size of the congestion window prior to the congestion event is used as threshold. In congestion avoidance, CUBIC probes for more bandwidth starting out slow allowing the network to stabilize, then probing more rapidly for bandwidth.

TCP CUBIC does not rely on the receipt of ACKs in order to increase the congestion window. The congestion window size in CUBIC is controlled by the last congestion event. Since the growth of congestion window is independent of RTT, CUBIC allows for a greater fairness between competing network flows.

2.1.1.6 TCP Vegas

TCP Vegas [TCP Vegas] is a delay-based CC algorithm which emphasizes queueing delay rather than packet loss when adjusting its transmission rate. Vegas uses the queueing delay to detect congestion in the network. Increasing values of the Round-Trip Time (RTT) indicates that the network is starting to experience congestion, which Vegas is able to detect early. The algorithm in Vegas depends

(34)

on the calculation of Base RTT value when adjusting its transmission rate. If the Base RTT value is calculated incorrectly, it will result in either lower or higher utilization of the bandwidth.

2.1.2 Internet Protocol (IP) header

The Internet Protocol (IP) [RFC0791] header contains the Type of Service (TOS) field which carry ECN information. See figures 2.3 and 2.4. When the Congestion Experienced (CE) bit is set with value 0, it means that no congestion has been experienced. If CE is set to 1, it means that congestion has been experienced. The ECN-Capable Transport (ECN) bit is set with value 0, it means that it is non-ECN- capable transport, if it set to 1, it means that it is ECN-capable transport.

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

Version IHL Type of Service

(TOS) Total Length

Identification Flags Fragment Offset

Time To Live Protocol Header Checksum

Source IP Address Destination IP Address











IP Header, 20 bytes

Options Data

Figure 2.3:IP Header

0 1 2 3 4 5 6 7

DiffServ Code Points (DSCP) ECT CE

TOS Field, 1 byte Figure 2.4:Type of Service (TOS) field

(35)

2.1.3 Transmission Control Protocol (TCP) header

The TCP header [RFC0793] contains fields relevant to the DA-LBE framework (coloured in red). Especially the header field stating whether or not it is an ECN signal.

Every segment sent by TCP contain an initial part portion called the TCP header.

The TCP header format is depicted in figure 2.5, and header fields are described in table 2.5.

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

Source Port Destination Port

Sequence Number Acknowledgement Number

Offset Reserved

C W R

E C E

U R G

A C K

P S H

R S T

S Y N

F I

N Window Size

Checksum Urgent pointer











TCP Header, 20 bytes

Options (optional, variable length 0-40 bytes)

Figure 2.5:TCP Header

(36)

Source port: Port number of the process/service sending the packet (2 bytes)

Destination port: Port number of the process/service recipient of the packet (2 bytes)

Sequence number: Sequence number of the first byte of data in this segment. Sequence numbers are accumulated and marked by the sending host (4 bytes) Acknowledgement number: Acknowledges the number of bytes received

correctly (in order) by the receiver. The number is set to the next byte the receiver expects and is only valid if the ACK flag is set (4 bytes)

Data offset: The byte number in the segment which contains the actual data sent. Varies with header size (4 bits)

Reserved field: Not assigned a role (3 bits)

Flags: A bit field used for signalling between hosts (9 bits)

Window: The maximum number of bytes the receiver is willing to receive (2 bytes)

Urgent pointer: An offset from the sequence number indicating data to be considered urgent for the receiving application. Only valid if the URG flag is set (2 bytes)

Table 2.5:TCP Header specification (used from [12])

2.1.4 Explicit Congestion Notification (ECN)

Explicit Congestion Notification (ECN) [RFC3168] is an extension to TCP/IP which allows explicit end-to-end notification of network congestion without dropping packets. An ECN event will trigger the congestion window to be reduced, thus reducing the transmission rate. ECN is used to notify the sender that the transmission rate should be reduced in order to prevent packet loss.

(37)

2.1.4.1 ECN in TCP Header

ECN marked in the TCP header’s ECE field (see figure 2.5) is used for signalling congestion between a TCP-to-TCP endpoint. The CWR bit is the response to the receiver, indicating that an ECE marked packet has been received.

2.1.4.2 ECN in IP Header

ECN marked in the IP header’s TOS-field (see figure 2.4) is used for signalling congestion between routers and connection endpoints. IP packets marked with the CE field will trigger marking of the TCP header’s ECE field when unwrapping the IP header.

2.1.5 TCP Cumulative Acknowledgements

TCP Cumulative Acknowledgements is a mechanisms which enables a single acknowledgement to indirectly acknowledge multiple preceding packets at the same time. The receipt of an ACK implies that the preceding packets have been received correctly. The TCP Cumulative Acknowledgement is used in the TCP Sliding Window mechanism.

2.1.5.1 TCP Segmentation Offload (TSO)

Cumulative Acknowledgements is often a result of TCP Segmentation Offload (TSO), which collects multiple packets together before sending them through the network controller interface (NIC). The purpose of TSO is to offload the Network Interface Card (NIC). TSO is primarily used in scenarios where the processing overhead of the network stack becomes notable, such as in high-speed network interfaces.

The TSO mechanism is not unique to TCP, and is called Generic Segmentation Offload (GSO) when not applied to specifically to TCP.

2.1.6 Queueing delay

Queuing delay is the sum of the delays encountered on the network between the source and the destination. Packets which arrive at a router along the path

(38)

from source to destination have to be processed and transmitted. Queueing delay fluctuates based on the current level of congestion in the network. The less packets which resides in queues on network routers, the lower the queueing delay will be. When the network traffic increases, more packets will be enqueued leading to higher queueing delay, indication increase network congestion which can be used by delay-based CCs in order to react to network congestion early and reduce its transmission rate.

2.2 Less than Best Effort (LBE)

Commonly, network traffic is carried out using a Best Effort (BE) service in order to maximize their transmission rate. Network traffic carried out by an LBE service is yielding towards network traffic using a BE service. LBE behaviour is commonly called a scavenger service, based on the fact that it is utilizing unused network bandwidth. Providing LBE service entails only using bandwidth which is unused by other consumers on the network, particularly non-LBE traffic. LBE yield to non-LBE traffic, adapting to network changes and traffic load on the network. When an LBE service observes a decrease in network congestion it will increase its transmission rate. Similarly, an LBE service will decrease its transmission rate when it observes an increase in congestion indications, in order to allow for traffic of differing urgency to maximize their transmission rate, resulting in a more efficient utilization of network bandwidth. Below is a graph showing how LBE yields towards other network flows.

Figure 2.6:LBE Example (used with permission from [NEAT])

There exist different transport protocols which implement LBE behaviour, such as LEDBAT and FLOWER, with LEDBAT being the most well-known deployed implementation. LEDBAT and FLOWER is explained in more detail below.

(39)

2.2.1 LEDBAT as an example of LBE CC

Low Extra Delay Background Transport (LEDBAT) [RFC6817] is the most common LBE transport service. It is standardized by the Internet Engineering Task Force (IETF) and is a delay-based CC protocol. It works by setting a predefined maximum value for the queuing delay, thereby reducing how much it affects higher priority traffic. Like any LBE-based service, it is designed to utilize the unused bandwidth in a network. LEDBAT is configured with two (main) parameters:

target and gain. One challenge when configuring these parameters is that they are highly dependent on network conditions and that they are statically configured, meaning that they are unable to adapt to network changes.

The two main issues with LEDBAT are aggressiveness and latecomer unfairness.

LEDBAT can, in some configurations, become more aggressive than TCP, thus undermining its design, which is to be a Lower Than Best Effort (LBE) service.

Latecomer unfairness is a problem that results in starvation of ongoing flows when the flows start at different times, where the sensed minimum one-way delay (OWD) is different on the two streams (worst case is that newcomers starve ongoing flows).

2.2.2 FLOWER as an example of LBE CC

Fuzzy LOWer-than-Best-EffoRt Transport Protocol (FLOWER) [FLOWER]

provides a more transparent LBE-service to the network and overcomes the challenges with LEDBAT. Fuzzy logic is when there are varying degrees of true, meaning that there is not only boolean logic (true or false). The key difference between LEDBAT and FLOWER lies in the controller used to adjust the congestion window. FLOWER will adjust its target value based on analysis of the minimum queueing delay observed, which is performed each roundtrip time (RTT). This analysis decides whether or not the target queueing delay must be adjusted in order to adapt to network changes (increasing or decreasing its sending rate).

2.3 Deadline Aware Less than Best Effort (DA-LBE)

LBE does not consider any notion of timeliness requirements when yielding for traffic of differing urgency. LBE services could be starved by BE flows, causing the LBE services to yield indefinitely. This can be remedied by adding the notion

(40)

of time to LBE, termed as Deadline Aware Less than Best Effort Services (DA- LBE). Adding notion of time to an LBE service will allow it to adjust how early it reacts to network congestion, taking into consideration the amount of time until the soft deadline. A DA-LBE service will be able to conform to standard TCP BE services if it needs to in order to complete within the soft deadline.

A DA-LBE traffic flow, should (list used from [DA-LBE]):

• Be no more aggressive than BE traffic

• React appropriately to network congestion

• Take advantage of available network capacity when there is no congestion

• Attempt to finish transmitting its data by the deadline

DA-LBE will initially have maximum LBEness and will decrease its LBEness (increase aggressiveness) with respect to the closeness of the deadline for the transmission.

2.3.1 Becoming Deadline-Aware

The deadline-aware part of DA-LBE is provided by being able to dynamically adjust its aggressiveness based on the notion time. Being able to adjust aggressiveness gives DA-LBE the ability to trade aggressiveness in order to meet a certain deadline. The deflation of congestion price could be necessary in particular flows in order to meet the deadline. Increasing the transmission rate could be beneficial in the case of Vegas having the need to compete more fairly towards a network streams running Cubic. This trade-off, can be controlled using the DA- LBE framework, based on configuration on (list used from [DA-LBE]):

• the size of the data to transfer

• the soft completion time for the transfer

This configuration is to be configured through a user-space API. By adjusting the perceived congestion price based on the soft completion time, the transmission rate can be dynamically controlled.

(41)

2.3.2 Imposing LBE behaviour on arbitrary CCs

Different CCs use different values to indicate congestion, such as loss and delay.

These CC specific prices will have different impacts on the transmission rate (i.e., Vegas CC reacts faster to congestion indication). Different CCs may provide different trade-offs. Different congestion prices must be mapped to a common price measure. This price measure could then be adjusted, regardless of CC, giving the same amount of relative change in congestion price. This adjustment is achieved by inflating or deflating the congestion price. Having a common price measure of congestion, and having the ability to modify this value, will make it possible to dynamically adjust the level of LBEness imposed on the CC in use.

The inflation of the congestion price allows the flow to achieve a lower relative share of capacity, whereas deflation of the congestion price allows the flow to achieve a higher relative share of capacity. DA-LBE will generally need to inflate the congestion price in order to reduce the relative share of capacity.

Inflating the congestion price for a loss-based CC could be achieved by dropping packets (packet loss); however, this triggers a retransmission. A better approach is to be able to reduce the transmission rate when receiving an ECN signal. An ECN signal results in reduction of the congestion window without dropping packets.

However, ECN signals require others to notify the sender that it should reduce its transmission rate. DA-LBE should be able to artificially generate an ECN event.

The mechanism is termed as a phantom ECN event. Phantom ECN signals are the easiest way to inflate the congestion price without the need for retransmissions, and they provide the same congestion window reduction as a real ECN signal would.

However, the phantom ECN does not send the CWR response to the receiver.

A disadvantage of the phantom ECN mechanism is that it prevents taking advantage of short periods of decreased congestion in the network. The Phantom ECN mechanism does not by itself detect congestion diminishment very well. In order for phantom ECNs to halt when congestion diminishes, to be able to make use of short periods of increased available capacity, an additional mechanism must be introduced. An average congestion indication time interval that is accompanied with the phantom ECN. This time interval measures the average time between the reception of congestion indications. When the time interval drops below a certain limit (user-defined), generation of phantom ECNs will halt, and thus the available capacity will be able to be utilized.

Phantom ECNs are supported by all CCs, regardless of price measure. Since delay- based CCs react faster to network congestion, DA-LBE adds support for artificially adjusting the queueing delay, termed as phantom delay. This is achieved by

(42)

adjusting the measured RTT. Increasing values of RTT gives the CC algorithm in use the impression that the network has increased congestion, leading to reduction to the transmission rate. Similarly, reducing the values of RTT leads to increased transmission rate.

Some delay-based algorithms have a slightly different measure of congestion price (measurement of delay). TCP Vegas is one CC algorithm which has a more fine- grained measurement of delay (congestion price). In order to support TCP Vegas, which uses the difference between measured RTT and base RTT, there must be support for modification to both of these. When the congestion price is inflated or deflated, it is accompanied with a control variable (set from user-space) stating whether or not it is delay-based and whether or not it operates with the base RTT.

DA-LBE’s proposed framework was originally designed to change the alpha and beta parameters of Vegas. Changing the alpha and beta parameters of Vegas requires modifying the source code of Vegas, because the variables where alpha and beta are configured is not exposed outside of the implementation of Vegas.

This means that in order to be able to modify the alpha and beta parameters, the implementation of Vegas would need to be rewritten to support accessing and modifying the alpha and beta parameters outside of the Vegas CC kernel module.

Though this could in theory be done, it is something that should be avoided in order to be able to get the DA-LBE framework code accepted into the Linux Kernel. In addition to the need for being able to directly modify Vegas, it also means that all other delay-based CC algorithms would need to be altered in order for DA-LBE to impose congestion price adjustment on that algorithm. This is not ideal. Instead, the approach chosen in the DA-LBE framework is altering the input that delay-based protocols evaluate in order to decide the queueing delay (network congestion), namely, the measured RTT. Modifying the perceived RTT gives a more fine-grained control compared to phantom ECN.

Delay-based CC algorithms detect network congestion earlier than loss-based CC algorithms. This is because delay-based algorithms can effectively achieve a constant window size, thus avoiding the oscillation inherent in loss-based algorithms (sawtooth pattern). In a network where the only protocol in use is delay- based, loss can be avoided (loss is inefficient). In a network where the bandwidth is shared between both delay-based and loss-based algorithms, the loss-based usually gets a greater share of the bandwidth, as delay-based algorithms are usually less aggressive.

As delay-based CCs adjust their transmission rate based on the perceived queueing delay, the transmission rate can be controlled by inflating or deflating the price in the network. The congestion price is based on the RTT, and that is what is being

(43)

adjusted in order to achieve the desired result. The amount of deflation or inflation is set by a variable which defines the percentage of adjustment. This means that a value of 100 percent is equal to the existing congestion price. Lower than 100 percent will deflate the congestion price, and a value higher than 100 percent will inflate the congestion price.

2.3.3 The NEAT system

The NEAT-system, is a A New, Evolutive API and Transport-Layer (NEAT) Architecture for the Internet. NEAT is funded by the European Union’s Horizon 2020 research and innovation programme. NEAT wants to change how applications are built in regard to the network stack. Today, the most predominant network protocols in use are TCP and UDP. Instead of forcing a programmer to decide upon which protocol the application should use, NEAT wants the user to specify its needs in regard to the network (i.e. guaranteed delivery, low overhead, etc.). Based on the configured requirements from the application programmer, the NEAT-system makes an educated choice on what underlying protocol to use, giving the application the ability to be unaware of what underlying protocol is being used, as all communication goes through the NEAT API.

Figure 2.7:NEAT Architecture (used with permission from [NEAT])

The DA-LBE framework is one of many contributions to the NEAT-system.

(44)

2.3.3.1 Role of DA-LBE in NEAT

A part of the NEAT-system is the ability to reduce/adjust the transmission rate of a network flow running the TCP network protocol. The ability to adjust throughput is important to support Less than Best Effort services to the application. The DA- LBE functionality will allow the NEAT-system to adjust the transmission rate of an arbitrary CC algorithm, in order to impose LBE behaviour. The goal is for the NEAT API to provide LBE mechanisms which can be configured as a system-wide configuration or a on a per-socket basis.

Through the NEAT API, the application programmer will be able to specify specify a soft deadline in order to provide dynamically level og LBEness in regards to the estimated completion time and the soft deadline.

2.4 Linux Kernel architecture

The Linux Kernel architecture is presented in figure 2.8. The Linux Kernel is organized into different sections based on which hardware component they are associated with. This thesis will make modifications to the networking stack in the Linux Kernel, more specifically the socket access and protocols part. This includes the TCP Input Engine and TCP Output Engine.

(45)

Figure 2.8:Linux Kernel Map [3], parts which DA-LBE touches marked with red square.

2.4.1 Compiling a custom kernel

In order to compile a custom kernel, a set of build packages is required. See section C.1. In Linux Kernel development, misconfiguration could in the worst- case scenario cause a system failure, requiring reinstallation of the system.

Common errors, such as overflows and memory leakage, are more likely to happen relative to the development of user-space applications. Keeping this in mind, it is a good idea to take some precautions when making modifications to the Linux kernel. It is a good idea to keep regular backups and run the OS as a virtual machine in case of system failure.

The official Linux Kernel repository is publicly available on Github [Linux].

David Miller’s net-next tree [net-next] has been used as a starting point in the implementation of the DA-LBE framework. Net-next is a fork of the Linux kernel which contains a wider range of different networking components in the source code (i.e., more TCP CC algorithms).

(46)

2.4.2 Communicating with the kernel

One challenge, when developing kernel features that should be able to talk to or get instructions from user-space, is how to achieve the communication between kernel- space and user-space. An operating system usually segregates virtual memory into kernel space and user-space. Kernel space is strictly reserved for running the kernel, device drivers, and kernel extensions. In most operating systems, kernel memory is never swapped out to disk. User space is the memory area where all user mode applications work, and this memory can be swapped out when necessary.

A user application cannot access kernel space directly, and similarly kernel code cannot access the user-space without checking whether the page is present in memory or swapped out. Even though these cannot access each other directly, user-space and kernel space can communicate with each other in a variety of ways.

Table 2.6 is an effort to summarize kernel communication approaches in Linux:

Name Description

System calls A system call is a request by a running task to the Kernel to provide some sort of service on its behalf.

Netlink sockets A special IPC used for transferring information between kernel and user-space processes.

procfs A special file system in the Linux Kernel (/proc). It is a virtual file system (in memory).

sysfs A virtual file system provided by the Kernel. Sysfs exports information about devices and drivers from the kernel device model to user-space, and is also used for configuration.

relayfs A virtual filesystem implemented by the Kernel; it must be explicitly mounted by user-space to be available.

debugfs A in-kernel filesystem for putting debugging information, some place other than proc and sysfs.

Socket options A method used to set or get socket layer or protocol options.

Table 2.6:Different ways to communicate with kernel-space

In the case of the DA-LBE framework, it should fetch socket statistics relevant to the DA-LBE framework and be able to use the provided API for tuning network performance. When researching the different options for acquiring communications between kernel-space and user-space, a similar use-case in the TCP_INFO socket options was observed. TCP_INFO is also a structure for

(47)

fetching real-time statistics about a socket, but it is read-only, meaning that there exists no way of getting data from user-space to the kernel using the TCP_INFO socket option. Therefore, the DA-LBE socket option had to be modified to accept socket options that update the DA-LBE parameters. Socket options are synchronous and applicable per-socket. This is a real advantage for the DA- LBE framework, as the application programmer don’t necessarily want the entire system/network to be LBE, but most likely just one or two network flows, such as an application performing a large backup. Procfs would, for instance, have required an additional file for every socket on the system, potentially creating a lot of overhead.

The user-space application is able to fetch the per-socket connection statistics relevant for the DA-LBE framework by calling the getsockopt-function call with the DA-LBE-INFO socket option. The DA-LBE relevant statistics can be analysed in order to decide what action must be taken in order to ensure LBE behaviour.

For the initial implementation, the functionality have been added directly to the Kernel instead of creating kernel modules. DA-LBE needs to be able to work with any CC module, so having it in the main TCP kernel code means not having need to duplicate functionality in each CC module. Eventually, the goal is to get the code accepted into the Linux Kernel code; however, this may require some adaptions.

2.4.3 The TCP stack in Linux

The implementation of the Linux TCP is split up into different locations. It will be presented how packets are stored and the flow when sending and receiving packets.

The main parts of Linux TCP are located in the following files: tcp.c, tcp_input.c, tcp_output.c, and tcp_cong.c. The various CC algorithms are located in files like:

tcp_cubic.c, tcp_reno.c, tcp_vegas.c.

The entry point is the tcp.c file, where TCP connection procedures are defined.

Incoming and outgoing packets are handled in tcp_input.c and tcp_output.c.

The tcp_cong.c file implements pluggable TCP CCs as well as the default CC algorithm. Figure 2.9 gives an impression of how communication between the different files works.

(48)

Figure 2.9:Linux TCP Output Layout [4]

2.4.3.1 TCP Output Engine

The TCP Output Engine handles all outgoing packets, and most of the logic is located in the tcp-output.c file. It is responsible for setting the flags or header variables to specify different events to the other node/endpoint. TCP Output is responsible for keeping track of the output queue, which is a doubly linked list of sk_buff. Output picks from the head of the queue when an ACK is received or more window space becomes available. All paths lead to tcp_transmit_skb(), handing over the SKB to the network layer. TCP retransmits in response to fast retransmit or an RTO, tcp_retransmit_skb, which in turn passes the SKB to tcp_transmit_skb(). When an ACK is received by the receiver, TCP is able to unlink and free the SKB in sk_write_queue using kfree_skb().

2.4.3.2 TCP Input Engine

The TCP Input Engine handles all incoming packets, analyses all the header information, and looks for flags indicating that a specific action is required. It

(49)

is responsible for storing and processing incoming packets for each socket before being delivered to user-space in the same order it was sent by the TCP sender.

2.4.3.3 The Socket Buffer (SKB)

The socket buffer (SKB) is a crucial part of the Linux networking stack. Every packet sent or received uses this data structure, the sk_buff structure. The sk_buffstructure is used by all network-related queues and buffers in Linux.

Thesk_buffis a very large structure, as it contains a range of different variables required for storing a packet. The structure is designed as a doubly linked list, which is ideal for keeping track of the beginning and end of the buffer. Each instance of the structure contains a pointer to the next and previous element in the buffer it is in.

Figure 2.10:Socket Buffer [9]

An instance of thesk_buffstructure is often moved between different kinds of buffers. The process of moving an element from the beginning/end of one buffer to the beginning/end of another buffer must be efficient. When the sk_buff structure is allocated, the header and data parts are split up, so when moving an sk_buffelement, only the header part is copied/moved. The TCP socket receive buffer is an example of a list wheresk_buffis used.

(50)

(51)

Implementation in Linux

This chapter presents the implementation of the DA-LBE kernel framework in Linux and explores how it has been structured and organized. The mechanisms that have been implemented in order to provide LBE behaviour in the Linux Kernel are discussed in the following sections.

3.1 Applying theory to practice

In order to apply the theoretical framework proposed by David A. Hayes, David Ros, Andreas Petlund, and Iffat Ahmed [DA-LBE] into practice, the implemented framework need to be able to adjust the transmission rate based on network conditions. A statistical structure is required additional indications about the level of network congestion on a per-socket basis. The statistical structure will provide user-space the ability to infer which DA-LBE mechanism is appropriate to execute in order to behave in an LBE manner.

Mechanisms that support an arbitrary CC algorithm (loss-based and delay-based CC) must be implemented. In the case of DA-LBE, the focus is Cubic and Vegas as CC algorithms, as they are commonly used on the Internet. In order to achieve per- flow rate limiting, an approach with socket options have been chosen, which adds structures to the Linux Kernel networking stack. The mechanisms implemented in the DA-LBE framework should be able to adjust the transmission rate without having to drop packets.

In order to decrease the footprint of the DA-LBE framework the calculations (logic) have been encapsulated into a control block. The control block check

37

(52)

whether or not DA-LBE is enabled on the current socket. If DA-LBE is disabled the logic within the control block is never executed, thus avoiding unnecessary processing cycles. This entails that no DA-LBE calculations are executed on a regular network flows, unless explicitly stated, even though the CPU overhead by the DA-LBE framework is negligible.

When implementing DA-LBE structures into the Linux Kernel, it must be kept in mind that a network transmission could run indefinitely and accumulating network statistics will need to handle potential wrap-around of variables. Therefore the size of the DA-LBE variables have been kept large to prevent this from occurring frequently. The DA-LBE framework perform checks on wrap-around on calculations made in the kernel, but places the burden on the user-space part of the application when it comes to the statistical structure. DA-LBE uses unsigned variables in the statistical structure, avoiding the problem of getting unexpected negative values as result of a wrap-around. By having defined upper and lower bounds on the variables the application programmer will be able to handle a wrap- around, should it occur.

3.2 Linux Kernel architecture

The DA-LBE Linux Kernel framework have been implemented in the Linux networking stack, more specifically the TCP stack. The statistical structure and mechanisms that build of the DA-LBE framework, is scattered around in different TCP specific variables. The location of the DA-LBE mechanisms has been placed in different files associated with the Linux Kernel networking stack. Tables 3.2, 3.3 and 3.3 present the location of the DA-LBE mechanisms.

File Implemented

net/ipv4/tcp.c Socket options handling and get_da_lbe_info()

net/ipv4/tcp_input.c Phantom ECN, Phantom Delay, Ignore ECN, Ignore CWND and Average congestion interval

net/ipv4/tcp_output.c Fast retransmits and Slow retransmits

include/uapi/linux/tcp.h Socket Options defined and da_lbe_info struct

include/linux/tcp.h DA-LBE declaration in tcp_sock

include/net/sock.h DA-LBE mode

Table 3.1:Files affected by the DA-LBE Linux Kernel framework