Locality-Aware Big Data Workﬂow Orchestration Using Software Containers

(1)

Locality-Aware Big Data Workflow Orchestration Using

Software Containers

Andrei-Alin Corodescu

Thesis submitted for the degree of Master in Informatics: Programming and

Systems Architecture

(Distributed Systems and Networks) 60 credits

Department of Informatics

Faculty of mathematics and natural sciences

UNIVERSITY OF OSLO

(2)

(3)

Locality-Aware Big Data Workflow Orchestration Using Software Containers

Andrei-Alin Corodescu

(4)

Locality-Aware Big Data Workflow Orchestration Using Software Containers http://www.duo.uio.no/

Printed: Reprosentralen, University of Oslo

(5)

Abstract

As part of extracting value from data, a variety of heterogeneous data sources, tools, processes need to be integrated, forming workflows. Existing solutions help with formalizing the processes of defining, deploying and executing workflows, and offer means through which custom processing steps can be integrated into workflows. More recently, software container technologies such as Docker and Kubernetes have been leveraged to allow for any kind of step to be integrated, regardless of the technology/ programming language, as long as it can be containerized.

With the advancement of the edge computing paradigm, processing of data happens on heterogeneous, geographically distributed infrastructure, which makes it necessary to have software solutions that are capable of scheduling the processing of data in a way to reduce data transfers over long distances (data locality). The existing big data workflow orchestration solutions are limited in their ability to handle data locality, are inefficient to process small, frequent events, specific to edge environments, due to overhead of instantiating a new container for each unit of data to be processed. Furthermore, while the processing logic can be injected using containers, the integration with different data sources is dependent on support from the solution, either built in or through the extension of the solution’s code.

The thesis proposes a novel architecture, accompanied by a fully functional implementation, for container-centric big data workflow orchestration systems addressing the identified limitations of the existing solutions. The proposal takes into account any available data locality information by default, leverages long-lived containers to execute workflow steps and handles the interaction with different data sources through containers.

The proposal is evaluated through a comparison with a similar, existing solution, Argo Workflows and through a series of experiments meant to highlight under which conditions does the proposal provide significant benefits.

(6)

Acknowledgements

I would like to thank everyone who stood by my side (physically or virtually) and supported my efforts towards the realization of this thesis.

Tremendous gratitude goes towards Dr. Dumitru Roman, my main supervisor, for the excellent guidance and feedback, especially in the form of asking the right questions to guide my thesis in the right direction.

Through my supervisors, I have been exposed to a group of awesome, talented people (Dr. Ahmet Soylu (co-supervisor), Nikolay Nikolov, Prof. Mihhail Matskin, Dr. Amir Payberah, Akif Quddus Khan) from different research institutions (University of Oslo, Oslo Metropolitan University, SINTEF Digital and KTH Royal Institute of Technology), who had the patience of assisting me throughout the entire process of my thesis, from the times when my ideas were all over the place to the final deliverable. Their guidance was invaluable and often helped me move past obstacles and significantly increase the quality of the thesis.

Finally, I would like to extend a special thank you to my family and friends who supported me during this work, especially towards my partner, Diana Rosu for all the encouragement and support.

(7)

List of Figures

2.1 Cloud deployment with ubiquitous computing . . . 19

2.2 Edge resources complementing the cloud . . . 19

2.3 Single step . . . 21

2.4 Workflow as sequence of steps with communication mediums . . 22

2.5 Example smart grid data workflow . . . 24

2.6 Separation of concerns figure . . . 25

2.7 Container vs VM based virtualization . . . 26

4.1 High level overview of the design components . . . 39

4.2 Data locality multi-objective optimization . . . 40

4.3 Comparison between library based and container based extension model . . . 43

4.4 Detailed view of a compute step . . . 44

4.5 Component ownership . . . 45

5.1 Aspects of communication . . . 48

5.2 Example workflow visualization . . . 54

5.3 Orchestrator overview . . . 56

5.4 Routing priority for the greedy and load spreading algorithms . . 59

5.5 Sidecar deployment pattern. . . 60

5.6 Volume setup . . . 63

5.7 Overview of the data transfer solution . . . 64

5.8 End to end request flow . . . 67

6.1 Testbed topology . . . 71

6.2 Jaeger UI . . . 72

6.3 Supporting components for experiments . . . 73

6.4 Proposed solution vs Argo . . . 75

6.5 Time distribution for same workflow, but different steps: echo (left) and byte shuffle (right) . . . 77

6.6 Benefit of data locality . . . 78

(11)

List of Tables

5.1 Comparison between communication options . . . 49

(12)

Listings

5.1 Protobuf syntax . . . 50

5.2 The specification of a workflow . . . 52

5.3 The specification of a workflow . . . 53

5.4 Example workflow . . . 54

5.5 Orchestrator Service . . . 55

5.6 Data localization type . . . 57

5.7 Sidecar/ Agent Service . . . 60

5.8 Interface implemented by the data adapter . . . 61

5.9 Interface implemented by the data master . . . 65

5.10 Interface implemented by the compute steps . . . 66

(13)

Part I

Preamble

(14)

Chapter 1

Introduction

1.1 Context

In recent years, harnessing large data sets from various sources has become a pillar of rapid innovation for many domains such as marketing, finance, agriculture and healthcare [1]. The innate ability of data to describe real world phenomena lead to the field of Big Data gaining significant attention from both the academic and industry communities. The large amount of attention leads to a fast development of multiple solutions meant to simplify and make Big Data accessible either by reducing costs or creating simple, yet powerful abstractions. The Big Data domain itself evolved rapidly and new challenges arose at all levels of the technological stack: from the complex business logic to be executed to the infrastructure required to process the ever increasing volume, velocity and variety of the data.

Cloud and Edge computing: Traditionally, cloud services providers have been the standard solution for working with Big Data. However, cloud services are inherently centralized in a small number of locations (datacenters) around the world and with the advent of the Internet of Things, significant amounts of data are generated at the edge of the network. Transferring data over large distances to the cloud can, in such cases, become prohibitively expensive and incurs a latency cost that might make low-latency scenarios unfeasible. To address these issues, the paradigm of edge computing [2] aims to complement cloud computing by leveraging hardware resources situated closer to the edge of the network to offload processing, reducing transfer cost and satisfying the latency requirements.

Big data workflows: Working with Big Data is a complex process, involving collaboration between a wide range of specializations (distributed systems, data science, business domain expertise, etc.) [3] [4]. Handling the complexity naturally comes with an increased cost and the value extracted from the data must offset this cost. Big data workflows formalize and automate the processes data goes through to produce value. Platforms helping with the orchestration of Big Data workflows are responsible for:

• Exposing simple, powerful abstractions that support the definitions of

(15)

expressive workflows.

• Efficiently leveraging the underlying hardware resources to coordinate the execution of the workflows.

The recent focus on Big Data lead to the creation of a large number of alternatives for addressing different challenges of Big Data workflows. Big data workflows usually integrate a wide variety of data sets and leverage different programming languages or technologies to process data. Different parts of the workflow might be easier to develop using certain technologies (e.g., a specific software library available only for Python), while other parts might be leveraging other technologies (e.g., if a piece of business logic already exists as a Bash script, it would be desirable to integrate it as part of the same workflow). Combining data from multiple sources and leveraging different technologies to handle data is a key element of Big Data, as it allows reasoning over a more complete picture of the studied phenomenon compared to what a single dataset can offer. A desirable feature of a Big Data workflow system is to be capable of orchestrating workflows in a technology-agnostic fashion, both in terms of data integrations and processing logic.

An important aspect of orchestrating Big Data workflows is the process of deploying and managing the software components involved in the workflow. As the computation happens in a distributed system, which is inherently more complex, it is desirable to have systems in place that simplify the deployment and management of these components.

Due to the sheer complexity of Big Data workflows, it is desirable to create dedicated components for a better separation of concerns (e.g., a component can handle the interaction with the data sources while another component can host the processing logic). This separation allows workflow creation to be a collaborative effort, where knowledge about the entire system is not needed.

Having small dedicated components also encourages re-usability across workflows and simplifies the development lifecycle of each component.

Containers: Software container solutions such as Docker [5], allow software to be packaged together with all dependencies and to be run in isolation from other applications on the same machine. While they are mainly used to simplify the deployment and management of stateless web services, more recently, one can see a wider adoption in the field of Big Data as well [6].

Software container and orchestration tools can greatly simplify the deployment of Big Data solutions by either wrapping components into containers or using them as the vehicle for developers to inject specialised code into platforms (e.g.

processing logic in a Big Data workflow system).

Many existing Big Data processing solutions use the concept of containers, but with implementations tailored to a particular technology or use case (e.g., the code has to be able to run in a Java Virtual Machine,), to isolate and execute logic on the distributed infrastructure. Software containerisation provides an alternative way of addressing this challenge while being a standardized, technology- agnostic and widely accepted method.

Existing tooling and architectural patterns developed around software containers have been successfully applied to Big Data workflows to address their specific

(16)

challenges as well (interoperability between different programming languages or technologies, separation of concerns and deployment simplification).

Container-centric big data workflow systems:Existing solutions capable of orchestrating big data workflows (such as Argo [7], Airflow [8]) leverage software containers to allow their users to create and execute workflows using processing steps developed in any technology or programming language, as long as the logic can be containerized.

1.2 Motivation

Latency and bandwidth utilization are two key measurements for many Big Data solutions and can decide the feasibility of a particular solution.

While it is beneficial to leverage software containers to better separate concerns in a Big Data workflow system, higher level abstractions come with a performance penalty and thus it becomes more relevant to ensure the system is performing as efficiently as possible.

The core motivation of the thesis is to integrate ideas and concepts used in other solutions or contexts to improve performance and reduce costs of container-centric Big Data workflow systems, without sacrificing on simplicity and usability. Existing solutions are mostly designed for cloud-only workloads, which makes them unsuitable or inefficient for workloads spanning both the cloud and edge. The proposed improvements are intended to make containerized Big Data workflow systems more efficient on cloud and edge resources, thus enabling use cases with stricter latency requirements and reducing the running costs. Simple to use, yet performant Big Data workflow systems make the extraction of value from data more accessible.

Data Locality:With the processing of data happening on a system which is geographically distributed across edge and cloud resources, a crucial performance aspect is the reduction of delays and costs introduced by the need to transfer large amounts of data over the network. To this end, workflow orchestration solutions should attempt to schedule the processing as close to the data as possible (data locality). Data locality has successfully been leveraged in other Big Data solutions such as Hadoop [9] to improve performance and reduce bandwidth usage, but less so in the container-centric Big Data workflow orchestration systems. Two of the main motivations for the edge computing paradigm are increased performance and reduced cost, making data locality a central aspect of any system targeting edge and cloud deployments.

Long-lived containers: At the edge of the network, data is usually processed in small, frequent units, which are not handled efficiently by the existing frameworks, due to their reliance on instantiating a separate container for each unit. The thesis explores how to re-use the same container to process multiple units of data (long-lived containers) to reduce the overall execution time.

Data source integration: While existing works leverage the software container capabilities to encapsulate the units of processing logic composing the workflow, the logic used to interact with different data sources does not benefit from the same flexibility, as it needs to be integrated in the code of the orchestration solution. The current work proposes an approach of extending the flexibility provided by having an isolated component (container) to handle the interaction with data sources.

(17)

Implications of containerization:Separating concerns in different software units (containers) raises two major challenges which are explored by the work in this thesis.

• Communication contracts between the components need to capture the information that needs to be exchanged to effectively orchestrate the workflow.

• The communication needs to happen on relatively slow mediums (such as networks), where the communication protocol choice can dictate the performance characteristics.

1.3 Research questions

Based on the presented motivations, the work of the thesis revolves around the following hypothesis:

Hypothesis Data locality, efficient communication protocols and long- lived containers can be used to orchestrate Big Data workflows efficiently (in terms of execution time and bandwidth utilization) on cloud and edge resources, using loosely coupled, single-responsibility components.

The thesis aims to answer the following research questions:

Research question (RQ₁): How can data locality be implicitly integrated in container-centric Big Data workflow orchestration?

Research question (RQ2): How can software containers be used to facilitate the interaction with different data management systems as part of Big Data workflows?

Research question (RQ3): How can long-lived containers encapsulating processing logic be used to improve the performance of Big Data workflows?

1.4 Research design

The research method employed in this paper consists of four steps:

1. The first step for the research of this thesis consists of a review of existing solutions and scientific literature. The target areas involve both broader fields such as Big Data in general, software containers and edge computing and other works touching upon aspects of orchestrating Big Data workflows using software containers. Motivations, implementation aspects and general trends in the field of Big Data workflow orchestration are the focus of the review.

2. Based on the literature and landscape review, an advancement of the state of the art is proposed by highlighting challenges that have not been fully resolved in the current ecosystem. The proposed approach attempts to align itself with the fundamental aspects of the existing work, such that it can be generally applied and re-used.

(18)

3. The improvements to the state of the art are exemplified by proposing a novel architecture, together with a fully functional implementation of the architecture. Both the architecture and implementation exhibit a large degree of modularity, which allows individual ideas and components to be re-used in other solutions.

4. The approach is evaluated by comparing the performance with an existing solution, followed by detailed experiments to observe the characteristics of the proposal under relevant conditions. The second set of experiments is designed to highlight both benefits and drawbacks of the proposed approach, to provide an objective view over cases when the approach is beneficial and when it is not.The experiments have quantitative measurements of success.

The research conducted as part of this thesis falls into the category of technological research as categorised in the technical report [10]. The report presents the technological research as an iterative process with the goal of producing better artifacts, starting from initial hypotheses or predictions.

1.5 Thesis outline

The remainder of the thesis is structured as follows:

Chapter 2: Current technological landscape This chapter expands upon the context and motivation of the introduction chapter by presenting the challenges of big data processing and how the cloud and edge paradigms complement each other to address these challenges.

Furthermore, the chapter dives deeper into the big data workflow orchestration solutions and introduces a taxonomy used to identify high level components of such solutions.

The last section of the chapter introduces software containers and related technologies, together with how these technologies are being integrated into the field of big data processing. The motivations and the benefits provided by software containers are also highlighted.

Chapter 3: Problem analysis This chapter starts by presenting the focus areas of the thesis and their relevance in the context of big data workflow orchestration.

A set of related works and other existing solutions are evaluated from the perspective of the focus areas and based on the results of the analysis, potential improvements to the state of the art are proposed.

A framing for the thesis work is then created by formalizing a set of goals around the proposed improvements.

Chapter 4: Design This chapter describes a novel architecture for Big Data workflow orchestration systems, which incorporates the ideas proposed in the previous chapter.

The architecture is described in terms of high level components, the roles they fulfill and the communication between them.

(19)

For key aspects, design alternatives are discussed and analyzed from the perspective of the focus areas of the thesis.

Chapter 5: Implementation This chapter presents a detailed description of the implementation of the architecture proposed in the previous chapter.

The chapter dives deeper into the features of different technologies (Kubernetes, Docker, ASP.NET, gRPC, etc.) and how these were leveraged together to create a fully functional implementation.

When applicable, implementation alternatives are analysed similarly to the architectural decisions in the previous chapter, from the perspective of the focus areas of the thesis.

Chapter 6: Experiments and results This chapter describes experiments conducted to evaluate the proposed implementation, together with the testing environment they are executed in. In addition, the technologies used to gather data about the execution of the proposed solution and the technologies used to further analyze the data and draw conclusions are also discussed.

The results of the experiments are analysed and discussed and a set of conclusions about the proposed solution is highlighted.

Chapter 7: Conclusion and future work The last chapter concludes the work of the thesis by summarizing the contributions to the state of the art.

The initial goals are revisited to analyze the degree to which they have been achieved.

The last section discusses potential directions for future expansion of the thesis work.

(20)

Part II

Background

(21)

Chapter 2

Current technological landscape

This chapter gives a brief introduction into the domains touched upon by the thesis. The information is not meant to be exhaustive, but rather to provide the reader with enough context to understand the background and the motivations for the proposed improvements to the state of the art.

2.1 Big data

The term ”big data” does not have a formal definition, but both the academic and industry communities often characterise big data using a number of ”Vs”

[1]:

1. Volume: refers to the raw size of the data with values ranging from hundreds of gigabytes to petabytes of data.

2. Velocity: in parallel with the high volume of data, the high rate at which it is being generated is also a defining characteristic for big data. Systems need to be designed to scale up to meet the throughput requirements.

3. Variety: big data solutions are often characterised by the ability to reason over a wide variety of data. The system needs to be able to adapt to different formats, varying characteristics and reconcile the differences to leverage all available data.

These characteristics refer to the technical aspects of handling big data, and building solutions that can address these challenges is complex and costly.

Devising the algorithms and tools that help extract value from the data further amplify the complexity and cost of building comprehensive big data solutions.

Consequently, another V, thevaluethe solution generates, which needs to offset the high cost, is often included as a central characteristic of big data.

Big data has quickly gained momentum as it allows the exploitation of data to uncover hidden trends that give useful insights to drive business decisions and support research. It has been successfully leveraged in a large number of industries including marketing, social network analysis, healthcare and finance.

(22)

The general applicability of big data patterns stems from the fact that many real-world phenomena can be better understood and predicted given a sufficient amount of data.

Data is considered an asset today and it can be bought and sold on data marketplaces ([11]). These data marketplaces encourage leveraging wide varieties of data which can be used to enhance the value of the end results. The recent study [12] reveals that the variety and the velocity of the big data play a more important role compared to the volume when measuring innovation performance.

2.2 Cloud technologies

The previous section highlights the high complexity and cost of managing big data operations. The characteristics of big data translate into challenges at all levels of the technology stack:

1. At the infrastructure level, significant raw network, storage, memory and compute resources are required to handle big data. Often these resources are provided by multiple machines, organized in a distributed system.

2. At the platform level, software frameworks that can effectively leverage the available resources and handle the ever changing needs of big data operations need to be continuously developed.

3. At an application level, algorithms running on the previously mentioned platforms need to be devised and combined to extract value from the data.

Applications can also facilitate the interaction with a big data solution (e.g., visualisation tools used by business executives to analyse the results produced by a big data solution).

The paradigm of cloud computing paved the way for accessible, affordable, efficient big data processing through scalability, elasticity, pay for what you use model. The cloud providers offer a wide range of offerings that can address and are tailored for the big data domain:

• At the lowest level, Infrastructure-as-a-Service (IaaS) offerings provide the raw resources needed for big data processing (e.g virtual machines, networking). Using the infrastructure as a service offerings greatly reduces the complexity of setting up a big data solution.

• Cloud providers also offer managed Platforms-as-a-Service (PaaS) to make big data more accessible (e.g., data lakes such as ADLS [13] to store the data and make it available to processing platforms, and even fully managed clusters running a particular big data platform - e.g., Amazon EMR [14]

offering managed Hadoop/ Spark clusters)

• An example of Software-as-a-Service (SaaS) offering is Microsoft PowerBI [15], a data visualisation application with built in integration with popular big data platforms. Such tools simplify the process of extracting value from raw data.

(23)

Figure 2.1: Cloud deployment with ubiquitous computing

Figure 2.2: Edge resources complementing the cloud

Cloud deployments are best suited for cases where the producer of data is largely centralised: for example, a web server capturing and processing the network traffic for further analysis (clickstream data) [16].

However, with the advancement of ubiquitous computing, tremendous amounts of data is produced by devices (sensors, personal devices, vehicles, etc.) at the edge of the network. With the number of such devices increasing at a high pace, the traditional model for making use of centralised cloud resources becomes unfeasible due to the high costs of transferring data from the edge of the network to the cloud and tighter latency requirements (Figure 2.1).

2.3 Edge computing

To address the limitations of cloud computing, a new paradigm is currently being explored: complementing cloud computing resources by performing computations on resources physically located closer to the edge (Figure 2.2).

(24)

The literature classifies edge computing paradigms in multiple categories [2], such as fog, cloudlet, multi-access edge [17]. For the remainder of the thesis, the term of edge computing will be used to refer to any resource outside of traditional cloud setups.

Although the edge computing paradigm addresses some fundamental limitations of cloud computing, it also faces a different set of challenges [18]:

• Hardware resources on edge devices exhibit different characteristics and capabilities (processor architectures, processor speed, amount of memory, etc.) Software running on such devices has to be designed taking into account the resource constraints more than a cloud-only counterpart. At the same time, edge deployments offer limited or no elasticity (it is not possible to provision more resources in a short amount of time).

• Edge resources are geographically distributed and observed latencies can differ significantly depending on the distance between the communicating parties.

• Geographical distribution also raises logistical challenges, as these devices can be spread over wide areas and sometimes even in hard to reach locations, making provisioning and maintenance a significant challenge.

• The nature of edge resources also makes them prone to failures, both at the device level (hardware failures) or supporting infrastructure (network failures). Software solutions targeting edge deployments need to tolerate failures gracefully and, if possible, be able to operate offline for extended periods of time.

• Security and privacy are two complex domains where edge computing plays a significant role:

– On then one hand, being able to process data closer to the source can make it easier to adhere to certain jurisdiction (e.g., European privacy laws restricting the flow of personal data internationally [19]) and ensure better security and privacy (by using trusted devices, anonymizing data before sending it to the cloud, etc.).

– On the other hand, large scale edge deployments are inherently harder to secure mainly due to their geographical spread and the risk of having devices compromised through physical interference is much higher than a cloud only setup.

From use-case to use-case, the challenges might differ significantly (e.g., security and privacy might not be a top priority for a system only processing publicly available data). Edge computing has been successfully applied to power domains such as smart cities, agriculture, healthcare, industrial processes and others. In most domains, the edge computing resources are used both the meet strict latency requirements [20] for user interaction and also to support the analysis of the collected data through big data solutions [21], [22].

While edge computing has many differences from cloud computing, the two are meant to complement one another and many systems need to operate on both edge and cloud resources. For the remainder of the thesis, the term ”computing continuum” [23] is used collectively to refer to all the available resources for a system, from the edge of the network to the cloud.

(25)

Figure 2.3: Single step

2.4 Big data workflows

Before producing value, raw data has to be taken through a series of steps (e.g., cleaning, aggregation, transformation), depending on the use case. While it is possible for each of these steps to be manually triggered and independently controlled in an ad-hoc manner, workflow orchestration tools facilitate the automation of the execution and sequencing of the steps, allowing for reliable, reproducible execution of complex processes.

2.4.1 Taxonomy

For the purpose of the thesis, a simplified taxonomy is presented to refer to specific concepts in workflow orchestration. However, the proposed taxonomy captures the essence of big data workflow orchestration and the presented concepts are generic enough to be mapped to particular implementations of existing solutions or other works.

Steps

The atomic unit upon which workflows are defined aresteps (Fig 2.3). Steps encapsulate units of logic that can be applied to data (e.g., applying a black and white filter to an image). In a simplified generalization, steps receive data as input from a data source, processes it and then push the outputs to a data sink.

(26)

Figure 2.4: Workflow as sequence of steps with communication mediums Workflows

Individual steps can be linked together to form a workflow. A linear sequence of steps is a workflow, with the semantics that steps are sequentially executed in the order they appear in the workflow definition (Figure 2.4).

The distributed nature of big data processing warrants the existence of a communication medium through which information can be exchanged between the components of the system. A distinction is made between two type of communication mediums:

Data communication medium

The data communication medium serves as the channel through which the data being processed is flowing through the system. As such, both the inputs and the outputs of a step connect to such channels. Any two subsequent steps within the workflow need to exchange data (the output from the previous step becomes the input to the next one) through the same medium. However, it is possible for a workflow to use different data communication mediums at different steps of the workflow (e.g., the input to the first step could come from a web service, and all the subsequent steps could use a distributed file system to exchange data and store results).

Control flow communication medium

To be able to execute the workflows, control messages (e.g., triggering a step, notifying when a step has finished) have to be exchanged within the system, through a control flow communication medium. Examples of such communication mediums include point to point network communication between components, or using a message queue [24] to decouple to producer and consumer of the messages.

(27)

2.4.2 Example workflow

To aid the understanding of big data workflows that can span resources over edge and cloud deployments, a fictional example inspired by real world trends is used:

The energy landscape is constantly changing to address the environmental challenges around using fossil fuels to produce electricity. The wider adoption of renewable energy sources, electric cars, and small scale household energy producers (e.g., solar panels), mandated the creation of hardware and software solutions to assist in the management of energy consumption and production (smart grid). Big data analytics and edge computing are two of the key technologies used to enable and optimize energy efficiency in smart grids [25], [26].

Smart appliances and electricity meters can emit data about consumption or production levels of electricity. This data can be analyzed and used to optimise electricity usage at peak times, thus keeping the costs down for the consumer and, at the same time, improve electrical grid stability. For example, the charging rate of an electric car can be adjusted automatically depending on factors such as variable electricity prices.

The large amounts of generated data, combined with the inherent geographical spread of the resources and the low latency requirements of this use case make it suitable for edge computing.

An example workflow, described using the presented taxonomy, is composed of the following steps:

1. Data from appliances and smart meters may come in different formats and can contain multiple bits of information. The first step is toconvert the data into a common format.

2. Data from multiple points (e.g., households) can beaggregatedtogether.

3. Based on the aggregated data capturing the current consumption and production of electricity, theprice of electricity can be adjusted.

4. Once calculated, the new price can bebroadcastedto the smart meters and appliances to take appropriate actions (e.g., stop the charging of an electric vehicle if the price is too high).

5. The aggregated data can be sent to the cloud for offline analysis to power long term predictions and optimizations.

The first steps in the workflow can be executed closer to the data source, on edge devices to support faster decisions. The data from different edge devices is captured in the cloud and the workflow can continue with further steps in the cloud for offline analysis (Figure 2.5).

2.4.3 Workflow definition expressiveness

Big data workflows can be specified in more complex configurations than the sequence of steps introduced previously. A widely used model stems from graph theory, where a workflow is modeled as a directed acyclic graph (DAG), where individual steps are modeled as nodes and edges between nodes model the dependencies between steps.

(28)

Figure 2.5: Example smart grid data workflow

2.4.4 The need for better abstraction

Creating big data workflows is a complex process involving knowledge from many different domains (hardware provisioning, cluster management, data handling, different kinds processing steps, definition of workflows according to business needs, and orchestration).

Delegating the different responsibilities to independent components allows easier development of both the orchestration frameworks and the workflows running on top of them, which in turn reduces the costs and makes big data workflows a more accessible technology.

Isolated components can abstract away implementation details and expose a limited set of interfaces which are easier to understand and integrate with by other components.

For example, a step applying a black and white filter on top of color image only needs information about where to find the input image and return the information about where the output image was stored. A data handling component is then responsible with making the image available to the processing logic and the orchestration components are responsible with invoking this step at the correct time, with the correct input.

Separating concerns in individual components builds the basis for domain specific languages describing workflows to be simplified and limit the exposure of the underlying implementation details. Through a clean separation of concerns, individuals responsible for one area do not need to have intimate knowledge of the other aspects of the workflow. For example, an individual with deep knowledge about the business needs a workflow addresses, does not need to have knowledge of how the black and white filtering logic is implemented, but can still incorporate it as a step in the workflow definition.

Design time concerns

Figure 2.6 captures three potential actors involved in designing workflows, each contributing to a different core element:

• In the example figure, the components denoted ”Data D1” and ”Data D2” represent components capturing the logic used to interact with data stores and expose data to be processed as part of the workflow (e.g., the component ”Data D1” could be used to retrieve images from a web service, while ”Data D2” could denote a cloud storage solution).

(29)

Figure 2.6: Separation of concerns figure

• In a similar fashion, the ”Compute” components capture the logic that needs to be applied to the data. For example, the step C1 could apply a black and white filter to an image, while the step C2 could compress the image.

• A third actor can define workflows that leverages the data and processing logic defined by the previous two elements. In this example, an image would be transformed to black and white and then compressed as part of the defined workflow.

Execution time concerns

Once defined, a workflow needs to be executed to produce value.

A fundamental pre-requisite is having the infrastructure to host and execute the logic of the workflow. In most cases, the infrastructure is a distributed system, with the nodes potentially geographically spread across many locations.

On top of the infrastructure, instances of the data and compute steps components are leveraged by the orchestration framework to execute workflows.

The orchestration framework is responsible for:

• Accepting and semantically parsing a workflow definition.

• Ensure the workflow execution happens in accordance with the definition (e.g., correct sequencing of steps, correct data being passed as input, etc.)

• In some cases, the orchestration framework also manages the instantiation and lifecycle of the data components and compute steps on the available infrastructure.

At first, the orchestration framework entity seems to go against the principle of separation of concerns, as it fulfills multiple roles. In practice, most of the existing implementations delegate the fulfillment of these roles to separate

(30)

Figure 2.7: Container vs VM based virtualization

components, maintaining the separation of concerns. However, the orchestration components are usually distributed together, as a single solution.

2.5 Software containers in the big data world

2.5.1 Software container characteristics

Software containers are standardised, isolated units of software that package everything needed to run an application [27].

Software containers provide a lightweight, faster virtualization alternative to hypervisor virtualization [28]. Multiple containers can share the kernel of the underlying operating system, whereas each virtual machine has to run a full operating system on top of a hypervisor (Figure 2.7). An important aspect to note is that containers are not meant to replace VMs. On the contrary, in practice, for most applications, the container runtime is hosted within a virtual machine running on cloud infrastructure.

Software containers exhibit a series of characteristics that make them applicable to a wide range of domains:

• Containers ensure the packaged software runs in complete isolation from other application on the same operating system. By packaging all dependencies in the container itself, issues of conflicting dependencies and complex prerequisite configurations are completely avoided.

• The low overhead introduced by containers allows large number of containers to be run efficiently on a single node, and it also makes them a good fit for resource constrained environments.

• Software that can run in a container is not limited to a particular technology or programming language, which allows solutions leveraging software containers to easily orchestrate cooperation of components developed using different technologies.

(31)

• Containers are a widely adopted standard for software packaging, which translates into two major benefits.

– Containerized software is easier to share and re-use across different solutions.

– Containers can be used to move the execution of logic on to the nodes of a distributed system (”function shipping”). Developers writing a piece of software can use a container image to distribute it and thus, be completely disconnected from the deployment aspects.

Frameworks can then leverage the container images to instantiate the containers where and when the packaged software is needed.

Software containers are extensively leveraged in cloud environments [29], containerized applications being in some cases referred to as cloud native applications [30]. A number of works [31]–[33] also identify software containers as a feasible technology for resource constrained, edge environments.

2.5.2 Container orchestration

Containers allow defining and running independent pieces of software serving different roles in a distributed application. However, today’s applications are relying on a large number of inter-connected components, heterogeneous infrastructure and the ability to safely roll out updates to these components.

Container orchestration solutions, such as Kubernetes [34], simplify the deployment and management of highly distributed systems by creating abstractions for the underlying infrastructure and facilitating the interaction between the components of an application.

Kubernetes is based on a declarative method of defining how the components should be deployed and managed through YAML files. The declarative approach makes it easy to capture the entire process of setting up complex applications as code, which greatly simplifies the deployment and management processes.

In addition, Kubernetes provides native support for many cross-cutting concerns in highly distributed application: secret management, service discovery, networking, application telemetry. It has an extensible model which allows for applications to natively integrate with it to provide additional functionalities or modify the default behavior.

Software containers and the associated orchestration tools are currently seeing a lot of adoption in a variety of domains, allowing the ecosystem of existing tools and knowledge to flourish.

Kubernetes is a complex offering with many features touching upon many areas of distributed systems. The implementation chapter will briefly describe the main functionalities leveraged for the development of the current thesis.

2.5.3 Usages for big data

In the context of big data, containers have been used to simplify the deployment and management of entire big data solutions [35] or of individual components [36]. Leveraging containers in big data solutions can also lead to performance improvements when compared to hypervisor-based virtualisation alternatives [37].

(32)

At a high level, a distinction can be made between two strategies of using containers in big data solutions:

1. Leverage software containers to deploy and manage the components of a big data processing platform. This approach simplifies the deployment process but it does not influence the runtime aspects of the platform.

2. Software containers are used as an integral part of the architecture and are used as the mechanism through which custom behavior can be injected into the platform (e.g data processing logic).

The paper [38] dives deeper into the distinction between the different types and recognizes four different integration strategies, depending also on the means through which the integration between the platform and the logic encapsulated in the container is achieved.

2.6 Inspiration from microservices architecture

Software containers gained a lot of traction in the field of microservices as they greatly simplify the management of highly distributed systems, where functionality is spread across a large number of small, single-purpose components.

As a consequence, the ecosystem around software containers has evolved to primarily address challenges common to these architectures. A large number of these solutions can be reused in other types of applications as well.

Employing a microservice-oriented architecture for big data workflow orchestration system allows the re-utilisation of the existing tooling and knowledge around microservices.

Such architecture can provide a number of benefits [39] which align with the needs of big data workflow systems:

• The communication between different components is governed by technology agnostic contracts and protocols (e.g., REST APIs over HTTP).

• Each component can be implemented independently of the others, allowing independent software development lifecycles and even the use of different technologies or programming languages.

• A modular architecture result in the creation of small, single purpose components, which encourages separation of concerns and re-usability.

On the flip side, there are also some drawbacks of using microservices:

• As the logic is now spread over a larger number of isolated components, the communication between these components can become a significant factor in the performance of the solution, especially for slow communication mediums (e.g., network calls).

• The larger number of components results in an overall more complex system, which is considerably harder to understand and manage.

(33)

Chapter 3

Problem Analysis

The previous chapter provided a short overview of the technical domains the work of the thesis is part of. The current chapter highlights the precise sub domains the thesis contributes to.

3.1 Focus areas

The previous chapter introduced big data workflows at high level, without going into details about the challenges encountered at runtime during workflow orchestration.

Execution time and bandwidth usage are two indicators that are often measured in big data systems and determine the feasibility of a solution. The high velocity of the data, combined with the large volume, mandates the processing to happen in an efficient, cost-effective manner to produce value that outweighs the costs.

The work of this thesis targets the reduction of execution time and bandwidth usage for big data workflows through the following techniques:

3.1.1 Data locality

Data locality refers to the process of moving computation closer to the data, which is typically cheaper and faster than moving data closer to the computation.

The nature of working with big data mandates the resources (network, memory, CPU, disk) of multiple machines to be pooled together in adistributed system. A desirable characteristic of distributed systems is to hide the complexities of the distributed nature of the resources behind a single interface such that the entire system appears as a single entity (e.g. cloud storage systems such as Amazon S3), which makes it more challenging to leverage individual hosts backing the distributed systems.

A fundamental invariant of the current computer architecture is that a CPU can only work with data present in the memory of the same machine.

Consequently, data movement across machines becomes an integral part of any big data system.

Besides traditional communication protocols that rely on the operating system network stack (e.g. TCP/IP based protocols), for use cases where latency is a critical factor, more efficient protocols have been evolved. RDMA (Remote

(34)

Direct Memory Access) [40] is such a protocol, and allows the transfer of data stored in the memory of one machine to another without involving the CPU or the operating system kernel, through specialized network cards.

As the quantity of data is significant, the network traffic and the associated latency of transferring data between machines can influence the overall cost and performance of a big data solution.

As a result, the ability to distribute the work in a manner that reduces data transfers (data locality) has been studied and integrated in a number of works:

Even for solutions targeted at centralized deployment (such as cloud deployments), data locality has proven to be effective in reducing the cost and execution times [41], [42]. For example, Apache Spark [43] can leverage the information provided by the Hadoop File System (HDFS) and knowledge about outputs of previous work to minimize the data transfer.

One of the core motivations of the edge computing paradigm is reducing the amount of data transferred from the edge of the network to the cloud and supporting lower latency scenarios, making data locality a primary concern for any edge computing solution.

Data locality is only one aspect that can be taken into account when scheduling work. Other aspects such as load distribution and heterogeneity of the available resources on different nodes need to be balanced together with data locality to effectively perform the work [44]. Works such as [45] and [46] propose advanced scheduling strategies to balance the reduction of data transfer with load distribution.

3.1.2 Inter-component communication optimization

Separation of concerns and delegating responsibilities to different components has numerous benefits, some of which are highlighted in the previous chapter.

At the same time, the communication between components may introduce additional performance and efficiency overhead through the need of message serialization and transfer through potentially slow mediums. For example, what is performed as simple method invocation in a monolithic solution can be turned into a REST API call for a solution where components are separated.

The choice of communication protocols directly impacts both the performance and bandwidth utilization as different protocols provide different guarantees related to data transmission (e.g., TCP ensures ordered, lossless transmission but requires multiple roundtrips to establish connections and exchange data, while UDP is faster, but also unreliable). Different protocols introduce additional overhead by injecting required data (e.g. HTTP headers). Techniques such as compression, and binary serialization help reduce the size of the payload. The work [47] explores the possibility of using RDMA-backed memory channels to support fast efficient inter-container communication.

Apart from the bandwidth utilization and speed of a particular protocol, the contract defining the communication between two entities (message format, content and semantics) plays a significant role in facilitating the integration between components. Defining and enforcing a contract for the communication between the components allows for the implementations of the components to be decoupled from one another.

(35)

3.1.3 Lifecycle management of containers

Even if software containers present themselves as a lightweight virtualization alternative solution to traditional hypervisor-based virtualization, there is still a cost associated with starting up and shutting down containers as needed.

One aspect treated by the study on the usability of containers in workflow system [38] is the impact of lifecycle management on workflow execution time, reaching the conclusion that the fastest option is re-using containers to process multiple units of data.

For many use cases, the execution time of the work delegated to a particular container is high, thus making overhead of instantiating containers negligible.

However, in edge computing environments especially, the available resources are limited and the amount of data processed at a given point in time on a single host is reduced.

With data being constantly processed in small parts (or even streamed), there is need for this processing to happen as quickly as possible to achieve desirable throughput. In such cases, the overhead of setting up and tearing down containers can quickly add up and become a significant bottleneck for the performance of the solution.

3.1.4 Integration with data management solutions

Being able to reason over and process heterogeneous datasets together, in a unified manner is one of the pillars of big data. These data sets can be stored using different technologies and the interaction with these technology requires complex logic in itself.

As such, it is desirable for big data workflow systems to facilitate easy integration with different data management solutions (databases, file systems, cloud storage, web services, etc.)

3.2 Existing solutions

As part of the common workflow language (CWL) project [48], a list of existing orchestration solutions has been compiled. At the time of the writing, the list contains 295 entries. For the purpose of this thesis, only orchestration solutions capable of leveraging software containers to encapsulate processing logic were considered. The set of studied solutions, while not exhaustive, aims to capture the most relevant examples. The evaluation relies on the information provided by the available documentation for each solution.

3.2.1 Evaluation criteria

The existing solutions have been evaluated from the perspective of the focus areas of the thesis, namely:

1. Their ability to incorporate data locality in the orchestration process 2. The ease of integration with any data management solution

3. Container lifecycle management.

(36)

3.2.2 Considered solutions

This section presents a summary of the considered solutions and highlights how the evaluation criteria mentioned in the previous section is treated.

• Snakemake [49] is a workflow orchestration tool that supports wrapping individual steps in containers and different data solutions can be integrated into workflows by extending the Snakemake codebase. However, there is no support for controlling where the computation happens (data locality).

• Kubeflow [50] is a workflow orchestration tools oriented at machine learning related workflows. The only storage supported is Minio (a cloud native, open source alternative to the S3 cloud storage). It offers no support for data locality.

• Makeflow [51] is a workflow orchestration tool able to orchestrate workflows on a wide variety of infrastructure, but it does not have any built in support for different data management systems or data locality features.

• Pachyderm [52] is another machine learning workflow orchestration solution, but the only supported storage system is the pachyderm file system, a distributed file system built to offer additional functionalities to Pachyderm.

• Pegasus [53] is a workflow orchestration solution that supports containerized steps, and leverages the location of the processed files to schedule the steps on the same host. However, the data management is limited to file systems.

• Airflow [8] is one of the most popular data workflow orchestrators, and supports the execution of the workflows on a Kubernetes cluster. It is also possible to leverage Kubernetes pod affinity feature to control where instances of steps are created. However, the pod affinity needs to be set when the workflow is defined, making unable to capture dynamically changing requirements. Integration with different data management solutions is possible by extending the Airflow code with providers, thus limiting it to Python implementations only.

• Argo workflows [7] is a workflow orchestration solution natively built on Kubernetes and supports data locality through the full set of mechanisms that allow the control of how pods are scheduled on Kubernetes nodes (node selectors, affinity and topology spread). Similar to the Airflow solution, different data management solutions can be integrated, but require changes and integration with Argo code libraries.

All the considered solutions leverage short lived containers as part of the orchestration - a container is created to execute work and destroyed as soon as the processing completes as these solutions target mostly batch processing scenarios.

In terms of data locality specification, Argo offers the most expressive feature, as it leverages the full feature set offered by Kubernetes. However, by default, the limitation of having to specify data locality at workflow (introduced with the analysis of Airflow) definition time applies to Argo as well.

Argo offers a mechanism through which particular outputs of processing steps can be used to modify the parameters (for data locality in this case)

(37)

of subsequent steps in the workflow, allowing for dynamic of data locality configurations at runtime. Such an approach would require additional logic to be injected in the processing step however.

Pegasus, although limited in terms of data locality features (the processing either happens on the host storing the data or outside it - there is no support for complex topologies taking into account geographical distance between different hosts), does handle data locality implictly, without the need to modify the workflow definition. In contrast, for both Argo and Airflow, while offering more expressive data locality features, the workflow definition has to capture these details, breaking the separation of concerns.

3.3 Thesis contributions to the state of the art

In the light of the limitations highlighted in the previous section, to advance the state of the art, the thesis proposes a novel architecture, accompanied by a proof-of-concept implementation, focused on addressing the previously mentioned limitations.

The proposed improvements have all been studied in isolation or different contexts, and the thesis aims to transpose those findings and integrate them in container-centric big data workflow orchestration systems.

The findings of this thesis can benefit big data workflow designers (when using the artifacts of the thesis to define and orchestrate workflows) and also workflow orchestration system developers (by integrating the ideas and findings of the current research into other projects).

3.3.1 Data locality

In terms of data locality, the proposed solution allows the orchestration components to take into account data locality, quantified using a flexible model that accounts for the physical distance between hosts spread across the computing continuum.

Information on data locality is implicitly exchanged between data management components and orchestration components, thus limiting the handling of data locality to only two components intuitively involved in the process. As data locality is automatically handled, the workflow definition does not need to contain these details, leading to a better separation of concerns.

3.3.2 Container lifecycle management

The proposed solution is better suited for processing small, frequent units of data by leveraging long-lived containers - container whose lifecycle is not tied to a particular data unit, but they are instead re-used to process multiple units.

3.3.3 Data management solution integration

The proposed solution extends the ideas behind isolating processing steps in separate containers to the data management aspect of big data workflows. As such, the logic needed to interact with data management systems is encapsulated in containers, providing the same benefits as for processing logic (technology agnostic solution, isolation, lightweight, etc.)

(38)

3.3.4 Communication overhead

As part of the current work, different protocols for communication between components are analyzed from the perspective of their performance characteristics and formal contract support. The communication between components is enabled by the gRPC [54] framework, an efficient, contract-based remote procedure call framework.

3.4 Alignment with current technological directions

In the technical field, it is sometimes necessary to trade simplicity and generality to meet tight performance requirements. As such, an important thing to note is that the contributions this thesis brings to the state of the art are made in a context that aligns with the trends of big data workflow orchestration systems identified in Section 2.4.4, keeping the trade-offs at a minimum.

This alignment facilitates the reusability of individual ideas and components of the proposed solution in other existing and future works.

3.5 Goals

A set of goals is guiding the realization of the thesis:

1. Conduct a literature and existing solutions review to identify areas of improvement for container-centric big data workflow systems.

2. Propose a novel architecture and implementation that addresses the identified unsolved challenges.

(a) Achieve performance improvements and bandwidth usage reduction for big data workflows by automatically handling data locality at a platform level.

(b) Leverage long-lived containers to reduce the execution time of big data workflows.

(c) Allow the integration of different data management solutions by allowing logic to be encapsulated in software containers.

(d) Propose a simple distributed data management solution that exposes host-level data locality information.

(e) Instrument the solution with telemetry that allows thorough analysis of execution time and behavior.

3. Compare the performance proposed architecture with a relevant existing solution.

4. Conduct experiments to understand under when is the proposed approach beneficial over alternatives.

The second goal and its associated sub goals apply to the software solution developed as part of the thesis work, while the other goals apply to the content the thesis presents surrounding the software solution.

(39)

3.6 Evaluation of the proposed contributions

As the main focus of the thesis is performance and bandwidth utilization, the proposed contributions will be evaluated by analyzing the observations gathered from a series of relevant experiments. The observations translate into quantitative measures of success (e.g., time spent, amount of data transferred, etc.). The experiments are run in a manner that attempts to capture real world conditions of a geographically distributed system.

The contributions are analyzed by comparing the performance and bandwidth usage of the proposed solution with a relevant existing solution.

Furthermore, to better explain the behavior and the improvements observed in the initial comparison, a series of experiments meant to isolate and highlight the behavior of individual aspects (data locality, long-lived containers, etc.) of the proposed solution are also conducted and have the results analyzed.

Chapter 6 presents the details of the conducted experiments, together with a discussion of the observed results.

(40)

Part III

Proposed solution

(41)

Chapter 4

Design

4.1 Overview

This chapter presents, from a high level perspective, the proposed architecture for container-centric workflow orchestration frameworks. The novelty of the architecture is given by the ability to incorporate knowledge about the physical localization of the data into the work scheduling process. The proposed architecture aligns itself with the directions set by existing solutions and other related work with the aim of having individual ideas or aspects being reusable outside of the presented proposal.

The chapter presents the architecture in terms of abstract components, their roles and the communications between them. An example implementation of the proposed architecture is presented in the next chapter.

4.2 Component breakdown

Section 2.4.4 introduces a possible way to separate different concerns in a big data workflow system. The design for the proposed workflow system architecture reflects the separation by identifying three major areas of concern that, together, cover the runtime aspects of big data workflows:

1. Control layer, responsible for the execution of workflows in accordance with their definitions (e.g. correct step sequencing, correct data being processed). The main component of the control layer is theOrchestrator.

2. The data layer collectively refers to all components involved in data handling (e.g., storage and retrieval of data, moving data between hosts to make it available to compute steps that require it). The architecture makes a distinction between thedata store, which refers to the technology used to store data (e.g., Distributed File System, Cloud Storage) and data adapter, which serves as an interface between the data store and other components in the workflow system.

3. The compute layerrefers to the processing logic contained in the steps used in the workflow. The compute layer is composed of multipleCompute

Locality-Aware Big Data Workﬂow Orchestration Using Software Containers

Locality-Aware Big Data Workflow Orchestration Using

Software Containers

Andrei-Alin Corodescu

Thesis submitted for the degree of Master in Informatics: Programming and

Systems Architecture

(Distributed Systems and Networks) 60 credits

Department of Informatics

Faculty of mathematics and natural sciences

UNIVERSITY OF OSLO

Locality-Aware Big Data Workflow Orchestration Using Software Containers

Andrei-Alin Corodescu

Abstract

Acknowledgements

Contents

I Preamble 9

II Background 16

III Proposed solution 36

IV Experiments and Results 69

V Conclusions and future work 81

List of Figures

List of Tables

Listings

Part I

Preamble

Chapter 1

Introduction

1.1 Context

1.2 Motivation

1.3 Research questions

1.4 Research design

1.5 Thesis outline

Part II

Background

Chapter 2

Current technological landscape

2.1 Big data

2.2 Cloud technologies

2.3 Edge computing

2.4 Big data workflows

2.4.1 Taxonomy

2.4.2 Example workflow

2.4.3 Workflow definition expressiveness

2.4.4 The need for better abstraction

2.5 Software containers in the big data world

2.5.1 Software container characteristics

2.5.2 Container orchestration

2.5.3 Usages for big data

2.6 Inspiration from microservices architecture

Chapter 3

Problem Analysis

3.1 Focus areas

3.1.1 Data locality

3.1.2 Inter-component communication optimization

3.1.3 Lifecycle management of containers

3.1.4 Integration with data management solutions

3.2 Existing solutions

3.2.1 Evaluation criteria

3.2.2 Considered solutions

3.3 Thesis contributions to the state of the art

3.3.1 Data locality

3.3.2 Container lifecycle management

3.3.3 Data management solution integration

3.3.4 Communication overhead

3.4 Alignment with current technological directions

3.5 Goals

3.6 Evaluation of the proposed contributions

Part III

Proposed solution

Chapter 4

Design

4.1 Overview

4.2 Component breakdown