• No results found

Architectures for Big Data processing

4. Demonstration and evaluation. We carry out experiments based on two real-world use cases to demonstrate that our software architecture provides a solution to the formulated problem

2.2  Related work

2.2.2  Architectures for Big Data processing

ere are a couple of architectural patterns that are often used for Big Data processing applications.

Each of these patterns targets specific use cases and has different benefits and drawbacks. In this section we review the most prominent approaches and compare them to ours.

Batch processing

Figure 2.3 A batch processing architecture

Batch processing works well for applications where data is first acquired and subsequently processed. e processing pipeline is depicted in Figure 2.3. e data is collected from one or more data sources and then put into a data store. e processing can be triggered any time and operates on the whole set of collected data or on a smaller batch of it. It can potentially happen in an iterative way. Intermediate results are written back into the store, but the original data is never changed. Final results are sent to a serving layer which produces a result view for consumers to query.

A typical programming pattern for batch processing is MapReduce (Dean & Ghemawat, 2008) which makes batch processing very scalable. It can handle arbitrary data volumes and can be scaled out by adding more resources—typically compute nodes. e most popular frameworks for batch

processing are Hadoop and Spark. Tools such as HBase, Imapala, or Spark SQL can be used to implement the serving layer and to provide interactive queries on the result view.

Stream processing

Figure 2.4 A real-time stream processing architecture

One drawback of batch processing is that it can take a long time (a couple of hours or even longer) if the input data set is very large. Applications that have to provide information in near real-time need a faster approach. In a stream processing system as depicted in Figure 2.4, incoming data is handled immediately. A stream-oriented processing pipeline is event-driven and has a very low latency. Results are typically produced in the order of less than a second. In order to achieve this, the result view in the serving layer is updated incrementally.

Immediately processing each and every single event can introduce overhead. Some stream sys-tems therefore implement a micro-batching approach where events are collected to very small batches that can still be processed in near real-time.

Typical frameworks for stream processing are Spark Streaming, Storm, or Samza. Data stor-age solutions that support incremental updates and interactive queries are Cassandra and Elastic-search.

While stream processing can provide results in a short time, it is not very resilient to changes in the processing algorithm code. Such changes can happen if there was a bug in the code or if requirements have changed and additional values need to be computed from the input data set.

Batch processing allows the result view to be recomputed by processing the whole data set again.

In the stream processing approach, on the other hand, there is no store for input data and hence recomputing is not possible. A bug in the processing code can corrupt the incremental result view without a way to make it consistent again.

In order to combine the benefits of both approaches—the fault-tolerance of batch processing and the speed of stream processing—Nathan Marz has created the Lambda architecture for Big

Data processing (Marz & Warren, 2015). In this architecture, input data is sent to a batch layer and a so-called speed layer at the same time (see Figure 2.5). Both layers implement the same processing logic. e batch layer is used to process large amounts of data and to regularly reprocess the whole input data store. e speed layer contains a stream processing system and compensates for the high latency of batch processing. It only calculates results for data that has arrived since the last run of the batch layer.

In the serving layer, the batch results as well as the incremental streaming results are combined to a view that provides a good balance between being up-to-date and correct (i.e. tolerant to errors). Streaming results are discarded as soon as new batch results have been computed.

Kappa architecture

Stream processing Data source

Data source Data source

Incremental algorithm

Serving layer

Result view Consumer

R

Log

Figure 2.6 e Kappa architecture

Modern stream processing systems are as capable as batch processing systems in terms of func-tionality and expression power. If they were not, it would actually not be possible to implement a Lambda architecture, where both branches have the same processing logic. Due to this, people have started questioning the usefulness of batch processing in the Lambda architecture. Main-taining two branches with the same logic in two different programming styles can become very complex. e only advantage of batch processing over stream processing is its resilience to pro-gramming errors or changing requirements, which is based on the fact that the original input data is permanently kept in a store and recomputing is therefore always possible.

In an attempt to simplify the Lambda architecture, Kreps (2014) has created the Kappa archi-tecture (see Figure 2.6). is archiarchi-tecture is very similar to a typical stream processing pipeline.

However, Kreps recommends keeping a log of incoming data (or events) which can be replayed and therefore used to recompute the result view if necessary. A typical framework that allows for collecting input data in a log for a certain amount of time (retention period) is Kafka. In order to cover most cases, it is recommended to configure a long retention period of at least several weeks.

Since Kafka allows for processing the log from any point and by multiple readers at the same time, recomputing the result view can be done without interrupting near real-time processing.

Summary

In this section we have described four of the most common architectures for Big Data processing.

e one that is best comparable to ours is batch processing. Although velocity—i.e. the speed in which data is acquired and has to be processed—often plays an important role for geospatial applications, data acquisition and processing are typically separate events with defined start and end points (see Kitchin & McArdle, 2016). For example, in our urban planning use case described in Section 1.8.1 mobile mapping data is first collected on a hard disk and then uploaded to the

Cloud for processing. A similar pattern can be found in our second use case (see Section 1.8.2) or in applications dealing with satellite imagery.

In contrast to information collected by a social media platform, for example, the acquisition of geospatial data is usually not a continuous process but one that is inherently eventually finished

—i.e. as soon as all required data has been collected. e stream processing approach expects data to be collected continuously and aims at processing it at the same time with a short latency. In order for this to work and to meet near real-time constraints in the order of less than a second, the individual processing steps need to be very fast. Geospatial algorithms and the models they operate on are, however, known to be very complex and expected to become even more so in the future (Yang et al., 2011). Near real-time is possible for certain use cases such as the evaluation of readings in a sensor network but not reasonable for every geospatial application.

e architectures presented in this section are typically used to implement very specific use cases and processing pipelines. Our architecture, on the other hand, allows for creating more flexible workflows that can be used for a wider range of purposes. In fact, as we will show later, we can incorporate batch or stream processing into our workflows and thus create pipelines on a much higher level.