• No results found

4. Save/persist

First output: Temperatures in California with complete city information.

1. Read from partial result

2. Filter out records wher city is not California 3. Store to California_temps.csv

Second output: Number of days where the temperature was below 0, per city per year.

1. Filter out records leaving those where temperature is below 0 2. Aggregates by city and date with a constant as output

This results in a list of days per city where one sample shows a temperature below 0

3. Aggregate by city and year using the counter function

This results in the number of days where temperatures were below 0 in a given city per year

4. Save to ColdNights.csv

Possibly incorrect assumption that the below 0 sample occurs at night

The join outputs are cached (marked by green circles) so it does not have to re-compute it between data.

2.3.3 Spark SQL

Spark SQL is a different interface to Spark, it uses an SQL-like syntax to generate the same processing graphs and the same operations as the programatic interfaces offer. This interface can also be accessed via Java Database Connectivity/Open Database Connectivity (JDBC/ODBC) interfaces and thus be handled by Business Intelligence (BI) Applications in the same manner as any other database.

There are other projects that offer the same interfaces and connectivity as Spark SQL. Among these are HIVE, a platform developed by Facebook Inc. for making big data processing easier for the programmer than Mapreduce.

2.4 Apache Hadoop

Apache Hadoop was originally developed as an implementation of Google’s GFS and mapreduce. Hadoop 2.0 however, is a big data processing framework that allows the user to write any application trough the YARN API.

14 2. BIG DATA TECHNOLOGIES AND DATA LAKES

2.4.1 GFS/HDFS

GFS was designed with the following assumptions: [GGL03]

– The system consists of many unreliable (consumer grade) nodes.

– The system stores primarily large files.

– The main workload consists of either large streaming reads or small random reads.

– Files are rarely modified, but may often be appended.

– Files may be appended to by multiple clients.

– High continuous bandwidth is more important than low latency.

It is not a traditional file system per se because it does not implement traditional file system interface. Instead it implements a file system protocol that is accessed trough the applications that use them. This interface supports several untraditional commands like “snapshot” and “append”. Snapshot copies the file or directory three at low cost by reusing data blocks. Append allows for several clients to write records to the same file at the same time and guarantee that data is not corrupted by the operation.

GFS runs in user-space on standard linux machines, this means that GFS is a application guest just like any standard database server. It uses a single-master design where one node acts as a GFS master and holds an index of all the files on GFS. The single master design simplifies the architecture, but a hot spare node is keep ready to take over should the primary node fail. The master node index keeps track of several chunkservers.

Files are stored in GFS in chunks. A chunk is a part of a file, for GFS they are described as 64MB each. Each chunk is then stored at at least two chunkservers, optionally more. This ensures that if a chunkserver goes down the data is still available in GFS. In the event of a chunkserver going down the master will detect this and ensure that replication is done and distributed properly.

When a client wants to read a file it queries the master and the master replies with chunk information. The client then uses the chunk information to query the chunkservers directly for the chunks. The client would typically choose the closest chunkserver to load the chunk from. If the client is an application that runs on the same nodes as the GFS cluster it can distribute the workload across the nodes and process the chunks locally in order to minimise network traffic.

2.4. APACHE HADOOP 15

Figure 2.4: Overview of different participants in a HDFS system. Client 1 is doing a read of a file. Client 2 is writing a file chunk, replicated to Node 0.

Writing operations and replication is handled internal to GFS. The client only asks for a chunkserver to write to and that chunkserver handles replication onwards.

This means that bandwidth usage is distributed, instead of having the client replicate the chunk written the chunkserver redistributes to the second copy, the second copy distributes to the third and so on.

HDFS is the implementation of GFS within Hadoop. It has a slightly changed terminology, but still uses the GFS paper as a design document. All the concepts described about GFS is true for HDFS.

2.4.2 YARN

YARN is a workload distribution model created for Hadoop 2.0. While Hadoop 1.0 was a Mapreduce platform Hadoop 2.0 is a general compute resource platform.

YARN is the model for managing and distributing the workloads across a hadoop cluster [VMD+13].

There are three roles in a YARN cluster. First is the Resource Manager that administers the cluster and functions as the resource arbiter. Second is the Application Master that is in charge of coordinating the application. The last role is the Node Manager, it is responsible for managing a noce in the cluster and allocate resources for workloads.

16 2. BIG DATA TECHNOLOGIES AND DATA LAKES

YARN can be used to host both Mapreduce, Apache Spark, and other distributed applications. The primary reason to use Hadoop1is to run HDFS with YARN based data processing applications on top, so the assumption that this the only use case will be taken.