Efficient and scalable storage of big weather data

(1)

Efficient and scalable storage of big weather data

An experimental set up and evaluation

Bendik Ibenholt

Master’s Thesis Spring 2017

(2)

(3)

Efficient and scalable storage of big weather data

Bendik Ibenholt 11th May 2017

(4)

(5)

Abstract

Accurate weather forecasts, and observations play an important role and has applications in an extensive variety of fields. Predicting the next rainfall for a farmer or accurate measurements of wind conditions for an airfield or wind farm holds a lot of value. It makes operation a lot smoother and more cost-efficient, and even plays an important part in planning new projects.

Remote weather stations, meteorological radar satellites and personal weather stations generate vast volumes of observational weather data.

While this allows for more accurate weather forecasts, it also presents some challenges when storing the data. Conventional RDBMS cannot provide the required availability over terabytes of weather data with real-time requirements for several users. Also when the volume of data and the number of users keep on increasing, scaling up RDBMS are often expensive.

This thesis examines how the hadoop-stack can be used to create a database for observational weather data. Hbase was used as a means to store data on HDFS while maintaining relatively short response time for queries.

In order to present the data and test different applications a RESTful API using Flask was made. Two different data models, one very compact, and one not, are presented, tested and evaluated along with different pipelines to the data.

The results of this thesis show that Hadoop along with Hbase provides a scalable and flexible way of both storing and processing weather data. We also show that although Hbase by itself has somewhat limited functionality compared to RDBMS, the selection of tools that interface with Hbase goes along way to fill this gap. We also demonstrate that the system provides the required response time to serve as a database for interactive applications.

(6)

(7)

Acknowledgements

First I would like to thank my supervisors professors Frank Eliassen and Hans Arno Jacobsen for all the patience, good advice and feedback, and pushing me to finish. Also the Energy informatics group at the Technical University of Munich for having me as an exchange student for an ex- ceptionally enjoyable semester, in particular Christoph Goebel that guided while working on the implementation.

I would also like to thank my family for all the support and encourage- ment. A big thank you also goes to the other students I have gotten to know over the years, in particular the guys at Assembler for some thoroughly fun and unforgettable years.

(8)

(9)

List of Figures

2.1 an overview of the HDFS architecture [11] . . . 9

2.2 Diagram of the Hbase datamodel . . . 11

2.3 Quadtree structure over a map[3] . . . 17

4.1 Overview of services and pipelines in the design . . . 25

4.2 schema for thestationstable . . . 27

4.3 First try at a schema for themeasurementstable . . . 27

4.4 Comparison of table design impact on storage space . . . 28

4.5 The second attempt at a schema for themeasurementstable 29 4.6 The RowKey for themeasurementstable . . . 30

4.7 A screenshot of the historic weather map application with markers showing the location of a weather measurement . . 34

4.8 When the markers are clicked a selection of measurements are displayed to the user . . . 34

5.1 An overview of where each service runs on the cluster . . . . 36

5.2 Comparison of the two schemas using the /timeseries URI . 39 5.3 Comparison of the two schemas over the /recent URI . . . . 41

5.4 Response times from the /stations URI . . . 42

5.5 Scaling /timeseries URI . . . 45

5.6 Scaling /recent URI . . . 46

5.7 Comparison of the two schemas with and without caching over the /recent URI . . . 47

5.8 Response time from database does not ruin the user experi- ence when zoomed in a bit . . . 50

5.9 When looking at a large area, the map becomes very crowded 50 5.10 Response time from the database becomes very slow when looking at a large area . . . 51

5.11 Computation times with Hive and Impala . . . 54

(12)

(13)

List of Tables

2.1 Example operations on resources using REST . . . 16

5.1 Parameters sent to the API for testing /timeseries . . . 38

5.2 response from /timeseries (ms) . . . 38

5.3 bytes returned from /timeseries for each query . . . 39

5.4 Parameters sent to the /recent API . . . 40

5.5 response time (ms) from /recent with blockcache . . . 40

5.6 Parameters sent to the API for testing /stations . . . 42

5.7 Response time from /stations . . . 42

B.1 Software versions and settings . . . 69

C.1 results from /timeseries with blockcache structured table . . 71

C.2 results from /timeseries with cache unstructured table . . . 72

C.3 results from /recent with caching structured table . . . 72

C.4 results from /recent with caching, unstructured table . . . . 72

C.5 results from /stations with blockcaching . . . 72

(14)

(15)

Chapter 1 Introduction

Most people living today have access to an array of weather services. Ser- vices that provide accurate forecasts and live measurements of everything from wind speeds to snow depth from all over the world. An important driver behind this development is the wealth of data, and data sources like radar satellites and weather stations, and the ability to make sense of these data through computer processing. Usually meteorological data is stored in relational databases and processed by expensive supercomputers, but de- velopments in big data technology might present a cheaper and scalable alternative which is what we will explore in this thesis.

1.1 Motivation

1.1.1 Why is weather data important

Agriculture, aviation and shipping are areas where weather obviously play a crucial role. Accurate weather forecasts and live measurements allow for smooth operations, and enable actions appropriate to current and future weather conditions. An emerging field where weather conditions play a crit- ical role is renewable power integration in the grid.

If the government targets for penetration of renewable energy are to be met, several challenges arise. The dominating sources of renewable energy today are hydro, solar and wind energy. The output of solar and wind power systems are mostly governed by the current weather conditions. In most current cases, the penetration of highly intermittent and variable renewable energy is quite low. So the variability and intermittentness is not a big problem. You just use the renewable energy whenever the wind blows or the sun shines. Otherwise you cover your demand with other sources of energy. But when the share of wind and solar energy increases it becomes more challenging to balance supply and demand.

When utilizing thermal and hydro power to balance supply and demand of energy, the start-up time of dispatchable energy sources has to be taken into account. A thermal power plant needs some time to warm its boiler,

(16)

varying from under an hour up to several hours depending on the initial temperature, before producing electricity[10]. A hydro electric power plant has a shorter start-up time for example Dinorwig power station in Wales can ramp up from zero to maximum production in 16 seconds. The implic- ations of this start-up time is that system operators have to be able to predict the amount of power that is going to be produced from intermittent sources so that generators can be started in time to produce energy when it’s needed.

Since the output of most renewable energy producers are so dependent on weather conditions, weather data plays an important part in predicting the power output from these energy sources. Either through accurate, high resolution weather forecasts, machine learning methods, or a combination of the two. Both weather forecasting and machine learning benefit from access to large datasets and lots of computational power which can churn through the data efficiently.

1.1.2 Challenges

Weather stations, radar, satellites. weather balloons and even personal weather stations report measurements to services like Weather Under- ground. According to their website, Weather Underground collects data from over 200 000 weather stations[22]. National and international meteorological centers like NCDC (National Climatic Data Center) has historic data dating back to 1900 from over 30 000 weather stations all over the world.

In Huang et al.[15] coarse temporal resolution of wind measurements and difficulty with using complex databases are some issues that are raised in relation to forecasting models for wind power. In ramp rate forecasting in particular, high frequency measurements can improve forecasts, and forecasts in general benefit from a high spatial resolution in meteorological measurements, meaning more stations. More stations and more frequent measurements might present itself as challenging for RDBMS(relational database management system) to manage, if data volume and velocity becomes very large.

In order to leverage these data for data analysis the data can be down- loaded and used in, for instance, statistical forecasting for wind energy.

When datasets to be analyzed get very large this is a bit impractical, and a better solution might be to move the computation to where the data is located, instead of moving the data to where the computation takes place[16].

(17)

1.2 Problem statement

Weather data is usually managed by RDBMS, a mature technology that can handle complex queries by using several indexes to access data. The downside to RDBMS is that they can be expensive to scale up often requiring spe- cialized hardware and because they have a fixed schema are not considered very flexible[16]. Therefore we want to explore the possibility of using a distributed non-relational storage system to store weather data. In this thesis we have decided to use the Hadoop stack because it is one of the most mature big data technologies, and it has a rich ecosystem of related projects and tools that can be used to store and gain insights and value from data.

Our research question for this work is

• How can the Hadoop stack be leveraged to efficiently store and process big weather data

from this question we derive the following subquestions:

• How does a solution implemented using the Hadoop stack perform

• What are the benefits and disadvantages of using Hadoop to store and process weather data

1.3 Evaluation metrics and premises

In order to evaluate if our solution addresses the challenges stated, we propose some requirements that our solution should fulfill. The main issues with RDBMS is scalability and flexibility, so our solution should provide a flexible and scalable means of storing and processing data, and retain fast query times. We will evaluate these features along with the functionality of the different tools and frameworks we make use of by specifying some standard test queries to measure performance in terms of query speed and processing time.

Ideally the solution in this thesis should provide the similar functionality as other sources for weather data that are based on RDBMS, so we look to them for features that we think are relevant. Some sources for weather data like yr.no have Web API’s that can be used by external applications. This is a feature that enables a great variety of uses for our storage system, and should be a part of our solution. We will implement a Web app that uses our systems API to demonstrate functionality. We will also test Hadoops ability to process large datasets by computing some simple aggregate values.

To summarize:

• Evaluate the scalability, flexibility of the system, performance and functionality of the tools used.

(18)

• We will specify some test queries and run them through our API to measure performance of our database

• In order to demonstrate the functionality of our solution:

– We will implement an app that will consume our API.

– We will use Hive and Impala to compute some aggregate values, like average temperatures

1.4 Approach

In this thesis we will explore the possibilities of using a distributed datastore for weatherdata. We will set up a database on a cluster of virtual machines using Hadoop Hbase. This will be our distributed scalable solution for storing large volumes of weather data. The database will be populated with pub- lic data from NOAA (National Oceanographic And Atmospheric Adminis- tration) ISD (Integrated Surface Database). The Hbase Key-value store will use quadtree based hashvalues in order to linearize geographical points and different ways of organizing and indexing the data will be evaluated. We will test three different pipelines to the data, Impala, Hive and Thrift, which will be exposed to the web through a REST API. The REST API is what we will use to test querytimes for our solution.

We will develop a web app that will consume our REST API in order to demonstrate functionality and check the systems ability to serve an interactive application. The app will visualize the data, by displaying weather stations on a map and historic weather data from the stations from a time that the users select.

1.5 Research method

In this thesis we will state some requirements our solution should fulfill, and then attempt to implement a solution that satisfies the requirements.

We then test and evaluate our solution in order asses whether or not it is actually able to do so. This is in accordance with the technological research method in Technology Research Explained [20]

1.6 Contribution

The outcome of this thesis is an implementation and evaluation of a distributed Key-Value store for weather data including testing of different pipelines to the data, and data models. As part of this work we develop a fully distributed cluster implementation of a storage system using the Hadoop stack, a REST-API and an app to consume the API.

(19)

1.7 Organization

The following chapters are organized as follows:

• Chapter 2 will lay down the technical background for the thesis.

Distributed systems in general are presented, before we go in depth on the specific frameworks and tools that we use in this thesis.

• In chapter 3 we will look at some closely related research on distributed non-relational databases for spatio-temporal data, comparative studies on big datastores in general, and a paper which uses both Ha- doop and Hbase to store and process weather data.

• Next we will present the details of our design decisions and implementation in chapter 4.

• Chapter 5 presents how we test and evaluate how our implementation tackled the challenges previously presented and the results of these tests.

• In Chapter 6 the limitations and value of the work is discussed

• Finally we conclude in chapter 7 by summarizing our work, discuss the limitations of the thesis and provide reflections on possible future work.

(20)

(21)

Chapter 2 Technical Background

In this chapter, the technical background for the different tools and concepts that are used in this thesis are presented. First the different parts of the Hadoop-stack that are part of the proposed solution are presented, including HDFS it self. Then an index for geographical points that will be used the row key for Hbase is introduced. Then lastly, the source for weather data that is used to test the set up is presented.

2.1 Distributed systems

We define a distributed system as one in which hardware or software components located at networked computers communicate and coordinate their actions only by passing messages.

The above definition of distributed systems [6] encompasses a very large group of modern computer systems that share hardware and software resources like disks, files and objects. Distributed systems come in many forms: sensory networks, computing clusters and distributed information systems like databases all fit in the above definition. The range off applications is wide and distributed systems can be found in many areas ranging from media streaming to scientific applications and web-search. The motivation for using distributed systems in these and other applications is mainly resource sharing. Resource sharing allows user to collaborate on shared files, access remote information and has economical benefits because it allows sharing of expensive hardware like printers and supercomputers.

2.2 Hadoop

Hadoop is an open source project that provides the means to distributively store and process large volumes of data on commodity hardware. Hadoop can run on a single server, or on several thousand machines, like the re- portedly largest Hadoop cluster at Yahoo. Which, according to[23] consists of 40 000 nodes containing 600 petabytes of data.

(22)

The Apache Hadoop project was founded in 2006[18]. Since then, a thriving ecosystem of related projects like databases and processing frameworks have emerged that all run on top of Hadoop.

2.2.1 Hadoop Distributed File System (HDFS)

The Hadoop Distributed File System (HDFS) is the distributed filesystem that all the different frameworks, libraries and services in the Hadoop ecosystem runs on top of. It provides a highly fault tolerant way of storing data across several computers, and has a UNIX-like interface that allows users to read, write and delete files and directories[18].

HDFS separates metadata and application data on different nodes in a cluster. The nodes in turn, typically run on different servers. The node on a cluster containing the metadata is called a Namenode, and the nodes with the application data are called Datanodes. HDFS has a master\slave configuration, where the Namenode acts as masters and the Datanodes as slaves.

This means that the clients access to the file system, the namespace tree, reads, writes and deletes, is controlled by the Namenode. While the Namenode has all the metadata of the filesystem, the content of a file in the HDFS is divided into blocks and distributed among the Datanodes. The Datanodes, acting as slaves, are responsible for the actual creation, deletion and replication of datablocks at the Namenodes request. The Datanodes are also responsible for executing reads and writes, without sending data through the Namenode. This is what allows the I/O of the HDFS to scale along with processing power and storage capacity when new nodes are added to the cluster.

2.2.2 Replication

HDFS is designed with the assumption that hardware failure is the norm rather than the exception. In order to detect faulty hardware or connection loss for one or several Datanodes, the Datanodes regularly send a heartbeat message to the Namenode. The heartbeat messages ensures the Namen- ode that the network and the relevant Datanodes are functioning properly.

To maintain data reliability in the face of hardware or network failure, data can be replicated to several Datanodes. HDFS hasrack awarenesswhich means it will replicate data to nodes that are on at least two different racks.

Thus maintaining data integrity even if an entire rack fails

Replication of datablocks across many datanodes has several advantages besides providing data durability. It increases data transfer bandwidth, because data can be retrieved from several servers simultaneously. Bandwidth utilization is also improved by HDFS by selecting a rack to read from that is physically closer to, or maybe even on the same rack as, the reader.

(23)

Figure 2.1: an overview of the HDFS architecture [11]

2.2.3 MapReduce

Hadoop comes with an implementation of the MapReduce[7] programming model called Hadoop MapReduce. Hadoop MapReduce is a software framework that equips programmers with a tool to write programs that can efficiently process the large data volumes that the HDFS is designed for. The MapReduce jobs run in parallel, usually on the HDFS Datanodes and so the processing power available in a MapReduce job scales up along with the storage capacity if another Datanode is added.

A MapReduce Job is split into two phases, a mapphase and a reduce phase. The map phase process the input data and creates a set of intermediate data for the reduce phase to process. The output of a mapper is a a set of key value pairs, where one key may map to zero or several values.

The reduce phase takes the key value pairs from the mapper as input. It then reduces all the values that share a key, usually to one value per key, but not necessarily. The output of the reducer, which is another set of key values, is then written to disk.

2.3 NoSQL and non-relational databases

NoSQL is a term that describes a group of databases that came about as a response to a need for more scalable data storage systems in the face of enorm- ous amounts of user generated content and logging of user-actions[13]. The term ’NoSQL’ refers to the fact that non relational databases usually offer some other interface than SQL as the primary or native way of querying

(24)

data. However for some NoSQL databases it is possible to query using SQL through an added layer on top of the primary interface. Because of this the term NoSQL is a little imprecise, and it has led to another interpretation of the term as NoSQL ’Not only SQL’.

There are other more defining features of NoSQL databases that differ- entiate them from relational databases, one of them is that RDBMS usually guarantee ACID(Atomicity,Consistency, Isolation, Durability) transactions.

ACID is a set of properties that assures transactions in a database will fully complete or not do anything at all(atomicity), conform to the rules of the database and brings the database to a new valid state(consistency), should lead to the same result as if they were processed serially(isolation) and that once a transaction has been committed it will last even in the face of power loss system errors or crashes. In non-relational databases some of these properties are usually sacrificed in order to achieve better performance or better fit to a specific use. For example many nosql databases are distributed, which limits what a database can offer in terms of consistency, availability, and partition-tolerance according to the CAP theorem.

The CAP theorem states that in distributed systems, consistency availability and partition tolerance cannot be achieved simultaneously if the system is partitioned across a network. Distributed systems have to be partition tolerant, which means that they will continue to function in the face of network failure, an important property because there is no such thing as a truly reliable network. This leaves system designers with a choice between consistency and availability. In the context of distributed systems, consistency means that all reads should receive the most recent writes, while availability guarantees that all requests should receive a response. The reason they can not both be completely guaranteed in a partition tolerant system if the network has failed, is that one partition has no way of knowing if there has been a more recent write on any of the other partitions. When the partition receives a request it is can then either respond with possibly outdated data, which would break the consistency property, or wait, possibly for ever, un- til the system knows it can respond with the most recent write which would break the availability property. In practice the choice is not selecting one of the properties, but rather prioritizing one and relaxing the other.

2.4 Hbase

Hbase is a distributed non-relational database that runs on top of the HDFS.

It is modeled after Googles BigTable [4]. Like Googles BigTable it is a multidimensional sorted map, where data is indexed by a rowkey, column and a timestamp. The data is stored as an uninterpreted string, so it lacks some of the features that a RDBMS has like typed columns.

(25)

2.4.1 The Hbase datamodel: Tables, RowKeys, column- families, columns, versions and cells

Similar to many other data storage systems, Hbase uses tables consisting of rows and columns to organize data. One table consists of a collection of rows, column families and columns - or column qualifiers - that make up a grid of cells, where the cells contain the data that is stored. A row in Hbase is identified by its rowkey, and consist of at least one column. The rows are lexicographicaly ordered by their rowkey, and partitioned into regions that are distributed on the Hbase Regionservers. This property means that rows that are lexicographical close, are stored on few or the same region. Which in turn means that if a lookup in the table is in a narrow range of rows it will span fewer regions, meaning less communication between the servers, and more efficient reads.

The rows in Hbase maps to a set of column families. The data in one column family is stored physically together on disk, so like the rowkey, they they can have a positive impact on I/O performance, provided scans over a table are often done for only a subset of the column families. Column families are collections of columns which are often calledcolumn qualifiers in Hbase terms, and have to specified in advance of writing to a table. The columns in one column family share properties like bloomfilters, time to live, number of versions of each cell, in-memory caching attributes, and compression type. This is important to take into account during schema design so that columns that have similar access and storage characteristics are grouped the same column family. Two other important points about column families are that the column family name is difficult to alter after creation, and that the column family name is repeated for each row so in order to save storage overhead it can be a good idea to keep them as short as possible.

Figure 2.2: Diagram of the Hbase datamodel

(26)

A column family maps to a list of column qualifiers, which unlike column families can be created on the fly by a client. For instance in Figure 2.2 we see the column family CF2 as a collection of the column families CQ, CQ1 and CQ2. The purpose of column qualifiers is to organize data so that it is easier to locate the correct data in the schema. Like column families, the name of the column qualifier is stored in the database, so it is common practice to use abbreviations in column qualifier names in order to save storage overhead. One column qualifier along with a row key defines a cell, which in turn may contain many versions of the data stored in it - adding another dimension to organize the data.

2.4.2 Hmaster and Region Servers

The Hbase architecture is organized in two services that have a master/slave relationship similar to the HDFS [25]. The master is called HMaster and usually runs on the Hadoop Master node, while the slaves are called Region Servers and runs on the Datanodes. HMaster is responsible for load balan- cing, table modification, creation and deletion, but like in the HDFS, reads and writes of data actually goes directly to the Region Servers which contain one or more regions of the data. The HMaster assigns regions to the Region Servers, and thus can balance the load between the Region Servers by moving regions between them in order to get an even distribution of disk usage.

Also, if a Region server crashes or is removed from the cluster, Hmaster re- assigns the regions on that server to the remaining Region Servers.

While the Hmaster is responsible for assigning regions to the Region Servers, it itself does not contain the list of which region is on which Region Server. This information is stored in a table calledhbase:metaon the Hadoop Zookeeper, a service that is responsible for coordination, and sharing states between the servers. This is what allows clients to read and write directly from the Region Servers without involving Hmaster at all, and by caching the meta they can avoid hitting the meta table every time the want to operate on the table.

2.4.3 Queries

There are two ways for clients to request data from Hbase, get [1], and scan [2]. Get and scan are configured in a very similar manner, the only difference is that while get returns data from a single row, scan returns data from an interval of rows designated by a start- and stop row.

Both get and scan can be configured to return a subset of columns, column families, and a subset of versions of columns. In addition, a filter can be applied to the results of a get or a scan before it is transmitted to the client. A filter can filter out cells or entire rows based on the existence of a column, the value of a cell, rowKeys or columnfamilies, but in the last two cases it is considered to be better practice and more efficient to set what columnfamilies are of interest, or use start- and stop rows in the scan

(27)

method where it is possible because filters don’t actually reduce disk I/O they just reduce the amount of data transferred to the client, meaning less bandwidth usage and transferring some the processing to the servers.

2.4.4 Thrift and REST servers

Hbase has APIs for clients written in other programming languages than Java, these include a Representational State Transfer (REST) server and Thrift gateways. REST will be described in further detail in a later section, but put shortly it allows users to interact with Hbase through HTTP by using the URL itself as an API. A request to a REST server will return data in XML, JSON or protobuf formats, where the data schema is a part of the data returned.

Thrift is a programming library for communication between programming languages. It supports some fundamental datatypes: signed integers, doubles, strings, bytes and booleans, some containers: lists, sets, maps, and objects in the form of structs similar to the ones found in the C programming language[19]. Unlike REST, Hbase Thrift server does not send a schema for the data, instead it transfers the row data directly using the Thrift binary protocol, along with metadata about the length and type of a field. Since Thrift relieves the client of parsing XML or JSON and transfers less data, it is the faster option when choosing between REST and Thrift.

The Thrift API offers support for retrieving several versions of a cell based on their timestamp, however it is somehow limited compared to the native Java API. Thrift only supports retrieving versions whose timestamps are more recent than a given timestamp, or exactly matches a list of timestamps.

2.5 Cloudera

Cloudera is a platform for managing Hadoop HDFS and a selection of related software for storing and analyzing data. Cloudera comes with a web- interface for deploying and monitoring a cluster, Hadoop and the related services running on it. The services used in this thesis that are managed through the Cloudera Manager are Hadoop, Hbase, Zookeeper, Spark, Hive and Impala.

2.6 Impala

Impala is a database engine that enables users to write SQL-like queries for files on the HDFS and Hbase[5] and consists of three components, the Impala Daemon, Statestore and Catalogue Service. It also has a Web GUI in addition to an API that allows users to write queries in the web browser.

Following is an example of a Impala query that retrieves stations with a row key beginning with 0012032 between 200 and 600 meters above sea level.

Impala Daemons run on each of the HDFS Data Nodes, and every one of them can accept queries and distribute them to the other Daemons. Each

(28)

1 SELECT name , elevation , id

2 FROM stations

3 WHERE rowkey LIKE '0012032% '

4 AND and elevation BETWEEN 200 and 600

Daemon can read and write on the Data Node they are running on. When a Daemon receives a query it becomes the coordinator for that query. The coordinator make a distributed query plan based on where the data is located on the cluster, the work is then distributed to the other Daemons which return their partial results to the coordinator, which in turn aggregates the results and returns them to the client.

The Statestore monitors the health of the Impala Daemons in the cluster, if one fails it notifies the other daemons so that requests don’t get forwarded to the faulty one.

The Catalog Service looks up and caches metadata like table names and column names from Hive or locations of datablocks on Data Nodes on HDFS and broadcasts the information to the Daemons.

Impala does not support querying Hbase tables directly so when using Impala to query Hbase, the Hbase tables have to be mapped through Hive first. Tables have to be mapped manually, meaning the user have write a small statement in the HDL (Hive Definition Language) and define a data type for each column intended to by queried by Impala. Like when executing scans on Hbase through the native API, the most efficient queries are the ones that subset on a - preferably narrow - range of RowKeys, an important consideration, and limitation, to take into account when writing Impala queries. Another note, which is important for our use of Impala, is that Impala is unable to query for a subset of versions of a cell, unlike the native Hbase queries get and scan, thus removing the ability to use that as an index for timestamped data.

2.7 Hive

Hive is a data warehouse system made for Hadoop. Like Impala it provides users with SQL-like query language for data in the HDFS so that users who already know SQL, but are not proficient in other programming languages, can analyze data stored there. However unlike Impala, Hive runs MapRe- duce jobs to execute queries. MapReduce jobs have a lot of overhead can take several hours to complete, which is fine when running large batch operations that are expected to take a long time unless the user wants to explore the data in a more interactive fashion.

Excluding interfaces, Hive consists of four components, Driver, Com- piler, Execution Engine and Metastore [21]. The Driver is the component

(29)

that receives and manages the lifecycle of queries. It invokes the compiler and submits the MapReduce jobs to the Execution Engines. The Compiler parses the query string it recieves from the Driver and fetches schema information from the metastore and verifies table names and typechecks the columns. The output from the compiler is an execution plan which is then executed by the Execution Engine The Metastore contains metadata about tables stored in Hive like columns and column types, and where in the HDFS the data is stored. The metastore is shared with some of the other Hadoop services, including Imapala, and is where Impala gets the table mapping from.

2.8 Spark

A MapReduce job requires a lot of reads and writes to disk, which works per- fectly well for many applications. However for iterative computation, like most machine learning algorithms, or for interactive data analytics, where data is reused several times, there can be a great advantage in keeping the data in memory. Spark [24] is a distributed computational framework that provides this feature and has become quite popular. Instead of operating directly on disk, Spark loads data from disk into resilient distributed datasets (RDDs) which can be cached in memory across several computers. A RDD can be created from a file, for example from HDFS or any other file system, by paralelizing a datastructure like an array in the spark program, or modifying an existing RDD. A RDD can be cached in memory as long as there is room for it. If there is not enough room for all partitions of the dataset in memory, the partitions will be recomputed when they are used.

In order to interface with Hbase Spark instantiates an object of the HadoopRDD which allows Spark to work with data stored in HDFS, like files or Hbase tables. A HadoopRDD uses the map reduce framework to read, write and map the RDD to the underlying data.

2.9 Leaflet

Leaflet is a JavaScript library for making map-applications that can run in a browser. It’s is quite straightforward to use and is very handy for visualizing geographical data, by adding layers of information onto the map. The following example loads a map in the browser and displays a simple marker in it. In the example OpenStreetMap is used as a tile layer and functions as the map, but any image could serve as a ground layer in an application. Open- StreetMap is an open source mapping project that we will also use when implementing the app in this thesis.

1 var map = L.map('map '). setView ([59.9 , 10.65] , 13);

2

(30)

Resource Identifier HTTP method Effect

stations/ POST Create a new station

which will be added to the list of stations

stations/125 GET Return a station from

the system with ID 125 stations/ GET Return a list of stations

Table 2.1: Example operations on resources using REST

3 L. tileLayer ('http ://{s}. tile. openstreetmap .org /{z}/{x }/{y}. png ', {

4 attribution : 'Map data © <a href =" http ://

openstreetmap .org"> OpenStreetMap </a>

contributors '

5 }). addTo (map);

6

7 L. marker ([59.9 , 10.65]) . addTo (map). bindPopup ('Marker ')

2.10 Happybase

Happybase is a Python library for interacting with Hbase through the Hbase Thrift server. Using the Thrift service requires the programmer to write boilerplate code for importing Thrift libraries, handling connections and protocols and so forth. Happybase hides this by presenting a simpler API while still offering the same functionality as using Thrift directly.

2.11 Flask and RESTful services

Flask is a Python framework for web development. It contains a Web server and can be used to develop RESTful web services. A RESTful web service is one that follows the REST architectural style [8] for designing web applications. It allows clients to use URLs to issue CRUD (Create/Read/Up- date/Delete) operations over HTTP by using the HTTP verbs (GET, POST and DELETE). Because it is based on standard HTTP, REST is platform and language independent, making it ideal for web applications that might interface with many different types of clients. In table 2.1 is an example of how a RESTful service might present operations on resources on the server Resources are identified by a Uniform Resource Identifier (URI), which might be used to retrieve or delete a single element or a collection, or create new elements depending on which HTTP method is used.

(31)

2.12 Geohash and Geospatial indexes for data- bases

Conventional RDBMS store data in a table and create an additional index on the data for spatial queries. One of the Indexes that are commonly used are quadtrees. Although there are good alternative ways to index spatial data, like R-trees or K-d trees, we have chosen to use quadtree in this thesis because it is simple to implement. R-trees, which could have been a good alternative have been shown to outperform quadtrees[12], but the improved retrieval time is in exchange for complexity when inserting, deleting, and maintaining the index.

A quadtree is a hierarchical data structure where each node has exactly four children, and works by recursively dividing space into four quadrants.

It is a datastructure that can be used in geographical applications such as ours by dividing a map in the same manner as shown in figure 2.3. The resolution of the decomposition of the map can be chosen up front, or it can adapr to the input data[9] so that areas with many datapoints gets more segmented.

Figure 2.3: Quadtree structure over a map[3]

Geohashes are a way to represent geographical coordinates as a string.

Geohashes are quite popular in indexes for geospatial databases and are used for instance by GeoMesa[12]. Geohashes work similarly to quadtrees by recursively dividing the map. The first step is dividing it into two separate squares, where things in the left half get geohashes starting with 1, and things in the right half get geohashes starting with 0. These two halves are divided again in top and bottom halves, where points in the top half are given 1 as the second binary in their geohash and points in the bottom

(32)

half are given 0. This system can represent a geographical location as a binary number, which is represented as a string of characters by Geohash.

Geohashes organize data in a hierarchical structure like a quadtree, which means that points are geographically close to each other usually share a common prefix.

2.13 Data-sources

In this thesis we use data from NOAA(National Oceanic and Atmospheric Administration) ISD (Integrated Surface Database) to populate the database. The ISD Contains data from more than 35,000 stations across the entire world, but due to storage constraints only data from 2014 is used. The data in the ISD is stored by year, each year contains a number of stations that made measurements that year, and each station has a number of hourly measurements for that year. The measurements for one hour are stored as one string, a record, following a specific format, where each segment of the string is designated for certain information. Each record contains hourly values for a range of measurements like air temperature, wind speed and direction. It also has information on the quality of those measurements and how the measurements are obtained. in addition to meta information like station identifiers, geophysical location. The dataset for 2014 contains 120 000 000 records, and when compressed it takes up about 5 GB of diskspace.

(33)

Chapter 3 Related work

In this chapter we will look at an approach to store and process weather data and present studies on non-relational databases and spatio-temporal data that is relevant for our thesis.

In the following section we will look at a another study that uses Hbase and Hadoop for storing and processing weather data, before we review non-relational databases MD-Hbase GeoMesa, Hgrid for geospatial and timeseries data and look at how they index their data.

3.1 Hadoop for Geoscientific data analysis

In this thesis we try to tackle the use case of storing weather data in a non- relational database. While there has been a number of studies on other uses for nosql databases and the benfit from employing them in a usecase, there is little research to be found on no-SQL databases specifically for storing weather data, but we did find one.

In ”Enabling Big Geoscience Data Analytics with a Cloud-Based, MapReduce-Enabled and Service-Oriented Workflow Framework” the autors implement a scientific workflow framework which allows users to build workflows for processing geoscientific data, using a GUI to visually connect different services. The services run by the workflow are MapRe- duce jobs which process geoscientific data stored in Hbase. The data in question is extracted from NetCDF files, a fileformat for grid- or array oriented data, which is decomposed and loaded into hbase. The resulting pro- totype achieved faster computation times with parallelized map-reduce jobs compared to a serial computation when computing monthly averages. The scalability of the solution was also tested by running the computation several times on an increasing number of nodes. The results show that the computation time decreased significantly, demonstrating that the system scaled efficiently, and that the paralelized map-reduce computation was faster than the linear one.

This study is smiliar to ours in that Hbase and HDFS is used for storing and running computations on weather data. However in our solution we will evaluate som of the other tools and frameworks available in the Hadoop-

(34)

stack, and make a REST-api that will provide access to the database. we will also use observational data from weather stations and not computed values from NetCDF files.

3.2 Spatial indexes for non-relational databases

Weather data is inherently spatio-temporal. Which means that a weather measurement always has a time and a place, so this is a natural way of organizing weather data. While traditional relational databases can rely on secondary indexes for time and place for storing and querying spatio- temporal data this is not an option, at least not an efficient one, for non- relational databases like Hbase. Although indexes for spatio-temporal data is not the main focus of this thesis, the RowKey as an index is a very important aspect of a non-relational database both in terms of scalability and query efficiency. So in this section we will review some of the research on spatial, and spatio-temporal indexes for non-relational databases.

Approaches for storing geospatial data in non-relational databases include MD-Hbase: a multidimensional index-structure for Hbase, and Hgrid: a data model which seeks to improve on the work in MD-Hbase. In A. Fox et al.[12] an index for geospatial data is presented that also includes a temporal component in the index. This index is used in GeoMesa, a database for spatio-temoral data that can run on top of several big-data frameworks.

3.2.1 MD-Hbase

MD-Hbase is a multi-dimensional datainfrastructure for location aware services proposed by Nishimura et al. [17]. In the paper they recognise that relational databases, while providing the means for sophisticated queries through secondary indexes, can be overwhelmed by location aware services that require a high level of scalability to store and analyze data and provide concurrent reads and writes. Further they they establish that non- relational databases on the other hand provide a high level of scalability and availability, but lack the possibility of multiple attribute access without full table scans. In order to bridge the gap between the two types of database they propose a multi-dimensional index for Hbase created by a z- ordered linearization technique which is used to implement quad-, and KD- trees: two common structures for multi-dimensional data, like geographical points. The downside of this linearization technique is that a scan might incur a lot of false positives because points that are geographically close might not always be close in the z-ordering. In order to cope with this they try to prune out irrelevant subspaces at a high level in the tree, which they consider to be an operation cheap enough to not cause severe performance issues.

(35)

3.2.2 Hgrid

Dan Han and Eleni Stroulia propose a different way of indexing geographical points in Hbase with Hgrid[14]. Like MD-Hbase they use z-ordering to implement a quad-tree index for the data and also try to improve on the index by combining it with a regular grid model. The regular grid model uses the row key and column qualifier as an index similar to how latitude and longitude works in the geographical coordinate system. Several points located within a cell defined by the regular grid model is stored in a stack created by having several versions of each cell. The drawback of the regular grid model is that, while it maintains locality of the data better than a z-ordered index, it scales poorly because Hbase does not support efficient subsetting of versions of a cell, and the some rows might contain a lot of points, which increases retrieval by more than double of the amount of data points. As a compromise between the z-ordered index and the regular grid model a two-level index is proposed. Large geographical areas are split using a quad-tree index and the points in each area in the quad-tree is further indexed using the regular grid model.

3.2.3 GeoMesa

GeoMesa is a spatio-temporal database that can run on top of several distributed storage systems like Hbase and Accumulo. The index they use in the database is presented in the paper Spatia-temporal Indexing in Non-relational Distributed Databases[12]. Like MD-Hbase and Hgrid Z- ordering is used to create a quad-tree structure index for the data points.

The z-ordering is achieved by Geohashing which converts geographical coordinates to a binary string. A temporal component is added to the index by interleaving the Geohash string with a string representation of a date and time. The Column family and qualifier key is used as an identifier for specific data elements and higher resolution in the temporal and spatial dimension, similar to how Hgrid leverages columns for fine grained resolution, but still using a quad-tree index.

(36)

(37)

Chapter 4 Design and implementation

4.1 Requirements

In the introduction chapter we discussed features our solution should have.

One important feature was that we want the system to be scalable in order to accommodate for a large, and growing number of weather stations and measurements. We also looked at how other publicly accessible weather databases work and present their data. From this we concluded that our system should offer an API that can be used to by external apps to for example visualize the data.

• Our storage system should be scalable

• Our storage system should offer a cross platform API with reasonable response times.

• Our storage system should enable queries that are typical for applications that use weather data

• Our storage system should provide the means scalable processing of the data in the database

4.2 Design decisions

Decisions on design of our storage solution are taken with the requirements stated above in mind. There are several factors that impact the functionality offered by our solution and its performance.

One of them is the choice and configuration of the tools and frameworks we use. There are plenty Nosql databases and possible pipelines to the data stored in them and choosing the right one could have a significant impact of how well our final solution performs.

Another factor that is important is the schema and row-key design.

Since a key-value database typically only has one index for a table - the key, special care has to be taken in order to design a key that allows efficient data retrieval on this index. Also some ways of structuring the tables in key-value

(38)

databases are more scalable than others, in Hbase this typically is a choice between tall-narrow and a flat-wide schema.

In the Hbase ecosystem there are several pipelines available to access the data stored in the database. They are not mutually exclusive, so several can be used in the same set up and be used for different usecases. The different pipelines have different intended uses and selecting the right one is can be important in achieving good performance and providing the desired functionality.

4.3 Hadoop stack and data pipline

We want to use tools that we know are inherently scalable as this is one of the key requirements of the system that we are designing HDFS offers a scalable distributed system that can run on commodity hardware. It scales both in regards to storage capacity and computational power and has a rich selection of related projects that can be useful when designing and implementing a new solution. Therefore it is be the base component of the storage system architecture in this thesis.

On top of the file system we run a distributed Hbase database in which we will store our weather data. Hbase is the natural choice because of its close affinity with Hadoop and it’s reletad projects like Hive, Impala and Spark. Hbase has also been shown to be scalable in comparison to RDBMS.

In this thesis all programming is done in python. Python was chosen because it is a popular scripting language for data scientists, which we believe are the people who typically will use Hbase to analyze large datasets.

Although Java is native to Hadoop and Hbase, and being a compiled language, is faster than python, we chose python because we believe it allows for more rapid development, and because we believe it’s more clean and simple to develop all the services in one language.

Communication with Hbase in any other language than Java goes through its Thrift gateway or REST server. Allthough REST is considered easier to set up and use, Thrift is the faster option because it does not send a schema along with it’s data, thus using less bandwith. To simplify inter- facing with thrift we use the Python library Happybase, which hides some of the boilerplate code that has to be written in order to read and write data through the Thrift gateway.

We want to compare a different pipeline to the data other than Thrift, and a popular way to query data on Hbase is Impala. Although Impala uses Thrift to communicate it also provide the means to query Hbase with SQL- like queries, so that users who are not proficient in Java or another programming language also have a way of interacting with Hbase. Impala is

(39)

Figure 4.1: Overview of services and pipelines in the design

shipped as a part of Cloudera, a management tool for Hadoop distributions that simplifies setting up the Hadoop stack and has an interface for managing and monitoring the cluster. In order for Impala queries to work on our Hbase schema, the schema has to be mapped through Hive, which shares its metastore with Impala.

The Hive mapping has to be written manually as a statement in HiveQL.

Following is how a table for weather stations stored in Hbase is mapped using HiveQL, showing how each attribute that is desired in the external Hive table has to be mapped to a column in Hbase. Hive can also be used as a pipeline, but because it runs MapReduce jobs, which can take a long time just to start up, it is not suited for applications that require fast responses for interaction.

(40)

1 CREATE EXTERNAL TABLE stations (

2 rowkey string ,

3 name string ,

4 WBAN string ,

5 id string ,

6 elevation int,

7 to_time int,

8 from_time int,

9 lat float,

10 lon float)

11 STORED BY 'org. apache . hadoop .hive. hbase . HBaseStorageHandler '

12 WITH SERDEPROPERTIES ( " hbase . columns . mapping " = ":key ,name:name ,name:WBAN ,name:id , coordinates :\

13 elevation , to_time :ts , from_time :ts , coordinates :\

14 lat , coordinates :lon" )

15 TBLPROPERTIES ("hbase .table .name" = " stations ");

4.4 Schema and Key

For our solution we experimented with several ways to organize and index the database. In this section we will discuss the design process and the strength and weaknesses of the different designs

4.4.1 Tables

Our design uses two tables, one for stations and their metadata, and one for measurements and their metadata. This is a logical segmentation that is also used by RDBMS that store weather measurements. And it will allow us to use the station table to retrieve a list of stations that we then can use to look up measurements from stations in an area in the measurements table.

In the stations table we store metadata from each station. Each station entry has the geographical coordinates of the station and its elevation.

Weather stations in the ISD have names and different identifiers depending on what agency or department the station belongs to, all of them are stored in the stations table. In addition we store information on when the station began operating and when it stopped operating, if it has stopped. An example is shown in figure 4.2, we leave out most of the values but it should be possible to catch the general idea from the figure.

In should be noted that column families and column qualifiers should have their names shortened to single letters to save storage overhead. This might seem like a trivial detail, but it is important to remember that the column qualifier and family is stored for each row in the table. In tables with many rows and columns this can have a significant impact, especially in tables where each cell contain very little data the column qualifier can can

(41)

take up more storage space than the data it self if long column qualifiers and families are used.

Figure 4.2: schema for thestationstable

In the measurement table, the records from each station is stored. Each record is stored as a row in the table containing the fields that are man- datory for measurements in the ISD. The data we store is wind speed and direction, cloud ceiling height, visibility distance, air temperature and air pressure. We also store metadata on each measurement. The metadata is information on the quality of the measurement, and how the measurement was made. All records have a timestamp for when they were generated, and coordinates for where they were made. A simple example of the table is shown in figure 4.3

Figure 4.3: First try at a schema for themeasurementstable

We tried several ways of organizing the measurements table, and they differ in the way they segment the weather measurements. The first proposal separates the data so that each type measurement will have it’s own column in the table. This design makes the data easy to work with in the SQL-like Impala interface because it mimics how data would be organized in a RDBMS with separate columns for wind strength and wind direction, making the mapping of data through Hive a simple task. Due to the flexibility of Hbase allowing column qualifiers to be added on the fly by clients, more types of measurements from a stations later would not be a problem either. However with all the columns containing relatively little data, the storage overhead was quite large and only part of the dataset was able to fit on the disks, as can be seen in figure 4.4 in comparison with our next try at a schema. This approach also goes against the recommendations found in the literature for designing Hbase schema of keeping the number of column qualifiers and column families small[13].

(42)

Figure 4.4: Comparison of table design impact on storage space In order to improve on the storage overhead the next proposed schema stores an entire record (a collection of measurements from one station at a give time) as one string, where different segments of the string are designated for certain types of data like the data in ISD is stored. This very minimal schema design has the benefit of having very little storage overhead, but requires some parsing of the data that requires knowledge of how data in the ISD should be read. The parsing of the data can be done server- side before returning the data to the client through the REST-API so that the user would not have to be familiar with how ISD data should be parsed.

However when working with Impala or Hive this is not the case and the user has to be familiar with the ISD format, and using any of these pipelines to compute aggregate values directly is not really possible. Another upside to storing entire records in one column is that it frees the columns as an index so that columns can be used to index other things than just measurement types, for instance each row could represent a station containing several records indexed by a timestamp as a column qualifier. In table 4.5 the second and quite sparse schema can be seen.

As can be seen in figure 4.4 the storage overhead was reduced significantly. The reason for this big difference has to do with how the data is stored. As previously mentioned there is some storage overhead associated with each cell because the column qualifier and column family is stored for all of them. When the cells contain very little information the column qualifier and family can take up as much, or even more space than the values stored in the cells. In figure 4.3 this can be seen in the first three columns where both the contents of the cells and the column id’s are three characters long.

(43)

When all the data is stored in the same column family it is also stored together on the disk. So depending on the query, this table design can have some impact on query speed. With all the data stored in the same column family, an entire row is read with a single disk read, which is faster if what you want is all the data in the row, or data spanning several column families, which would otherwise require several disk reads depending on the number of column families. On the other hand if the the table is separated in several column families and a query requests data from only a single column family less data would be read from the disk, which might reduce query time.

Figure 4.5: The second attempt at a schema for themeasurementstable

4.4.2 RowKey

In a Key/Value database like Hbase, the design of the RowKey is a very important factor. Serving as the only index without having to do full table scans it will have considerable impact on how efficient our queries will be and what type of queries we will be able to write while maintaining a relatively short response time. Queries in Hbase are done via gets, which return exactly one row given a RowKey, or Scans, which return a range of rows given a start and stop row. Queries that do not use rowkeys to reduce the search area, have to preform a full table scan which take a very long time.

Ideally the row key design should exploit the lexicograpcal ordering of row keys to enable scans on a narrow band of rows to retrieve the data the users want. In the Google BigTable paper[4] this is demonstrated in the example tableWebtablethat indexes webpages. In the table, pages in the same do- main are grouped together into contiguous rows by reversing the host name components of the URL.

In order to figure out what the Rowkey should look like we should take into consideration what type of queries we are likely to write. We look at other services that allow users to explore historic weather measurements to get an idea of what type of queries our solution will typically receive. Pre- viously we looked at the NOAA ISD, YR and and Weather Underground.

These services offer historic weather data where the users specify one or more stations and then give a specific time, or a time interval for the data the users want. NOAA ISD also have an interactive map where users can select an area and get data from weather stations in the selected region.

(44)

From looking at these services we decided on some typical queries that we let guide the design of the Row Key:

• Get measurements from a given an area and a time or time interval.

• Get measurements from a single given station and a time or a time interval.

• Get metatdata from stations in a given area

For the first and last type of geospatial queries to be efficient, stations and measurements that are geographically close should also be indexed closely together so that they can be retrieved in one scan, or at least as few scans as possible. For our stations table it is pretty straightforward, for the RowKey design we use a hash based on a quadtree in order to lexico- graphically sort geographical points in z-order. The hashing encodes each stations geographical coordinates as a string of integers 12 characters long.

The quad tree is hierarchical, which means that close points will often share a common prefix. This means that for queries for an area we can calculate a shortest common prefix for stations in that area and use hbase scan for stations with that prefix. This method will return some false positives, but it is simple and quick to remove them from the set of returned stations.

For the design of the rowkey for the measurements table the hashvalues from the stations are used as a prefix and we append the timestamp for the record to it, the two parts of the row key are separated by a ’-’ as shown in figure 4.6 With this RowKey a query for records in the measurements will require a specific station hash, but can be executed to retrieve records from a given time range, or a specific time if the user knows the exact timestamp of the record. This design works fine for queries that request records or from one specific station, and it is able to quite efficiently answer the queries previously stated.

Figure 4.6: The RowKey for themeasurementstable

In order to improve the row key design in respect to retrieval time for queries for several stations we could use a row-key design similar to how Geo-mesa with some simplifications. Instead of having each row contain a single record, each row will contain all records from a single station with the station hash as a row-key, and each records timestamp as a column qualifier.

The downside to this design is that it excludes us from using Impala to query

(45)

the data. The table can still be mapped to a Hive table by making a Key,value map of the row, but unfortunately Impala only supports querying this type of structure in Parquet files, not in Hbase tables.

The difference between the two designs of Row-key and Table layout can also be looked at as a choice between what is known as a tall-narrow and flat-wide approach to table design in Hbase. In our designs the row-key consisting of station hash and timestamp combined would be considered as a tall-narrow approach with only one, or quite few columns stored in each row, while the rowkey consisting of just the station hash and individual records indexed by timestamps as column qualifiers would be considered a flat and wide approach with a lot of information stored in each row.

In Lars George [13] the recommended approach to table design is tall- narrow. The two main arguments for the tall-narrow approach is that scans on rowkeys yields the best query performance because it is the only Key- Value field that allows skipping rows in addition to store files. The other point he makes in favor of tall-narrow tables is that some rows might grow beyond region size. In our case for instance a station that makes frequent measurements for long time might exceed the region size, which would dis- allow that region from splitting.

Our usecase is a special one in that the area queries requires a separate scan for each station in a given area. This is what we could have mitigated with the flat-wide design which would have enabled us to query several stations with one scan. Another, third option could a middle ground which would interleave the timestamp and the station hash of a measurement.

This is what the Row-key in GeoMesa looks like, the Database mentioned in chapter 3. In GeoMesa the exact interleaving of geohash and timestamp can be changed and optimized according to the different aspects of the data like how many geohashes vs timestamps there are, in order to archive a good performance. For instance the RowKey could be the year of the record, followed by part of the geohash, followed by a timestamp, which again would be followed by the rest of the timestamp.

For datasets spanning several years this provides data locality per year, and distributes the data more evenly among the region servers. We will not do any do any further changes to the key we have as we believe that what we have is good enough for our purposes, but it is an interesting prospect for further improvement of the response time of the database.

4.5 RESTful API

Previously we mentioned the other weather services like Weather Under- ground, NOAA ISD and WSKlima that provides web services and Web apps that allow users to explore historical weather data. Making a web API, and use that API to make a simple Web app is a good way to test the functionality of our data storage solution and get a feel of how well suited it is for interactive applications. Because RESTfull services are platform independent and are so widely used, it was a natural choice for the API.

Efficient and scalable storage of big weather data

Efficient and scalable storage of big weather data

An experimental set up and evaluation

Bendik Ibenholt

Master’s Thesis Spring 2017

Efficient and scalable storage of big weather data

Abstract

Acknowledgements

Contents

List of Figures

List of Tables

Chapter 1

Introduction

1.1 Motivation

1.2 Problem statement

1.3 Evaluation metrics and premises

1.4 Approach

1.5 Research method

1.6 Contribution

1.7 Organization

Chapter 2

Technical Background

2.1 Distributed systems

2.2 Hadoop

2.3 NoSQL and non-relational databases

2.4 Hbase

2.5 Cloudera

2.6 Impala

2.7 Hive

2.8 Spark

2.9 Leaflet

2.10 Happybase

2.11 Flask and RESTful services

2.12 Geohash and Geospatial indexes for data- bases

2.13 Data-sources

Chapter 3

Related work

3.1 Hadoop for Geoscientific data analysis

3.2 Spatial indexes for non-relational databases

Chapter 4

Design and implementation

4.1 Requirements

4.2 Design decisions

4.3 Hadoop stack and data pipline

4.4 Schema and Key

4.5 RESTful API