Representing and Storing Semantic Data in a Multi-Model Database

(1)

Representing and Storing Semantic Data in a Multi-Model

Database

Simen Dyve Samuelsen

Thesis submitted for the degree of

Master of science in Programming and Networks 60 credits

Department of Informatics

Faculty of mathematics and natural sciences

UNIVERSITY OF OSLO

(2)

(3)

Representing and Storing Semantic Data in a Multi-Model

Database

Simen Dyve Samuelsen

(4)

c

2018 Simen Dyve Samuelsen

Representing and Storing Semantic Data in a Multi-Model Database

http://www.duo.uio.no/

Printed: Reprosentralen, University of Oslo

(5)

Abstract

With the emergence of NoSQL multi-model databases (natively supporting scalable and unified storage and querying of various data models such as graph, documents, key-values, relational, etc.) arise new opportunities for efficient representation and efficiently storing of data. Whereas traditional systems relay on multiple databases and/or using databases that’s not optimized for the data that needs storing, multi-model databases allow for more flexibility and are built around the concept of database distribution and availability. The multi-model structure also allow for one database to do what several databases nowadays are combined to do in a polyglot structure.

Semantic data with its graph-oriented structure is one type of data structure that could benefit from the use of multi-model databases, both for representing and storing the data. RDF is a popular model for semantic data, but RDF management systems are facing challenges when it comes to scalability and generality and the scalability challenge is particularly urgent.

Working with RDF graphs, which are typically highly connected and distributed, results in querying large volumes of data, thus making the scalability issue more pressing. Earlier approaches to make better storage for RDF data have been done through the use of relational databases, but even though they are optimized for data handling they are not very flexible and semantic data doesn’t necessarily fit within a pre-defined rigid schema inside the relational database. NoSQL databases allow for better flexibility and do not enforce any pre-defined schema to the data stored, thus better supporting the variety of data within the semantic data domain.

This thesis explores and defines different approaches to represent and store RDF data within a multi-model NoSQL database. Id identifies various as- pects of representing the RDF data structure into a multi-model data structure and discusses their advantages and disadvantages. In addition, the

(6)

thesis also describes an approach to represent the semantic spacetime data model introduced by Mark Burgess, compering how two different semantic models (RDF and spacetime) can be represented in the same multi- model database. Furthermore, the thesis proposes a prototype implementation of the two representation and storage approach in ArangoDB — a popular multi-model database.

(7)

Acknowledgements

I would like to thank everyone helping and contributing to the thesis. I am so grateful for all the time invested.

First, I will thank my supervisors Dumitru Roman (UiO and SINTEF), and Nikolay Nikolov (SINTEF), for all the guidance, motivation, technical and concept discussions and contribution to the development and writing process. Including me in the project and the group they have been essential for the thesis, and the end result. In addition, thanks Dumitru for inviting me to do presentations and hands-on on NoSQL and multi-model databases at the University of Oslo challenging me to present and discuss features of the technology.

I also want to thank everyone else within the Smart Data group at SINTEF for their contribution, and discussions of implementation. And without doubt for including me in the discussion of how the connection between ArangoDB and DataGraft could be implemented.

Second, I would like to especially thank my family for supporting me through this process and so being patient with me.

(8)

(9)

List of Figures

1.1 Iteration process . . . 18

2.1 RDF graph . . . 22

2.2 ArangoDB NoSQL benchmark results 14/02/18 . . . 27

3.1 DataGraft dashboard . . . 35

3.2 DataGraft assets . . . 36

3.3 Graph mapping in Grafterizer . . . 37

3.4 DataGraft connection administration . . . 38

3.5 DataGraft ArangoDB database administration . . . 38

3.6 Localscript components . . . 39

3.7 String hash function, to hash URIs . . . 41

3.8 web service components . . . 42

3.9 RDF flattened representation in JSON . . . 43

3.10 The different options to handle the results when using Grafterizer . . . 44

4.1 Example spacetime data . . . 49

4.2 Example spacetime data represented in ArangoDB . . . 50

4.3 The association menu types . . . 51

4.4 The STtypes as referred to from the association menu . . . . 52

4.5 AQL query retrieving all connected nodes from the node n . 53 5.1 Survey question 1 . . . 63

5.2 Survey question 2 . . . 63

(12)

5.9 Survey question 9 . . . 66 6.1 Polyglot persistence . . . 76

(13)

List of Tables

2.1 Spacetime STtypes . . . 24 5.1 Benchmark results . . . 62 5.2 Benchmark results RDF . . . 69

(14)

(15)

Chapter 1

Introduction

1.1 Context

The adoption of the linked data paradigm and the RDF format¹has grown significantly over the past decade – the Linked Open Data (LOD) cloud² initiative reports close to 1200 datasets (up from just 32 in 2008), and the current total size of the Data Web is estimated at almost 3000 distinct datasets and around 150 billion triples³. The linked data paradigm promotes the publishing of semantically enriched data on the Web through the use of self-describing data/relations and interlinking based on associating globally unique identifiers of data. Every entity or thing in RDF is represented by a Uniform Resource Identifier (URI) that can be dereferenced, which allows integration of data in a cross-domain graph.

Even though RDF data is getting a wider acceptance, there are still challenges with the practical use of RDF. ”RDF data management systems are facing two challenges: namely, systems’ scalability and generality. The challenge of scalability is particularly urgent”[15]. Working with RDF graphs, which are typically highly connected and distributed, results in matching and querying large volumes of data, thus making the issue with scalability more pressing.

The article [9] describes new trends of consumers that triggered the need of new ways to store large amounts of data. The entire thing is an eternal loop; more users generate more data, more data leads to better algorithms,

1https://www.w3.org/RDF/

2http://lod-cloud.net/

3http://stats.lod2.eu/

(16)

better algorithms make for a better user experience that in turn drives more users. Larger amounts of data have pushed the development of solutions to handle BigData like Hadoop⁴. To further improve handling of all the collected data that exist in a variety of formats, NoSQL database systems have emerged. Unlike relational database systems, NoSQL database systems does not enforce attributes, types or structure of the stored data enabling the very flexibility needed.

1.2 Motivation

With increasing amounts of data stored, the need to store it as descriptively as possible is even more important. By using semantics like RDF to describe and store data, the data in itself contains the value definition and data type. The RDF description would ease the use of data for analytical purposes later on as the data would not need to be transformed or stan- dardized before use.

The same increasing amount of data and different data formats have increased the popularity of NoSQL database management systems and their adoption into market. Providing availability of the NoSQL database management systems as open source solutions is another factor in the increased popularity of these database management systems, not locking users into expensive license plans.

As stated in section 1.1 traditional RDF data management systems have challenges when it comes to scalability and availability of data. Earlier approaches to build better storage solutions for RDF databases are often done using either relational databases or single-model NoSQL databases. Both of these approaches have benefits and disadvantages, and the main disadvan- tage is that neither of the storage engines have a very good support for both large amounts of data and graph connections. Within the NoSQL database management systems new solutions have emerged. These databases are referred to as multi-model database management systems that combine and handle several storage structures enabling storage of large amounts of data and graph connections.

4http://hadoop.apache.org

(17)

1.3 Research questions

The scope of this thesis is to investigate the possibility to use recently emerged NoSQL multi-model databases to store and represent semantic data. To archive insight into storing and representing semantic data within a multi-model NoSQL database managment system, the thesis makes use of two semantic data models as a basis for the data models that are represented.

The purpose of this thesis is to develop an approach to store semantic data within a multi-model NoSQL database management system to archive a better storage solution that better handle the scalability challenges traditional RDF stores face today.

Based on the above described purpose of this thesis, the following research questions are explored:

• How can the different semantic data models be represented within a multi-model NoSQL database management system?

• How well does multi-model NoSQL database management systems perform compared to single-model database management systems?

• How well does the suggested solution stored within a multi- model NoSQL database management system perform compared to a traditional RDF database management system?

• Can the two different semantic data models use the same approach of representation or are they better handled with different approaches?

1.4 Research design

Based on [13] we divide research into two categories; basic and applied research. Basic research focuses on gaining insights and knowledge from what already exists. Applied research on the other hand is research trying to solve existing problems. Technology research mainly falls into the applied research where the research is based on using technology in new combinations to try and improve or create new and better solutions to what already exists. When doing technology research the main motivation is to improve or cover a present need. To do this the first thing needed is to define the requirements for a new and better solution. Then the process

(18)

Figure 1.1: Iteration process

of development starts based on the requirements previously defined, and a POC (proof of concept) is often built to demonstrate that the end goal is reachable. Working on the development of new solutions requires several iterations (shown in figure 1.1 of development, testing and adjustments based on tests.

1.5 Thesis outline

The thesis is structured into six chapters

Chapter 1 gives an introduction to the thesis. The chapter describes the context and motivation, and defines the research questions and research design used for this thesis.

Chapter 2 contains the background for the thesis by discussing the two semantic modeling methods, and introducing the concept of multi- model databases. The chapter also gives an introduction to the multi- model database ArangoDB, including a justification why we chose to use ArangoDB in our research setting.

Chapter 3 discusses the implementation of the first modeling method. In this chapter the approach to represent RDF data in ArangoDB is discussed,

(19)

and some alternatives are provided to how RDF data can be represented in ArangoDB. The RDF conversion was implemented using the tools on the DataGraft platform. The chapter will give an introduction to the DataGraft Platform, how the conversion was implemented, and provide an evaluation of the usage of ArangoDB as a multi-model database for our approach.

Chapter 4 introduces the implementation of the second modeling method. In this chapter the approach to represent spacetime data in ArangoDB is discussed. It moves on to describe the implementation of the model, and how the transformation of spacetime data is done. The chapter ends with an evaluation of the representation and how well ArangoDB allows us to represent this modeling approach.

Chapter 5 presents the evaluations of implementations presented in chapter three and four. It continues to describe the test environment and basis for the evaluation before presenting the results.

Chapter 6 gives a conclusion of the thesis, including a discussion of outlooks for multi-model databases with the advantages of being versatile and not restricted to one type of data representation. At the end of the chapter future work is discussed based on the modeling approaches used in the thesis and the thesis it self.

(20)

(21)

Chapter 2

Background

2.1 Semantic data

Semantic data is a way of modeling data so that it is meaningful without human intervention. The semantic data model usually organizes data as the relationship between two objects. Semantic data provides an accurate description of the data, and based on the relationships and hierarchical organization of data, it will be possible to extract and derive knowledge.

In a software context, the semantic data model can be compared to the spoken language, in which words in a given order makes for one meaning while another sequence results in a different meaning. Based on this convention the semantic data model makes it possible for different vendors and users of the data to know with certainty what the data explains and references. A growing interest to semantic data came from the US Air Force Integrated Computer-Aided Manufacturing program¹ where they used informational models that handle organization and semantics of environmental information. Furthermore, vendors like Google and Bing promote the use of semantics on the web to give more accurate search results and better applications that can understand the information within the data².

2.1.1 Semantic web

Semantic web is a term used about semantic data that extends information online and enables computers to take advantage of the information, and

1http://www.semagix.com/what-is-semantic-data.htm

2https://www.technologyreview.com/s/424259/google-microsoft-and-yahoo-team- up-to-advance-semantic-web/

(22)

is often expressed through RDF or ”Resource Description Format”. The adoption of the linked data paradigm and the RDF format³has grown significantly over the past decade – the Linked Open Data (LOD) cloud⁴initiative reports close to 1200 datasets (up from just 32 in 2008), and the current total size of the Data Web is estimated at almost 3000 distinct datasets and around 150 billion triples⁵. The linked data paradigm promotes the publishing of semantically enriched data on the Web by using self-describing data/relations and interlinking based on associating the globally unique identifiers of the data. Every entity or thing in RDF is represented by a Uniform Resource Identifier (URI) that can be dereferenced, which allows integration of data in a cross-domain graph. By using URIs we guarantee that the entity we are talking about is what it’s stated to be. As URIs also is within its own domain we can use this to build definitions and knowledge graphs with in our own domain and expose this as a public available definition set for others to use. There are already several open RDF libraries publicly available, and dbpedia⁶ is one of the largest. URIs can be combined, and thereby different libraries can interact with each other, and re- sources can reference one library to another based on how RDF is defined.

RDF is made up of triples generating a graph in the following format:

Figure 2.1: RDF graph

3https://www.w3.org/RDF/

4http://lod-cloud.net/

5http://stats.lod2.eu/

6http://wiki.dbpedia.org

(23)

Figure 2.1 illustrates the three-part format (the triple) described in section 2.1. In RDF we work with triples that describe the relation between data, and these triples are formed by a subject, a predicate and an object. The predicate is what defines the relation between the subject and object, and the subject is assigned the value of the object. As in the example in the figure above, the user is assigned an age of 0 through the ”foaf:age” predicate.

2.1.2 Semantic spacetime

Semantic spacetime [3] is a concept introduced by Mark Burgess that aims to be a better approach to represent and model data suited for analysis and knowledge extraction. Burgess states that ”In modern machine learn- ing, pattern recognition replaces real-time semantic reasoning. The mapping from input to output is learned with fixed semantics by training out- comes deliberately. This is an expensive and static approach which depends heavily on the availability of a very particular kind of prior training data to make inferences in a single step. Conventional semantic network approaches, on the other hand, base multi-step reasoning on modal logics and handcrafted ontologies, which are ad hoc, expensive to construct, and fragile to inconsistency.”[3]

The spacetime concept is built around events, and tries to make sense of and describe events happening. This can be real life events or system events (hardware / software) as it originally was developed for. The aim is to connect events within a graph describing the event and the relation to any other events to look for connections that normally are not discovered.

By representing these connections between events, events being caused by other events are observable, and while the initial impression is that one event is the direct cause of another, this structure could reveal other connections that better fit what actually happened, and disclose situations that normally would go unnoticed. When the events are recorded and stored, the process of retrieving knowledge is accomplished by building what is referred to as stories. For a given search criteria (could be a word, or relation types), the connected events and their relations are transformed into stories describing the chain of events and why they happened. As this is a model that can be used on more than just events, these are named concepts.

In essence the format is two concepts connected with a relation type.

This way of modeling data aims to replicate the way humans learn and

(24)

ST TYPE FORWARD RECIPROCAL SPACETIME STRUCTURE

is close to is close to

approximates is equivalent to PROXIMITY

1 is connected to is connected to “near”

is adjacent to is adjacent to Symmetrizer

is correlated with is correlated with

depends on enables

2 is caused by causes GRADIENT/DIRECTION

follows precedes “follows”

contains is a part of / occupies

3 surrounds inside AGGREGATE / MEMBERSHIP

generalizes is an aspect of / exemplifies “contains”

has name or value is the value of property

4 characterizes is a property of DISTINGUISHABILITY

represents/expresses is represented/expressed by “expresses”

promises Asymmetrizer

Table 2.1: The four irreducible association types are characterized by their spacetime coincidence or adjacency.

Note that, in promise theory, even relationships like ‘is correlated with’ are directed relationships: one may not assume that the assessment of a mutual property is mutually assessed. Similarly the expression of a property is a cooperative relationship.

make new meaning of knowledge. Something learned today may have a distinct meaning, but in 10 years a completely new meaning may emerge due to new knowledge and additional associations. The spacetime model is built as a triple, where there are two concepts and a relation type connecting the two events. Each concept can be a longer sentence or just single words, and words are often connected by a relation type to the longer sentences. There are four different relation types describing the relation, including a description of the back and forward meaning of the type and a context note that are stored for retrieval. The different relation types can be seen in table 2.1 describing the back and forward meaning of a given relation type.

2.2 Multi-model databases

Multi-model databases have existed in different forms for a long time [2].

In fact, one of the first versions of multi-model databases, the garlic system, was presented in 1997 and based its implementation on wrappers that

(25)

encapsulated the data sources that were stored. It modeled the data as an object, and the generated ID identified an interface and repository for the object describing how it should be handled. The system implemented SQL as the query language and extended on the functionality to facilitate the use this database.

In earlier approaches to multi-model databases, a multi-model database was built in layers handling the different storage types and functionality, while at the bottom layer having a physical storage layer. Thereby the definition of multi-model databases was ”Multi-model database management system engine for database having complex data models”[2]. With the emergence of NoSQL databases, the term ”multi-model database” has received a new definition. Nowadays, multi-model databases support multiple connected storage models such as document, key/value and graph.

These NoSQL databases implement and handle all three storage models to enable the benefits of each of these database types within one single database instance. The benefits of using a multi-model database include scalability and query performance of document and key/value databases, as well as the flexibility and easy extensibility of graph databases.

As mentioned in [2] new databases have emerged that support the multi- model data structure, and include both open source and commercial solutions like: ArangoDB⁷, OrientDB⁸, MarkLogic⁹, Amazon DynmoDB¹⁰ and Microsoft Azure Cosmos DB¹¹. In terms of the scope of this thesis, the choice of databases were restricted to open source solutions.

Current NoSQL multi-model databases have emerged based on the need to use more diverse data that have become more available. Sensors and the concepts of IoT (Internet of Things) are used to collect even more data than earlier, and users often obtain huge amounts of data with poor quality. As stated in [1], ”Increasingly, people want to use messier and messier data in complex ways”. With both the amount, complexity and quality of data changing, there is a need for more flexibility in storage options within a database. Traditional NoSQL databases solve some of the chal-

7http://www.arangodb.com

8https://www.orientdb.com

9https://www.marklogic.com

10http://aws.amazon.com/dynamodb

11http://azure.microsoft.com/services/cosmos-db

(26)

lenges when it comes to handling unstructured and distributed data, while NoSQL databases are built specifically to address issues of distribution and scalability. Also, being able to distribute a query to multiple nodes of a cluster and even let each node process the entire query locally before returning the result, decreases amount of processing power needed on a single node.

The challenge of most NoSQL databases is that they only support one single data model - either document, key-value storage or graph. Hence, relations between data are not handled particularly well (when using key- value or document store), and the databases do not perform very good when it comes to querying large amounts of homogeneous data stored on a node (in the case of graph stores).

The new multi-model NoSQL databases that have emerged attempt to combine and take advantage of the benefits from the different NoSQL storage methods. Allowing a combination of a flat representation of data (i.e., as key-value pairs or documents) and inter-node/document associations (i.e., producing graphs), opens new opportunities for building efficient data storage solutions. The flexibility of the multi-model design allows for handling of diverse data and different data models, where this can be everything from graph represented data models like RDF / knowledge graphs, product information represented as a document, or postal codes linked to a city name in a key-value representation. Systems that today use several databases to handle different data models would be able to combine everything into one type of database, as an example, e-commerce platforms tend to use a polyglot system design using everything from relational databases to graph databases to handle transactions, product information, user rec- ommendations and associated products . These type of storage structures could with the flexibility of a multi-model database be combined into one database handling everything. This could potentially reduce production cost because of reduced number of databases to administrate, fewer types of query languages to handle and a smaller amount of data that needs to be merged after retrieval.

2.2.1 Overview of multi-model databases

As mentioned in section 2.2 there are several multi-model databases to choose from, but based on the scope of the thesis to use open source databases the following two databases were considered; ArangoDB and OrientDB. They are both mentioned in [2] and are the most referred to when

(27)

searching for multi-model databases. ArangoDB is a database that builds it storage around collections, and utilizing these for supporting storage and representation of all three NoSQL storage models. ArangoDB implements its own query language AQL to handle querying of all data models. Ori- entDB also supports the three NoSQL data models, and enables the same multi-model opportunities as ArangoDB. The query language is based on traditional SQL, and extended with functionality to handle the different data models.

Figure 2.2: ArangoDB NoSQL benchmark results 14/02/18¹²

Since the aforementioned multi-model database solutions support roughly the same features, in choosing between the current multi-model database solutions, we looked at performance with common tasks and how the databases compare to the most commonly used graph database – Neo4j and other NoSQL solutions. We therefore used an already available benchmark that implements such a comparison¹³. The initial results for the NoSQL benchmark is presented in the figure 2.2 above, the results show that ArangoDB currently provides the best relative performance of the multi-model storage solutions. Furthermore, ArangoDB is easy to de- ploy on ad-hoc infrastructure, since it is a fully certified package for the

12https://www.arangodb.com/wp-content/uploads/2018/02/UPDATE-Performance- Benchmark-2018-Overview-Table.jpg

13https://www.arangodb.com/2018/02/nosql-performance-benchmark-2018- mongodb-postgresql-orientdb-neo4j-arangodb/

(28)

DC/OS cluster management system. It also has support for other infrastructure configurations, as ArangoDB provides deployable recipes for public cloud-based infrastructure-as-a-service platforms like Amazon AWS and Microsoft Azure, as well as Docker images for all of the necessary database components for Docker-based deployment. The query language provided by ArangoDB (AQL) to access all storage models is easy to understand and use as it uses programming structures that are typical in im- perative programming (”FOR - FILTER - RETURN”). Therefore, based on the performance benchmarks, ease of use and non-functional support for different deployments, ArangoDB was chosen for the implementation part of this thesis.

2.2.2 ArangoDB

ArangoDB is built up with collections being the storage method used to accommodate the three different storage methods of NoSQL. A document within ArangoDB is in essence a JSON object with sets of keys and values, a method that enables the document storage and the key-value storage. Stor- ing the JSON structure for each data entry has the potential to increase the storage amount needed, and to better handle this ArangoDB only stores unique JSON structures so that documents with equal keys reference the same JSON structure that is stored once. Each document (as in any other database) has the option to either set the key attribute for the entry or have this auto generated, and the keys of a given document is used to define relations between documents and makes it possible to define a graph over the dataset inside the database. In ArangoDB there is the option to choose between two types of collections for storing data; a normal document collection or a special edge collection. Edge collections are where we define and store the relations between documents in a normal collection. An entry to the edge collection contains two extra keys that define the relation, a from and to key. The keys defining the relation use the document ID and not only the document key. The document id is built up by the collection name where the document is stored, the document key, and the set of keys describes the relation between the documents. By using the from and to keys the direction of the relation is described, to describe a bi-directional relation between the documents there’s added two entries to the edge collection, one for each direction.

ArangoDB offers two different storage engines; memory-mapped-files(mmfiles)

(29)

and rocksDB¹⁴. The two storage engines both serve different needs when it comes to storage and retrieval. mmfiles is the original and default storage engine used by ArangoDB, the type of storage engine the user wants to use needs to be specified upon setup of the database and can not be converted later. The mmfiles engine is particularly suitable when the datasets fit within main memory. While the data is contained within main memory allowing for faster retrieval times, the drawback is when the system restarts.

The database would need to reload the data into memory and rebuild the indexes since these also are stored within main memory. To solve the issue of large datasets that do not fit within main memory, the rocksdb engine has been implemented since it is optimized to handle datasets within the bigdata domain. The rocksdb storage engine keeps a hot set stored within the main memory and persists indexes to disk to avoid rebuild on a reboot of the system.

Query languages for multi-model databases and other NoSQL databases is mainly divided into two different groups, either they apply some extension of the SQL language or they have defined their own language. ArangoDB implements its own declarative query language AQL to support querying of all storage types used within the database, i.e., query of all three models can be combined in one query. The query basis is using the keywords

”FOR - FILTER - RETURN” to build up the query that completes a loop by using the data supplied to the ”FOR” statement to apply some sort of filtering on the data and return data that fulfills the filter. When it comes to graph queries, these are also using a similar syntax described above, but adds some arguments to the ”FOR” statement. These options are defining the direction of the traversal, number of steps (depth) to traverse, a starting point for the traversal, and what graph the traversal should be done within. In addition, ArangoDB has a set of predefined graph functions (queries) that can be called directly, these are queries for everything from shortest-path to page ranking. If there is some type of queries or traversals that is often performed, a user can define its own functions that can be stored to the database.

The database is a certified DC/OS application supporting all necessary distribution and replication [7][5] needs of a distributed database imple-

14https://www.arangodb.com/why-arangodb/comparing-rocksdb-mmfiles-storage- engines/

(30)

menting RAFT[10] to handle automatic failover when a node goes down in a master / slave cluster setup. Out of the box the database delivers a REST[11] API that can be used to interact with the database directly, the REST functionality is also an enabler for the built-in JavaScript framework Foxxjs¹⁵. Foxxjs makes it possible to write JavaScript applications that can be used through the REST API to communicate and interact with the database, one example of usage is when it comes to type checking. As NoSQL does not apply any form of input validation the developer can use the Foxxjs framework to make an application to take input and validate the formatting before inserting into the database.

15https://www.arangodb.com/why-arangodb/foxx/

(31)

Chapter 3

Modeling RDF in ArangoDB

This chapter starts describing the different ways RDF can be represented within a multi-model NoSQL database and how this can be implemented in the multi-model database ArangoDB and its data model. First, the thesis will discuss three different approaches to represent RDF data within a multi-model database; a direct approach, a direct approach using edge values, and a flattened approach. Each of the three representations has benefits and drawbacks when it comes to how RDF can be represented in a multi- model database compared to a traditional triple store database.

Second, after defining the three possible representation methods of RDF in a multi-model database, the chapter gives and introduction to the Data- Graft¹) platform, and describes how this platform is extended to enable storage of RDF data within a multi-model database with the help of the Grafterizer²tool.

Third, after establishing the different ways to represent RDF data within a multi-model database and the connection to the DataGraft platform, the third subsection describes the two implementations that have been developed to transform the RDF output from DataGraft into a representation that can be used in ArangoDB.

Chapter 5 include a general evaluation of ArangoDB as a multi-model database and an evaluation of the proposed implementation and representation of RDF within a multi-model database.

1https://datagraft.io/

2https://github.com/dapaas/grafterizer

(32)

3.1 Representing the RDF data model in the ArangoDB data model

As multi-model databases provide flexibility (as described in section 2.2), mapping RDF values into to ArangoDB collections can be done in various ways. Given that flexibility, three mapping strategies have been developed:

1) direct representation with respect to the RDF data model where each node in the RDF mapping corresponds to a node in the node collection.

This strategy is the most expressive, but performs poorly when the number of stored entities is large and/or when there are many connections between entities; 2) direct representation storing the predicate data in edge documents, connecting the subject and object in an approach that handles larger datasets better than the direct mapping, but is still verbose; and 3) RDF flattening – using a set of heuristics for mapping RDF nodes and literals to the multi-model structure in an approach that allows storage of RDF in the most natural way with respect to graph/multi-model databases.

3.1.1 Direct representation

The most obvious approach to store RDF within a multi-model database is to use a direct representation that maps each node in an RDF triple to a node in the multi-model database connected through entries in the edge collections. As any RDF triple is built up by a subject, predicate and object, the direct representation would contain one node for each of these values, and two edges connecting these nodes. Within each node, we define an attribute ”rdf” to store the fully qualified name (URI), which allows to retain the semantics of the RDF node. This approach offers the most expressive representation of RDF when it comes to querying and matching data within the generated graph and could in fact also be applied in any graph database. However, multi-model databases (and also graph databases) are not optimized to work with very large numbers of small objects (containing one attribute/value apart from the key), which results in this approach performing least satisfactorily of the three approaches.

3.1.2 Direct representation with edge values

The direct approach with edge values is similar to the direct approach, but instead of mapping each value of the RDF triple into one node, each the predicate values is stored directly on the edge connecting nodes.

(33)

The result is then two nodes (the subject and object) connected with one edge (containing the predicate value), while the expressiveness of a direct approach is kept, also gaining a reduction in data size. While the drawback of large datasets as described above is reduced, it is still present in this approach.

3.1.3 RDF flattening - a document representation of RDF

When using normal graph databases, the challenge is to balance between having too large or too small objects when representing data. Using small objects results in extremely large graphs and a large number of traversals when querying data. Using too large objects increases query time when matching values, because of the need to go through all values within each node. Document databases, on the other hand, handle large objects very well, and store entries in attribute-value pairs allowing for high- performance querying. Hence, this approach to storing and handling RDF data within a multi-model database is based on taking advantage of the data model of multi-model databases to store properties as object attributes within a document following a set of rules:

• URI nodes are mapped JSON objects, which serve as nodes in the representation.

• URIs, which in RDF uniquely identify nodes, are used to generate unique numeric keys for the JSON object (numeric keys enable more efficient storage and lookup). This is done using a standard hash function and the keys themselves are stored in a special attribute called key.

• Edges between nodes are generated based on the links between URI nodes in the RDF mapping template.

• Exception: rdf:type mappings – in RDF, these are used to specify type mappings for RDF entities. Types in RDF are URI nodes, which point to the semantic classes in an ontology or vocabulary, similarly to classes in object-oriented programming. The classes specified are instead stored in a ’type’ attribute, which is an array of types.

• RDF literals are mapped to JSON attributes for the URI node objects.

• Exception: rdfs:label – in RDF, these mappings are used to denote textual labels to denote entities. In the multi-model mapping, the

(34)

values of these mappings are stored in a ’label’ attribute and used to display labels in the graph interface when exploring the graph.

• Prefixes and fully qualified RDF URIs are also stored in the resulting JSON object. The specified prefixes in the mapping are additionally kept in separate JSON objects in the node collection to avoid overlaps with other prefixes, and to enable namespace-based lookups (based on the RDF namespaces defined in the mapping).

3.2 Implementation in the DataGraft platform

As part of this thesis, a tool that can represent RDF within a multi- model database has been implemented in the DataGraft platform. The implementation is developed as an extension to the DataGraft platform using the output from Grafterizer to define the mapping between RDF and the flattened representation of RDF proposed in this thesis.

3.2.1 Overview of the DataGraft platform

The DataGraft³platform was developed in the EU project proDataMarket⁴ as a platform for storing, cleaning and transforming tabular data into RDF.

The platform is built for data sharing and management of various assets.

The option for sharing is available for all asset types the platform provides, e.g., the original file, the transformation, or the stored result from transforming the data into RDF. These can be set as public assets making them available and explorable through the dashboard of DataGraft (see figure 3.1).

The platform provides four different asset types (figure 3.2); file page, SPARQL endpoint, SPARQL query, and transformation. The file page is the asset type for storing files and accessing these through the platform, and can be created by uploading a new file or copying an existing file. In addition, a user can choose to use the Grafterizer tool to clean, rearrange, combine or split the data within the file before storing it. The transformation asset is a stored transformation defined in Grafterizer enabling reuse of transformations used on tabular data files. Grafterizer provides an inter- active user interface for cleaning tabular data and transforming it into RDF

3https://datagraft.io

4https://prodatamarket.eu

(35)

Figure 3.1: DataGraft dashboard

triples. It supports data transformations from raw tabular data to knowledge graphs in RDF using a schema mapping (graph template as showed in figure 3.3). The finished transformation definition can either be run on the uploaded file or downloaded as a JAR⁵ file to run it locally for larger datasets. The SPARQL endpoint asset is where the transformed result from a transformation can be accessed if the user chooses to store it in the connected triple store. This endpoint can be accessed either through the portal directly or programmatically through the API, and SPARQL queries can be stored as assets for reuse and sharing in the portal. As a default, the Data- Graft portal integrates a connection to the RDF triple store GraphDB⁶provided by Ontotext⁷. This is a traditional triple store providing a SPARQL endpoint that is exposed through the DataGraft portal as described above.

The user can choose to either use an instance deployed and hosted by On- totext or host its own database and connect this with the platform. The platform also integrates its own API and connection options to connect to and use the assets stored within DataGraft.

3.2.2 Extension to the DataGraft platform

The tool Grafterizer on the DataGraft platform was used to do an implementation of the proposed RDF representation. This was done to extend the DataGraft platform with the option to use a multi-model database for

5https://docs.oracle.com/javase/tutorial/deployment/jar/basicsindex.html

6http://graphdb.ontotext.com

7https://ontotext.com

(36)

Figure 3.2: DataGraft assets

data storage instead of just a traditional triple store. In addition, as described earlier in this chapter, as the amount of data grows within the big data domain, there is an increased need for new ways to store RDF and semantic data.

Grafterizer already implements methods to transform tabular data into RDF, and with this it also provides a good starting point to map RDF into a structure for multi-model databases. When the user defines the RDF structure (see figure 3.3) Grafterizer generates this structure of the RDF as a JSON object. The JSON object contains a set of attributes with different values, and these values may be objects or arrays with values or objects.

In this way, the root JSON object encapsulates the entire graph structure itself where a value for any given attribute can be seen as a graph connected value.

For the implementation of a transformation script to transform RDF into a representation usable in multi-model databases, the following three outputs from Grafterizer are used: 1) The JSON object representing the RDF graph structure as defined by the user; 2) A JSON object containing a mapping between prefixes and a fully qualified URI used to represent RDF and 3) The tabular data as a CSV file. In addition, the option has been added for

(37)

Figure 3.3: Graph mapping in Grafterizer

the user to setup a connection to an ArangoDB database instance using the administration panel illustrated in figure 3.4. In figure 3.5 the user gets access to administrating the connected database, adding keywords, descrip- tions, license (used for publicly available assets) and the option to create or delete data collections inside the database, and upload new data to the existing collections. To handle the public / private option inside DataGraft for the data collections, DataGraft needs access to user administration in ArangoDB. When the user adds the ArangoDB database instance the cre- dentials should allow the user administration, and by adding a standard public user to the database providing read only rights, the access can be defined per database or even per data collection within a database itself.

3.2.3 Implementation details

The extension to the DataGraft platform was developed in Nodejs and JavaScript as both a local script that could be run locally, and a REST web

(38)

Figure 3.4: DataGraft connection administration

Figure 3.5: DataGraft ArangoDB database administration

application. The implementations are open source and available for down-

(39)

load on Github⁸⁹. For the purpose of the first prototype of the implementation, two experimental instances of the ArangoDB database have been deployed with two different configurations. The first configuration uses a three-node in-memory cluster. It has been used for initial experimentation for lower-volume data, which has been sharded over the three different instances. The second deployment uses a single-node deployment with the RocksDB engine. It has lower memory requirements, which is important considering the data volumes in the big data domain. Both instances of the database were deployed using the Docker-based deployment option of ArangoDB.

In figure 3.6 the components and data flow of the local script are shown. To

Figure 3.6: The components and data flow used in the implementation of a local script

use the local script version the user needs to access the DataGraft platform and open the Grafterizer tool. In the Grafterizer tool the user defines the RDF graph mapping and the connections between the CSV data and the graph representation. When ready, the user downloads the prepared CSV and the JSON file describing the transformation and vocabularies used in the mapping, and saves them to disk. Then the user then has to manually invoke the local transform script providing the script with the two input

8https://github.com/datagraft/Datagraft-RDF-to-Arango-DB/tree/master/

9https://github.com/datagraft/Datagraft-RDF-to-Arango-DB/tree/REST/

(40)

files (JSON and CSV files). When the script finishes processing the data it outputs two JSON files, which can then be imported in ArangoDB: 1) the JSON file congaing the node values, and 2) the JSON file containing the edges that connect the nodes. These two files are stored to disk, where the user has to manually import the JSON file with the node values and the JSON file with edge values.

The initial implementation was done using the direct approach to represent the RDF data within ArangoDB, but as we ran our first transformation test the resulting output was up to 10 times the size of our initial CSV file, much due to the fact that the JSON format is considerably more verbose than a CSV, adding headers to all values within the JSON object. These initial results lead to rethinking how the representation could be done, and how an implementation could benefit from the multi-model data structure.

This result was the development of the proposed flattened representation of RDF data, with the defined criteria (described in sub-section 3.1.3) for mapping of RDF into a multi-model data structure.

For the implementation the following packages and versions were used:

• NodeJS¹⁰ using version: 9.11.1 as the base framework for the JavaScript implementation handling input, transformation and output of the data.

• Papaparse¹¹ as a library to handle and convert CSV data into JSON data usable in JavaScript.

• minimist¹²to handle input arguments to the script.

The developed script takes the three elements from Grafterizer as input (see figure 3.6. First, it stores the graph definition that is used to map the tabular data into the RDF flattened representation, then it stores the defined mapping of prefixes and URIs, before starting to read the CSV file.

The CSV file is read as a stream and processed asynchronously to speed up the processing and prevent out of memory exceptions when working with large files. The JSON graph representation has attributes specifying columns that should be used to get the correct value from the corresponding CSV file, and hence, before the processing and mapping can start the

10https://nodejs.org/en/

11https://www.papaparse.com

12https://github.com/substack/minimist

(41)

first line of the CSV is mapped to a header object with the header as an attribute and the column offset as the corresponding value. After the initial mapping of columns, the script runs the mapping function for each line from the CSV file. Papaparse parses each line of CSV and creates an array with all values from the CSV line.

The mapping function runs recursively through the JSON graph definition and builds new JSON objects for the output. Each new JSON object is a URI node that contains the value from the literal nodes as attributes, and the attributes are the corresponding predicate to the literal node. When the script encounters an attribute column, it performs a lookup on the value matching it to the mapped headings to fetch the correct value from the CSV line.

In the case that an encountered value is not a literal node but a URI node, a new JSON object is created for the encountered URI node, as well as a JSON object for the edge definition connecting the two URI nodes. If the prefix name is present the URI is fetched based on the prefix and added to an RDF value within the JSON object. In this way we ensure keeping the fully qualified RDF URI available in the document created for the multi-model database. The RDF URI is also used to set the key for the document. While ArangoDB does not allow special characters in keys, a simple hashing al- gorithm is applied (see figure 3.7) to hash the URI into a numeric value.

This hash is added as a function to the string type, so it can be called with just adding ‘.hash()‘ to the string. When the function ends, it appends the

Figure 3.7: String hash function, to hash URIs

created JSON objects to the corresponding files. All URI node objects are written to one file, and all edge values connecting them are written to a second file. Figure 2.1 shows the corresponding flattened representation of

(42)

the RDF structure showed in figure 3.9.

The ArangoDB mapping has also been implemented as a web service that can be accessed from other components to produce ArangoDB nodes and edges and import them to an ArangoDB instance. Figure 3.8 shows the components and the data flow of the implemented web service. The user starts the process by accessing DataGraft and connecting the user account to a ArangoDB instance for user administration, data insertion and querying. Then the user uses the Grafterizer tool to clean and prepare a CSV file to be mapped into a RDF representation of the data (see figure 3.3). Grafter- izer outputs two files; 1) the prepared CSV data and 2) a JSON file with the transformation and vocabulary definitions from Gradterizer. The web service consumes these files and maps the data accordingly before outputting two JSON files, one with node values and one with edge values. Then the data is imported into ArangoDB. web application is based on the same

Figure 3.8: The components and data flow used in the implementation of a web service

transformation script and the same set of input files needed, but in addition it has some extra parameters that need to be included when posting to the web application. The following inputs are needed for the web application:

• CSV – the CSV file used in the transformation from tabular data into the RDF flattened representation.

• mapping – a JSON graph mapping structure used to define how the graph should be built.

• vocabulary – a JSON mapping between prefixes and fully qualified URIs.

The parameters described above is the same as in the local script, but the transformation is run on the server and returns two JSON files with

(43)

Figure 3.9: RDF flattened representation in JSON

the resulting representation of the transformation into a multi-model data structure. As this is a web service, it also implements the option to insert the transformed data directly to ArangoDB using the REST API provided.

To use the REST API there is the need for some extra parameters given to the application:

• REST – a Boolean value true/false. If true, the application uses the REST API to insert data and the rest of the parameters below is needed. If false or not present the application will return two files.

• endpoint – The URL for the ArangoDB instance being used.

• db – The name of the database within the ArangoDB instance that the data should be inserted into.

• name – The name of the collection that the data should be inserted into. This name is used for both the normal collection and the edge collection using a suffix of “ edge” after the name.

• authToken – The authorization token obtained from ArangoDB which needs to be included in the REST calls to the database to get access.

In addition to the packages and frameworks used for the local script, the web application includes use of the following packages:

(44)

Figure 3.10: The different options to handle the results when using Grafterizer

• express¹³version 4.13.4 or newer – A package to handle web requests and serving the web application on top of NodeJS.

• cors¹⁴ version 2.7.1 or newer – package to enable CORS (Cross- Origin Resource Sharing) for remote invocation and usage of the web application.

• body-parser¹⁵ version 1.15.1 or newer – Adding a middleware to process JSON input.

• multer¹⁶version 1.3.0 or newer – A package to handle multipart/form- data used to handle file uploads.

• request¹⁷ version 2.69.0 or newer – Package to handle requests used to make the communication with the REST API of ArangoDB.

• readline¹⁸ version 1.3.0 or newer – used to stream and process the uploaded CSV file line by line.

required input fields are used to handle the communication with the ArangoDB instance. When inserting data into the database the service runs the transformation and based on the endpoint, database name and the collection name it tries to connect to the appropriate database collection. If the endpoint or database do not exist, an error will be thrown indicating a bad gateway. If the endpoint and database exists, the application will

13https://www.npmjs.com/package/express

14https://www.npmjs.com/package/cors

15https://www.npmjs.com/package/body-parser

16https://www.npmjs.com/package/multer

17https://www.npmjs.com/package/request

18https://www.npmjs.com/package/readline

(45)

try to fetch the collection with the input name, and if non-existent, the application will try to create them. In the event that the application cannot create the two collections, a 500 error is thrown (server error) stating that the collections could not be created. When the application verifies that the collection exists or can be created, the function gets called starting the process of transforming the input data into the multi-model representation.

Depending on the given parameters an insert request to the database is either done in chunks or per transformed line of CSV.

By using the REST API there is no need to manually upload the results to the database, but is inefficient because sending the files has to be done over the Internet and since, issuing requests, handling larger files and inserting one item at a time as they are processed leads to massive load on the database to handle the number of requests. This could, in turn, trigger blocking of the insert requests made by the web application. There are different ways to handle this issue, and one of them is processing the data in bulk. By defining a number of lines to process or a number of objects that should be included in the insert request the number of insert request can be dramatically reduced. Another option is to use the built in Foxxjs framework of ArangoDB and let the processing be handled in the database. This would prevent any denial of insert request since everything is done inter- nally and not externally, but again large datasets is an issue. If the database instance does not have the needed processing power (e.g., it is deployed in a cluster, load balancing and distributing the process between the nodes) large files would potentially block the database for other users while run- ning the transformation.

DataGraft already provides a combination of ways to transform tabular data into RDF by allowing the user to do a direct transformation online for smaller data sets, and this is where the web application that where built would fit. Limiting the size of the dataset possible to insert using the web application would decrease the number of issues related to insert requests.

The online processing provided by Grafterizer allows for both insertion to a database and download of the result as a file - again this is made possible in the web application developed for transforming the data. By specifying option “REST” in the web application call the application either inserts or returns the resulting transformation.

(46)

Processing of larger datasets is not allowed online by Grafterizer, because it is a publicly available online tool. Instead, the user can download an exe- cutable JAR, based on the set of steps defined on a sample dataset to run locally on larger files. Handling large datasets with a local script can be done in several ways, in which one alternative is to make a service looking for files in a folder, processing them and outputting them to a second folder for import to the database. As a test for handling large files we implemented a similar work flow in our test cluster. The original files we received were TSVs¹⁹, and therefore we needed an extra step converting TSVs into CSVs.

As a step to speed up the file processing each file was split into smaller subsets of the original file. The service handling input and transformation then starts new processes to convert the input CSV files, scaling the number of processing units up and down based on the process need. Finally, the transformed files are imported into the database.

19https://en.wikipedia.org/wiki/Tab-separated values

(47)

Chapter 4

Modeling spacetime in ArangoDB

This chapter on modeling spacetime in ArangoDB focuses on the implementation and modeling of spacetime data within the ArangoDB model.

Divided into 2 sections, it will present and discuss the approaches to use ArangoDB to store spacetime modeled data.

In the first section the focus is on comparing the spacetime model and ArangoDB data model. This part will investigate and describe the similari- ties of the two models and discuss the way we can transform the spacetime model to fit the data model used in ArangoDB, i.e., how the ability to combine the three major NoSQL representations - document store, graph store, and key/value store allows us to model spacetime into the ArangoDB data model.

Within the second section we will present a reference implementation. We will describe how the outputs for generating spacetime data are transformed to the ArangoDB data model. This part is implemented as a web service to accommodate the need for ad-hoc input and availability since its usage is geared towards diagnosis systems, and reporting. The key concept of spacetime modeling is to be able to investigate and discover new knowledge within connections between concepts and be able to make new discoveries that we normally would not see. The web service implements methods for both insertions mapping data to the ArangoDB data format and retrievals of data building stories for entries so that we can investigate and discover connections.

(48)

4.1 Representing the spacetime data model in the ArangoDB data model

After making an implementation of RDF data into the multi-model database ArangoDB we were interested to try and use the multi-model store for other semantic models in addition to RDF. This introduced us to the spacetime model concept introduced by Mark Burgess. These two semantic models are not that different in structure, but do differentiate when it comes to the number of possible connections between nodes and dupli- cates. Furthermore, semantics within spacetime, in addition to describing relations, aim to describe the context and cause between relations. One example used by Burgess in [4] is describing an event from the classic game Cluedo where the players try to piece together why a certain event happened and how it happened. The example uses the following event: “Pro- fessor Plum murders Miss Scarlet with a breadknife in the library, because she refused to marry him ”. This shown as a graph in figure 4.1. By braking the statement into different parts that in themselves describe or reference an object or event, it is possible to apply the spacetime model between the concepts, whereby each node is connected through an edge describing the relation type (shown in table 2.1 and context between them.

The formatting used to describe the relation and context between nodes are as follows: ”(from concept, STtype (see table 2.1), association, to concept, inverse association, context-note)”. And as an example, this is how it would look like when applying the Cluedo example: ”(murder by breadknife, 2, may be caused by, Miss Scarlet refuses to marry Professor plum, can cause, Cluedo)”. The spacetime model is a natural graph model and as such it is natural to represent it in ArangoDB.

Considering the structure of the existing data formats and how this can be represented in a multi-model database, the three types of representation defined in section 3.1 become applicable for the spacetime data model.

Each part of the input can be broken down to four components or nodes that connect to each other. These four components would be; 1) The first concept. 2) The second concept. 3) the forward association from the first

(49)

with a bread knife

knife bread knife

used for bread murder by bread knife

Miss Scarlet refuses to marry Professor Plum Professor Plum murders Miss Scarlet

in the library with the bread knife because she refused to marry him

Professor Plum

Miss Scarlet

how where

In the library library

the library

refusal marriage

refusal marriage murder

why active verb

background passive

motivation intent

what action

Figure 4.1: Example spacetime data - Professor Plum murders Miss Scarlet with a breadknife in the library, because she refused to marry him

to the second concept, containing the STtype, the association and the context note. 4) The backward association containing the STtype, the inverse association and the context note.

4.1.1 Direct representation

Using a direct representation of this model using 4 nodes for each relation between concept is expressive but not as storage-efficient as for RDF. This way of representing the spacetime model does not give any scene either as the components 3 and 4 are a description of the relation between concepts.

Each concept in itself should be a node as this allows for fast retrieval when looking for a specific concept and its relations.

4.1.2 Flattened representation

A flattened representation can be implemented in two ways, either each concept is modeled as a document where the ”from” concept contains these attributes: from concept, association, STtype and context note. The ”to”

concept would then contain the following attributes: to concept, inverse association, STtype and context note. Then, these two documents would be connected with an edge. The other way is to model each entry as a document resulting in a document containing each concept. Neither of these representations makes sense and most definitely is hard to use when

(50)

looking for relations and reasons between concepts, either the connection between nodes is meaningless or everything is a document and we would have duplications of concepts where it would be hard to differentiate between incoming and outgoing associations.

4.1.3 Direct representation with edge values

The best way to represent the spacetime model and optimize for storage and retrieval is to use the direct representation with edge values. This allows us to keep the four components and the expressiveness of the data model. By storing each concept as a node and the association on the edges it is easy to construct and retrieve knowledge from the associations between concepts. There is no need for duplications of concepts, making it easy to investigate all outgoing and incoming associations to a given concept. The approach of having a concept as the only thing stored in a node enables even better performance when doing lookups because it allows adding indexes of the concept attribute in the document collection.

By applying this direct approach with edge values, the Cluedo example shown in figure 4.1 would become as shown in figure 4.2

Figure 4.2: Professor Plum murders Miss Scarlet with a breadknife in the library, because she refused to marry him – ArangoDB representation

Representing and Storing Semantic Data in a Multi-Model Database

Representing and Storing Semantic Data in a Multi-Model

Database

Simen Dyve Samuelsen

Thesis submitted for the degree of

Master of science in Programming and Networks 60 credits

Department of Informatics

Faculty of mathematics and natural sciences

UNIVERSITY OF OSLO

Representing and Storing Semantic Data in a Multi-Model

Database

Simen Dyve Samuelsen

Abstract

Acknowledgements

Contents

List of Figures

List of Tables

Chapter 1

Introduction

1.1 Context

1.2 Motivation

1.3 Research questions

1.4 Research design

1.5 Thesis outline

Chapter 2

Background

2.1 Semantic data

2.2 Multi-model databases

Chapter 3

Modeling RDF in ArangoDB

3.1 Representing the RDF data model in the ArangoDB data model

3.2 Implementation in the DataGraft platform

Chapter 4

Modeling spacetime in ArangoDB

4.1 Representing the spacetime data model in the ArangoDB data model