Predictive Data Transformation Suggestions Using Machine Learning: Generating Transformation Suggestions in Grafterizer Using the Random Forest Algorithm

(1)

Predictive Data Transformation Suggestions Using Machine

Learning

Generating Transformation Suggestions in Grafterizer Using the Random Forest Algorithm

Saliha Sajid

Thesis submitted for the degree of

Master in Informatics: Programming and Networks 60 credits

Department of Informatics

Faculty of Mathematics and Natural Sciences

UNIVERSITY OF OSLO

(2)

(3)

Predictive Data Transformation Suggestions Using Machine

Learning

Generating Transformation Suggestions in Grafterizer Using the Random Forest Algorithm

Saliha Sajid

(4)

c

2019 Saliha Sajid

Predictive Data Transformation Suggestions Using Machine Learning

Generating Transformation Suggestions in Grafterizer Using the Random Forest Algorithm

http://www.duo.uio.no/

(5)

(6)

Abstract

Data preprocessing is a crucial step in data analysis. For preparing data for analysis, different actions may need to be performed on a given dataset including formatting, modification, extraction and enrichment.

Data scientists and data analysts need to prepare their data before feeding it to Machine Learning (ML) algorithms to find useful insights.

Quality of data used to train ML algorithms is one of the challenges faced by data scientists. Having a tool that facilitates data preparation intelligently can significantly reduce the time consumed in creating a rich dataset for analysis.

Since there is potentially a large variety of data transformations that could be applied to a tabular dataset, it is typically more convenient for a user to use a system that recommends a subset of the transformations that are most relevant for the given dataset, taking into account possible user preferences. That subset of data transformations can be based on data about user interactions, user preferences, and the type of data in the dataset.

This thesis explores approaches for providing transformation suggestions which are of users’ interest in the context of tabular data preprocessing. The proposed approach is realized in an application prototype which uses ML algorithms to provide the user the recommendations best suited for her needs. It takes into account feedback from the user regarding usability of recommendations in addition to the data collected from actions users have performed in the past as deciding fac- tors for predicting transformation suggestions. The thesis includes dis- cussions about common transformation suggestions provided by state- of-the-art Extract-Transform-Load (ETL) tools and evaluation of the reference application to infer transformations based on user interactions.

The application prototype was validated and tested in terms of effec- tiveness, efficiency and satisfaction to determine the usefulness of intelligently generated data transformation recommendations among potential users. The validation consisted of a usability study involving test and a survey to evaluate the prototype, and to identify limitations with the proposed approach. Based on this evaluation, we argue that a recommender system for data transformations is useful for making data preprocessing more efficient.

(7)

(8)

List of Figures

2.1 An illustration of collaborative filtering. . . 13

2.2 An illustration of content-based filtering. . . 14

3.1 A typical data analysis life cycle. . . 22

3.2 An example of decision tree. . . 27

3.3 An illustration of Random Forest classifier. . . 28

3.4 An illustration of the suggested approach. . . 32

4.1 Architecture of prototype for tabular data transformation suggestions. . . 36

4.2 Generation of multiple data transformation suggestions. . 37

4.3 Update training data. . . 38

4.4 Screenshot of application prototype. . . 39

4.5 Procedure for filtering irrelevant transformations. . . 44

5.1 A five-level Likert scale. . . 53

(11)

Listings

4.1 An example of feature object . . . 42

4.2 An example check for irrelevant transformation. . . 45

4.3 An example check for replacing null transformation. . . 45

4.4 A transformation key/value pair . . . 46

4.5 Data to be added to training dataset . . . 47

(12)

List of Tables

3.1 Table-based transformations . . . 23

3.2 Format-based transformations . . . 24

3.3 String transformations . . . 25

3.4 Features of given dataset . . . 30

3.5 Features of selected data item . . . 31

4.1 Implemented data transformations . . . 41

5.1 Validation of requirements . . . 51

5.2 SUS survey scores per question and total SUS score . . . . 55

(13)

(14)

Acknowledgements

I would like to thank everyone who provided me guidance for writing my thesis. First, I would like to thank my main supervisor Dumitru Roman and co-supervisors Bjørn Marius von Zernichow and Ahmet Soylu for making this thesis possible by steering me in the right direction, answering my questions and providing support throughout the process.

I truly appreciate everyone at SINTEF Digital for helping me with brainstorming and providing me access to data for the purpose of this thesis.

A special thanks to everyone who helped me in evaluating the performed work by participating in the usability study and providing valuable feedback.

Last but not least, my husband, Abdul Moeed Mohammad, a great appreciation for his patience and continued support throughout my work on this thesis.

(15)

(16)

Chapter 1 Introduction

1.1 Overall Context

With increasing amounts of data being generated everyday, organiza- tions rely on reliable data quality to make sure that the data analysis is done accurately and without any predispositions. Although ideally it is preferred to perform analysis on data which is clean and doesn’t contain any irrelevant values, in real life that kind of data is seldom available. For performing analysis on data, data scientists need to pre- process data to ensure that it is in the right format and conforms to a certain set of rules. Through a series of interviews with data pro- fessionals, it is revealed that a majority of their time is spent in time- consuming data transformation tasks in the data preprocessing phase [17].

Data preprocessing is an important step in the data analysis and data mining process. It involves preparation and transformation of raw data and making it ready to use for analysis. In real world data comes in different formats. It is often incomplete, noisy, inconsistent, and is likely to contain errors. Data preprocessing is a way of resolving such issues to improve the ease and efficiency of the analysis process. It includes actions such as cleaning, transforming and narrowing down data to most important variables.

For improving data quality before analysis, there has been done a lot of research where diverse strategies and tools have been proposed [6, 8, 13–15, 23, 30], both commercial and non-commercial. Applications with highly interactive user interfaces let users choose from a number of transformation operations helping them transform their data with ease. These applications range from providing minimal assistance to the user to recommending transformations most relevant to a dataset.

Despite considerable research, there are still some areas which can potentially be explored, including intelligent approaches to the data preprocessing using Machine Learning (ML) and the applicability of

(17)

solutions for intelligent, automated data cleaning and transformation.

This thesis focuses on the prediction of relevant tabular data transformation suggestions using historical data gathered through user interactions.

1.2 Thesis Motivation

Raw data comes in different formats and types and is highly suscep- tible to noise and inconsistency. It may require rectifying inconsisten- cies and converting to the right format before it can be used to extract meaningful insights.

Data preprocessing is one of the most time consuming step in a data analysis process [17]. Data with bad quality as a result affects the analysis performed on it and data transformation, as part of preprocessing, plays a critical role in ensuring data quality before analysis [10]. It is a domain specific problem that focuses on the statistical properties, semantics and structure of data and usually domain experts have the required knowledge to apply the right transformations on data. Several commercial tools and frameworks exist for data preprocessing offering a large number of data cleaning and transformation actions. In addition to the commercial tools, most common frameworks and languages for data analysis such as Pandas¹, scikit-learn², and R³ also come with several useful methods for preprocessing tasks. However it can be challenging for data scientists to choose from a large number of transformations and view the changes made to the dataset at the same time. This time and cost consuming process could be made efficient by applying an intelligent approach to the data transformation problem and automatically recommending relevant data cleaning and transformation actions in an interactive interface to the user.

Recommender systems [24], a subclass of information filtering system, predict items that best meet the preferences of users and are widely used to recommend content such as movies, products and services. An interactive recommender system for suggesting data transformations by analyzing the users’ preferences can reduce the time and cost for data preprocessing. There has been done a lot of research about the applications of ML to facilitate better decision making. After ML models are trained on historical data, when exposed to new data, the models are able to independently adapt. According to Gartner’s Market Guide for Data Preparation Tools⁴, data preparation

1https://pandas.pydata.org

2https://scikit-learn.org

3https://www.r-project.org

4https://www.gartner.com/en/documents/3906957

(18)

is becoming a key feature in data analysis. The availability of ML capabilities is now considered a requirement in data preparation tools and there is a need for ready-to-use ML functions which ease the process of data preparation both for experts and non-experts.

When doing data preprocessing, often there is a limited number of transformations which can be performed on a particular type of data, for instance a column containing dates often needs to be transformed into the right format and unnecessary special characters need to be removed from some text. Use of ML in data transformation becomes a relevant choice since ML models have the ability to learn from previous actions to produce reliable decisions. When a user performs the same transformation on a particular type of data, an ML algorithm can predict the same transformation the next time a user wants to transform a similar type of data.

Classification algorithms in ML that classify a set of data features into different categories based on their characteristics are useful when we want to recommend the most appropriate action to the user from a set of available actions. Random Forest [3], an ensemble learning method for classification constructs a number of decision trees and provides output by aggregating the predictions of the ensemble.

Currently there is no systematic investigation how ML, in particular Random Forest, could be used to generate relevant data transformation suggestions. The ability of the Random Forest algorithm to formulate rules to make predictions by going through the characteristics of training data motivates the work in this thesis to investigate its benefits for the problem at hand.

This thesis explores the usefulness of data transformation suggestions in the context of Grafterizer [29] – a tabular data transformation and Linked Data generation tool, developed as part of the DataGraft platform [25, 26]⁵. Grafterizer currently supports data cleaning and transformation in addition to linked data generation but does not yet offer generation of data transformation suggestions based on user interactions. Grafterizer, thus, served as a motivation for implementing the proposed approach and evaluating it in context of a concrete relevant system.

1.3 Research Questions

The scope of the thesis is to analyze common data transformations and create user interaction-based data transformation recommendations using the Random Forest algorithm, and its implementation in a concrete system (Grafterizer). The research questions addressed within

5https://datagraft.io

(19)

the scope of this thesis are:

• How can relevant data transformation suggestions be provided based on user interactions?

• How can the Random Forest algorithm be used to suggest data transformations?

• How useful are data transformation suggestions generated using machine learning?

1.4 Thesis Contributions

This thesis contributes to exploring the usefulness of data transformation suggestions based on user interactions, and proposes an approach for the generation of relevant data transformation suggestions using the Random Forest algorithm.

The implementation of the application prototype reuses Grafterizer’s graphical user interface (GUI) proposed [31]. The application prototype provides the user with the most relevant data transformation suggestions and lets user transform data by choosing one of suggested transformations. It also provides a history of the transformations previously applied to the data and some basic statistics about the dataset.

Summary of Thesis Contributions

The contributions of this thesis include:

• Analysis and discussion of tabular data transformations and evaluation of relevant existing tools and frameworks for tabular data transformations.

• Analysis and discussion on the use of ML for generating tabular data transformations based on user interactions.

• An approach for user interaction based generation of tabular data transformation suggestions using the ML Random Forest algorithm.

• Implementation of the approach as part of the Grafterizer framework for data preparation.

• Evaluation of the implemented work in terms of a usability study and discussion of the results.

(20)

1.5 Research Design

The research process can be performed in two different forms:

• Basic research is research for the purpose of obtaining new knowledge.

• Applied research is research seeking solutions to practical problems [28].

Technology research as opposed to classical research which is used more often in basic research, is used in applied research and it aims to finding a solution to a problem by creating a new artefact or improve the existing one.

The main steps of technology research include [28]:

1. Problem analysis: This step revolves around the analysis of potential problems and creation of a list of requirements for the new artefact. The researcher then formulates hypotheses based on the analysis and makes predictions about the outcomes if hypotheses are correct.

2. Innovation: The researcher creates an artefact that is aimed at satisfying the list of requirements in step 1 and supporting the hypotheses.

3. Evaluation: Evaluation of developed artefact based on the predictions to check if the predictions are true and support the hypoth- esis.

Basic and applied research can be further divided into three types:

1. Qualitative: Involving the use of qualitative data such as interviews and observations typically collected in the participant’s set- ting to get a detail of the actual user experiences.

2. Quantitative: Involving objective measurements and statistical analysis of data in a systematic way to analyze the relationships among variables [7].

3. Mixed: Involving both qualitative and quantitative methods.

This thesis represents two types of research methods:

• Applied research

• Qualitative research

(21)

Research design for this thesis has been done by specifying research questions which influence the steps taken to conduct this research.

After defining research questions, a literature review is done to learn about the studies which have been done in the past and their conclusions in the research area. In addition, the literature review integrates and summarizes the information from existing solutions to define the magnitude of the problem and to justify the topic for further research. For conducting applied research and to realize a set of requirements needed to provide answers to the research questions, a prototype application is created. The prototype is created as a means to gain insight into what potential users think of the suggested solution.

In addition, for the evaluation of the performed work and checking the validity of research questions, qualitative research methods are used. As part of qualitative research, the participants are allowed to evaluate the created artefact and respond to the usability survey afterwards. The survey is used to test the usefulness of the application prototype and to gain information about the reasoning and motivation behind the responses and the standpoint of the participant. The results gained from the analysis of user responses can help understand the usefulness of the suggested solution and open up further areas of research.

1.6 Thesis Outline

This section briefly describes the thesis structure and provides an overview over contents in each chapter.

• Chapter 1 - Introduction introduces the overall context for the thesis and the motivation for research. It also provides the research questions and the research methodology used in this thesis.

• Chapter 2 - Background and Related Work describes the background information and concepts related to recommender systems. In addition, this chapter shows an overview of the related technologies and frameworks in the context of tabular data cleaning and transformation, in accordance with the scope of the thesis.

• Chapter 3 - Problem Analysis and Suggested Approach describes the common tabular data transformations and how these are used in this thesis. It also explains the ML classifier and suggested approach along with the requirements and success criteria.

• Chapter 4 - Architecture and Implementation describes the architecture of application prototype and its different components. In addition, this chapter provides a walk-through of the implementation.

(22)

• Chapter 5 - Evaluation evaluates the application prototype in light of a usability survey and discusses the results.

• Chapter 6 - Conclusion summarizes the thesis by discussing the performed work in relation to the requirements and identifies directions for the future work.

(23)

(24)

Chapter 2 Background and Related Work

This chapter briefly describes the relevant background concepts. The first section gives a brief introduction of the recommender systems and their types, and how they fit in the context of this thesis. The rest of this chapter does a literature review of the related work in the area of tabular data cleaning and transformation.

2.1 Recommender Systems

A lot of computer applications and websites being in use today provide the user with personalized content. Most of those systems make use of a recommendation system. Recommender system, a subclass of information filtering system, is a system which aims at providing relevant recommendations to users for items or products that might interest them [20]. The recommendations created are personalized and often based on the analysis of previous interaction between users and items, because historical preferences and activities provide a good indication of choices a user will make in the future [2]. Some examples of recommender systems include:

1. Suggesting TV shows and movies to users based on prediction from users’ interests and past behavior.

2. Suggesting user items to buy on an online retailer website based on historical purchases.

The design of recommender systems is a complex process, including certain aspects which need to be considered. Below are some of the aspects as described in the recommender canvas [5]:

• Goals define the motivation behind creating a recommender system and the milestones one wants to achieve.

(25)

• Domain characteristics: The design of recommender system depends on the application domain and data collected from users’

preferences.

• Functional design considerationsreflect the functional relationship between the user and the recommender algorithm. One needs to consider the algorithm to be used for generating recommendations and the most suitable way of receiving feedback from the user.

• Technique selection. Different filtering methods for recommender systems serve different purposes. Technique selection leads to making the choice of filtering algorithms to use when designing a recommender system based on the goals and domain characteristics.

• Interface design refer to the selection of design which presents the recommendations in most appropriate way.

• Evaluationrefers to the evaluation metrics that reflect the recommender system goals and optimization should maximize the output while minimizing the cost.

While creating a recommender system for relevant data transformations, the design aspects were taken into consideration to provide a user with a tool that optimizes data preprocessing. In the following sections we describe in detail how feedback is collected from the user and which techniques are used to create the recommender system.

2.1.1 Data Collection in Recommender Systems

For providing relevant suggestions, a recommender system needs to learn about user preferences. Data collection in a recommender system includes gathering data which is essential to create a profile of users’

preferences. It may include all kinds of data about users’ interactions, interests and behavior. Recommender systems, thus, depend on feedback provided by users which can be either explicit or implicit and use this information to infer users’ interests. Feedback is provided either explicitly by each user using a quantifiable method, or implicitly where the system infers the users’ feedback in a way that does not directly reflect the interest of the users but can be associated to their actions.

Explicit Feedback

Explicit feedback refers to when a user explicitly provides feedback for an item in the system [2]. A recommender system implementing

(26)

explicit feedback takes into consideration the input from the user about how they liked or disliked an item, leading to the improvement in its prediction model. Explicit feedback can be in the from of a score or rating. Examples of explicit feedback include rating purchased items on an E-commerce website. This kind of feedback is of high quality, accurate and an absolute measurement reference. However, at the same time explicit feedback is difficult to acquire [16] since not all users provide an input to every item used. This lack of information leads to poor data quality and the inability of a recommender system to provide suggestions efficiently.

Implicit Feedback

Even though explicit feedback is a more convenient way of getting input from user, it is not always easy to acquire. Implicit feedback, however, is acquired by the system itself by deriving information from user’s actions rather than being explicitly prompting user to provide feedback [2]. It is not obvious and can be based on user interaction such as click or view. Types of implicit feedback include purchase history on an online retailer website and search patterns in a web browser. Following are some of the characteristics of implicit feedback [16]:

• It is easier to acquire

• It can be gained in abundance

• The numerical value of implicit feedback can be mapped to degree of preference. For instance, if a user selects an item more often than the other, it is interpreted as user’s preferred choice.

• Unlike explicit feedback, both good and bad input from the user is considered useful. The lack of feedback for a particular item is not necessarily interpreted as disadvantageous.

• It has low accuracy

• It has relative measurement reference

This thesis makes use of implicit feedback to recommend relevant data transformation suggestions to the user. The feedback is gained by acquiring the data about user interactions made in the application.

These interactions help understand preferences of the user with respect to the data object selected and its characteristics.

(27)

2.1.2 Recommendation Filtering Methods

The basic models for recommender systems work with two kinds of data which are described below [2]:

1. The past interactions between users and items, such as ratings or buying behavior.

2. The attribute information about the users and items such as textual profiles or relevant keywords.

Most recommender systems use generally either of the two approaches mentioned above. Systems that use the former approach are referred to as collaborative filtering systems, whereas systems that use the latter are referred to as content-based filtering systems¹.

Collaborative Filtering

Collaborative filtering systems recommend items based on opinions of other people [27]. These systems work by collecting and analyzing a large amount of information to create profiles of users’ behaviors and preferences and create similarity measures between users and/or items.

Recommendations are then provided by predicting what the current user will like based on his/her similarity to other users’ preference profiles.

In recommendation systems implementing collaborative filtering method, a user can get recommendations to items which have been received positively by similar users even though he has not provided any feedback for those items. An illustration of collaborative filtering technique is shown in Figure 2.1.

1https://towardsdatascience.com/introduction-to-recommender-systems-6c66cf15ada

(28)

User 1 User 2

A B

Liked by User 1 Liked by User 2

Similar preferences

B A

Recommended to User 1

Recommended to User 2

Figure 2.1: An illustration of collaborative filtering.

In the illustration above, initially item A and item B are liked by User 1 and User 2 respectively. Based on their preference profiles if users have similar preferences, the items liked by User 1 will be recommended to User 2 and vice versa.

Content-based Filtering

Content-based filtering provides recommendations similar to what a user has liked in the past by finding correlations between his/her preferences and attributes of items available [19]. A profile is created for each item to describe its inherent characteristics. The profiles help recommender systems associate actions with matching items.

The content-based recommender systems make use of to types of data:

• The attributes of the items available in the system. For instance, in case of tabular data the most common attributes would be the domain of data, the size, data dimensions, number of observations and number of columns.

• A profile based on the preferences of a user. The profile develops and becomes more precise over time and includes either implicit or explicit feedback from the user.

An illustration of content-based recommender system is shown in

(29)

provide recommendations based on current user’s preferences. In the illustration below, when the user likes item A, another item (item B) similar to item A is recommended to the user.

User

A

B

Liked by user

B

Recommended to user

Similar to item A

Figure 2.2: An illustration of content-based filtering.

The scope of this thesis includes creating a data transformation recommendation system for a single user. Therefore, we choose to implement content-based recommendation system because of its ability to recommend items which are similar to the ones the user has liked in the past. When there are a number of transformations to choose from, there is need for filtering and efficiently choosing the most relevant transformations. By recommending transformations to users based on the characteristics of their dataset and their chosen transformations as implicit feedback, it is possible to reduce the time spent on data cleaning and preprocessing, in turn letting users extract meaningful insights from data.

For creating relevant data transformations suggestions for this thesis, we get the attributes of selected data along with the metadata of the entire dataset and labeled each set with a transformation. These labeled sets are then used as training data to create a classification model for the current user. The classification model generated can be used to predict transformations for data for which the user’s behavior is unknown.

(30)

Strengths and Weaknesses of Chosen Approach

Content-based recommender systems have the advantage of making recommendations for new items, when sufficient data about that item is unavailable [19]. This is because items with similar attributes contribute to the system providing the most appropriate recommendation even when there is a lack of historical data available for the new item.

Recommender systems, in general, have some weaknesses as well.

Some of those are described in section 2.2 and 2.3 detail in addition to the ways those weaknesses are addressed in the context of this thesis.

2.2 The Cold-Start Problem

The quality of a recommender system depends on historical dataset and a large dataset can help predict recommendations more precisely. This leads us to one of the major problems in recommender systems, the cold-start problem [19]. It concerns the issue that the system cannot draw any inferences for users or items about which it has not yet gathered sufficient information.². There are two generally two kinds of cold-start problems:

• New user: Refers to a new user being registered into the system without having provided any preferences yet, making it difficult for the recommender system to provide personalized recommendations.

• New item: Refers to the addition of a new item in the system which doesn’t yet have a history of being used by a certain group of users.

Content-based recommender systems make predictions by using data about the descriptive attributes of an item. The cold start problem in such systems arises when creating personalized recommendations for users with no or very less amount of data available about their preferences. It is difficult for recommender systems to provide recommendations based on less historical data of user interactions and popularity of a certain item.

In the context of this thesis, the cold-start problem is addressed by initially creating some data about user preferences beforehand and input to the ML algorithm as training data. As user performs data preprocessing and makes use of more transformations, new data about user preferences is appended to the already existing training data.

2https://en.wikipedia.org/wiki/Cold_start_(computing)

(31)

2.3 Overspecialization

Unlike collaborative filtering, content-based filtering systems can recommend items which have not been given feedback on by any user.

A drawback with content-based filtering is that it suffers from overspecialization by recommending items that are similar to those the user has seen so far [19]. In addition to relevance, it is often preferred to have a certain amount of novelty, diversity and serendipity in the recommendations for maximizing the potential of a recommendation system [2]. Novelty of a recommendation refers to how different a recommendation is from what user has previously seen. Diversity refers to how different the recommendations are compared to each other and serendipity implies that user gets an unexpected recommendation that the user otherwise would not have found.

In the application prototype developed for this thesis we introduce diversity by dividing available transformations into groups of table- based, format-based and string-based transformations and recommending one most relevant transformation from each group, giving user the option to choose from diverse set of actions.

For the purpose of serendipity, the transformation suggestions could be randomly generated from time to time for catching user’s interest.

It would be possible to have serendipity and novelty in the system if there were a lot of transformations available but for the scope of this thesis, our prototype is limited to only a few.

2.4 Overview of Relevant Related Approaches

There has been done significant research on data transformation and there are several commercial tools and libraries which can be used to transform data both programmaticly and with the help of Graphical User Interfaces (GUI). In this section we will take a look at some of the available tools.

2.4.1 Tableau

Tableau³ is an interactive data analysis and visualisation platform that lets users view data in understandable format and helps generate customized dashboards.

Tableau’s data preparation and preprocessing tool Tableau Prep⁴ works by selecting data in one of its many supported formats and applies transformations to it, for example, filter, rename and join

3https://www.tableau.com

4https://www.tableau.com/products/prep

(32)

to clean and shape data before analysis. In addition to identifying problems in the data, it analyzes the given data to recommend transformations that may be of interest to the user and which could be applied automatically to make data preprocessing quicker. While the platform itself has a steep learning curve, it also provides visual data profiling and a graphical list of steps taken to transform data in the form of a flowchart.

2.4.2 Talend

Talend Data Preparation⁵ is a data preparation tool that comes with a user-friendly interface to transform data before analysis. It lets users filter, modify and enrich data by providing transformations intelligently based on the data selected. However it is not clear if Talend uses ML to provide customized transformation suggestions to the users.

2.4.3 Ideas in Microsoft Excel

Ideas⁶ for spreadsheet application Microsoft Excel⁷ is a feature which helps the user understand data through visual summaries in the form of charts and statistical patterns. It comes with an interactive interface and provides suggestions tailored to the task being performed but has some limitations. It works best with clean tabular data with correctly defined headers. Although it provided interactive analysis and statistical summary of data, it does not provide transformation recommendations for data preprocessing.

2.4.4 HoloClean

HoloClean⁸ is a statistical inference engine to impute, clean, and enrich data. As a weakly supervised ML system, HoloClean makes use of data integrity constraints, quantitative statistics, value correlations and external reference data to build a probabilistic model for data cleaning task. HoloClean allows data experts to save the time effectively communicate their domain knowledge to enable accurate analysis, predictions, and insights from noisy, incomplete, and erroneous data.

While HoloClean being a holistic data cleaning framework, it provides information about incorrect data values and conflicting tuples but suffers from the lack of a convenient user interface.

5https://www.talend.com/products/data-preparation

6https://support.office.com/en-us/article/ideas-in-excel-3223aab8-f543%

2D4fda-85ed-76bb0295ffc4?ui=en-US&rs=en-US&ad=US

7https://products.office.com/en/excel

8http://www.holoclean.io

(33)

2.4.5 Trifacta

The data transformation tool most related to the research performed in this thesis is Trifacta Wrangler⁹. Trifacta Wrangler is a powerful tool which comes with a lot of data manipulation functionalities including restructuring, cleaning, enrichment and distillation of data. A predictive model computes a ranked set of suggestions in the form of suggestions cards based on user’s selection and historical data in an attempt to interpret the data transformation intent of the selection [13]. Before applying a certain transformation, Trifacta lets users modify these suggestions to see which suits best, in addition to providing user the ability to modify a particular suggestion. Though Trifacta provides transformations through predictive interactions, this thesis bases itself on creating a diverse set of transformations provided to the user based on historic data for Grafterizer.

2.4.6 Other tools

In addition to the tools and frameworks mentioned above, other commercial and non commercial tools were also reviewed. We provide a short description of each of these in the following.

OpenRefine¹⁰ is an opensource tool for cleaning and transforming data and extending it with web service and external data. It lets users apply basic and advance transformations on large datasets including normalization of numerical data and filtering of text.

NADEEF [8] is a generalized data cleaning system which relies in on rules to clean data. It allows the users to specify different data quality rules including functional dependencies for the given dataset. These rules are then used to find any violations in the given data and repair it while at the same time interacting with domain experts through the data quality dashboard to achieve higher quality results.

KATARA [6] is a data cleaning system powered by knowledge base and human help. It can be used to clean various datasets by providing a table as input, and a knowledge base to interpret table semantics.

It also identifies incorrect data and the possible repairs for it. It then utilizes the help of human beings to disambiguate the table semantics and annotate data.

Despite substantial research in data preprocessing and use of ML in that context, the above mentioned tools have come with limitations.

Though some of the tools come with interactive user interfaces, this thesis bases itself on providing a diverse set of transformations to the user based on historical data. HoloClean uses ML to identify and correct anomalies in the given dataset but does not come with a interactive

9https://www.trifacta.com

10http://openrefine.org

(34)

user interface to visualize the transformations performed on the dataset.

OpenRefine, NADEEF, and KATARA are data cleaning tools which rely on human input to clean data for analysis. Though Trifacta comes with a user friendly interface providing data transformation suggestions based on user interactions, it is however unclear if it uses the Random Forest algorithm to generate data transformation suggestions.

The analysis of related approaches for data transformation done above paves a way to define the scope of the thesis. The application prototype should provide data transformation suggestions based on user interactions using the Random Forest algorithm and have a graphical user interface for the user to see the transformed dataset right away in addition to the actions performed on the input dataset.

(35)

(36)

Chapter 3 Problem Analysis and Proposed Approach

Data analysis is the key action for finding useful insights from data.

Raw data can be acquired from a number of sources ranging from users’ online activity to sensor data. Its analysis paves a way for better decision making and improvement of services. However data is not always available in the format best suited for analysis. For discovering hidden patterns from data using data mining methods and eventually analysis, raw data needs to be made ready. It needs to be cleaned, integrated with other datasets, reduced by getting rid of the redundant attributes, and transformed [11]. Data transformation is a process of changing format or structure of data from one to another. It may be done for reasons such as:

• Extracting meaningful knowledge

• Enrichment of dataset

• Fixing incorrect data to make it ready for analysis.

Data scientists and data analysts need to prepare their data for running machine learning algorithms on and to uncover useful insights.

Having a tool that automates the data preparation can significantly reduce the time consumed in creating a rich dataset for analysis. Data transformation is a part of data preprocessing in the analysis life cycle which typically involves collection of raw data to communicating meaningful insights in the form it. Figure 3.1 below shows an illustration of steps involved in data analysis.

(37)

Data Collection

Data Proﬁling

Data Cleaning and Transformation Data Analysis

Data Visualisation

1

2

4 3 5

Figure 3.1: A typical data analysis life cycle.

Data preprocessing which includes data cleaning and transformation involves a wide range of activities that can be performed on a dataset for improving quality and facilitating analysis. Those activities may involve the following:

• Aggregation, where aggregate functions like average and standard deviation are applied to the dataset.

• Normalization, where values in data are rescaled to a certain range.

• Conversion and standardization of formats, where different variables are converted into a common format.

• Generalization of data, where data is abstracted from a low conceptual level to higher conceptual level.

The goal of this thesis is to focus on some of the problems that exist in a given dataset and create an application that suggests transformations based on the type of data and historical information from user interactions. In the section below we describe the data transformations implemented in this thesis and their respective categories.

(38)

3.1 Tabular Data Transformations

For including diversity as explained before in Section 2.3, we divide the transformations into three groups. These transformation categories may not necessarily have disjoint goals with respect to data preprocessing but are created to provide user with more options. The three groups of transformations implemented are:

• Table-based transformations

• Format-based transformations

• String-based transformations

The transformations chosen for the thesis is a small subset of a potentially large number of transformations that can be applied to a dataset. This subset of transformations has been taken from multiple sources [22, 29] including Trifacta, while taking into account the potential ones needed to transform the test dataset. This set of data transformations contains basic and intuitive transformations that may not need any help from expert. These transformations can be made in a short number of steps, making it possible for the user to see the result when performed on a dataset.

Table-based transformationsinclude transformations which take dataset as input and transform dataset on structural level. These transformations do not individually impact the data in rows and column but they change the formation of the dataset. These transformations include some feature extraction methods such as removing excess rows and columns which may be unnecessary for the analysis. In addition, these contain transformations for correcting the column names of the dataset. See Table 3.1 for a complete list of table-based transformations.

Scope Name Description

Row Convert row to header Convert the selected row to header

Row Drop row Drop the selected row

Row Keep row Keep the selected row

Column Delete column Delete the selected column Column Keep column Keep the selected column Column Rename column Rename the selected column

Table 3.1: Table-based transformations

(39)

Format-based transformations refer to transformations for changing the format of the existing data object. These transformations change the format of each individual cell in the selected data object without having much impact on the structure of the entire dataset. One transformation that adds additional column to the dataset is Count data by column, which adds a column containing number of corresponding values existing in source column. See Table 3.2 for a complete list of format- based transformations.

Column Normalize Normalize numeric values Column Count data by group

Generate a new column with summed count of unique values of selected column

Column Format date Convert date to the given format Column To uppercase Convert data to uppercase

Column To lowercase Convert data to lowercase

Cell Round to nearest Round the selected number to nearest integer

Table 3.2: Format-based transformations

String-based transformations refer to transformations applied to strings or text in the selected data object. This group of transformations can be used for cell, column and row and includes methods such as filtering and modifications applied to the individual strings in the dataset. See Table 3.3 for a complete list of string-based transformations.

(40)

Row Set row to null Set the whole selected row to null Column Set column to null Set the whole selected column to null Column Fill Fill nulls in the column with the given

input

Column Split column Split column using custom separator Cell Set cell to null Set the value of selected cell to null Cell Extract Extract values matching the selected text

into a new column Cell Replace

Replace values matching the selected text with the given input in the

corresponding column Cell Remove special

characters

Remove special characters from the corresponding column

Table 3.3: String transformations

How these transformations are implemented is described in further detail in the next chapter.

3.2 Machine Learning Classifier

Machine Learning is a subset of Artificial Intelligence that trains systems to automatically learn and improve from experience without being explicitly programmed. Machine learning algorithms differ in their approach but generally they can be divided into the following categories:

• Supervised learning is the process of predicting the outcome of new input data by learning a function that maps example data to associated target responses.

• Unsupervised Learning refers to drawing inferences from data without reference to known outcomes.

• Semi-supervised learning falls between supervised and unsupervised learning and refers to training a model on data where some of the training examples have corresponding outcomes but others don’t.

• Reinforcement learning works by taking suitable action to maximize the ultimate reward in a particular situation.

(41)

For recommending relevant data transformation suggestions, we train our machine learning model on historical data about user interactions and using that model we predict the a transformation on unseen data. For training the model we have a set of observations and their outcomes which leads to supervised learning model being the most suitable machine learning algorithm for our problem [12].

Supervised learning can be further classified into following types of problems depending on the type of desired output:

• Classification: When the desired output is discrete i.e the output can be classified into classes.

• Regression: When the desired output is continuous i.e numeric.

In the context of our thesis, since the outcome to be predicted is a set of limited number of discrete transformations, we refer to it as a classification problem [12].

Classification [1], like mentioned above, is the process of predicting a class for a new set of observations. Classes are often referred to as targets, categories or labels. Classification in machine learning is the process of identifying the most appropriate class for a new set of observations based on training data containing observations of which the classes or labels are known. In the context of tabular data, an observation is a row of the table. The observation for which the class is to be predicted is normally referred to as test data. For example sorting emails into “spam” and “not spam” based on their characteristics is a common classification problem.

The characteristics or variables that an observation is comprised of are called features. A classifier uses training data to map how given input features relate to a class. In the above example, emails marked as “spam” and “not spam” are used as training data. Training the classifier accurately can help identify a new email as “spam” or “not spam”. In the sections below we describe the Random Forest algorithm, a tree-based classifier chosen to create data transformation suggestions for the thesis. We also give an overview of decision tree, a decision support tool and the building block of the Random Forest algorithm.

3.2.1 Decision Trees

A decision tree [18] is a tree-like model used to represent decisions with nodes representing a test based on which the tree splits into branches on an attribute and edges represent the outcomes the to the question.

The leaves of a decision tree represent the actual output or class label.

Decision trees effectively visualize options and go through the possible outcomes of each possible course of action on those options.

(42)

Decision trees are used to classify the observations by narrowing the decisions down from the root to a leaf node, with the leaf node serving as the category or label to a particular observation. This decision making process is recursive and is repeated for every subtree.

Consider the following example, the decision tree in figure 3.1 de- fines Fill null values with given input as the most relevant transformation or class for a column with string data type having null values. How- ever, if the column doesn’t contain any null values, the user may want to get the number of unique entries in the selected column by choosing COUNT grouped by selected column.

Column

String Integer Boolean

Includes null values

Count data by group

true false

Fill null values with given

input

true

false

true

Figure 3.2: An example of decision tree.

Decision trees learn the relationships between dataset features and formulate the decision making process as a hierarchical structure. The advantage of using decision trees is that they are easy to interpret and are able to assign specific value to each outcome while requiring minimal data preparation from the user.

3.2.2 Random Forest

Random forest is a supervised learning algorithm which can be used for classification. It is an ensemble of decision tree predictors. A random forest classifier creates a set of n decision trees, where k being the number of estimators set as the algorithm is run, such that each tree

(43)

with the same distribution for all trees in the forest [3]. The results of the decision trees are then aggregated into one final decision to decide class of the test object. An illustration of Random Forest classifier is show in Figure 3.2. In this example we can observe that there has been created n=3 decision trees for the given test instance and each decision tree has predicted a specific class. The final output selected by the Random Forest will be the Class A, since it is in the majority and is the predicted output of 2 out of 3 trees.

Instance

Tree-1 Tree-2 Tree-3

Class A Class B Class A

Majority Voting

Final Class

Figure 3.3: An illustration of Random Forest classifier.

For creating data transformation suggestions in this thesis, random forest is used to predict the most relevant transformation for an unseen observation. We use random forest instead of single decision tree because a single decision tree may be prone to overfitting. Overfitting is referred to as a consequence of a machine learning model that is trained on data too well such that that it fails to predict the classes of unseen observation correctly. An ensemble of several decision trees in contrast to one, however, limits overfitting and error due to bias by adding randomness to the model [3]. It does so by searching for the best feature among a random subset of features. The accuracy of a random forest depends on the precision of the trees in the forest and their correlation.

Machine Learning Versus Rule-Based System

As shown in Figure 3.2, decision tree is a branching structure where each node describes whether to follow the left or right branch. Some

(44)

machine learning algorithms including decision trees may sometimes be perceived as rule-based, leading to the question if rule-based systems can be used instead of machine learning algorithm. Rule-based systems contain rules encoded into the system in the form of if-then-else statements which can be easy to state and understand. Such systems are deterministic and not having a right rule may result in incorrect results. Although being simple in the start when only containing a set of few rules but with the addition of more constraints, these systems may get complex, hence making it difficult to identify and remove errors. A rule-based system can be called efficient if it contains all the combinations of rules and their target responses specified accurately.

The efficiency of a machine learning algorithm which is a decision tree in this context, however, depends on the amount and quality of training data representing all possible variations along with the target responses. The lack of good training data may lead to the algorithm performing poorly, especially in case of rare inputs. Decision trees, unlike rule-based systems are probabilistic and use statistical models. For a machine learning model, it is also important to select features that best represent the dataset to create a predictive model.

The resulting model is created by deriving the rules from the given historical data instead of describing manually. The decision tree in our case is trained on historical outcomes in the form of dataset features and the appropriate transformation suggestion to determine the future outcomes.

Decision trees and rule-based systems complement each other. Rules can be derived from a decision tree that describe the relationships between the inputs and target responses. A system consisting of a complex set of rules may be difficult to interpret as a decision tree.

However, a dataset consisting of a rich set of features may be difficult to be mapped as a rule-based system.

3.2.3 Dataset Features

In machine learning, a feature is a measurable property or characteristic of the data being analyzed. Feature engineering is the process of identifying features which play an important role in creation of prediction model for a dataset. Feature engineering is important in machine learning as it has a direct impact on how good the final predictions will be [9]. A raw dataset cannot itself contribute to the learning phase but quality features of the dataset provide value in creating better machine learning models.

Different datasets include different kinds of features and the quality of the features in the dataset plays a role in the insights one can get from it. For generating data transformation suggestions, we use two

(45)

groups of features of the given dataset to predict the most relevant transformation:

1. Features of the given dataset, i.e metadata 2. Features of the data selected by the user

The former group of features, in context of tabular data, contains the attributes of the dataset in general including the number of variables and total observations. The latter group contains the properties of the data selected by the user at a certain instant. That would include the selected row, column or cell of the dataset and properties of data it contains. Tables 3.4 and 3.5 below show the detailed description of dataset features required for generating relevant data transformations and which type of data these could be extracted from.

Feature Scope

Number of attributes Entire dataset Number of observations Entire dataset Percentage of missing values Entire dataset Percentage of categorical attributes Entire dataset Percentage of numeric attributes Entire dataset Percentage of boolean attributes Entire dataset Percentage of date attributes Entire dataset Percentage of null attributes Entire dataset

Table 3.4: Features of given dataset

(46)

Feature Description Scope

Data type Number, string, boolean, date Column, cell Data selected Row, column or cell selected Entire dataset No. of numeric

values

Number of numeric values in the

selected data. Zero or more. Row No. of boolean

values

Number of values with boolean data type in the selected data.

Zero or more.

Row No. of string

values

Number of values with string data type in the selected data.

Zero or more.

Row No. of date

values

Number of values with date data type in the selected data.

Zero or more.

Row No. of categories Number of unique categories in the

selected data. One or more. Column No. of nulls Number of null values in the selected

data. Zero or more. Row, column

Special characters Check if any special characters exist

in the selected data. Cell

Table 3.5: Features of selected data item

3.3 Requirements and Success Criteria

The approach we take for suggesting relevant data transformations to the user is to create a recommender system where we send features of the selected data object as an input to the Random Forest classifier.

The classifier then generates the most appropriate transformation by creating a set of decision trees and choosing the output that has most votes. The suggested approach is illustrated in Figure 3.4:

(47)

Dataset features + Corresponding transformation

Random Forest classiﬁer

Tabular data transformation suggestions Recommender System

Figure 3.4: An illustration of the suggested approach.

As a part of the research in this thesis, we create a prototype of the recommender system as shown in the approach above. The functional and non-functional requirements for the data transformation recommender system are hereby stated in a way, which will make it possible to evaluate the prototype after the development.

Functional Requirements

R1. Provide user the possibility to transform data.

R2. Suggest transformations based on user interactions and historical data using the Random Forest algorithm.

R3. Provide an interactive interface for transformation suggestions.

R4. The user should be able to see the transformations applied to the dataset.

R5. Provide statistical information about the dataset.

Non-Functional Requirements

R6. The user should be able to transform data effectively.

R7. Provide efficient way to transform data.

(48)

R8. Provide a simple and easy to use data transformation solution.

(49)

(50)

Chapter 4 Architecture and Implementation

In the previous chapter, the most common transformations implemented for this thesis were defined along with the dataset features required to generate the transformations intelligently. This chapter focuses on the architecture and implementation of the approach as per the requirements for creating data transformation suggestions.

Further in this chapter, we describe the methods and components for creating and validating data transformations. The core application for displaying data and creating smooth and user-friendly interactions has been extended from previous work and will be discussed in this section.

4.1 Architecture

The prototype for creating data transformation suggestions has been created independently of the Grafterizer platform. The main goal of this prototype is to generate data transformation suggestions as a result of user interactions to support ease of data preprocessing. The architecture of the application can be seen in Figure 4.1. The core application for displaying data in a user-friendly interface has been extended from previous work [31]. The application is implemented as part of the front-end architecture, leading the data transformations to be performed front-end. The application takes tabular data in CSV format as an input and lets user select a row, column or a cell. Relevant transformations are then generated based on user interaction and the properties of dataset.

(51)

Data Transformation Suggestion GUI

Input Dataset

2. User Selection

9. Data transformation suggestions

Recommender System 3. Features of selected data

and metadata of dataset

Training data

User

Roundthe number to the nearest integer

Rename column Fill missing values with

given input 8. Relevant transformations

Random Forest 5. Input data 7. Transformation

suggestions 10. Modify and apply transformation

1. 4.

11. Update training data with the features of data selected and applied transformation

6.

Figure 4.1: Architecture of prototype for tabular data transformation suggestions.

The different steps involved in the generation of tabular data transformation suggestions as shown in the architecture above are described in detail below:

1. Input dataset: As the first step for data preprocessing, user is prompted to input a tabular dataset in CSV format into the system. The imported dataset is made visible to the user on the graphical user interface in the form of a table.

2. User selection: User can then select either row, column or cell to transform the underlying data.

3. Features of selected data and metadata of dataset: As the user makes a selection of data to be transformed, the features of the entire dataset along with the features of data selected are fetched and saved as test data to be sent in to the recommender system.

4. Training data: Training data contains several entries containing combination of features and the expected transformations. These entries are used as examples for training the machine learning model to predict the transformation for test data.

5. Input data: Both test and training data is sent as input to the Random Forest algorithm

(52)

6. Random Forest: The prediction model of the Random Forest algorithm learns the relationships between the features of the selected data and the transformations we want to predict. The algorithm creates a tree structure to make the most accurate predictions possible. As mentioned in Section 3.1, for including diversity in the transformations generated, we use the same test data to build predictive models with three different training sets to get more than one transformation. This is shown in Figure 4.2.

Test data

RF_Classifier(training_data_1, test_data) RF_Classifier(training_data_2, test_data) RF_Classifier(training_data_3, test_data)

Table-based transformation

Aggregate

transformation String-based

transformation

Figure 4.2: Generation of multiple data transformation suggestions.

7. Transformation suggestions: At most three transformations are generated as a result of applying Random Forest algorithm to the test data from user selection.

8. Relevant transformations: The resulting transformations are then sent to the user interface.

9. Data transformation suggestions:The data transformation suggestions are shown to user.

10. Modify and apply transformation: The user selects one of the transformations and applies it to the dataset. The dataset is then updated with the applied changes.

11. Update training data with the features of data selected and applied transformation: After the user has selected and applied a transformation to the dataset, the features of selected data object that led to the generation of that particular transformation along with the target variable are added to the respective training dataset. For example, if the user applied a transformation that belonged to the set of table-based transformations, the new instance of data will be added to the training data for table- based transformations. Figure 4.3 below shows the procedure for

Predictive Data Transformation Suggestions Using Machine Learning: Generating Transformation Suggestions in Grafterizer Using the Random Forest Algorithm

Predictive Data Transformation Suggestions Using Machine

Learning

Generating Transformation Suggestions in Grafterizer Using the Random Forest Algorithm

Saliha Sajid

Thesis submitted for the degree of

Master in Informatics: Programming and Networks 60 credits

Department of Informatics

Faculty of Mathematics and Natural Sciences

UNIVERSITY OF OSLO

Predictive Data Transformation Suggestions Using Machine

Learning

Generating Transformation Suggestions in Grafterizer Using the Random Forest Algorithm

Saliha Sajid

Generating Transformation Suggestions in Grafterizer Using the Random Forest Algorithm

http://www.duo.uio.no/

Abstract

Contents

List of Figures

Listings

List of Tables

Acknowledgements

Chapter 1

Introduction

1.1 Overall Context

1.2 Thesis Motivation

1.3 Research Questions

1.4 Thesis Contributions

Summary of Thesis Contributions

1.5 Research Design

1.6 Thesis Outline

Chapter 2

Background and Related Work

2.1 Recommender Systems

2.1.1 Data Collection in Recommender Systems

2.1.2 Recommendation Filtering Methods

A B

B A

A

B

B

2.2 The Cold-Start Problem

2.3 Overspecialization

2.4 Overview of Relevant Related Approaches

2.4.1 Tableau

2.4.2 Talend

2.4.3 Ideas in Microsoft Excel

2.4.4 HoloClean

2.4.5 Trifacta

2.4.6 Other tools

Chapter 3

Problem Analysis and Proposed Approach

3.1 Tabular Data Transformations

3.2 Machine Learning Classifier

3.2.1 Decision Trees

3.2.2 Random Forest

3.2.3 Dataset Features

3.3 Requirements and Success Criteria

Chapter 4

Architecture and Implementation

4.1 Architecture