• No results found

Background and Related Work

2.4 Overview of Relevant Related Approaches

There has been done significant research on data transformation and there are several commercial tools and libraries which can be used to transform data both programmaticly and with the help of Graphical User Interfaces (GUI). In this section we will take a look at some of the available tools.

2.4.1 Tableau

Tableau3 is an interactive data analysis and visualisation platform that lets users view data in understandable format and helps generate customized dashboards.

Tableau’s data preparation and preprocessing tool Tableau Prep4 works by selecting data in one of its many supported formats and applies transformations to it, for example, filter, rename and join

3https://www.tableau.com

4https://www.tableau.com/products/prep

to clean and shape data before analysis. In addition to identifying problems in the data, it analyzes the given data to recommend transformations that may be of interest to the user and which could be applied automatically to make data preprocessing quicker. While the platform itself has a steep learning curve, it also provides visual data profiling and a graphical list of steps taken to transform data in the form of a flowchart.

2.4.2 Talend

Talend Data Preparation5 is a data preparation tool that comes with a user-friendly interface to transform data before analysis. It lets users filter, modify and enrich data by providing transformations intelligently based on the data selected. However it is not clear if Talend uses ML to provide customized transformation suggestions to the users.

2.4.3 Ideas in Microsoft Excel

Ideas6 for spreadsheet application Microsoft Excel7 is a feature which helps the user understand data through visual summaries in the form of charts and statistical patterns. It comes with an interactive interface and provides suggestions tailored to the task being performed but has some limitations. It works best with clean tabular data with correctly defined headers. Although it provided interactive analysis and statistical summary of data, it does not provide transformation recommendations for data preprocessing.

2.4.4 HoloClean

HoloClean8 is a statistical inference engine to impute, clean, and enrich data. As a weakly supervised ML system, HoloClean makes use of data integrity constraints, quantitative statistics, value correlations and external reference data to build a probabilistic model for data cleaning task. HoloClean allows data experts to save the time effectively communicate their domain knowledge to enable accurate analysis, predictions, and insights from noisy, incomplete, and erroneous data.

While HoloClean being a holistic data cleaning framework, it provides information about incorrect data values and conflicting tuples but suffers from the lack of a convenient user interface.

5https://www.talend.com/products/data-preparation

6https://support.office.com/en-us/article/ideas-in-excel-3223aab8-f543%

2D4fda-85ed-76bb0295ffc4?ui=en-US&rs=en-US&ad=US

7https://products.office.com/en/excel

8http://www.holoclean.io

2.4.5 Trifacta

The data transformation tool most related to the research performed in this thesis is Trifacta Wrangler9. Trifacta Wrangler is a powerful tool which comes with a lot of data manipulation functionalities including restructuring, cleaning, enrichment and distillation of data. A predictive model computes a ranked set of suggestions in the form of suggestions cards based on user’s selection and historical data in an attempt to interpret the data transformation intent of the selection [13]. Before applying a certain transformation, Trifacta lets users modify these suggestions to see which suits best, in addition to providing user the ability to modify a particular suggestion. Though Trifacta provides transformations through predictive interactions, this thesis bases itself on creating a diverse set of transformations provided to the user based on historic data for Grafterizer.

2.4.6 Other tools

In addition to the tools and frameworks mentioned above, other commercial and non commercial tools were also reviewed. We provide a short description of each of these in the following.

OpenRefine10 is an opensource tool for cleaning and transforming data and extending it with web service and external data. It lets users apply basic and advance transformations on large datasets including normalization of numerical data and filtering of text.

NADEEF [8] is a generalized data cleaning system which relies in on rules to clean data. It allows the users to specify different data quality rules including functional dependencies for the given dataset. These rules are then used to find any violations in the given data and repair it while at the same time interacting with domain experts through the data quality dashboard to achieve higher quality results.

KATARA [6] is a data cleaning system powered by knowledge base and human help. It can be used to clean various datasets by providing a table as input, and a knowledge base to interpret table semantics.

It also identifies incorrect data and the possible repairs for it. It then utilizes the help of human beings to disambiguate the table semantics and annotate data.

Despite substantial research in data preprocessing and use of ML in that context, the above mentioned tools have come with limitations.

Though some of the tools come with interactive user interfaces, this thesis bases itself on providing a diverse set of transformations to the user based on historical data. HoloClean uses ML to identify and correct anomalies in the given dataset but does not come with a interactive

9https://www.trifacta.com

10http://openrefine.org

user interface to visualize the transformations performed on the dataset.

OpenRefine, NADEEF, and KATARA are data cleaning tools which rely on human input to clean data for analysis. Though Trifacta comes with a user friendly interface providing data transformation suggestions based on user interactions, it is however unclear if it uses the Random Forest algorithm to generate data transformation suggestions.

The analysis of related approaches for data transformation done above paves a way to define the scope of the thesis. The application prototype should provide data transformation suggestions based on user interactions using the Random Forest algorithm and have a graphical user interface for the user to see the transformed dataset right away in addition to the actions performed on the input dataset.

Chapter 3

Problem Analysis and Proposed