Generation of Transformation Suggestions - Architecture and Implementation

Architecture and Implementation

4.2.2 Generation of Transformation Suggestions

This section describes the process of relevant data transformation generation including data collection and the use of Random Forest algorithm.

Data Collection from User Interactions

The tabular dataset input by the user is displayed in the graphical user interface, providing user the ability to select data by clicking on it. User has the ability to select a cell, a single row and a single column. Select-ing one of these data objects triggers the collection of characteristics of the selected data as listed in Section 3.2.3, and eventually the creation of a set of suggestions. The characteristics of the data object collected are assigned to a JSON object and are shown in Listing 4.1.

1 {

2 D a t a S e l e c t e d : ’ column ’,

3 DataType : ’ n u l l ’,

4 NoOfBinaryValues : 0 ,

5 NoOfCategories : 0 ,

6 NoOfDates : 0 ,

7 NoOfNulls : 0 ,

8 NoOfNumericValues : 0 ,

9 NoOfStringValues : 0 ,

10 h a s S p e c i a l C h a r a c t e r s : f a l s e ,

Listing 4.1: An example of feature object

The code snippet in Listing 4.1 above shows an example set of dataset features collected from one selection and that includes features of both the data object and also of the entire dataset. The feature object includes key/value pairs where each pair is separated by a comma.

The keys denote names of features while values denote their respective

11https://opencorporates.com

values. Also note that key/value pairs in line 2 - 10 represent the features of the test object while in line 11 - 17 the features of entire dataset are represented. These possible values and data types of these key-value pairs are described further below.

The feature DataSelected which refers to the type of selected data object can have one of the following possible values in text format:

• row

• column

• cell

The DataTypewhich is the data type of selected data object can have the following possible values:

• String: Alphanumeric combinations and text..

• Number: Positive and negative integers, and floating point num-bers.

• Boolean: True or False.

• Date: A date.

• null: Refers to an empty data object.

The feature hasSpecialCharacters can have value either true or talse depending on the presence of special characters in the selected data object. hasSpecialCharacters applies only to cell data object with DataType as String and has the default value set to false.

The remaining features in a feature object will include numeric values. These values represent the statistical properties of a particular data object and also the selected dataset. These statistical properties may vary across different dataset and the data objects selected.

Random Forest Classifier

For generation of transformation suggestions, we use

random-forest-classifier¹² which is an open source JavaScript library under MIT license for implementation of Random Forest classifier.

Number of estimators: Refers to the number of trees. In Random Forest algorithm, each tree created identifies class as a result and the final output of the prediction model depends on the class that gets maximum number of votes. Though in general, a large number of trees will lead to a better prediction model, it is not always the case.

The performance of the algorithm may not necessarily be significantly

better than having fewer number of trees. Increasing number of trees only increases computational cost of the model and results in no significant performance gain [21]. Due to limitations of the host system, for this thesis we have used a constant number of trees. Input data: Refers to a labelled training dataset consisting of an array of several feature objects and their corresponding target transformations.

Features: Refer to the the names of features to be considered by the algorithm when creating a predictive model. In this thesis, we use all the features collected through user interaction, therefore leaving the parameter empty or null.Target variable: Refers to the name of variable to be predicted, which in our case is the data transformation. Since this is a classification problem, we want to predict the categories user interactions fall into. Those categories are discrete which makes it appropriate to have target variables as text instead of numbers.

Checks for Irrelevant Transformations and Null Values

At the initialization of application, we use training data created beforehand to generate transformations. For inducing diversity, training data as with the transformations is divided into three categories as mentioned in Section 3.1. These three groups do not necessarily include data transformations for all three data objects i.e row, column and cell.

The machine learning classifier returns NULL for the corresponding category if user selection has not been mapped to an appropriate transformation. For instance, if user selects a cell, no table-based transformation will be generated as this particular category does not include any transformations for a cell. This will result in number of transformation generated to be at most 2.

As transformations are generated, for good user experience and in order to make sure that no irrelevant transformations are suggested to the user, we have implemented two checks. Figure 4.5 below demon-strates the implemented procedure from generation of data transforma-tions to finally making recommendatransforma-tions to user on Graphical User In-terface.

Figure 4.5: Procedure for filtering irrelevant transformations.

Check for irrelevant transformations: Initially, due to lack of sufficient training data, the Random Forest classifier may not generate the appropriate transformations. A check for irrelevant transformations makes sure that no incorrect transformations with respect to the data

object selected and its data type are suggested to user. Listing 4.2 below shows an example check created for filtering out irrelevant transformations.

Listing 4.2: An example check for irrelevant transformation.

In the example code, we check if the data object selected is a row. If that condition holds, we use a conditional statement to find out if the transformation generated by the machine learning classifier, here defined as allowedTransformations is one of the transformations implemented for rows. If this condition is true, we move on. Otherwise we replace the generated transformation with null.i in the code above denotes iterator for the corresponding array.

Check for NULL’s returned: In this case, if all the transformations initially returned from the Random Forest classifier after passing through the filter for irrelevant transformation return null, we generate transformations randomly based on the data type and the type of data object selected. This check is implemented to make sure to return at least one recommendation to the user in case the prediction algorithm is unable to generate the most relevant transformation and returns none. Listing 4.3 below shows an example check created for filtering out irrelevant transformations.

12 t r a n s f o r m a t i o n s A r r a y = [’ d e l e t e C o l ’, ’ keepCol ’, ’

Listing 4.3: An example check for replacing null transformation.

In the code above, we demonstrate the generation of a non-null transformation for a selected column. Here,we check if the transformation is null. If this condition holds, we check the type of the selected data which is column in this case. After that we generate a transformation based on the data type. This is either randomly generated or after filtering through a logical statement. i in the code above denotes iterator for the corresponding array.

In document Predictive Data Transformation Suggestions Using Machine Learning: Generating Transformation Suggestions in Grafterizer Using the Random Forest Algorithm (sider 57-61)