• No results found

Problem Analysis and Proposed Approach

3.2 Machine Learning Classifier

Machine Learning is a subset of Artificial Intelligence that trains systems to automatically learn and improve from experience without being explicitly programmed. Machine learning algorithms differ in their approach but generally they can be divided into the following categories:

Supervised learning is the process of predicting the outcome of new input data by learning a function that maps example data to associated target responses.

Unsupervised Learning refers to drawing inferences from data without reference to known outcomes.

Semi-supervised learning falls between supervised and unsuper-vised learning and refers to training a model on data where some of the training examples have corresponding outcomes but others don’t.

Reinforcement learning works by taking suitable action to maxi-mize the ultimate reward in a particular situation.

For recommending relevant data transformation suggestions, we train our machine learning model on historical data about user interactions and using that model we predict the a transformation on unseen data. For training the model we have a set of observations and their outcomes which leads to supervised learning model being the most suitable machine learning algorithm for our problem [12].

Supervised learning can be further classified into following types of problems depending on the type of desired output:

Classification: When the desired output is discrete i.e the output can be classified into classes.

Regression: When the desired output is continuous i.e numeric.

In the context of our thesis, since the outcome to be predicted is a set of limited number of discrete transformations, we refer to it as a classification problem [12].

Classification [1], like mentioned above, is the process of predicting a class for a new set of observations. Classes are often referred to as targets, categories or labels. Classification in machine learning is the process of identifying the most appropriate class for a new set of observations based on training data containing observations of which the classes or labels are known. In the context of tabular data, an observation is a row of the table. The observation for which the class is to be predicted is normally referred to as test data. For example sorting emails into “spam” and “not spam” based on their characteristics is a common classification problem.

The characteristics or variables that an observation is comprised of are called features. A classifier uses training data to map how given input features relate to a class. In the above example, emails marked as “spam” and “not spam” are used as training data. Training the classifier accurately can help identify a new email as “spam” or “not spam”. In the sections below we describe the Random Forest algorithm, a tree-based classifier chosen to create data transformation suggestions for the thesis. We also give an overview of decision tree, a decision support tool and the building block of the Random Forest algorithm.

3.2.1 Decision Trees

A decision tree [18] is a tree-like model used to represent decisions with nodes representing a test based on which the tree splits into branches on an attribute and edges represent the outcomes the to the question.

The leaves of a decision tree represent the actual output or class label.

Decision trees effectively visualize options and go through the possible outcomes of each possible course of action on those options.

Decision trees are used to classify the observations by narrowing the decisions down from the root to a leaf node, with the leaf node serving as the category or label to a particular observation. This decision making process is recursive and is repeated for every subtree.

Consider the following example, the decision tree in figure 3.1 de-fines Fill null values with given input as the most relevant transformation or class for a column with string data type having null values. How-ever, if the column doesn’t contain any null values, the user may want to get the number of unique entries in the selected column by choosing COUNT grouped by selected column.

Column

Figure 3.2: An example of decision tree.

Decision trees learn the relationships between dataset features and formulate the decision making process as a hierarchical structure. The advantage of using decision trees is that they are easy to interpret and are able to assign specific value to each outcome while requiring minimal data preparation from the user.

3.2.2 Random Forest

Random forest is a supervised learning algorithm which can be used for classification. It is an ensemble of decision tree predictors. A random forest classifier creates a set of n decision trees, where k being the number of estimators set as the algorithm is run, such that each tree

with the same distribution for all trees in the forest [3]. The results of the decision trees are then aggregated into one final decision to decide class of the test object. An illustration of Random Forest classifier is show in Figure 3.2. In this example we can observe that there has been created n=3 decision trees for the given test instance and each decision tree has predicted a specific class. The final output selected by the Random Forest will be the Class A, since it is in the majority and is the predicted output of 2 out of 3 trees.

Instance

Tree-1 Tree-2 Tree-3

Class A Class B Class A

Majority Voting

Final Class

Figure 3.3: An illustration of Random Forest classifier.

For creating data transformation suggestions in this thesis, random forest is used to predict the most relevant transformation for an unseen observation. We use random forest instead of single decision tree because a single decision tree may be prone to overfitting. Overfitting is referred to as a consequence of a machine learning model that is trained on data too well such that that it fails to predict the classes of unseen observation correctly. An ensemble of several decision trees in contrast to one, however, limits overfitting and error due to bias by adding randomness to the model [3]. It does so by searching for the best feature among a random subset of features. The accuracy of a random forest depends on the precision of the trees in the forest and their correlation.

Machine Learning Versus Rule-Based System

As shown in Figure 3.2, decision tree is a branching structure where each node describes whether to follow the left or right branch. Some

machine learning algorithms including decision trees may sometimes be perceived as rule-based, leading to the question if rule-based systems can be used instead of machine learning algorithm. Rule-based systems contain rules encoded into the system in the form of if-then-else statements which can be easy to state and understand. Such systems are deterministic and not having a right rule may result in incorrect results. Although being simple in the start when only containing a set of few rules but with the addition of more constraints, these systems may get complex, hence making it difficult to identify and remove errors. A rule-based system can be called efficient if it contains all the combinations of rules and their target responses specified accurately.

The efficiency of a machine learning algorithm which is a decision tree in this context, however, depends on the amount and quality of training data representing all possible variations along with the target responses. The lack of good training data may lead to the algorithm performing poorly, especially in case of rare inputs. Decision trees, unlike rule-based systems are probabilistic and use statistical models. For a machine learning model, it is also important to select features that best represent the dataset to create a predictive model.

The resulting model is created by deriving the rules from the given historical data instead of describing manually. The decision tree in our case is trained on historical outcomes in the form of dataset features and the appropriate transformation suggestion to determine the future outcomes.

Decision trees and rule-based systems complement each other. Rules can be derived from a decision tree that describe the relationships between the inputs and target responses. A system consisting of a complex set of rules may be difficult to interpret as a decision tree.

However, a dataset consisting of a rich set of features may be difficult to be mapped as a rule-based system.

3.2.3 Dataset Features

In machine learning, a feature is a measurable property or characteristic of the data being analyzed. Feature engineering is the process of identifying features which play an important role in creation of prediction model for a dataset. Feature engineering is important in machine learning as it has a direct impact on how good the final predictions will be [9]. A raw dataset cannot itself contribute to the learning phase but quality features of the dataset provide value in creating better machine learning models.

Different datasets include different kinds of features and the quality of the features in the dataset plays a role in the insights one can get from it. For generating data transformation suggestions, we use two

groups of features of the given dataset to predict the most relevant transformation:

1. Features of the given dataset, i.e metadata 2. Features of the data selected by the user

The former group of features, in context of tabular data, contains the attributes of the dataset in general including the number of variables and total observations. The latter group contains the properties of the data selected by the user at a certain instant. That would include the selected row, column or cell of the dataset and properties of data it contains. Tables 3.4 and 3.5 below show the detailed description of dataset features required for generating relevant data transformations and which type of data these could be extracted from.

Feature Scope

Number of attributes Entire dataset Number of observations Entire dataset Percentage of missing values Entire dataset Percentage of categorical attributes Entire dataset Percentage of numeric attributes Entire dataset Percentage of boolean attributes Entire dataset Percentage of date attributes Entire dataset Percentage of null attributes Entire dataset

Table 3.4: Features of given dataset

Feature Description Scope

Data type Number, string, boolean, date Column, cell Data selected Row, column or cell selected Entire dataset No. of numeric

values

Number of numeric values in the

selected data. Zero or more. Row No. of boolean

values

Number of values with boolean data type in the selected data.

Zero or more.

Row No. of string

values

Number of values with string data type in the selected data.

Zero or more.

Row No. of date

values

Number of values with date data type in the selected data.

Zero or more.

Row No. of categories Number of unique categories in the

selected data. One or more. Column No. of nulls Number of null values in the selected

data. Zero or more. Row, column

Special characters Check if any special characters exist

in the selected data. Cell

Table 3.5: Features of selected data item