Evaluation methodologies - Background Theory 7

2. Background Theory 7

2.6. Evaluation methodologies

sentence embeddings in an unsupervised manner and have proven to be very useful for transfer learning. However, these large models are very computationally intensive.

Word2vec, GloVe and FastText word embeddings are context-independent, which means that they output one vector for each word. This is a consequence of not considering the order of words. On the other hand, ELMo and BERT can generate different word embeddings for a word that captures the context of a word, i.e. its position in a sentence.

2.6. Evaluation methodologies

When dealing with a machine learning problem, it is vital to evaluate and test the performance of the implemented model. While training the model is an important step, how the model performs on unseen data is a crucial aspect that should be considered.

It is necessary to know if it actually works and if its predictions are trustworthy. The model might merely memorise the data it was trained on, and consequently, be unable to predict correctly on new data. Therefore, it is essential to measure the performance of the model. This section describes techniques for evaluation of a machine learning model and also presents various metrics that are often used when evaluating hate speech detection systems. All of the presented metrics are common when dealing with both classification problems and anomaly detection problems.

2.6.1. Techniques

The first step in the evaluation of a machine learning model is to split the data into three categories: atraining set, avalidation set and atest set. In order to improve accuracy, the training set is used to build predictive models by adjusting the weights of the model.

The validation set is not a part of the training data and is used to assess the performance of the model and avoid overfitting to the data in the training set. With a validation set, it is possible to select the best performing model after fine-tuning the data, and validate that the model is improving. The test set consist of unseen data and is used to assess the performance of the finalised model.

The accuracy of this evaluation can vary based on how the data was split into categories, and unwanted bias can become part of the model. Cross-validation is a regression method used to avoid this. K-Fold Cross Validation is where a dataset is partitioned intoK folds, and each fold is then used as a testing set. In the first iteration, the first fold is used to test the model and the rest to train the model. This process is repeated until each of the K folds have been used to test the model. Regression is then performed on the combined results, which are then regarded as one validation set (Büttcher et al., 2016).

2. Background Theory

2.6.2. Metrics

Evaluating the quality of the results using performance metrics provides a quantitative measure which makes it easier to effectively determine a model’s accomplishments.

Evaluation metrics are used to make informed decisions based on quantitative measures to increase performance and discover better approaches. Comparing the effectiveness of one method against another in the same situation is useful when deciding which method best achieves their intended purpose and meets the user’s need (Büttcher et al., 2016). In this section, various evaluation metrics are presented. All the presented metrics require having ground-truth labels available. In this case, where the datasets are adapted from imbalanced classification problems, the ground-truth labels can be used to measure the performance of the anomaly detection system.

Precision and recall

Precision andrecall are two of the most widely used evaluation metrics. Usually, the metrics are used for measuring the effectiveness of set-based retrieval, and they can be used to evaluate anomaly detection models that possess labelled data samples (Aggarwal, 2017, p. 27-28). Assume that an anomaly detection model outputs an anomaly score. By setting a threshold, the score can be converted into a binary label for each data sample.

For any given thresholdt on the anomaly score, the declared set of anomalies is denoted byS(t). Furthermore, let Gdenote the true set consisting of all anomalies (based on the ground-truth labels). Then, for any given thresholdt, the precision is defined as the fraction of retrieved documents that are in fact outliers:

P recision(t) = |S(t)∩G|

|S(t)| (2.4)

On the other hand, recall is the fraction of ground-truth outliers that have been retrieved:

Recall(t) = |S(t)∩G|

|G| (2.5)

Recall is used to evaluate the effectiveness of tasks where the user wants to find all relevant documents. This evaluation metric measures how meticulously the search results meet the user’s information need. By varying the parametert, it is possible to plot a curve based on the precision and recall values. This is called a precision-recall curve (PRC), and it shows the relationship between precision and recall for fixed recall levels from 0% to 100%. For each recall point, the curve plots the maximum precision achieved at that recall level or higher. The curves are normally used to compare retrieval quality of distinct retrieval algorithms because it allows evaluation of both the fraction of relevant documents found and the quality of the results itself. A precision-recall curve can also be used to clarify where a retrieval algorithm is most suitable. For instance, if an algorithm has higher precision at lower recall levels, it will be more suitable for the Web. On the

2.6. Evaluation methodologies other hand, if the algorithm has higher precision at higher recall levels, it may be more suitable for use cases such as legal applications (Baeza-Yates and Ribeiro-Neto, 2011).

The baseline of PRC is determined by the ratio of positives (P) and negatives (N) as y = P / (P + N). Hence, for a balanced dataset, the baseline is 0.5.

F-measure

Several of the previous works on hate speech detection use another measure to evaluate their models. F-measure, also known as the F₁-score, is an accuracy measure that provides another way of combining precision and recall. It is defined as follows:

F = 2

R+_P¹ = 2∗R∗P

R+P . (2.6)

Here, R represents recall and P represents precision. F-measure is the harmonic mean between these two metrics, and it brings a balance between recall and precision. Its use cases include serving as a measure for search, query classification and document classific-ation performance, as well as evaluating named entity recognition or word segmentclassific-ation.

AnF₁-score varies between 0 and 1, where 1 equals perfect precision and recall and is the optimal value. If needed, the formula can be weighted to focus more on either recall or precision.

Area under the ROC curve

Areceiver operating characteristic (ROC) curve is a probability curve that is closely related to the precision-recall curve (Aggarwal, 2017, p. 28). It is used as a performance measure, and it tells how much a model is capable of distinguishing between classes. The ROC curve provides a geometric characterization of filter effectiveness, plotted with the true positive rate (TPR), or recall (Equation 2.5), on the y-axis against the false positive rate (FPR) on the x-axis. FPR gives the proportion of falsely predicted positives out of the ground-truth negatives. For a dataset D with ground-truth labelsGand threshold t, the false positive rate is given by Equation 2.7.

F P R(t) = |S(t)−G|

|D−G| (2.7)

The degree of separability between classes is measured in thearea under the curve (AUC). The higher the AUC, the better the model is at predicting the correct label (Büttcher et al., 2016). A model with a good measure of separability will have AUC near to 1.

Likewise, a poor model will have AUC near 0. AUC near 0.5 means the model cannot separate between classes which equal random guesses.

2. Background Theory

2.6.3. Inter-annotator agreement metrics

When creating a linguistic data collection, it is common to have multiple people annotate the same data and then compare the annotations. There might be several reasons why this is desired, for example, to validate the annotation guidelines or identify difficulties within the annotation procedure. The evaluation, which often is a comparison, can, for instance, be a qualitative examination of the annotations or a quantitative examination based on the calculations of agreement metrics. Either way, the annotation variation between annotators must be examined to assure the quality and reliability of the dataset.

According to Artstein (2017), an annotation process is reliable if the annotations yield consistent results. This section presents the most common inter-annotator agreement metrics, which are used in Chapter 4. The metrics are intended to provide a quantitative measure of the magnitude of agreement between observers.

Observed agreement

The easiest way to measure the level of agreement between annotators is to use the observed agreement, or raw agreement. This measure equals the percentage agreement between the annotators. Hence, it is calculated by counting the number of items for which the annotators provide identical labels and divide this by the total number of annotated items. The metric is easy to understand and calculate, and according to Bayerl and Paul (2011), it is the most common way of reporting agreement. The drawback to this approach is that it does not account for the chance that the agreement might be by accident (Artstein, 2017, p. 299).

Kappa and alpha

Coefficients in the kappa and alpha family are intended to calculate the amount of agreement that was attained above the level expected by chance (Artstein, 2017, p. 300).

Hence, they attain the expected level of agreements given a scenario.

Cohen’s kappa (Cohen, 1960) measures agreement between two annotators while considering the possibility of an agreement by chance. Fleiss kappa (Fleiss, 1971) have many similarities to Cohen’s kappa, but it allows for more than two annotators. When there are more than two annotators, the agreement is calculated pairwise. However, according to Artstein (2017), the coefficients are not compatible because they differ in their conceptions of an agreement by chance.

Let Ao be the actual/observed agreement and Aebe the expected agreement. Then, both Cohen’s and Fleiss’ kappa (κ) can be calculated using the simplified formula presented in Equation 2.8:

κ= Ao - Ae

1−Ae (2.8)

2.6. Evaluation methodologies Viera and Garrett (2005) provided an overview of how to interpret the kappa score. This is presented in Table 2.1.

Table 2.1.: The interpretation of the kappa coefficient Kappa Agreement

< 0 Less than chance agreement 0.01−0.20 Slight agreement

0.21−0.40 Fair agreement 0.41−0.60 Moderate agreement 0.61−0.80 Substantial agreement 0.81−0.99 Almost perfect agreement

Unlike Cohen’s and Fleiss’ kappa,Krippendorff’s Alpha can assess agreement among a variable number of annotators and also accepts non-annotated examples (Bobicev and Sokolova, 2017). Krippendorff’s α is similar to Fleiss’ κ, but is expressed in terms of disagreement, rather than agreement. α is calculated from the simplified formula in Equation 2.9, where Do = 1 −Ao and De = 1 −Ae.

α= 1−Do

De (2.9)

α does not treat all disagreements equally and uses a distance function in order to set a specific level of disagreements between each pair of labels. The observed disagreement is then calculated by counting the number of disagreeing pairs, rather than the agreeing pairs. Furthermore, each disagreement is scaled by the appropriate distance given by the distance function (Artstein, 2017).

2. Background Theory

In document Detecting hateful utterances using an anomaly detection approach (sider 43-48)