6. Experiments and Results 73
6.2. Experimental setup
6.2.1. Datasets
To perform the experiments with semi-supervised detection of hate speech and to check the feasibility of the approach, two large datasets were used; one in the English language and one in the Norwegian language. For both datasets, the text preprocessing steps described in Section 5.1 were applied.
English (Jigsaw) dataset
The English dataset used in the experiments was the Jigsaw dataset from a Toxic Comment Classification Challenge, which is available on Kaggle.1 The dataset consists of a number of comments and their respective set of labels. There are six categories in the dataset; toxic, severe toxic, obscene, threat, insult and identity hate. Two random samples from the dataset looks like this:
Table 6.1.: Two random samples from the Jigsaw dataset showing how the data is represented.
Each row also contains an id which is not included here.
comment text toxic severe
toxic obscene threat insult identity hate FUCK YOUR FILTHY
MOTHER IN THE ASS,
DRY! 1 0 1 0 1 0
You, sir, are my hero.
Any chance you remember what page that’s on?
0 0 0 0 0 0
As can be seen from Table 6.1, a comment may have several labels, exactly one label or no labels. The comment is neutral if does not have any labels, i.e. none of its labels are set to one. The dataset consists of one training and test set, where some of the labels in the test set are set to -1. After all of these labels were removed, the dataset consisted of
1https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/data
6. Experiments and Results
223 549 comments in total, where 159 571 are in the train set and 63 978 are in the test set. Hence, the test set consists of 28.6% of all comments. Table 6.2 shows the number of comments that are labelled in each category.
Table 6.2.: The number of comments in each category in the English dataset Label Train set Test set
Toxic 15294 6090
Severe toxic 1595 367
Obscene 8449 3691
Threat 478 211
Insult 7877 3427
Identity hate 1405 712
Since many comments are labelled with more than one label, the total number of hateful comments is less than the sum of the comments presented above. Table 6.3 shows the distribution between hateful and neutral in both the training and test set. The number of hateful comments is calculated by summing all comments that contains at least one hateful label. The number of neutral comments is the sum of all comments without any hateful label.
Table 6.3.: The number of comments categorised as neutral and hateful in the English dataset.
Dataset Neutral Hateful Total
Train 143346 16225 159571
Test 57735 6243 63978
Based on the amount of neutral and hateful comments presented in the table above, one can find that both datasets contains approximately 90% neutral comments.
Since the test set contains approximately 30% of all the data, one third of this data are used for validation instead of testing. This results in approximately 70% of the data used for training, 10% for validation and 20% for testing.
6.2. Experimental setup
Norwegian dataset
After the entire dataset was annotated and all comments labelled with X were removed, the dataset was complete. Every row in the dataset is on the format ‘id, label, text’. The dataset contains the number of comments presented in Table 6.4.
Table 6.4.: The number of comments and percentage of total in the annotated Norwegian dataset Category Number of comments Percentage of total
1 - Neutral 34083 82.8%
2 - Provocative 4734 11.5%
3 - Offensive 1563 3.8%
4 - Moderately hateful 509 1.2%
5 - Hateful 250 0.6%
Total 41139 100%
The distribution of comments in the five distinct categories is also displayed in Figure 6.1.
1 2 3 4 5
labels 0
5000 10000 15000 20000 25000 30000 35000
count
1 Neutral 2 Provocative 3 Offensive 4 Moderately hateful 5 Hateful
Figure 6.1.: The distribution of comments in each category in the Norwegian dataset
To fit the problem statement of this thesis, it was necessary to only separate between hateful/anomalous and everything else. The inter-annotator agreement calculations in Section 4.2.3 indicated that even the expert annotators struggled to agree on the annotation of the comments in category 4 or 5. Based on this and the definition in Section 2.1, it was chosen to include both the moderately hateful and hateful comments as the anomaly class, whereas category 1 to 3 was the normal class. But since many hate speech detection methods struggle with the separation of offensive and hateful utterances,
6. Experiments and Results
it was decided to compare the performance of the model with and without the inclusion of the offensive class as anomalies. This involves the investigation of two cases: (1) class 1, 2 and 3 are normal samples and class 4 and 5 are anomalous, (2) class 1 and 2 are normal samples and class 3, 4 and 5 are anomalous. Both experimental tests from Section 6.1 were conducted using these two cases. Table 6.5 displays the number of comments and percentage of total when only separating between normal and anomalous in the two cases explained above.
Table 6.5.: Preprocessed combined annotated Norwegian dataset with and without the inclusion of the offensive class as anomalies.
Case 1: 4+5 are anomalies Case 2: 3+4+5 are anomalies Category #comments % of total #comments % of total
Normal 40880 98.2% 38817 94.4%
Anomalies 759 1.8% 2322 5.6%
Total 41139 100% 41139 100%
The dataset is separated into a training, test and validation set using stratified splits in the beginning of every experiment. In other words, the data were divided so that each set had approximately the same distribution between the different classes.