Classification methods - Related Work 33 - Detecting hateful utterances using an anomaly detect

3. Related Work 33

3.5. Classification methods

Existing work within the field of hate speech detection can be divided into two categories;

classic methods and deep learning methods. Previously, classical methods were mainly used, but more recently, neural networks and deep learning methods tend to outperform these methods, as stated by Schmidt and Wiegand (2017). This section presents the current state of the art within the field of classification methods, including classic methods and deep learning.

Classic methods include support vector machines (SVM), naïve Bayes (NB), logistic regression (LR), gradient boosted decision trees (GBT), random forest (RF) and general NLP. Despite deep learning being the most popular research area nowadays, some classical methods still compete with certain methods, and they are often used as a baseline for deep learning methods.

As stated by Schmidt and Wiegand (2017), LR and SVM have been the most popular choices among the classic methods within hate speech. Waseem and Hovy (2016) used LR in combination with extra-linguistic features and character n-grams to detect hate speech, while Gaydhani et al. (2018) proposed an approach which combined n-gram and TF-IDF with NB, LR and SVM. LR outperformed the previous state of the art methods at that time. However, these results have shown difficult to reproduce, and it turned out that 74% of the test data were either already in the training data or a duplicate (Isaksen, 2019). Davidson et al. (2017) first used LR to reduce the data dimensionality before testing a variety of models such as NB, decision trees, RF, and linear SVMs. They found that LR and linear SVMs performed significantly better than the other models. Even though they managed to get a high F1-score on the best performing model, this model still misclassified almost 40% of the hate speech instances in the dataset.Lee et al. (2018) used the Founta et al. (2018) dataset to create the first baseline model using the most frequently studied classic machine learning and deep learning methods. They use BoW, TF-IDF and n-grams as features and they experimented with NB, LR, SVM, RF and GBT. For the neural network-based models, they used CNN and RNN. For their variant models, they use a pre-trained GloVe word embedding model. The models performed more or less equal, with an F1-score ranging from 0.73 to 0.805, where RNN performed the best.

It is quite common to combine different machine learning classifiers. Burnap and Williams (2015) used Bayesian LR, GBT, SVM and an ensemble method to detect cyber hate speech. MacAvaney et al. (2019) proposed a multi-view SVM (combining multiple SVMs) approach that achieved near state-of-the-art performance while being simpler and producing easier interpretable decisions than neural methods. Malmasi and Zampieri (2017) and Robinson et al. (2018) looked into using SVM in combination with typical NLP features and achieved acceptable results, while Nobata et al. (2016) and Sharma et al. (2018) used a supervised model combining various surface and linguistic features.

As previously mentioned, supervised machine learning methods are often used as a

3.5. Classification methods baseline when testing deep learning methods. Fagni et al. (2019), Z. Zhang et al. (2018) and Badjatiya et al. (2017) used SVM, NB and LR as a baseline in their deep learning approach. Even though at present, classical methods are mostly outperformed by deep learning methods, some recent papers find SVM and LR models outperforming current deep learning architectures. Biesek (2019) tested SVM with TF-IDF vectors, bidirectional gated recurrent unit (GRU) and a contextual string embeddings model and found SVM outperforming the other methods. The reason probably being that it was a small and unbalanced dataset which can make more complicated models overfit. Classical methods are often chosen and preferred due to simplicity and when there is a smaller amount of data.

In most recent papers, deep learning with convolutional neural network (CNN), recurrent neural network (RNN), Long Short-Term Memory (LSTM) and a combination of methods have been the preferred approach. Both Zampieri et al. (2019), Badjatiya et al. (2017), Z. Zhang et al. (2018) and Fagni et al. (2019) showed that deep learning methods outperforms the more classical machine learning methods. Gambäck and Sikdar (2017) proposed to use CNN to classify hate speech by training four different CNN models using character n-grams, word vectors, and a combination of those. Park and Fung (2017) did one-step and two-step classification of abusive language. First detecting whether a tweet was hateful or not, and afterwards using a different classifier to detect whether it was racist or sexist. They, as well as Gambäck and Sikdar (2017), used a HybridCNN model, which is a variant of CNN that uses both words and characters to classify. For the one-step method, their proposed HybridCNN performs the best and for the two-step approach combining two LR classifiers performs as well as the HybridCNN in one step.

This was surprising considering LR used fewer features than the HybridCNN. In addition, using HybridCNN for the first step and LR for the second step worked better than just using HybridCNN. Z. Zhang et al. (2018) introduce a new method based on deep neural network combining CNN and GRU, where word embeddings are first fed into CNN, which produces input vectors for the GRU. They evaluated it against several baselines and state of the art methods on a large collection of publicly available datasets. The results outperformed the baseline methods on six out of seven datasets.

Mehdad and Tetreault (2016) experimented with RNN and outperformed previous state-of-the-art methods. Pavlopoulos et al. (2017) used RNN in combination with GRU to further improve the previous state of the art approaches. It outperformed state of the art, which used LR or ML with features such as character n-grams or word n-grams. They also beat a standard CNN using word embeddings. Badjatiya et al. (2017) experimented with a supervised learning model based on deep learning architectures. They experimented with multiple classifiers such as LR, RF, SVMs, fastText, CNNs and LSTM. The best result was achieved using the LSTM model, assisted by gradient boosted decision trees and the features extracted by character n-grams. Their methods outperform previously considered state of the art methods such as character and word n-grams methods. However, other researchers have failed to reproduce their best-performing experiments, stating that their cross-evaluation method had a bug which increased the final F1-score for each iteration

3. Related Work

(Fortuna et al., 2019). However, Gröndahl et al. (2018) showed that they still achieved competing results without these implementations. De Gibert et al. (2018) tried different types of machine learning techniques such as SVM, CNN and LSTM. The LSTM based classifier obtained better results, but the SVM model still achieved decent results. Founta et al. (2019) experimented with different types of RNN architecture; GRUs, LSTMs and bidirectional RNNs. They found that the simple GRUs performed as good as more complex units.

In the aforementioned methods, they have separated the use of RNN and CNN, but as Z. Zhang et al. (2018) states; In theory, combining them should show to be more powerful than just solely based on either. In hate speech, CNN is useful to extract word or character combinations (word embeddings), and RNN will learn word or character dependencies (order information). They hypothesise that a combined structure can be more effective as it may capture co-occurring word n-grams as useful patterns for classification.

Pitsilis et al. (2018) proposed a detection scheme that is an ensemble of RNN classifiers, and it also incorporated various user-features. Their solution achieved a higher classi-fication quality than the current state-of-the-art algorithms. One of these algorithms included Badjatiya et al. (2017). Furthermore, Fagni et al. (2019) looked into six different machine learning classification strategies, three classic and three deep learning. They compared SVM, NB and LR to CNN, GRU and ensemble and concluded that the best classification results were obtained through deep learning techniques, and ensemble in particular. H. Liu et al. (2019) proposed a fuzzy multi-step method to classify, and it was compared to SVM, CNN and LSTM beating the current state of the art. Meyer and Gambäck (2019) proposed an optimised architecture to detect hate speech by combining CNN and LSTM networks, utilising both character n-grams and word embeddings to produce the final classification. They used the Waseem and Hovy (2016) dataset and they outperformed all previous state-of-the-art approaches with an F1-score of 0.792.

Schmidt and Wiegand (2017) stated that there does not exist comparative studies which would allow making a judgement on the most effective learning method. However, there exist several studies that compare the performance of different classification methods.

Burnap and Williams (2015) did a comparative study which concluded that an ensemble method seemed most promising. Still, considering that this method could only work well for the exact dataset and features, it does not prove that it is the ideal approach for every hate speech problem.

As hate speech detection is experiencing an increase in popularity, SemEval 2019, a yearly international workshop on semantic evaluation, had two tasks regarding hate speech detection. Zampieri et al. (2019) presented the results and the main findings where the task was to identify and categorise offensive language in social media, with three sub-tasks. One hundred and four teams submitted a result for the first sub-task where the goal was to discriminate between offensive and non-offensive posts. The most popular models involved deep learning approaches, and within these, there was a variation of

3.6. Hate speech detection for non-English languages

In document Detecting hateful utterances using an anomaly detection approach (sider 58-61)