Detecting threats of violence in online discussions
Aksel Wester
Master’s Thesis Spring 2016
Detecting threats of violence in online discussions
Aksel Wester May 16, 2016
Abstract
This thesis presents research and experiments on the task of threat detection in online discussions, using natural language processing in a machine learning approach. We use an existing data set of manually annotated YouTube comments, that we enrich with linguistic information.
We create classifiers for the detection of threats of violence, by extracting linguistic features from the data set, as well as from external sources.
The data set we use contains manually annotated 28, 000 sentences from 10, 000 YouTube comments. The work performed on the data set for this thesis includes restructuring the data set, so that the data is structured with both a comment level, and a sentence level. We enrich the data set, the YouTube threat corpus, with linguistic information, and convert it to the CoNLL format. We enrich the data set with lexical information, and morphosyntactic information through PoS-tagging and dependency parsing.
We use the YouTube threat corpus to extract lexical and morphosyn- tactic features, that we use when constructing classifiers for the detection of threats of violence. We also create features using WordNet synsets, and Brown clusters. Through our exhaustive experiments, we test the addition of these features to various classification models. We also test and evaluate multiple classification frameworks for this task.
Through our empirical testing, we conclude that lexical features make the best classifiers for the task of threat detection. We also conclude that using these lexical representations of tokens in bag-of andn-gram feature templates results in better classification models than using the dependency triple feature template. We also show that morphosyntactic and semantic features are not able to outperform feature sets with lexical representations of tokens either in bag-of representations, or as backoff from n-grams or dependency triples.
Acknowledgments
This thesis is the result of 18 months of research, experiments and writing.
There have been many long hours and late nights, and I could not have completed this project without the help and support of multiple people.
First, I would like to thank Erik Velldal and Lilja Øvrelid, my two excellent supervisors. I could not have written this thesis without you, and I am very grateful for our cooperation over the last one and a half years.
You helped shape this work into a proper research project that I am very proud of. You have always been sincere and generous in your praise, and honest and direct in your feedback. Because of this, you have made me a better writer. I feel truly lucky to have had the best supervisors I could have wished for.
I also want to acknowledge Hugo Lewi Hammer, for your work with threat detection that inspired our research, and for compiling the data set we used as a basis for the YouTube threat corpus that we use in this thesis. And I want to thank the anonymous peer reviewers at the WASSA workshop for your invaluable feedback on the article summarizing this thesis.
My friends and my family have also been wonderful throughout my work on this project. You have been curious and genuinely interested in my work, and I really appreciate it. In particular I have to thank Kjerstin Wester and Hans Erik Sørensen for your feedback on this thesis. And Johanne Håøy Horn and Charlotte Kjøge Wilhelmsen for taking time out of your own studies to read this thesis. Thank you very much.
Lastly I have to thank Julie Formo. Your patience and support through this project has been extraordinary, you are unselfish and kind, and I could not have done this without you. Thank you.
Contents
1 Introduction 1
2 Background 3
2.1 Review of related literature . . . 3
2.1.1 Bullying and threats of violence in YouTube comments 3 2.1.2 Detecting threats in Dutch tweets . . . 5
2.1.3 Detecting hate speech . . . 7
2.2 Discussion . . . 8
3 The YouTube threat corpus 11 3.1 The data set . . . 11
3.2 Preprocessing . . . 13
3.2.1 Sentence splitting . . . 14
3.2.2 Normalization and tokenization . . . 16
3.2.3 Lemmatization . . . 16
3.2.4 PoS-tagging . . . 17
3.2.5 Dependency parsing . . . 17
3.2.6 The CoNLL-format . . . 18
3.2.7 Preprocessing example . . . 18
4 Experimental setup 21 4.1 Features . . . 21
4.1.1 Lexical and morphosyntactic features . . . 22
4.1.2 Semantic features . . . 24
4.2 Feature templates . . . 28
4.2.1 Bag-of features . . . 29
4.2.2 n-grams . . . 30
4.2.3 Dependency triples . . . 30
4.2.4 Backoff . . . 31
4.3 Data split . . . 34
4.4 Classifiers . . . 36
4.5 Evaluation . . . 36
4.6 Tuning . . . 38
4.7 Performing an experiment . . . 40
4.7.1 Feature extraction . . . 41
4.7.2 10-fold split and feature reduction . . . 42
4.7.3 Tuning and evaluation . . . 44
5 Experiments and results 47
5.1 Baseline system . . . 47
5.2 Lexical features . . . 49
5.2.1 n-grams . . . 50
5.2.2 n-gram backoff . . . 51
5.2.3 Dependency triples . . . 53
5.2.4 Dependency backoff . . . 54
5.3 Morphosyntactic features . . . 54
5.3.1 Bag-of features . . . 55
5.3.2 n-gram backoff . . . 58
5.3.3 Dependency backoff . . . 62
5.4 Semantic features . . . 66
5.4.1 Bag-of features . . . 67
5.4.2 n-gram backoff . . . 70
5.4.3 Dependency backoff . . . 73
5.5 Constructing a final model . . . 76
5.5.1 Lemman-gram backoff . . . 77
5.5.2 Dependency bigram backoff . . . 80
5.5.3 Brown trigram backoff . . . 81
5.5.4 Development results . . . 83
5.6 Error analysis . . . 85
5.7 Held-out results . . . 86
6 Conclusion 91
List of Figures
3.1 Examples of comments from the YouTube threat corpus.
Sentences containing a threat are denoted with a 1, while a 0 denotes a non-threat. . . 13 3.2 Examples of comments incorrectly split by spaCy. An x
denotes an instance where spaCy separated one annotated threat into two or more sentences. . . 15 3.3 The dependency tree of the example sentence used to
demonstrate preprocessing. The period is removed to conserve space. . . 20 4.1 The example sentence that we will use when describing our
feature types. . . 22 4.2 The dependency tree derived from the example sentence. . . 24 4.3 The dependency tree parsed from of an example sentence
from the data set, with assigned uPoS-tags, PTB-tags, synset labels, parent synsets labels and Brown 100 cluster labels. . . 31 4.4 F-scores of the bag-of-words feature set, dependent on
different C-values for the MaxEnt classifier tested during tuning. . . 40 4.5 The print-out of the Evaluation object for the bag-of-words
feature set, using the MaxEnt classifier, after tuning. . . 44 5.1 Examples of false positives; sentences that were annotated as
non-threats, but classified as threats by the lemman-grams SVM model. The false positive in each sentence is in bold. . 89 5.2 Examples of false negatives; sentences that were annotated
as threats, but classified as non-threats by the lemma n- grams SVM model. The false negative in each sentence is in bold. . . 90
List of Tables
3.1 The number of comments, sentences and users in the YouTube threat corpus. . . 11 3.2 Number of threats of violence posted by users. . . 12 3.3 The number of comments that were split identically using
the manual and the automatic sentence splitting. The table also shows the number of comments that were split into more sentences by the automatic sentence splitter (More), fewer sentence by the automatic splitter (Fewer), and the number of sentences split into the same number of sentences by both methods, but split at different places in the comment (Differently). . . 15 3.4 The information derived during preprocessing, formatted
using our extended CoNLL-format. We have removed columns 6, 9 and 10 in this table, since they only contain underscores. . . 20 4.1 Example of the sentence from Figure 4.1 in the CoNLL
format. Columns F, I and J are not shown, since they only contain underscores. . . 22 4.2 Coverage for WordNet synset in the development set. Type
are unique words in the development set, and Token are the number of tokens. Coverage is what percentage of Types or Tokens that have synsets in WordNet. . . 25 4.3 Coverage for WordNet synset with parents (hypernyms or
causes) in the development set. Coverage is what percentage of Types or Tokens that have synsets in WordNet. . . 26 4.4 Coverage for WordNet synset with grandparents (hyper-
nyms of hypernyms, or causes of causes) in the development set. Coverage is what percentage of Types or Tokens that have synsets in WordNet. . . 26 4.5 Coverage for Brown clusters in the development set. Type
are unique words in the development set, and Token are the number of tokens. Coverage is what percentage of Types or Tokens that are represented in the Brown cluster data set.
The coverage is not dependent on the number of clusters in a clustering, since the same corpus is used for every clustering. 27 4.6 The last sentence from Figure 3.1 shown with uPoS tags.
(The last word is parsed incorrectly). . . 33
4.7 The number of comments and sentences in the development and held-out test sets after our split of the data set. . . 35 4.8 The number of comments and sentences in the development
subsamples used for 10-fold cross-validation. . . 35 4.9 True and false positives and negatives, in a confusion matrix. 37 4.10 The number and percentage of sentences in the threat and
non-threat classes in the entire the YouTube threat corpus. . 37 5.1 F-score for the baseline experiments, with feature sets
consisting of bag-of-words, bag-of-lemmas, bigrams and trigrams, using the Maximum Entropy (MaxEnt), linear Support Vector Machine (SVM), and Random Forest (RF) classifiers. . . 48 5.2 Tuning time for each of the feature sets and classifiers from
Table 5.1. Tuning time is in minutes and seconds for MaxEnt and SVM, and weeks for Random Forest (RF). . . 48 5.3 F-scores of the different combinations of bag-of-words and
bag-of-lemmas. The best result is in bold. . . 50 5.4 F-scores of the bag-of-words (BoW) feature set, with differ-
ent combinations of the other feature sets tested in Table 5.1, namely bag-of-lemmas (BoL) and n-grams of word forms.
The best overall result is in bold. . . 50 5.5 Bag-of-words, bag-of-lemmas and different combinations of
n-grams with lemma backoff. Each backoff combination is the one that achieved the best result. . . 52 5.6 Bag-of-words, bag-of-lemmas, dependency triples and dif-
ferent combinations ofn-grams. . . 53 5.7 Bag-of-words, bag-of-lemmas, dependency triples, with and
without lemma backoff, and different combinations of n- grams. The results are the best variants of dependency backoff for each of the feature sets. . . 54 5.8 Results of bag-of-words, regular bag-of, and lexicalized bag-
of (Bag-ofLex) with the morphosyntactic feature types. The best F-score for each feature type is in bold. . . 55 5.9 F-scores of the bag-of-words (BoW) and bag of features (BoF)
feature sets, with different combinations ofn-grams. . . 56 5.10 F-scores for the feature sets with bag-of-words and bag-of
lexicalized features (Bag-ofLex), with different combinations ofn-grams. . . 57 5.11 F-scores of the feature sets withbag-of-words (BoW), bag-
of-lemmas (BoL), and bag-of with morphosyntactic features, with different combinations ofn-grams. . . 57 5.12 F-scores of the feature sets withbag-of-words (BoW), bag-of-
lemmas (BoL), and bag-of with lexicalized morphosyntactic features, with different combinations ofn-grams. . . 58 5.13 Bag-of-words and different combinations of n-grams with
morphosyntactic feature backoff. Each backoff combination is the one that achieved the best result. . . 59
5.14 Bag-of-words, bag-of with morphosyntactic features and different combinations of n-grams with morphosyntactic feature backoff. Each backoff combination is the one that achieved the best result. . . 59 5.15 Bag-of-words, bag-of with lexicalized morphosyntactic fea-
tures and different combinations of n-grams with mor- phosyntactic feature backoff. Each backoff combination is the one that achieved the best result. . . 60 5.16 Bag-of-words, bag-of-lemmas, bag-of with morphosyntactic
features and different combinations of n-grams with mor- phosyntactic feature backoff. Each backoff combination is the one that achieved the best result. . . 60 5.17 Bag-of-words, bag-of-lemmas, bag-of with lexicalized mor-
phosyntactic features and different combinations ofn-grams with morphosyntactic feature backoff. Each backoff combi- nation is the one that achieved the best result. . . 61 5.18 Bag-of-words, bag-of-lemmas and different combinations of
n-grams with morphosyntactic feature backoff. Each backoff combination is the one that achieved the best result. . . 62 5.19 Bag-of-words, bag-of-lemmas, the different n-gram combi-
nations, and dependency triples with dependent backoff to the different morphosyntactic feature types. . . 63 5.20 Bag-of-words, bag-of-lemmas, the different n-gram combi-
nations, and dependency triples with head backoff to the dif- ferent morphosyntactic feature types. . . 64 5.21 Bag-of-words, bag-of-lemmas, the different n-gram combi-
nations, and dependency triples with full backoff to the dif- ferent morphosyntactic feature types. . . 64 5.22 Bag-of-words, bag-of-lemmas, dependency tuples and the
differentn-gram combinations. . . 65 5.23 Bag-of-words, bag-of-lemmas, the different n-gram combi-
nations and dependency tuples with backoff to the mor- phosyntactic features. The results are the best variants of dependency backoff for each of the feature sets. . . 66 5.24 Results of bag-of-synset variants with bag-of-words and
bag-of-lemmas. Synset p is the synset parent, and synset gp is the synset grandparent. The best F-score for each synset variant is in bold. . . 67 5.25 Results of bag-of-clusters variants with bag-of-words and
bag-of-lemmas. The number after the capitol B is the number of clusters used in each clustering that the cluster labels are taken from. The best F-score for each Brown cluster variant is in bold. . . 68 5.26 Bag-of-words, bag-of-lemmas, bag-of-synsets and the differ-
ent n-gram combinations. The best F-score for each synset variant is in bold. . . 69 5.27 Backoff is the max of the different combinations of backoff . 69
5.28 Bag-of-words, bag-of-lemmas, bag-of-synsets and the differ- ent combinations ofn-gram backoff. Each line presents the result of the best backoff variant for that backoff type, and the best F-score of each of the synset variants is in bold. . . . 71 5.29 Bag-of-words, bag-of-lemmas, bag-of Brown cluster labels
and the different combinations of n-gram backoff. Each line presents the result of the best backoff variant for that backoff type, and the best F-score of each of the Brown cluster variants is in bold. . . 71 5.30 Bag-of-words, bag-of-lemmas and the different combina-
tions of synsetn-gram backoff. Each line presents the result of the best backoff variant for that backoff type. . . 72 5.31 Bag-of-words, bag-of-lemmas and the different combina-
tions ofn-gram backoff with Brown cluster labels. Each line presents the result of the best backoff variant for that backoff type. . . 72 5.32 Bag-of-words, bag-of-lemmas, the different n-gram combi-
nations, and dependency triples with dependent backoff to the different synset variants. . . 73 5.33 Bag-of-words, bag-of-lemmas, the different n-gram combi-
nations, and dependency triples with head backoff to the dif- ferent synset variants. . . 74 5.34 Bag-of-words, bag-of-lemmas, the different n-gram combi-
nations, and dependency triples with full backoff to the dif- ferent synset variants. . . 74 5.35 Bag-of-words, bag-of-lemmas, the different n-gram combi-
nations, and dependency triples with dependent backoff to the different Brown cluster label variants. . . 75 5.36 Bag-of-words, bag-of-lemmas, the different n-gram combi-
nations, and dependency triples with head backoff to the dif- ferent Brown cluster label variants. . . 75 5.37 Bag-of-words, bag-of-lemmas, the different n-gram combi-
nations, and dependency triples with full backoff to the dif- ferent Brown cluster label variants. . . 75 5.38 Bag-of-words, bag-of-lemmas, word form bigrams and tri-
grams, and all combinations of lemma bigram backoff. . . . 77 5.39 Bag-of-words, bag-of-lemmas, word form trigrams, and the
different lemma bigram backoff variants. Each feature set includes only one bigram variant. . . 78 5.40 Bag-of-words, bag-of-lemmas, one bigram variant and one
trigram backoff variant, without regular word trigrams. . . . 79 5.41 Bag-of-words, bag-of-lemmas, one bigram variant and one
trigram backoff variant, with regular word trigrams. . . 79 5.42 Bag-of-words, bag-of-lemmas, regular word trigrams and
<word, lemma, lemma> trigrams, and all combinations of lemma bigrams and bigrams with dependency backoff. . . . 81
5.43 Bag-of-words, bag-of-lemmas, lemma bigrams and the dif- ferent trigram combinations using Brown 1000 cluster labels as backoff. . . 82 5.44 Bag-of-words, bag-of-lemmas, lemma bigrams and the dif-
ferent trigram combinations using Brown 3200 cluster labels as backoff. . . 82 5.45 Precision, recall and F-score for the classifications of the
feature sets using the MaxEnt classifier. . . 83 5.46 Precision, recall and F-score for the classifications of the
feature sets using the SVM classifier. . . 83 5.47 The mean number of features in each subsample of the
development set, after count-based feature reduction. . . 84 5.48 Precision, recall and F-score on the comment level, for the
baseline, lexicaln-grams, and lemman-grams. . . 84 5.49 Baseline system. The true and false positives and negatives
of the bag-of-words feature set using the MaxEnt classifier.
The numbers are the sum of each of the 10 classification, when performing 10-fold cross-validation on the develop- ment set. . . 85 5.50 Lexicaln-grams. The true and false positives and negatives
of the best model before the final model construction, the lexical n-grams feature set, using the SVM classifier. The numbers are the sum of each of the 10 classification, when performing 10-fold cross-validation on the development set. 85 5.51 Lemman-grams. The true and false positives and negatives
of our best model, the lemman-grams feature set using the SVM classifier. The numbers are the sum of each of the 10 classification, when performing 10-fold cross-validation on the development set. . . 86 5.52 Precision, recall and F-score on the held-out test set for the
baseline system, bag-of-words; the best performing feature set before the final feature set experiments, lexicaln-grams;
and the best performing feature set after the final feature set experiments, lemman-grams. . . 87
Chapter 1
Introduction
Threats of violence are a common occurrence in online discussions. Threats disproportionately affect women and minorities, often to the point of effectively eliminating them from taking part in discussions online. Social networks operate on such a large scale that it is an insurmountable task for moderators to manually read all posts. Methods for automatically detecting threats could therefore be very helpful, to both moderators and users of social networks.
Law enforcement and intelligence agencies could also benefit from the automatic detection of threats of violence. The threat of terrorism has grown in recent years, and many terrorists communicate and express themselves on social media. Automatic detection of threats of violence against groups or places could help in counter-terrorism work, both in terms of preventive measures, such as outreach to people who express extremist views, and in the detection of actual terrorism plots.
This thesis describes our experiments into the task of detecting threats of violence in online discussions. We use natural language processing and a machine learning approach to the task, by creating linguistically informed features that we utilize to construct a classifier for detecting threats of violence.
For our classification task, we use an existing data set consisting of 28, 000 sentences from 10, 000 YouTube comments. Each sentence is manually annotated according to whether it contains a threat of violence or not. Through preprocessing, we enrich the data set with lexical and morphosyntactic information about the text, in the form of lemmas and PoS-tags for each of the tokens; and information about the structure of sentences, in the form of dependency graphs corresponding to the syntactic relationships of the tokens in the sentences.
As features in our classifier, we use both lexical and morphosyntactic information from our data set. We also use semantic information from external sources to create features. We derive semantic features from WordNet, which is a large lexical database where semantically similar concepts are grouped together insynsets, which we can use to extract the higher-level semantic concepts a token represents. We also derive semantic features from Brown clusters, which are clusters of the words in a large
corpus of news text, where words are clustered according to their semantic similarity.
We conduct exhaustive testing of the different feature types, and we experiment with different classification frameworks. We perform experiments in order to evaluate the effects of individual feature types, and ways of using those feature types to construct features of varying complexity. We experiment with bag-of features, n-gram features and dependency triples. We also experiment with backoff from lexical representations of tokens, where we instead represent tokens as other types of features.
A summary of the results we present in this thesis will also be presented at, and published in the proceedings of, the 7th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, at the 15th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies in San Diego (Wester, Øvrelid, Velldal, & Hammer, 2016).
One point to note is that this thesis concerns the detection of written threats of violence in online discussions. And while it can be the case that threats of violence are followed by the threat being realized in an actual act of violence, this is not necessarily the case. Furthermore, the aim of this thesis is not to determine the veracity of threats of violence, since that is outside the scope of both our methods and our data set. Our experiments merely aim to detect threats of violence, which, in a real world application of our methods, would only serve as the first step in dealing with the threat.
In Chapter 2 we will review the literature on the research that has been done within the field, using text mining or natural language processing to detect threats of violence, or similar concepts, in online discussions. We will focus on tasks similar to the one presented in this thesis, while we will examine more general natural language processing techniques in Chapter 3 and Chapter 4. In Chapter 3 we will also examine the data set we use for our experiments, and the ways we enrich the data set, converting the data from a set of plain text, annotated comments, to a corpus with rich lexical and morphosyntactic information.
In Chapter 4 we describe the features we use in our classification models. We also examine how we perform and evaluate experiments, and how we have implemented this experimental setup in our code.
We describe the classification frameworks we experiment with, how we evaluate the performance of individual classifiers, and how we tune those classifiers to ensure that we achieve the best possible results from each feature sets.
Chapter 5 describes all our experiments and their results, where we will evaluate the effects of using the different types of information in our data set, as well as features derived from outside sources, for the task of threat detection. We will also construct a final model for detecting threats of violence, that we will evaluate in comparison to a baseline model.
Chapter 2
Background
2.1 Review of related literature
In this chapter we will review the literature concerning the detection of threats of violence online, or research aiming to detect similar concepts.
We will examine more general language technology techniques in Chapter 4. There has only been a modest amount of research published concerning the task of threat detection. However, most of the literature we found was published relatively recently, and it appears that threat detection is an area that is garnering an increasing amount of attention.
Some of the literature describes methods of detecting other concepts than threats of violence, like cyberbullying or hate speech. These articles are still relevant to our research, and some of them also use data sets from sources similar to ours. The first set of articles describe research similar to the research in this thesis, in that the articles also use data sets of YouTube comments. The second set of articles describes the detection of threats of violence, using a data sets consisting of collections of Dutch tweets. The last article concerns the detection of hate speech, which, in some ways, is similar to the detection of threats of violence.
2.1.1 Bullying and threats of violence in YouTube comments Hammer (2014) describes a method of using machine learning to detect threats of violence, from a data set of YouTube comments. The data set compiled by Hammer (2014) is the same that we will use as the basis for our data set, although it has been changed slightly since the publication of his initial study. The method described in the article uses logistic LASSO regression analysis on bigrams (skip-grams) of important words to classify sentences as threats of violence or not.
The data set used in Hammer (2014) serves as the basis for the YouTube threat corpus, the data set we use for the experiments in this thesis, which we describe in Chapter 3. However, there are some differences between the data set described in the article, and the version of it that we work with. Sentences are seen as separate and independent, not belonging to a comment or a user. Some annotations also differ from the YouTube threat
corpus, and it is slightly smaller, containing 24,840 sentences, compared to the 28,643 sentences in our data set.
The method described in the article uses a set of important words that are correlated with threats of violence. The features are bigrams of two of these important words observed in the same sentence. The article does not describe exactly how these important words were selected, stating only that words were chosen that were significantly correlated with the response (violent/non-violent sentence). However, it appears likely that the words were arrived at using LASSO regression.
The features used in the research were skip-grams, i.e. bigrams defined for non-contiguous words in a sentence. An example of this from the sentence “We love to kill Muslims”, is the skip-gram ’we-kill’. The words are not necessarily contiguous, but the same combination can nonetheless be seen in many threats of violence. A weight function was used to decrease the relevance of a feature, the longer away from one another the two important words were. The reason for using these feature templates seems to be to achieve rough approximations of predicate structures in the sentences, without performing syntactic parsing of the data set.
The logistic LASSO regression analysis used in the research has the side effect of performing an implicit feature selection while estimating the model. The regression analysis does this by giving weights to the features, and by giving non-zero weights to only the features most correlated with the different classes. The model estimation resulted in between 400 and 1,000 non-zero features for the different feature sets.
The method tested 7 different feature sets that were combinations of weight functions and skip-grams. The one that yielded the best results, ACW1, is described here.
The feature set is comprised of skip-grams consisting of all combina- tions of two important words, as well as a weight function for those fea- tures. The weight function is rather straightforward, as described below:
w1(d) = 1 d+1
whered is the number of words between the two important words in the skip-gram. The weight function does not take word order into account.
The article does not report accuracy, precision or recall directly, but based on the reported numbers of true and false positives and negatives, we have calculated it. Accuracy is 0.9466, which is slightly above 0.9371, the majority-class baseline accuracy for the data set. Precision is 0.5696, recall is 0.9028, which results in an F-score of 0.6985.
The article concludes that the method used is quite accurate, and more computationally effective than regular bigrams. It also speculates about whether parsing could yield better results. The author suggests that it may not, due to the poor grammar and sometimes unintelligible sentence structure found in the threats in the data set.
Dinakar, Reichart, and Lieberman (2011) also uses machine learning on a data set of YouTube comments. The goal of the research is to detect what is described as “cyberbullying”, which has some resemblance to our goal of
detecting threats of violence. The research describes a method of detecting cyberbullying by targeting combinations of profane or negative words, and words related to several predetermined sensitive topics.
Their data set consists of over 50,000 YouTube comments taken from videos about controversial topics, such as sexuality, race, culture and intelligence. The comments were grouped by video topic, and then 12 % of comments were manually annotated to check that they were placed in the right category.
The first stage of the detection method was the same across all categories. It consisted of using a lexicon of negative words and a list of profane words, as well as part-of-speech tags from the training data that were correlated with bullying. The second stage was category-specific, and used commonly observed uni- and bigrams from each category as features.
The experiments reported accuracies from 0.63 to 0.80, but did not report precision or recall.
The article does not delve into details about key ways the experiments were set up, which makes their research difficult to replicate. The ad hoc feature engineering described in the article does not lend itself to be used by us, since it is largely undocumented, and does not seem particularly novel. The article does however point out some complications that could also prove to be true for our research, namely detecting euphemisms and metaphors. Another aspect that could be useful to our research is the use of the Ortony Lexicon, which is a taxonomy of words of emotion.
The Ortony Lexicon, or some other taxonomy of semantic concepts, like WordNet, could help in deriving semantic meaning from words.
2.1.2 Detecting threats in Dutch tweets
The next set of articles we examine, describe methods of detecting threats of violence in user-generated text. The first article describes a method of usingn-grams in two different ways to detect threats of violence in Dutch tweets (Oostdijk & van Halteren, 2013a).
The first method uses manually constructed recognition patterns, in the form ofn-grams. Then-gram variants used in the methods are regular uni- , bi- and trigrams, as well as skip bi- and trigrams. The articles do not go into detail about the methods used to construct these patterns, stating that the researchers relied on their linguistic intuition as speakers of Dutch. The article does state that the patterns usually included a verb, and also that they included negativen-grams to cancel out idioms using violent words in non-violent settings. An example in English would for instance be the phraseto kill time.
The second method described in Oostdijk and van Halteren (2013a) uses machine learning to identify n-grams indicative of threats. The system they used, called the Linguistic Profiling system, was built to do author identification, based on over- and underuse ofn-grams. The system does not seem particularly well suited for the task in this experiment, as it typically expects a large volume of text. However, the tweets in the data set are necessarily 140 characters or shorter, with an average of 10 words,
according to the article.
The data set used for the experiment consisted of tweets from two different sources. The first was a collection of threatening tweets collected by a website over a period of about two years. All these tweets were presumed to be threats, but were also manually checked against the researchers’ definition of a threat, and tweets not matching the definition were removed. The remaining set contained about 5, 000 tweets, where 90 % of the tweets were used for development and the remaining 10 % was used for testing. In addition, a set of 2.3 million random tweets was collected and used for development, and a set of 1 million random tweets was used for testing. These data sets was not annotated in any way prior to being used in the experiments.
The two methods described above were tested on both data sets. After testing on the larger data set, the researchers would manually annotate all tweets that the experiment found to be positive, to check whether they were actual threats of violence, and to then calculate the precision. The recall when classifying the larger data set could not be calculated, as the total number of threats in the sample of random tweets was not known.
Similarly, since the smaller of the two data sets only contained threats, there was no point in calculating precision; however, recall was calculated.
After classification the precisions of the manual pattern construction and the technique using machine learning were 0.12 and 0.07, respectively, and the recall was 0.85 and 0.90. However, these numbers should not be used to calculate an F-score, as they were not from the same data set (Oostdijk & van Halteren, 2013a).
In a follow-up article, the researchers try to improve the results of the system. The main objective is to improve precision, since that was the main problem in Oostdijk and van Halteren (2013a). The same data set is used and, therefore, the experiment are under the same limitations regarding the testing and reporting of results. The addition to the system is a shallow parsing step, added after the initial steps from the previous system (Oostdijk & van Halteren, 2013b).
To parse a tweet, it is first chunked — rather crudely — by simply splitting at certain predetermined punctuation marks and conjunctions.
Then came a filtering step where certain tweets were placed in a category of unlikely threats, because the chunks that matched threats of violence were, for instance, hashtags or usernames. Finally, the shallow parsing step was performed.
Instead of using an off-the-shelf parser, the researchers handcrafted a set of rules that had to be fulfilled for the tweet to be marked as a threat.
This method required first clustering different verbs that could be used to express the same action. For instance, the article cites different ways of sayingto shoot. It is not explained why the approach of using handcrafted rules taken, nor how the clustering was performed or what resources were used in this step. Finally, the parser checks whether the sentences fit the different rules for the verbs, and classifies the tweets accordingly. After adding this parsing step, the experiment gets an improved precision of 0.39, but at the expense of a drop in recall to of 0.59.
Other articles have also discussed detecting threats in Dutch tweets.
The aim of the third article differs from the other research in that its focus is primarily on detecting shifts in behavior on social networks, that could possibly relate to protests and demonstrations would be of interest to law enforcement agencies (Bouma, Rajadell, Worm, Versloot, & Wesemeijer, 2012). The article describes a system for detecting what they refer to as abnormal behavior on social networks, where language processing is only a small component of the setup, and not explained very thoroughly. A follow-up article does, however, go into greater detail about the linguistic threat detection aspect of the system (Spitters, Eendebak, Worm, & Bouma, 2014).
The first stage of detecting threatening tweets was to filter out any tweets not containing words from a list of threat triggers, i.e. words that were associated with threats. This list was compiled from their training data by computing a correlation score, and then manually edited by removing unwanted features. These unwanted features were words that, on their own, were not indicative of threats.
The article describes two alternate second stages. The first is a method of classification based on context cues, selected using a correlation coefficient to compute the score of each context word in relation to a trigger.
This gave reinforcing and weakening context cues for the triggers. The second method used classification based on patterns. The Needleman- Wunsch algorithm, an algorithm from bioinformatics mostly used for aligning protein and nucleotide sequences, was used to construct patterns from the threats in the data set. The patterns were applied to the test data, and matches were given a higher score, the longer the matching pattern were.
Bouma et al. (2012) and Spitters et al. (2014) use the same type of data sets, with a small sample of annotated threats from one source, and a large unannotated set from another source. This makes the results of these articles difficult to compare, since they are based in part on guessing and assumptions about the unseen parts of the data sets. The best results were a precision of 37 %, which, with an unknown recall, is not very informative.
2.1.3 Detecting hate speech
Warner and Hirschberg (2012) present a method for detecting unwanted or illegal comments in user-generated text from the internet. The article describes a method of detecting hate speech by using a machine learning approach with template-based features. Hate speech differs from threats of violence in that it is harder to define, and whereas single threats rarely span multiple sentences, this could certainly be the case with hate speech. This might make hate speech harder to detect than threats of violence. The first step of the method is to only look at certain categories of hate speech, since the article assumes that all the different types of hate speech use unique expressions and words.
The data set used in the research came from two sources. The first consisted of posts from Yahoo news groups, the second were web pages
collected by the American Jewish Congress that had been identified as offensive. The data set was manually annotated, and then the hate speech was assigned to a category, such as antisemitic, anti-woman, anti-Asian, etc. Further research then focused on the antisemitic category.
The task was approached as a word-sense disambiguation task, since the same words can be used in both hateful and non-hateful contexts.
The features used in the classification were combinations of uni-, bi- and trigrams, part-of-speech-tags and Brown clusters. The best results of the classifications were obtained using only unigrams as features, with a precision of 0.67 and a recall of 0.60. The other feature sets resulted in much lower precision and recall. The authors suggest that deeper parsing could reveal significant phrase patterns.
2.2 Discussion
Although there have been quite a few articles related to the topic of detecting threats of violence, abuse, hate speech, or other similar unwanted or illegal speech on the internet, not much of it has been particularly linguistically informed. Many of the articles point to this as an area worth exploring. Our research will investigate a more linguistically informed approach to the task, as detailed in Chapter 4. This means that we will not be able to use many of the methods from the related literature, however, there are still lessons to be learned from the literature.
Hammer (2014) uses the same data set as we will use in our research, however, there are several things that we will do differently in our own experiments. The first is to reduce the risk of overfitting caused by the selection of training and test data. The training data in Hammer (2014) consisted of 80 % of the sentences containing threats of violence, and 80 % of the sentences without, chosen randomly. This means that the training and test data could contain sentences from the same comment. This could lead to somewhat inflated results, as one might imagine that the different sentences in a comment could contain some of the same phrases, which in turn would make the sentences easier to classify. This is likely to not be very problematic, however, since most comments only contain one threat, as seen in Table 3.1.
Another risk factor closely related to overfitting could be that a particular user has comments both in the test and development set, and that the training and test data are extracted from all the different videos, possibly causing types of threats very specific to a video to be artificially easy to detect.
A way to combat these risks will be to use a stricter separation of training and test data. For instance to train only on the comments from seven of the videos, and test on the comments from the remaining video. Another approach could be to separate training and test data at the comment level, to avoid the first problem, and to also only use the comments of some commenters for testing, and the comments from the rest of the commenters for development. The topics and types of threats seem
to be very similar throughout the data set, so some of these risks might not constitute a problem.
There are also some aspects of the research on Twitter data that could be worth looking at (Oostdijk & van Halteren, 2013b, 2013a; Bouma et al., 2012;
Spitters et al., 2014). The first is their use of a formal definition. They use a definition of a threat of violence based in part on Black’s law dictionary and the Canadian criminal code, which could be interesting to look at for our own research:
A threat is a declaration of an intention to cause death or bodily harm to a person or persons, to damage or destroy their personal property, or to kill or injure an animal that is the property of a person.
A formal definition has the upside of making annotation less subjective, and the task we are doing more clear. However, since annotation has already been performed on the data set we will use, by Hammer, the cost of reannotating all the 28, 000 sentences in the feature set would outweigh the gain.
The articles also bring up the problem of detecting reported speech, for instance when citing other people. Another challenge is speech that is not intended to be taken literally, like sarcasm and irony. Both of these challenges could prove difficult to solve.
A common aspect of much of the related literature is their use of in-house tools, and a combination of machine learning and ad hoc, handcrafted rules. This achieved some good results, but it had the effect of making it quite hard to replicate the research and test set-ups. Especially when these handcrafted rules were only described broadly, or not at all. We have not been able to find any of the data sets used in the related literature, except in Hammer (2014), as they do not seem to be published, this also contributes to the difficulty of replicating the research. We intend to use only, or at least mostly, off-the-shelf tools and software, and we will rely on machine learning in favor of handcrafted rules. We will also publish the YouTube threat corpus. This will make our findings easier to replicate, and they can be more easily adapted to other, similar problems.
Chapter 3
The YouTube threat corpus
In the following chapter, we will present the data set we will be using in our experiments, the YouTube threat corpus. We will describe its origin and its content. We will also describe our preprocessing of the data set, and the format in which we save the corpus for further use.
3.1 The data set
Our data set is comprised of user-written comments from eight different YouTube videos. It was compiled by Hugo Lewi Hammer at Oslo and Akershus University College of Applied Sciences (Hammer, 2014). A comment consists of a set of sentences, each of them manually annotated to be either a threat of violence or not. The definition of a threat of violence includes statements in support of threats of violence. The data set furthermore records the username of the user that posted the comment.
The eight videos that the comments were posted to cover religious and political topics like halal slaughter, immigration, Anders Behring Breivik, Jihad, etc. (Hammer, 2014). Though covering slightly different topics, the videos all contain the same type of discussions, namely of xenophobia and racism.
Table 3.1 presents an overview of the number of comments and sentences in the YouTube threat corpus. The data set consists of 9,845 comments, comprised of 28,643 sentences. In total there are 402,673 tokens in the sentences in the data set. There are 1,285 comments containing threats, and 1,384 sentences containing threats. There are 992 users responsible for posting the 1,285 comments containing threats of violence.
Comments Sentences Users posting
Total 9, 845 28, 643 5, 483
Threats 1, 285 1, 384 992
Table 3.1: The number of comments, sentences and users in the YouTube threat corpus.
Posted No. of users
25 threats 1
16 threats 1
12 threats 1
8 threats 3
7 threats 3
6 threats 4
5 threats 5
4 threats 15
3 threats 16
2 threats 87
1 threat 856
0 threats 4, 491
Table 3.2: Number of threats of violence posted by users.
It is clear that most of the 992 users only posted one threat of violence. In fact only 49 of the 992 users posting threats (4.9 %), posted more than two threats, as detailed in table 3.2. We see that the vast majority of threats were posted by a user who only posted a single threat.
There is no distinction between threats of violence and support of threats of violence in the annotations of the data set. There is also no formal definition of a threat of violence used to annotate the comments, but the implicit definition used seems to be quite broad, for instance annotating calls for deportation as threats of violence. Inter-annotator agreement was reported by Hammer (2014) to be 98 %, as calculated on 120 sentences, doubly annotated for evaluation. The 120 sentences consisted of 100 randomly selected sentences annotated as non-threats by the primary annotator, and 20 randomly selected sentences labeled as threats. The reported inter-annotator agreement is simple overlap, and not a chance- corrected measure like kappa.
A chance-corrected measure for inter-annotator agreement would have been better, especially since there are only two classes, and because the classes are unbalanced. It had also been better if the sample size used to calculate inter-annotator agreement had been larger, since 20 threats would not be enough to calculate statistical significance of the agreement.
However, since there was no formal definition used in the annotation, it would not have been possible for us to calculate a chance-corrected inter- annotator agreement ourselves, since we cannot use the same definition of a threat of violence for annotation.
The number of sentences and the number of threats differs slightly from the data set we use as the basis for the YouTube threat corpus and the version used in Hammer (2014). The two versions also differ in that the data set used in Hammer (2014) only consisted of sentences, and that there was no information about what comments sentences belonged to. Some sentences were therefore reannotated by Hammer, after seeing them in the
context of their comments.
Figure 3.1 contains examples of some comments containing threats of violence taken from the data set. The first line is the anonymized username, and the subsequent lines are the sentences of the comment. An empty line indicates the end of a comment. The sentences are annotated with a number indicating whether they contain a threat of violence (1), or not (0).
From the examples in Figure 3.1, it appears that most of the threats are not directed at specific persons, but rather at groups of people. This also appears to be the case when reviewing more sentences from the YouTube threat corpus.
Comment #74, Commenter #44
1 and i will kill every fucking muslim and arab!
Comment #142, Commenter #88 0 Need a solution?
1 Drop one good ol’ nuke on that black toilet in Mecca.
Comment #198, Commenter #138
1 Funny, We will conquer you all in just few years, U will be my slave and your women will be my Sex Toy in Bed.
Comment #46, Commenter #42
0 As long as i’m alive you’ll have no victory.
1 I’ll kill all you cunts
Figure 3.1: Examples of comments from the YouTube threat corpus.
Sentences containing a threat are denoted with a 1, while a 0 denotes a non-threat.
3.2 Preprocessing
In this section we will describe the steps of our preprocessing of the YouTube threat corpus. We will describe the toolkits we use, and the information we derive from the text, that we use to enrich the YouTube threat corpus. One of the overriding goals of our project was to use state- of-the-art, freely available toolkits and packages, to make our experiments replicable. After considering multiple toolkits for Natural Language Processing, we decided to use the spaCy toolkit1 in our preprocessing.
The spaCy toolkit, through its python API, can perform sentence splitting, tokenization, PoS-tagging and dependency parsing, among other things.
We will discuss each of these tasks separately in this section.
1https://spacy.io
Since we will be developing our code in python for both the prepro- cessing and the experiments, we could have opted to store the data set in-memory after preprocessing, and to access it directly in our experiment code. The preprocessing is, however, quite time consuming, so we in- stead opt to perform the preprocessing and the experiments as two separate steps, where we write the results of the preprocessing to file after comple- tion, and read it at the start of the experiments. For this, we have to select a file format for the data set, which we will describe in Section 3.2.6. In selecting this approach, we also make it possible to distribute the YouTube threat corpus as already preprocessed, as well as in its raw text form.
Even though we decided to write the preprocessed data set to a file after preprocessing, we also wrote code to feed a data set straight through to the classification module. This, in principle, allows us to easily test the final classification system with text from other sources than the YouTube threat corpus.
3.2.1 Sentence splitting
We start our description of the preprocessing by examining the sentence splitting found in the data set. During the annotation process described in Hammer (2014), each comment was manually separated into sentences, and the sentences were annotated as either containing threats of violence or not. Since we use the data set from Hammer (2014) as the basis for the YouTube threat corpus, this means that our data set already has been split into sentences manually, with this exact splitting tied to the threat annotation.
For the sake of replicability we wanted to evaluate what effect this manual sentence splitting had on the data set, compared to an automated sentence splitting done by spaCy. Since in a realistic setting, where our system will be applied to raw text, we must rely on automatically assigned sentence boundaries. spaCy performs sentence splitting during its dependency parsing, for which spaCy uses its own dependency parser (Honnibal & Johnson, 2015). According to the spaCy documentation2:
Sentence boundaries are calculated from the syntactic parse tree, so features such as punctuation and capitalisation play an important but non-decisive role in determining the sentence boundaries.
This could pose a problem for our data set, since many of the comments have little or no punctuation, and capitalization ranges from none, to entire comments in all caps, as shown by one example in Figure 3.2.
We use spaCy to perform a new sentence splitting of the entire the YouTube threat corpus. For each comment, we combine all of its constituent sentences into a single string. This string consists of the sentences, in order, separated by a single space. We then use spaCy to perform a dependency parsing, and derive one or more sentences from each comment. We then
2https://spacy.io/docs#annotation-sentence-boundary
Automatic, compared to manual Comments Percentage
Identical 7, 149 72.6 %
Differently 194 2.0 %
More 1, 465 14.9 %
Fewer 1, 036 10.5 %
Table 3.3: The number of comments that were split identically using the manual and the automatic sentence splitting. The table also shows the number of comments that were split into more sentences by the automatic sentence splitter (More), fewer sentence by the automatic splitter (Fewer), and the number of sentences split into the same number of sentences by both methods, but split at different places in the comment (Differently).
compare the sentence splitting in each comment to the manual split done in Hammer (2014).
Of the 9, 844 comments, the manual splitting was identical to the split performed with the spaCy splitter in 7, 149 cases, as we can see in Table 3.3.
Of the remaining 2, 695 comments, 1, 465 comments were separated into more sentences by the spaCy sentence splitter than the manual splitting, and 1, 036 comments were split into fewer sentences by spaCy than in the manual splitting. The last 194 sentences were split into the same number of sentences by the two methods, but the split points were not in the same place using the two different methods.
Despite the large number of mismatches in the two splitting methods, we were hesitant to use spaCy, since that would mean reannotating a large number of comments. We would have to reannotate at least the 1, 465 comments where spaCy made more sentences than the manual splitting, since it could not be automatically inferred which sentences were a threat if a sentence containing a threat had been split in two.
Comment #2680, Commenter #1606
x ALL MUSLIMS SHOULD WAGE A HOLY JIHAD AGAINST THE WEST. WHERE EVER YOU
x ARE SLAY THESE EUROPEAN SCUMS DOWN TO BITS AND SHOW NO MERCY. WIDESPREAD DESTRUCTION & MASSACRE AWAITS EUROPE. YOUR MEN, WOMEN & CHILDREN WILL BE BUTCHERED LIKE HALAL MEAT. YOU REAP WHAT YOU SOW AND THIS TIME MUSLIMS WILL CONQUER.
Figure 3.2: Examples of comments incorrectly split by spaCy. Anxdenotes an instance where spaCy separated one annotated threat into two or more sentences.
After examining the output of the spaCy splitting, we found several examples where the sentence split was at the wrong place in the comment.
In particular, spaCy seemed to have a problem with sentences in all caps, as
can be seen in Figure 3.2. The sentences proposed by spaCy were both too short, and too long, as in the last sentence in the comment, which contains at least three whole sentences. The end of sentences were also placed in seemingly strange places, such as the first sentence which ends with
“WHERE EVER YOU”. We therefore decided to forego spaCy’s sentence splitter, and instead rely only on the manual splitting that had already been performed.
3.2.2 Normalization and tokenization
The next step of the preprocessing is to normalize the sentences. This is not necessarily a crucial step to perform, but we saw in the Section 3.2.1 that the capitalization in our noisy data set could make the dependency parsing engine less accurate. We therefore decide to lowercase most words in our data set before we perform the subsequent steps in the preprocessing.
We assume that removing all caps from our data set will not remove relevant information as well. However, it is possible that all caps capitalization is an indicator of threats of violence, since it can be used as a way to express anger. Nonetheless, we think that the trade-off from the normalization will be worth it.
Another reason for normalizing the sentences in our data set, besides making it easier for our dependency parser to correctly parse, is that it reduces the number of variations a single word can occur in. As we will discuss in Chapter 4, features created from test data need to be identical to features created during training for the two to match, and normalization increases the chance of this happening.
We decide on the following rules for our normalization: If a word is all caps, we lowercase it, except if it is the first word in the sentence, in which case it is capitalized. If words are capitalized, or in any other way mixed cased, we do not change their capitalization, no matter where in a sentence the word occurs.
After normalization we tokenize the sentences, and for this step we rely on the tokenizer in spaCy. The tokenization standard used by the spaCy tokenizer is based on the standard for the OntoNotes 5 corpus (Weischedel et al., 2013), which is a large corpus of various genres of text.
The OntoNotes standard for tokenization is based on the standard used for the Penn Treebank (Marcus, Santorini, & Marcinkiewicz, 1993), but the two differ in some ways. OntoNotes, for instance, differs from Penn Treebank on how it deals with hyphenated words. OntoNotes splits words on hyphens, while the Penn Treebank standard does not.
3.2.3 Lemmatization
For lemmatization we use the WordNet lemmatizer (Fellbaum, 1998), which is built into spaCy. The built-in version of the WordNet lemmatizer differs from the original in the way it handles, for instance, pronouns.
While WordNet creates the lemma “hi” from the pronoun “his”, the built- in version instead substitutes this for the lemma -PRON-. This is a
better approach than the way the original WordNet lemmatizer processes pronouns, but instead of either of these options, we add an exception in our code for pronouns, and let the lemmas of pronouns simply be their word forms.
3.2.4 PoS-tagging
Lastly in this step, we perform Part-of-Speech-tagging (PoS-tagging) on the data set. The built-in PoS-tagger in spaCy uses the OntoNotes 5 version of the Penn Treebank tagset (PTB) (Santorini, 1990). The tagset consists of 36 tags, and the tags specify both grammatical number and tense. In addition to assigning each token a PTB PoS-tag, spaCy also has an option to convert the PTB-tag to the simpler Google Universal Part-of-Speech tagset. The Universal Part-of-Speech tagset (uPoS) consists of just 17 tags, which do not take grammatical tense or number into account (Petrov, Das, & McDonald, 2011). We will include both of these tagsets in our data set, so that we are able to experiment with the different degrees of granularity the two tagsets offer.
3.2.5 Dependency parsing
The last step of the preprocessing that pertains to enriching the data set, is performing dependency parsing. Dependency parsing aims at assigning syntactic structure to a sentence, expressed as a dependency representation. A dependency representation is a structure that connects words directly to other words in a tree structure. This differs from other types of syntactic representations, like phrase structure representations, that make use of non-terminal categories, so-called phrases, in its tree structure. Dependency trees are simpler structures than other types of syntactic representations, and are therefore easier to parse.
In a dependency graph, every word is a node, and there are no non- word nodes. Every node has one, and only one parent, except the root, which does not have a parent. This forms a tree structure of the sentence.
In this tree, a parent and its child form a dependency relation. In this dependency relation the parent is often referred to as the head, and the child is called thedependent, or modifier. Each edge between a head and a dependent is labeled according to the type of relation between the two words.
We use the spaCy dependency parser (Honnibal & Johnson, 2015) to extract these dependency trees from our sentences. The spaCy dependency parser is a transition-based dependency parser extended by non-monotonic transitions (Honnibal & Johnson, 2015), and it is both fast and state-of-the-art (Choi, Tetreault, & Stent, 2015). The parser uses the ClearNLP dependency tagset (Choi & Palmer, 2012) to annotate the edges of the dependency graph. ClearNLP is a variant of the Stanford dependency tagset (de Marneffe & Manning, 2008), with some additional tags inspired by the CoNLL dependency approach (Johansson, 2008), and some other additional tags.
3.2.6 The CoNLL-format
Since we decided to store our data set in a file, instead of just accessing it from memory, we have to decide on a format to store it in. We decide to use the CoNLL-X-format (Buchholz & Marsi, 2006), which is the format used for the shared task of the tenth Conference on Computational Natural Language Learning (CoNLL). We also decide to make some modifications to this format, to suit our needs.
The CoNLL-X-format (or simply CoNLL-format) is a simple column format (see Table 3.4). Every sentence is contained in the same file, and the sentences are separated by a single blank line. Each sentence is represented asnlines for each of thentokens in the sentence. Each token is divided into 10 columns, by tab-characters, and each column contains a representation of the token. As we will not be using all the columns, the following is a list of the ones we will be using in our data set:
1. ID: The index of the token in the sentence, counting from 1.
2. FORM: The word form of the token.
3. LEMMA: The lemma of the token.
4. CPOSTAG: The course-grained PoS-tag of the token, in our case the uPoS-tag.
5. POSTAG: The fine-grained PoS-tag of the token, in our case the PTB- tag.
7. HEAD: The index of the head (parent) of this token in the dependency tree.
8. DEPREL: The label of the dependency relation between this token and its head.
The columns 6, 9 and 10 are not included in our data set, so those columns contain an underscore for every token to ensure that our data set follows the CoNLL specification. In addition to the seven columns described above, we add three more at the end for our own purposes:
11. COMMENTERID: A numeric ID given to the commenter by us, based on their username.
12. COMMENTID: A numeric ID given to the comment by us.
13. THREAT: Either a 1 or a 0, if the sentence contains a threat or not, respectively.
3.2.7 Preprocessing example
We will now go through the preprocessing of one sentence from the YouTube threat corpus. We will use the sentence“As long as i’m alive you’ll have no victory.”. It is the first sentence of the last comment in Figure 3.1,
and the sentence is not a threat. The sentence has, of course, been separated out from its comments during annotation, so the first step is to normalize it.
Since there are no words in all caps in the sentence, there is no change after normalization. The next step is tokenization, which results in the following list of tokens:
As long as i ’m alive you ’ll have no victory .
We see that the tokenizer has separated the sentence into 12 tokens.
It has separated i and ’m, and you and ’ll. The tokenizer has also separated the period from victory. Next is the lemmatization. When performing the lemmatization, we use the WordNet lemmatizer, along with our modifications, and get the following lemmas (on the bottom row):
As long as i ’m alive you ’ll have no victory . as long as i be alive you will have no victory .
We see that the lemmas are mostly the same as the word forms. The differences are the lack of capitalization for as, ’m has been lemmatized tobe, and ’llhas been lemmatized towill. Because of our modification to the lemmatizer, the pronouns “i” and “you” have their respective word forms as lemmas, instead of the lemma “-PRON-”. The last step before the dependency parsing is to assign part-of-speech-tags to the tokens. In the following tables, we will remove the period from the sentence, to conserve space. We use the spaCy PoS-tagger to assign PTB-tags (the third row), which will be mapped to uPoS-tags (the second row):
As long as i ’m alive you ’ll have no victory
as long as i be alive you will have no victory
RB RB IN PRP VBP JJ PRP MD VB DT NN
ADV ADV ADP PRON VERB ADJ PRON VERB VERB DET NOUN
We see an example of the difference in granularity between the two tagsets in the sentence above. The verbs in the sentence have all received the tag VERBwith the uPoS-tagset, while they all have different tags using the PTB-tagset, expressing distinctions relating to tense and modality.
The final type of information we derive from the sentence is created during the dependency parsing step. With the spaCy dependency parser, we extract a dependency tree from the sentence, shown in Figure 3.3. We see that the entire sentence is represented in the same tree, and that the requirements of dependency graphs mentioned above, are satisfied. We also see that each of the edges are labeled with the type of syntactic relation they represent, e.g. advmod, for adverbial modifiers, nsubj, for nominal subjects, aux for auxiliary verbs.
The final step of the preprocessing is to write all the information derived from the sentence to a file. We use the CoNLL-format, and write the information as in Table 3.4, with one tab-character between each of the columns in the actual file. As described above, columns 7 and 8 represent the dependency tree, with column 7 containing the edge label, and column
As long as i ’m alive you ’ll have no victory
advmod
advmod
mark nsubj advcl
acomp
nsubj
aux det
dobj root
Figure 3.3: The dependency tree of the example sentence used to demonstrate preprocessing. The period is removed to conserve space.
1 2 3 4 5 7 8 11 12 13
1 As as ADV RB 2 advmod 42 46 0
2 long long ADV RB 9 advmod 42 46 0
3 as as ADP IN 5 mark 42 46 0
4 i i PRON PRP 5 nsubj 42 46 0
5 ’m be VERB VBP 2 advcl 42 46 0
6 alive alive ADJ JJ 5 acomp 42 46 0
7 you you PRON PRP 9 nsubj 42 46 0
8 ’ll will VERB MD 9 aux 42 46 0
9 have have VERB VB 0 ROOT 42 46 0
10 no no DET DT 11 det 42 46 0
11 victory victory NOUN NN 9 dobj 42 46 0
12 . . PUNCT . 9 punct 42 46 0
Table 3.4: The information derived during preprocessing, formatted using our extended CoNLL-format. We have removed columns 6, 9 and 10 in this table, since they only contain underscores.
8 containing the index of the head of each token. Since token 9, the word
“have”, is the root, the index of its head is set to 0, which is not a token in the sentence, but only used to signify the root.
All the tokens have the same values in columns 11, 12 and 13. This is because these columns hold the commenter-ID, the comment-ID, and the threat annotation, respectively, which, of course, is the same for each sentence. Since this sentence does not contain a threat of violence, it contains only zeroes in column 13.