Tuning word embeddings for neural dependency parsing of Norwegian

(1)

Tuning word embeddings for neural dependency parsing of Norwegian

Henrik Hillestad Løvold

Thesis submitted for the degree of

Master in Informatics: Technical and Scientific Applications

(Language Technology group) 60 credits

Department of Informatics

Faculty of mathematics and natural sciences

UNIVERSITY OF OSLO

(2)

(3)

Tuning word embeddings for neural dependency parsing of

Norwegian

Henrik Hillestad Løvold

(4)

Tuning word embeddings for neural dependency parsing of Norwegian http://www.duo.uio.no/

Printed: Reprosentralen, University of Oslo

(5)

Abstract

In the recent years, data driven dependency parsers using neural networks have been developed and show promising results in parsing accuracy. An- other area that has seen an increase in interest among researchers is using distributional semantics in the form of word embeddings as input features in dependency parsers, replacing traditional feature engineering. No previous research has been done on systematically tuning of word embeddings for neural network parsing of Norwegian.

We present the results of tuning word embeddings for the task of neural network dependency parsing of Norwegian. We also present the results of comparing the performance of word embeddings in dependency parsers, to their performance in intrinsic evaluation metrics.

(6)

(7)

Acknowledgements

First of all, I want to express my gratitude for my supervisors Lilja Øvrelid and Erik Velldal, for their guidance, and the motivation and help offered during my work on this project.

I want to thank the researchers and developers of the tools used in this project, and for making them publicly available. Without these tools, this project would have been impossible.

I also want to thank my fellow students at the University of Oslo, in par- ticular Kjetil B. Kristoffersen, Camilla E. Stenberg, Hans P. T. Kragset and A. Helene Holter, for giving me the motivation to keep on going, and for pulling me through in my darkest moments.

Finally I would like to thank my parents and my family for believing in me, and for the unconditional support offered to me.

To my dear daughter Ella.

(8)

(9)

List of Figures

2.1 Example feed-forward neural network with two hidden

layers. . . 8

2.2 Example dependency graph . . . 13

2.3 Example of a hierarchical clustering dendrogram . . . 14

2.4 Example RNN with one layer. . . 18

4.1 Tuning workflow . . . 39

5.1 Mean formula . . . 46

5.2 Sample standard deviation formula . . . 46

5.3 Quantification results . . . 47

5.4 LAS relative to window size . . . 51

5.5 LAS relative to negative samples . . . 53

(14)

(15)

List of Tables

2.1 Nivre’s algorithm operations . . . 13

2.2 Source ratio for NDT . . . 21

2.3 Categories of the Google Analogies dataset . . . 24

2.4 Categories of the Norwegian Analogies dataset . . . 25

3.1 Number of tokens for the Norwegian Newspaper Corpus and NoWaC . . . 29

3.2 Source ratio for NDT . . . 29

3.3 Explanation of the CoNLL-X data fields column-wise . . . . 30

3.4 Example sentence in the CoNLL-X format . . . 31

3.5 Tuning options for word2vec and wang2vec. . . 33

3.6 Tuning options for Dyer’s parser, from the help screen . . . . 34

3.7 Tuning options for UDPipe. . . 36

4.1 Standard embedding tunings for Dyer’s parser and UDPipe 40 4.2 Results without word embeddings . . . 41

4.3 Results of parsing with NAK based word embeddings . . . 42

4.4 Results of parsing with NAK+NoWAC based word embeddings . . . 42

4.5 Results of parsing with swapped word embeddings . . . 43

4.6 Dataset experiments results recap . . . 44

5.1 Comparing skip-gram to structured skip-gram . . . 48

5.2 Parsing results with 50, 100 and 200 dimensions . . . 49

5.3 Parsing results with varying window size . . . 50

5.4 Results of frequency cutoff experiments . . . 52

5.5 Results of experiments with negative samples . . . 54

5.6 LAS and UAS with hierarchical softmax . . . 54

5.7 LAS with varying values for NCE . . . 55

5.8 LAS with 5 and 25 negative examples with and without NCE 56 5.9 Results of all embedding tuning experiments . . . 58

5.10 Final embedding settings. . . 59

5.11 LAS and UAS in Dyer’s parser and UDPipe on the parsing of the test dataset . . . 59

(16)

6.1 Recognised analogies and LAS for skip-gram and structured skip-gram . . . 62 6.2 Intrinsic evaluation details . . . 64 6.3 Five closest neighbours of "katt" with SSG and SG . . . 65 6.4 Five closest neighbours of "beregne" with SSG and SG . . . . 65 6.5 Five closest neighbours of ambiguous words using skip-gram 66 6.6 Five closest neighbours of ambiguous words using skip-gram 66 6.7 Deprel+attachment f-score of SSG and SG . . . 67

(17)

Chapter 1 Introduction

1.1 The project

The aim of this project is to explore how different word embedding models and tuning of embedding hyperparameters used as input representations affect the performance of neural network-based dependency parsers, with the goal of improving dependency parsing of Norwegian.

Dependency parsing is a field within the realm of language technology.

A dependency parser analyses text with the goal of being able to recognise the syntactic relations between words within a sentence in a given language. Modern dependency parsers use statistics and probability to achieve this goal. In the recent years, extensive research has been carried out to apply machine learning algorithms known as artificial neural net- worksto statistic dependency parsers.

Another area that has seen a rise in interest recently is distributional semantics in the form of word embeddings. Word embeddings are multidi- mensional vector representations of words in a vocabulary. Whereas traditional distributed word representations use occurrence counting of words to generate vector representations, word embedding software utilises machine learning techniques to allow for much more densely populated word vectors of lower dimensionality. This makes the representations well suited for specifying the input layer to artificial neural networks.

A huge advantage of using distributional semantics in the form of word embeddings as features in dependency parsing is its ability to replace manual feature engineering. Traditionally, features used in dependency parsing had to be carefully selected by linguists. The modern word embedding models show promise to be a competitive alternative to these features.

(18)

Many tools for generating word embeddings exist, and exploring them all is out of reach of this project. In this project we will train models using the neural network-based word2vec (Mikolov, Sutskever, Chen, Corrado,

& Dean, 2013) andwang2vec(Ling, Dyer, Black, & Trancoso, 2015).

In order to train word embeddings, a corpus of text is required. In this project we will carry out experiments with the Norwegian Newspaper Corpus and a concatenation of this and the Norwegian Web as Corpus (NoWAC) in order to find an optimal setup for dependency parsing of Norwegian.

Recently, researchers at the University of Oslo and the Norwegian Na- tional Library have developed atreebank, a resource which is instrumental in training and evaluating dependency parsing, for Norwegian (Solberg, Skjærholt, Øvrelid, Hagen, & Johannessen, 2014). This is the first of its kind for Norwegian, and will form the basis of the parsing performed in this project.

The word embedding software we use offers a wide range of hyperparameters to be tuned by the user. To reach our goal of optimising word embeddings to be used as features in dependency parsing, we will ex- periment with various language models, datasets and parameters in the training of word embeddings. We show that the choice of configuration can have a large impact on downstream parser performance.

Recently, a dataset for evaluation of word embeddings in the task of recognising analogies has been developed for Norwegian (Stadsnes, forthcoming). Using this dataset, we show that the choice of embedding model largely affects the performance of the embeddings in terms of recognising semantic and syntactic analogies. We also show that there is no one-to-one correlation between models performing well in downstream dependency parsing, and performance in the task of recognising analogies.

1.2 Outline

Chapter 2 provides an introduction to the theoretical background and an overview of previous research within the field of dependency parsing and word embeddings.

Chapter 3 describes the data and tools used in this project, and the preparatory work that had to be carried out to make them usable in our experiments.

The experiments we carried out to find parameters for optimal word em-

(19)

bedding models to be used in dependency parsing of Norwegian are described in chapter 4 and 5.

As mentioned, we also performed experiments to evaluate our embeddings in traditional, intrinsic evaluation tools. These experiments are described in chapter 6.

Finally, chapter 7 contains our conclusion, our contributions, as well as a description of possible future work based on the outcome of this project.

(20)

(21)

Chapter 2 Background

This project aims to explore how different word embedding models and embedding hyperparameters affect the performance of statistical dependency parsers for Norwegian. The experiments will be performed on the Norwegian treebank (Solberg et al., 2014). Dependency parsing has recently seen an increase in interest among computational linguists. Another area that has seen an increase in interest by scientists recently is word embeddings. Word embeddings are vector representations for words. Com- pared to traditional vector space models, word embeddings offer more dense, dimensionality reduced vectors.

As labelled data is a scarce resource, various means of using unlabelled data have been proposed. This project aims to perform experiments to fine-tune embeddings for Norwegian. Embeddings will be trained using the Norwegian Newspaper Corpus¹ and the Norwegian Web Corpus (NoWaC) corpora (Guevara, 2010). Two modifications to the traditional word embedding models proposed by Mikolov, Chen, Corrado, and Dean (2013) have been made by Ling et al. (2015). In this project we will look at how these embeddings perform when trained on Norwegian language.

This chapter describes the theoretical background of this project. The practical details regarding the tools and data are described in Chapter 3.

The experiments carried out to optimise word embeddings and syntactical parsing of norwegian language are described in Chapter 4.

2.1 Distributional semantics

Before looking at dependency parsing, we need to understand the concept of distributional semantics. The basic component in distributional semantics is vector space models. Vector space models are numeric methods used to represent features of words as vectors. Each dimension of

1https://www.nb.no/sprakbanken/show?serial=sbr-4&lang=nb

(22)

a word’s feature vector represents the weight of a given feature for this word. In semantic vector space models, the vector features are generally based on co-occurrence (Jurafsky & Martin, 2000). The numeric vectors can be used to calculate the geometric distance between words in our vector space, typically by simple metrics such as Euclidean distance or cosine similarity. This information about word distribution can be used to represent the syntactic similarity between words.

The core hypothesis in distributional semantics, thedistributional hypothesis, states that words with similar meaning often occur in close proximity to the same words; the context is similar. Utilising the distributional hypothesis, we can use the distributional properties of natural languages to estimate the semantic meaning of words (Sahlgren, 2008).

Consider the following sentences:

Too muchSpooshmakes you gain weight.

Spooshtastes like cocoa and nuts.

A bar ofSpooshwill melt in the sun.

AlthoughSpooshis a made up word, it is obvious to the human mind that Spooshis some kind of chocolate bar with nuts. But how about a computer that has no prior knowledge of the concept of "chocolate"? By looking at the context words, the computer might find that the word Spooshoccurs within the vicinity of many words that also occur in the vicinity of the word chocolate. By applying the distributional hypothesis, the computer can assume that these two words have a semantic similarity, only based on the surrounding words.

2.1.1 Machine learning and Artificial Neural Networks

Machine learning is a set of methods used to make machines capable of making decisions based on experience. Machine learning has long tradi- tions, and many algorithms, such as support vector machines, decision trees and Bayesian classifiers, exist. One of them is calledartificial neural networks.

Artificial neural networks consist of a set of nodes, dubbedneurons, containing activation functions, which are distributed over a varying amount of hidden layers. Between the layers are weighted edges. At the end of the network is an output layer of one or more neurons, which output represents the decision made by the network.

(23)

During training, the artificial neural network is given examples. Each example consists of an input, and the desired output given that input. When the network makes a prediction, an error function calculates how far away from the desired output this prediction was. The error is then fed back- wards through the network, adjusting weights accordingly. This step is called backpropagation. Training is finished when the error converges, or after a given number of iterations.

The Artificial neural network is a popular machine learning algorithm in many fields. As computer hardware has improved drastically over the last couple of decades, artificial neural networks, which both requires and scales well with huge amounts of data, have become more viable in many applications. An important part of the success of artificial neural networks within the field of natural language processing, is its ability to replace manual feature engineering.

Representation learning is a computer’s ability to both learn how to use the features, and to actually learn how to learn. Artificial neural networks with multiple hidden layers have a high level of abstraction, making them suited for this task. Another important feature of artificial neural networks in the field of natural language processing, is the low dimensionality of each word’s vector representation. The low dimensionality of the vectors enables us to concatenate the word vectors into a high dimensional model, which is used as input directly to the parser. These low dimensional vectors called word embeddings, will be described further in chapter 2.1.3.

Many artificial neural network models, such as shallow feed-forward networks, recurrent neural netowrks (including long short-term memory networks), convolutional neural networks exist, and are used for different tasks. In this project we will use tools that utilise neural networks both in the training of word embeddings, and in the dependency parsers Dyer’s parser, which will be presented in 2.3.6, and UDPipe, which will be presented in 2.3.7.

2.1.2 Language models

Language models are statistical probability distributions over the likelihood of sequences of words. The probability of a sequencewis given by P(w1,w2, ...,wn). Language models are useful in many applications of natural language processing, such as machine translation, word suggestions in word processing, etc. In search technology, language models are used to return the most relevant documents given a query string. In statistical parsing, we are often interested in the likelihood of a wordwoccurring in a given contextc, P(w|c).

(24)

Input #1 Input #2 Input #3 Input #4

Output Hidden

layer 1

Hidden layer 2 Input

layer

Output layer

Figure 2.1: Example feed-forward neural network with two hidden layers.

The N-gram model

The most commonly used language model is known as then-gram model.

N-grams are sequences ofnwords in a text or document. In n-gram models, the probability of a wordwin a given by the precedingn−_{1 words:}

P(wi|w_i₋₍_n₋₁),w_i₋₍_n₋₂), ...,w(i−1)

In this model, we assume that the probability of an occurrence of a word only depends on the n−1 preceding words. Although this is a naive assumption, N-gram models are often used when modelling languages.

The main reason is that training N-gram models requires little complexity, and the trade-off in terms of precision compared to higher complexity language models is low.

A problem often encountered when working with n-gram models is called thedata sparsity issue. Regardless of how big the training data is, it is likely that certain sequences only occur once, or that a sequence we attempt to look up never occurs at all in the training data. This leads to low, or even zero probabilities (unless a technique known assmoothingis applied). In order to avoid the data sparsity issue, other language models have been proposed.

The neural network language model (NNLM) was proposed by Bengio, Ducharme, Vincent, and Jauvin (2003). The aim of this model is to avoid the data sparsity issue by training an artificial neural network on the a text corpus, and create a probability function based on the distributed representation for words. "Generalization is obtained because a sequence of words that has never been seen before gets high probability if it is made

(25)

of words that are similar (in the sense of having a nearby representation) to words forming an already seen sentence" (Bengio et al., 2003).

The Neural Network Language Model

A problem with traditional language models like the n-gram model, is that sequences of words which are not seen in the training corpus either yield a zero probability, or a low probability, in cases where smoothing is applied.

For instance, in a case where the sentence"The dog eats food"occurs in the training corpus, but "The cat eats food"does not occur, the probability of the latter sentence will be low. The words "dog" and "cat" are semantically close to each other, and it would make sense that the probability of the sentence "The cat eats food"should be close to that of "The dog eats food".

Several attempts at solving this problem, including variants of the traditional models and entirely new language models, have been proposed.

One of the newer language models, and the one we will focus on in this project, is called theNeural network language model.

The idea behind the NNLM approach is by Bengio et al. (2003) sum- marised as follows:

1. Associate with each word in the vocabulary a distributed word feature vector (a real valued vector inRm),

2. Express the joint probability function of word sequences in terms of the feature vectors of these words in the sequence, and

3. Learn simultaneously the word feature vectors and the parameters of that probability function.

A byproduct of the neural network language model is the vectors representing each word in the vector space. These vectors have proven to be useful for many tasks in natural language processing, including dependency parsing, and are often referred to asword embeddings.

2.1.3 Word embeddings

Traditionally, vector estimation has largely been based on frequency counting. This is a simple, low-cost process, but it has some drawbacks.

First of all, it requires some sort of smoothing to be applied to the vectors, to avoid zero probabilities when used in statistical language models.

Secondly, the problem of data sparsity (as described in chapter 2.1.2) may also cause problems with low or zero probabilities. As a byproduct of the research on neural network language models, continuous vector space models known asword embedding, have been proposed.

(26)

Word embedding is a technique that is used to estimate the proximity between words. One of the most popular family of methods is known as word2vec(Mikolov, Chen, et al., 2013). Word2vec implements the methods skip-gramand CBOW (continuous bag of words). The aim of word embedding is to reduce the dimensionality of the feature vectors, to make it easier to calculate the semantic proximity between words. Bansal, Gimpel, and Livescu (2014) compared different types of embeddings to Brown clusters as features for dependency parsing, and found that all embeddings yield significant parsing gains. In this project, we will use word2vec (Mikolov, Chen, et al., 2013) and wang2vec (Ling et al., 2015) embeddings as features in dependency parsing.

The skip-gram model takes a word wt as input, and outputs the context window. In opposite, the CBOW method takes the context as input, and predicts wt. Although the methods are similar and produce similar embeddings, they behave quite differently. There is usually one choice which is better than the other for a given task (Jurafsky & Martin, 2000).

In addition to the different models, word2vec also offers a wide range of hyperparameters, which values can be specified by the user. As the aim of this project is to explore how different models and parameters affect parsing by utilising word embeddings, we will focus on tuning these parameters to find a setup that works well for Norwegian.

2.1.4 Count-based vs. prediction-based models

Research on comparing traditional count-based language models to the newer prediction-based language models has been done by Baroni, Dinu, and Kruszewski (2014). Count-based models were trained using the DIS- SECT Toolkit², and prediction-based models were trained using word2vec (Mikolov, Chen, et al., 2013). Baroni et al. (2014) found that their word2vec models beat the count-based models in every single task, and encouraged researchers to use prediction-based models rather than count-based models. When we decided to use word embeddings, this encouragement was taken into account.

2.1.5 Evaluation of word embeddings

Traditionally, word embeddings have been evaluated using computationally low-cost datasets which measure how well the embeddings recognise predefined analogies and word similarities. This is known as intrinsic

2DISSECT Toolkit: http://clic.cimec.unitn.it/composes/toolkit/

(27)

evaluation, and will be further described in chapter 2.5.

Although intrinsic evaluation is computationally cheap, recent discussion about whether the results of intrinsic evaluation actually measures the vi- ability of the embeddings in the task they are trained for, has emerged in the NLP community. In this project we will use our embeddings only as a tool for improving dependency parsing of Norwegian language, and evaluation will be performed by passing the embeddings as parameters to our dependency parsers, and then measuring the accuracy of the results. This is known as extrinsic evaluation of word embeddings.

After our extrinsic evaluation of word embeddings in the downstream task of dependency parsing, we will take a closer look at how a sample of our models perform in intrinsic tests for comparison. These experiments are presented in chapter 6.1.1.

2.2 Dependency parsing

The goal of dependency parsing is to determine the syntactic relationship between words in a sentence. In contrast to constituency parsers, where the aim is to decide the lexical phrase structure of a sentence, dependency parsers find relationships between words, and has no concept of constituents. The reason that dependency structure grammars lately have gained popularity compared to phrase structures, is that they are more efficient to learn and parse while still encoding much of the predicate-argument information needed in applications (McDonald, Pereira, Ribarov, & Hajiˇc, 2005). Modern dependency parsers are trained using large corpora of annotated text, so-called treebanks. There are primarily two approaches to data-driven dependency parsing – graph- based and transition-based parsing (McDonald et al., 2005; Nivre, Hall,

& Nilsson, 2006).

2.2.1 Graph-based dependency parsing

Graph-based parsers use graph algorithms to find the most likely dependency structure of a sentence. One such parser is the MSTParser.

The aim of the MSTParser is to "(...) formalize weighted dependency parsing as searching for maximum spanning trees (MSTs) in directed graphs" (McDonald et al., 2005). The algorithm takes an input sentence x = x1,x2, ...,xn, and assumes a proper feature representation, as well as a weight vector w. A graph is constructed, where each word acts as a vertex, and an additional root vertex is created. Directed edges connect all vertices to each other. This process yields a list of all the possible trees

(28)

for the input sequence. After creating the trees, a maximum spanning tree algorithm is used to find the most likely parse tree for the input sequence.

2.2.2 Transition-based dependency parsing

The second approach to dependency parsing is called transition-based dependency parsing. In opposite to graph-based parsing, which finds the globally most likely dependency graph, transition-based dependency parsers are state machines, and find the most likely transition between given states. Nivre et al. (2006) introduce MaltParser, a multilingual transition-based parser. MaltParser makes greedy decisions based on probabilities in each step of the parsing. It can thus be looked at as an opposite to MST’s approach, which looks at the graph as a whole and maximises probability for the entire graph. "While a traditional parser- generator constructs a parser given a grammar, a data-driven parser- generator constructs a parser given a treebank" (Nivre et al., 2006). The reason that decision based parsers such as MaltParser have become popular, is its speed. In linear time complexity, MaltParser can produce parses very close to the state of the art dependency parsers in terms of precision and accuracy.

The MaltParser framework supports any deterministic parsing algorithm compatible with its architecture. One of them is Nivre’s algorithm, which is a linear time algorithm limited to projective graphs. It has two modes:

arch-standard and arch-eager (Nivre et al., 2006). Nivre’s arc-eager algorithm uses three data structures; the Stack, which starts with the ROOT symbol, theBuffer, which starts with the input sequence, and a set of dependency arcs, which starts off empty. In each transition, it has four possible operations (table 2.1), and its finishing condition is reached when the buffer is exhausted.

Many modern dependency parsers are based on the datastructure presented by Nivre et al. (2006). Two examples are the LSTM-Parser (Dyer, Ballesteros, Ling, Matthews, & Smith, 2015) and UDPipe (Straka, Hajic, &

Straková, 2016), which we will use in this project. These parsers are further described in chapter 2.3.6 and 2.3.7.

2.2.3 Learning

By analysing an annotated corpus with various properties associated with each word (POS-tag, dependency relation, etc.), the parser is able to learn relationships between words, which is required to later parse untagged text. Various machine learning algorithms for classification

(29)

Shift Shifts the next object from the buffer onto the stack.

Left arc Creates a left arc between the next element in the buffer and the top element on the stack, and adds the arc to the set of dependencies. Pops the dependent off the stack.

Right arc Shifts the first element in the buffer onto the stack, and creates a right arc dependency between the two elements on the top of the stack. The new arc is added to the set of dependencies.

Reduce Pops the top element off the stack.

Table 2.1: Nivre’s algorithm operations

Den viser at han har trengt å komme i modus .

It shows that he has needed to get into competition-mode .

ROOT

SUBJ SBU

SUBJ DOBJ

INFV DOBJ INFV ADV

PUTFYLL IP

Figure 2.2: Dependency graph of the example sentence from the development set.

can be used to learn relationships from a dependency treebank along with a corresponding set of features. MaltParser is compatible with support vector machines (with various kernel functions), or a linear classifier to learn relationships³. Features are extracted from the annotated treebank, and usually include information about each word’s parent, siblings (words of similar depth depending on the same parent), and for labelled parsing, the dependency relation.

2.3 Parsing with distributional semantics

Recent studies have shown significant parsing gains using continuous word representations and word clusters derived from embeddings as features in dependency parsing. Parsing improvements by using Brown clusters based on discrete representations was shown by Koo, Carreras Pérez, and Collins (2008).

3Different algorithms described at: http://www.maltparser.org/userguide.html#

learner

(30)

a b c d e

distance

Figure 2.3: Example of a hierarchical clustering dendrogram

2.3.1 Clustering

Clustering is an unsupervised machine learning technique used to find groups of related objects,clusters, in a set. Unlike classification, supervised machine learning techniques that require a gold standard training set with predefined classes, clustering is used when we have no prior knowledge of the objects to be classified. Clustering is not a specific algorithm, but the task to be solved. Various clustering algorithms and proximity functions exist, and the choice of algorithm is largely based on the data to be classified.

Clustering algorithms can be divided into two major categories; flat clustering and hierarchical clustering. Flat clustering is the basic approach where the number of clusters, K, is specified prior to the clustering process. The results areK clusters of related objects. Hierarchical clustering generates a hierarchical structure of clusters, where the clusters become more fine grained for each level down in the hierarchy. At a given layer of hierarchical clustering, the clusters are similar to the clusters of flat clustering at theK in that point of the hierarchy. Figure 2.3 shows an example of the layers in a hierarchical clustering dendrogram. Notice how the clusters become less fine-grained (i.e. the clusters tolerate a higher distance between their members) for each layer from the bottom and upwards. Each layer makes upKclusters.

(31)

Both flat clustering and hierarchical clustering have proven to be useful in the creation of clusters used as features in dependency parsing. In this project, we will take a different approach to using distributional semantics as features to statistical dependency parsing. In recent studies by Dyer et al. (2015) and D. Chen and Manning (2014), using pre-trained word embeddings directly as features has proven to be successful in the training of parsers. We will train our embeddings using the models proposed by Mikolov, Chen, et al. (2013) and Ling et al. (2015), and use these as features in the parser.

2.3.2 Word embeddings as features in dependency parsing

As mentioned in chapter 2.1.3, the CBOW model learns vectors to predict a word from a context windoww, whereas the skip-gram model does the inverse; it takes a word and predicts its context window w. Bansal et al.

(2014) found that the size of w affects the embeddings substantially. For high wvalues, words seemed to group with other words that shared the same topic (e.g. king and queen). For low w, words grouped with words that shared the same POS tag (e.g. king and car). The study found that training the CBOW and skip-gram models on dependency context (parent words in dependency graphs from the training data) instead of linear context in text, made the embeddings better fit to be features in dependency parsing.

Bansal et al. (2014) conducted parsing experiments with the MSTParser using continuous representations with two different approaches, and compared them to Brown clusters (as described by Koo et al. (2008)). The first approach, called bucket features, consisted of creating "one indicator feature per dimension of each embedding vector, where the feature consists of the dimension index d and a bucketed version of the embedding value in that dimension". The second approach is calledcluster bit string features.

Agglomerative hierarchical clustering was performed to take into account all dimensions of the embeddings simultaneously. Ward’s minimum vari- ance algorithm was used as cluster distance metric, and euclidean distance was used to measure distance between vectors.

The experiments conducted by Bansal et al. (2014) showed that bucket features generally do not improve accuracy in parsing, but cluster bit string features showed statistically significant improvements. Furthermore, the Brown clusters did better at attaching proper nouns, prepositions, and conjunctions, while CBOW did better on plural common nouns and ad- verbs.

As focus recently has shifted from using clusters based on word embed-

(32)

dings, to using word embeddings directly as features in dependency parsing. Two parsers, Dyer’s parser(LSTM-parser) (Dyer et al., 2015), further described in 2.3.6, and UDPipe (Straka et al., 2016), further described in 2.3.7, will be used in this project.

2.3.3 Word embeddings and word order

As the embeddings created by CBOW and skip-gram do not take word order into consideration, they have a weakness when used for order- sensitive tasks such as POS-tagging and syntactic parsing. Two modifications to the traditional CBOW and skip-gram models have been made by Ling et al. (2015), with the goal of making the embeddings more suited for these tasks, while maintaining the simplicity and low computational cost of the original models. In their study, they present two new models, theStructured Skip-gram Modeland theContinuous Window Model.

Whereas the traditional skip-gram model uses a single output matrix to predict every contextual word given the center word, the structured skip- gram model proposed by Ling et al. (2015) defines a set of output prediction matrices. Each output prediction matrix is dedicated to predicting the word in its position relative to the center word. When making a prediction, the appropriate matrix is selected to predict a word given its position relative to the centre word.

Similarly, the traditional CBOW model uses a single prediction matrix, which is fed with the concatenation of the embeddings of the context words. The Continuous Window Model proposed by Ling et al. (2015) defines a different output prediction matrix which allows the words to be treated differently depending on where they occur in relation to the target word.

Compared to the traditional skip-gram and CBOW models, the modifications proposed by Ling et al. (2015) do not increase the computational cost of calculating probabilities. Due to the increasing number of parameters of the prediction matrix, the models are more prone to issues with data sparsity than the traditional models when trained on small datasets, but generally this is not a big problem, as these models are typically trained on datasets in the order of 100 million words.

Ling et al. (2015) conducted experiments with the structured skip-gram and continuous window models for both POS-tagging and dependency parsing. In the dependency parsing experiments, the models were trained on the English Penn Treebank, using the neural network parser introduced by D. Chen and Manning (2014). The embeddings were appended to the

(33)

feature vectors directly, and no bucket features or clusters were used. The results of the experiments showed improvement over the traditional skip- gram and CBOW models proposed by Mikolov, Chen, et al. (2013).

2.3.4 A neural network approach to dependency parsing

D. Chen and Manning (2014) addressed that current dependency parsers classify based on millions of sparse indicator features which generalise poorly, and restricts parsing speed. To solve this problem, they constructed a parser based on a neural network classifier. In contrast to traditional parsers, this parser does not need to compute conjunction features and look them up in a huge feature table. In each step, the neural network parser extracts the corresponding word, POS-tag and embedding from its configuration. It then uses matrix operations to pick the transition with the highest probability. The parser was tested and compared to MSTParser (McDonald et al., 2005)) and MaltParser (Nivre et al., 2006), and the results showed that it outperformed both in terms of labelled and unlabelled accuracy score and speed, when trained and tested on the Penn treebank.

2.3.5 Recurrent neural networks

A problem that arises with the use of feed-forward neural networks in dependency parsers, such as the one proposed by D. Chen and Manning (2014), described in 2.3.4 is the constant number of inputs and outputs.

Whereas traditional neural networks, as described in chapter 2.1.1, use a fixed-length input, a set of hidden nodes, and a fixed-length output, the recurrent neural network is fed a sequence of inputs, and the output is a sequence corresponding to the length of the input.

Unlike traditional feed-forward neural networks, the recurrent neural network typically has only a single hidden layer. The advantage of recurrent neural networks, and the reason why they are able to handle varying length input, is that the hidden layer of each input is fed forward through a set of weights, and used as an additional input to the hidden layer of the next input. The hidden layer of recurrent neural networks can then be viewed as a sequence, where the output depends on the previous layer.

This complicates the calculation of error, and thus making backpropagation a more complex task than in standard feed-forward networks. An algorithm known as backpropagation through time calculates the error in each time step, and adjusts both the weights for the hidden layer and the between-time steps layer.

(34)

ht

h₀

y1

v₁

h₂ y2

v₂

h₃ y3

v₃

h₄ y4

v₄

h₅ y5

v₅

Figure 2.4: Example RNN with one layer. Eachvi ∈ V denotes the input vectors, each y_i ∈ Y denotes the output in each time step. Horizontal arrows denote multiplication of the weightsh_i →h_i+1.

Traditional recurrent neural networks suffer from a problem known as the vanishing gradient problem. Many solutions to the vanishing gradient problem have been proposed, such as Elman and Jordan networks, echo state networks and neural history compressors. The current state-of-the- art recurrent neural network is known as the long short-term memory network. Recently, a dependency parser based on long short-term memory networks has been proposed by Dyer et al. (2015). This parser will be further described in chapter 2.3.6.

2.3.6 Dependency parsing using stack long short-term memory

Dyer et al. (2015) presented a new way of utilising neural networks in transition-based dependency parsing. Where traditional neural network models often suffer from the vanishing gradient problem, Dyer’s parser trains a variant of recurrent neural nets (RNNs) called a long short-term memory neural network, first presented by Hochreiter and Schmidhuber (1997).

Backpropagation through traditional neural networks is done by comput- ing gradients by the chain rule, and when the backpropagation reaches the

"front" nodes, the gradient becomes small making training of these nodes very slow. In recurrent neural networks, where each time step creates a new layer in the network, this might lead to a significant loss of information. This problem is known as thevanishing gradient problem.

In order to solve this problem, Hochreiter and Schmidhuber (1997) introduced the long short-term memory neural network, placing read, write and keep gates, which are opened and closed depending on the input, on each

(35)

node in the network to control their behaviour. These gates allow nodes to remember information for a longer time than the traditional nodes, which are always read and written back to during backpropagation.

Dyer’s parser, like Nivre et al. (2006), uses a stack based architecture.

The difference between them is how probabilities are calculated. Whereas Nivre et al. (2006) uses an SVM classifier to predict transitions, Dyer’s parser constructs the representation used to predict transitions incremen- tally for each time step, using three stacks of long short-term memory units. Two stacks are similar to the stacks in MaltParser; the stack and the buffer. The third stack stores the previous actions taken by the parser.

2.3.7 A trainable NLP pipeline

Researchers at the University of Prague, Czech Republic, have developed a tool dubbed UDPipe, intended to streamline the processing of natural languages, from tokenisation to dependency parsing (Straka et al., 2016).

UDPipe was built to processUniversal Dependency⁴ treebanks without the need for any external data, but it works well for all treebanks in the CoNLL-U format.

This project focuses on statistical dependency parsing, and thus we are most interested in this part of the UDPipe pipeline. The parser used in UDPipe is the Parsito Neural Network parser, presented by Straka, Ha- jic, Straková, and Hajic jr (2015). Parsito is a transition based dependency parser, and utilises theright-arc,left-arcandshiftoperations, as well as a stack datastructure, similar to MaltParser (Nivre et al., 2006) and other projective arc-standard systems. Attardi (2006) presented a parser with additional stack operations to this system making arcs between non- adjacent subtrees possible, allowing the parser to build non-projective graphs. Gómez-Rodriguez, Sartorio, and Satta (2014) presented another parser where a restriction the operations presented in Attardi (2006) were allowed, making non-projectivity to a certain degree possible, while re- taining theO(n²)computational time of arc-standard parsers. Parsito uses the same set of operations as the Gómez-Rodriguez et al. (2014) parser.

2.3.8 Word embeddings in UDPipe and Dyer’s parser

Both UDPipe (Straka et al., 2016) and Dyer’s parser (Dyer et al., 2015) use internally trained word embeddings as features in the dependency parsing. Additionally, both parsers accept pre-trained word embeddings, such as word2vec (Mikolov, Chen, et al., 2013) and wang2vec (Ling et al., 2015)

4Universal Dependencies introduction: http://universaldependencies.org/

introduction.html

(36)

embeddings.

In Dyer’s parser, words are represented by the concatenation of three vectors; a learnt representation for each word type, a fixed representation from a neural network language model (see chapter 2.1.2), and learnt representation of the POS tag of the token. By default, Dyer’s parser uses the provided word embeddings directly as features for each word occuring both in the embeddings and the training data (Dyer et al., 2015).

UDPipe’s internal structure is a neural network based on the neural network parser presented by D. Chen and Manning (2014). The input to the network consists of nodes representing words in the tree being built (Straka et al., 2015). The nodes in the network are represented by embeddings of its form, POS tag and arc-label (if applicable). The embeddings are initialised with random values and trained on the treebank data to- gether with the network. It is possible to provide the parser with pre- trained form embeddings. When pre-trained word embeddings are provided, these could either be used directly as form embeddings, or be used as a basis for the training of the internal form embeddings. Along with this option, UDPipe offers a wide range of hyperparameters which can be tuned for better result⁵. Exploring these parameters is out of the reach of this project.

In this project we will focus on the task of improving dependency parsing of Norwegian by training embedding models, which in turn will be used as input to the dependency parsers.

2.4 Dependency parsing of Norwegian

An annotated text corpus, called atreebank, is required in order to train and test a dependency parser. During training, the parser learns relationships based on the annotation, and in order to evaluate the results of a parse, we need an annotated gold set to compare our results to. In this project, we will use the recently published Norwegian Dependency Treebank⁶ (Solberg et al., 2014), which is incorporated in the Språkbanken resource catalogue.

2.4.1 The Norwegian Dependency Treebank

A treebank, is needed to train a data driven dependency parser. Until recently, no treebank for Norwegian language was available. As depen-

5UDPipe user’s manual: https://ufal.mff.cuni.cz/udpipe/users-manual

6Språkbanken dependency treebank: http://www.nb.no/sprakbanken/show?

serial=oai%3Anb.no%3Asbr-10&lang=nb

(37)

Source Percentage of NDT sources

Newspaper articles 82%

Government reports 7%

Parliament transcriptions 6%

Blog posts 5%

Table 2.2: Source ratio for NDT⁸

dency parsing has gotten a lot of attention, and many dependency parsers have become more available in the recent years, work was carried out by Solberg et al. (2014) to construct a treebank for Norwegian.

When building a treebank, reference guidelines for the annotation process have to be made. The Norwegian Dependency Treebank has four funda- mental principles⁷;

• Linguistic adequacy: The annotation should be as linguistically adequate as possible.

• Consistency: It had to be possible for annotators to implement the analyses consistently.

• Quick annotation: The annotators should be able to annotate quickly, in order to cover a sizable amount of text.

• Easy retrieval: It should be easy to retrieve specific constructions after annotation.

As we can see from table 2.2, the Norwegian Dependency Treebank consists mainly of newspaper text, but includes government reports, parliament transcriptions and blog posts. This has to be taken into consideration when the treebank is used for dependency parsers trained using word embeddings – word embeddings trained using similar texts are more likely to recognise words and contexts in this kind of material, and will be more useful in the training of dependency parsers.

Looking at our example sentence in figure 2.5, we see that the finite verb (INFV) "viser" and the auxillary verb "har" were selected as heads. Further we see that the subjunctive "han" is dependent on the verb, and that the infinitive marker "å" is the head of the infinite verb. We also see that the

7List cites Solberg et al. (2014).

8Information presented in Table 2 by Solberg et al. (2014).

(38)

Den viser at han har trengt å komme i modus . It shows that he has needed to get into competition-mode .

INFV

SUBJ

SBU SUBJ DOBJ

INFV DOBJ INFV ADV PUTFYLL

IP

preposition "i" is the head of the object of the preposition "modus".

Further work on optimising a POS tag set targeted at the task of dependency parsing for the Norwegian Dependency Treebank was proposed by Hohle, Øvrelid, and Velldal (2017). This task was linguistically motivated, and more fine-grained POS-tags containing more specific linguistic information about each word were introduced. On one hand, this is good as it adds further information about, and distinction between, every single word in the corpus. On the other hand, this leads to more scarce information regarding each POS, as fewer examples for each word class were available in the dataset. Hohle et al. (2017) showed an overall increase in labelled accuracy in dependency parsing using a wide range of parsers to evaluate the tagset.

Hohle et al. (2017) also propose a data split consisting of a training set, a development set and a test set, with an 80-10-10 distribution. In this project we will use the data split proposed by Hohle et al. (2017), but not the optimised tag set. As little research has been done using the optimised tag set, we found it more feasible to use the standard tag set, in order to be able to compare our results of existing and future studies using the Nor- wegian Dependency Treebank.

The practical details regarding the usage of the Norwegian Dependency Treebank in this project are described in chapter 3.2.1.

(39)

2.5 Intrinsic versus extrinsic evaluation of word embeddings

This project focuses on the use of word embeddings as a tool to improve dependency parsing of Norwegian language. This is a downstream, extrinsic task. Intrinsic evaluation, on the other hand, is simply a way of evaluating language models in general. Intrinsic evaluation of word embeddings is a vaguely defined task, as word embeddings are typically used as tools to increase the precision in certain tasks, such as POS tagging and dependency parsing. Typically, when it comes to evaluation of word embeddings, a task would be considered intrinsic if it evaluates word embeddings in an isolated context (i.e. not as part of a larger task).

Intrinsic evaluation is often performed in order to evaluate the quality of word embeddings, because it is quicker and less time consuming than ex- haustively testing multiple embedding models in the actual task they are trained for. Furthermore, intrinsic evaluation is instrumental in tasks such as semantic analysis, where the relation between words in a word embedding model is the primary data.

2.5.1 Intrinsic evaluation datasets

A number of tools and datasets for intrinsic testing of word embeddings, such as SimLex-999 (Hill, Reichart, & Korhonen, 2016) and Google Analo- gies (Mikolov, Chen, et al., 2013), exist for English. The Google Analo- gies set contains word quadruples, and the intention is to evaluate how well the embedding system performs at predicting the last element in each quadruple. The quadruples can be read as "v1 is to v2 asv3 is to v4"; for instance, one quadruple could be v = [Oslo,Norway,Stockholm,Sweden]. This is typically done by v₂−v₁+v₃ = v₄, as is presented by Mikolov, Yih, and Zweig (2013).

The Google Analogies dataset is split into two major categories; semantic and syntactic analogies. Each major category is further divided into subcategories. Table 2.3 shows all subcategories, as well as a simple example from each category⁹

2.5.2 Evaluation of Norwegian analogies

As Norwegian is a much smaller language in terms of number of native speakers and users (approximately 5 million), similar tools for Norwegian are scarce resources. Creation of tools aimed at intrinsic testing of word

9Table 2.3 is a citation of Mikolov, Chen, et al. (2013).

(40)

Category v₁ v₂ v₃ v₄ Semantic

Common capital city Athens Greece Oslo Norway All capital cities Astana Kazakhstan Harare Zimbabwe

Currency Angola kwanza Iran rial

City-in-state Chicago Illinois Stockton California Man-Woman brother sister grandson granddaughter Syntactic

Adjective-to-adverb apparent apparently rapid rapidly Opposite possibly impossibly ethical unethical

Comparative great greater tough tougher

Superlative easy easiest lucky luckiest

Present Participle think thinking read reading Nationality adjective Switzerland Swiss Cambodia Cambodian

Past tense walking walked swimming swam

Plural nouns mouse mice dollar dollars

Plural verbs work works speak speaks

Table 2.3: Categories of the Google Analogies dataset. The first five categories contain semantic questions, and the last nine contain syntactic questions.

embeddings is currently being carried out in the MSc thesis of Stadsnes (forthcoming) at the Language Technology Group, University of Oslo. At this point, an analogy dataset has been proposed.

The dataset proposed by Stadsnes (forthcoming) is similar to the Google Analogies set proposed by Mikolov, Chen, et al. (2013), but as the Google Analogies dataset contains a certain amount of content specific to English, such as thecity-in-statecategory, the task of creating a similar tool for Nor- wegian is more complex than simply translating every single word. The categories in this dataset are presented in table 2.4.

This dataset will be used for intrinsic testing of word embeddings in this project, and is further described in chapter 3.3.5.

Although our project aims to optimise embeddings used as parameters for dependency parsers, we found it interesting to run the embeddings through the intrinsic analogy test proposed by Stadsnes (forthcoming).

The goal of the intrinsic testing presented in chapter 6 is to see how different embedding models perform, and to identify whether or not there is a correlation between embeddings performing well as features for dependency parsers and in the task of identifying analogies.

(41)

Category v₁ v₂ v₃ v₄ Semantic

Common capital city Athen Hellas Bagdad Irak All capital cities Abuja Nigeria Accra Ghana

Currency Algerie dinar Angola kwanza

City-in-county Hønefoss Buskerud Stord Hordaland

Man-Woman gutt jente bror søster

Syntactic

Adjective-to-adverb munter muntert hel helt Opposite akseptabelt uakseptabelt vitende uvitende

Comparative dårlig dårligere stor større

Superlative dårlig dårligst stor størst

Nationality adjective Albania albansk Argentina argentinsk

Past tense danser danset avtar avtok

Plural nouns banan bananer fugl fugler

Plural verbs avta avtar beskrive beskriver

Table 2.4: Categories of the Norwegian Analogies dataset proposed by Stadsnes (forthcoming)

(42)

(43)

Chapter 3 Data and tools

In order to carry out the experiments presented in chapter 4, 5 and 6, extensive work had to be done to prepare and preprocess the tools and data used in this project. In this chapter, we will discuss the preparation of the tools and data, as well as the practical parts (parameters, technical details, etc.) of the tools in use. The theoretical background regarding the methods underlying these tools is described in Chapter 2.

3.1 Unlabelled text

Unlabelled text is raw text without any annotation or additional information. It is typically gathered from various sources into corpora, which can be used for a wide variety of purposes. In this project, we use unlabelled text to train word embeddings. By looking at a large amount of text, word embedding tools such as word2vec (Mikolov, Chen, et al., 2013) are able to learn which words are similar to each other. This is done by looking at each individual word and its context, minimising the distance in the vector space to words occurring in similar contexts, whilst maximising the distance to all the other words in the vocabulary. In this project we used the Norwegian Newspaper Corpus and NoWAC corpora. These corpora will be described in this chapter.

3.1.1 The Norwegian Newspaper Corpus

The Norwegian Newspaper Corpus (NAK) is a compilation of text gathered continuously from Norwegian newspaper articles. The corpus is available as a free download at the Norwegian National Library (Nasjon- albiblioteket) website. With over 1 billion words for Norwegian Bokmål and 60 million words for Norwegian Nynorsk, it is the largest corpus of its kind. Tokenised texts are available for newspaper articles dated between 1998 and 2011. Newspaper articles from 2012 through 2014 are available as XML documents. In this project we have extracted the raw text from

(44)

these XML files using our own Python script, and prepared the text for the training of embeddings using wang2vec (Ling et al., 2015), a reimplemen- tation of word2vec (Mikolov, Chen, et al., 2013) with extensions.

The concatenation of the pre-tokenised 1998-2011 files and the 2012-2014 files contains 1,291,982,180 tokens, and has a vocabulary of 8,671,622 unique words.

3.1.2 NoWAC - Norwegian Web as Corpus

The Norwegian Web Corpus (NoWAC) (Guevara, 2010) is a web corpus for Norwegian Bokmål, developed at the Department of Linguistic and Scandinavian studies at the University of Oslo. The corpus was created by crawling top level.nointernet domains in the period between November 2009 and January 2010.

NoWAC is the Norwegian contribution to the WaC Corpora Collection, a multilingual collection of web corpora developed using the Corpus Fac- tory method (Kilgarriff, Reddy, Pomikálek, & Avinesh, 2010). The method used to gather the data used in NoWAC is based on the work of Ferraresi, Zanchetta, Baroni, and Bernardini (2008); first a group of seedURLs containing information in the target language is collected by querying popular search engines. The acquired documents are then used to start a crawling of sub-domains and hyperlinks, limited to the.nodomain (Guevara, 2010).

Although Norway has two official national written standards,Bokmåland Nynorsk, the web contains a lot of unrecognised words, such as words in- fluenced by dialect and slang words. Due to the complex linguistic sit- uation in Norway, steps had to be taken to identify the language of the websites crawled. The target of NoWAC was to build a corpus in the Bok- mål language, mixed with some "spoken language", making up a corpus of combined written and spoken language.

NoWAC is distributed as a single file, consisting of sentences one word per line and with start of sentence and end of sentence tags. Tokenisation of NoWAC was carried out by Nico Kuijlen, MSc student at the language technology group, to prepare this data for wang2vec.

3.1.3 Merging the corpora

After preprocessing the corpora, they were merged into a single corpus.

As the Norwegian Dependency Treebank mostly consists of text from newspaper sources, as we can see from table 3.2, we did not conduct experiments using only NoWAC for the training of word embeddings.

(45)

Dataset Tokens Unique tokens

NAK 1,527,414,380 8,671,622

NoWAC concat. 2,217,487,419 12,156,743

Table 3.1: Number of tokens for the Norwegian Newspaper Corpus and NoWaC

Source Percentage of NDT sources

Newspaper articles 82%

Government reports 7%

Parliament transcriptions 6%

Blog posts 5%

Table 3.2: Source ratio for NDT¹

NoWAC was concatenated to the Norwegian Newspaper Corpus increase the size of the corpus, possibly providing some extra detail. The resulting token counts of the Norwegian Newspaper Corpus and the concatenation are presented in Table 3.1.

3.2 Labelled text

In order to train a data driven parser for a language, we need a treebank.

A treebank is an annotated dataset. It typically contains information such as form, lemma, POS-tag, morphological and/or syntactical features for each word, as well as its relation to other words within the sentence. The parser uses this information to learn how a language is constructed, which is essential in the task of dependency parsing. In this project we use the Norwegian Dependency Treebank (Solberg et al., 2014). This treebank will be described in chapter 3.2.1.

3.2.1 Norwegian Dependency Treebank

The Norwegian Dependency Treebank (NDT), described in chapter 2.4.1, was developed at the Norwegian National Library in collaboration with linguists and computational linguists from the University of Oslo. It is a syntactic treebank for Bokmål and Nynorsk with manual syntactic and morphological annotation (Solberg et al., 2014). The treebank consists mainly of newspaper text, but includes government reports, parliament transcriptions and blog posts (see table 3.2).

The Norwegian dependency treebank uses the CoNLL-X format to en-

(46)

Tag Explanation

ID Word index.

FORM Word form or punctuation.

LEMMA Word lemma or stem

CPOS Coarse grained part-of-speech tag POS Fine grained part-of-speech tag

FEATS Morphological and/or syntactical features from a language specific extension.

HEAD Index of the word’s parent. 0 if the parent is root.

DEPREL Dependency relation to the HEAD.

PHEAD Projective head of current token.

PDEPREL Dependency relation to the PHEAD.

Table 3.3: Explanation of the CoNLL-X data fields column-wise² code dependency graphs, and can therefore be used without any modifications in modern dependency parsers, such as Dyer’s parser and UDPipe.

In the CoNLL-X format, each line represents a word, and sentences are separated by empty lines. Word features are fixed order and tab-separated.

In order to use the treebank to train a parser, it has to be split it into a training set, a development set, and a testing set. A split for the Norwe- gian Dependency Treebank was proposed out by Hohle et al. (2017), in the context of their work on optimising a POS-tag set for Norwegian. The split divides the treebank into a training set containing 15696 sentences, a development set containing 2410 sentences, and a test set containing 1939 sentences. In this project we aim to use this split for training, development and testing, since it makes it possible to compare our results to previous work.

Table 3.3 shows an example sentence taken from the training set. In our treebank, the DEPS and MISC fields are always empty, and have been omitted from this example for readability. The dependency graph representing this sentence is shown in Figure 3.1.

The Norwegian Dependency Treebank is distributed as a set of folders, each containing CoNLL-X formatted files sorted by source. The datasets proposed by Hohle et al. (2017) is an 80-10-10 split of the Norwegian De-

1Information presented in Table 2 by Solberg et al. (2014).

2CoNLL-U format: http://universaldependencies.org/format.html

(47)

ID Form Lemma CPOS POS Feats Head Dep

1 Den den pron pron mask(...) 2 SUBJ

2 viser vise verb verb pres 0 FINV

3 at at sbu sbu _ 5 SBU

4 han han pron pron mask(...) 5 SUBJ

5 har ha verb verb pres 2 DOBJ

6 trengt trenge verb verb perf-part 5 INFV

7 å å inf-merke inf-merke _ 6 DOBJ

8 komme komme verb verb inf 7 INFV

9 i i prep prep _ 8 ADV

10 modus modus subst subst appell(...) 9 PUTFYLL

11 . $. clb clb <punkt> 2 IP

Table 3.4: Example sentence from the development set in CoNLL-X format, last two fields omitted.

Den viser at han har trengt å komme i modus .

It shows that he has needed to get into competition-mode .

ROOT

SUBJ

SBU SUBJ DOBJ

INFV DOBJ INFV ADV PUTFYLL

IP

pendency Treebank, balancing the data in terms of genre.

In order to make the treebank usable with our word embeddings, similar normalisation had to be done to the treebank, as had been done to the corpora used to train embedding models. This normalisation is described in chapter 3.2.2.

3.2.2 Normalisation of the corpora and treebank

It is common practice when working with word embeddings to replace every numeric digit by a placeholder. To retain information about the number of digits in a number, every single digit, and not every number, was replaced NUM. For instance, the number 5 would be replaced by NUM, whereas 22 would be replaced by NUMNUM. In order to make the parsers recognise the embeddings associated these placeholders, the same normalisation was done to the dependency treebank.

Tuning word embeddings for neural dependency parsing of Norwegian

Tuning word embeddings for neural dependency parsing of Norwegian

Henrik Hillestad Løvold

Thesis submitted for the degree of

Master in Informatics: Technical and Scientific Applications

(Language Technology group) 60 credits

Department of Informatics

Faculty of mathematics and natural sciences

UNIVERSITY OF OSLO

Tuning word embeddings for neural dependency parsing of

Norwegian

Henrik Hillestad Løvold

Abstract

Acknowledgements

Contents

List of Figures

List of Tables

Chapter 1

Introduction

1.1 The project

1.2 Outline

Chapter 2 Background

2.1 Distributional semantics

2.1.1 Machine learning and Artificial Neural Networks

2.1.2 Language models

2.1.3 Word embeddings

2.1.4 Count-based vs. prediction-based models

2.1.5 Evaluation of word embeddings

2.2 Dependency parsing

2.2.1 Graph-based dependency parsing

2.2.2 Transition-based dependency parsing

2.2.3 Learning

2.3 Parsing with distributional semantics

2.3.1 Clustering

2.3.2 Word embeddings as features in dependency parsing

2.3.3 Word embeddings and word order

2.3.4 A neural network approach to dependency parsing

2.3.5 Recurrent neural networks

2.3.6 Dependency parsing using stack long short-term memory

2.3.7 A trainable NLP pipeline

2.3.8 Word embeddings in UDPipe and Dyer’s parser

2.4 Dependency parsing of Norwegian

2.4.1 The Norwegian Dependency Treebank

2.5 Intrinsic versus extrinsic evaluation of word embeddings

2.5.1 Intrinsic evaluation datasets

2.5.2 Evaluation of Norwegian analogies

Chapter 3

Data and tools

3.1 Unlabelled text

3.1.1 The Norwegian Newspaper Corpus

3.1.2 NoWAC - Norwegian Web as Corpus

3.1.3 Merging the corpora

3.2 Labelled text

3.2.1 Norwegian Dependency Treebank

3.2.2 Normalisation of the corpora and treebank