Named Entity Recognition for Norwegian

(1)

Named Entity Recognition for Norwegian

Experiments on the NorNE corpus

Tobias Aasmoe

Thesis submitted for the degree of Master in Language Technology

60 credits

Department of Informatics

Faculty of mathematics and natural sciences

UNIVERSITY OF OSLO

(2)

(3)

Named Entity Recognition for Norwegian

Experiments on the NorNE corpus

Tobias Aasmoe

(4)

Named Entity Recognition for Norwegian http://www.duo.uio.no/

Printed: Reprosentralen, University of Oslo

(5)

Abstract

In this thesis, we report the first NER results on the Norwegian Named Entities corpus (NorNE). Through our experiments, we investigate the impact of unsupervised pretraining, the use of different label sets and encoding schemes, and also the effect of combining the Bokmål and Nynorsk partitions of NorNE. We also perform a short hyperparameter tuning and an ablation study, showing the impact of certain components in the NER architecture.

(6)

(7)

Acknowledgments

I would like to thank my supervisors, Erik Velldal and Lilja Øverlid at the Department of Informatics.

(8)

(9)

List of Figures

2.1 A three-layered feed-forward neural network. It receives four inputs, has a layer of five units, and the output layer consists of one single unit. . . 12 2.2 Comparison of thesigmoid,tanhandReLUactivation functions 13 2.3 CNN architecture for text classification, using convolutional

and pooling layers, followed by a fully connected layer (Kim, 2014). . . 14 3.1 The MUC-7 format.U.Kis labelled as a location,17,5 percent

as percent, andthe past yearas a date. . . 24 3.2 The CoNLL-2003 format. U.N.is labeled as an organization,

Ekeusas a person, andBaghdadas a location. . . 24 3.3 The OBT format. Both RaytheonandHughes are labeled as

organizations, andUSAas a location. . . 25 3.4 Example of how multi-word expressions such as "Cornell

University" is treated in the Nomen Nescio, CoNLL-2003, and NorNE corpora. . . 29 5.1 An example showing a NCRF++ configuration file . . . 43 5.2 NCRF++ example architecture (Yang and Zhang, 2018) . . . 44 6.1 Aggregated confusion matrices (5.7) for NorNE-full, NorNE-

7 and NorNE-6. . . 52 6.2 Box plot of the two sets ofF1scores. . . 56

(14)

7.1 Example sentence in Nynorsk (italics) and Bokmål, with a corresponding word-by-word English translation (third line). A proper English translation would beOn a daily basis Sydnes works a lot outside of Norway’s national borders. . . 57 8.1 Binary gazetteer usage withNCRF++ . . . 72 8.2 Categorical gazetteer usage with NCRF. Categories PER,

LOC and ORG are omitted. . . 72 8.3 Capitalization pattern usage withNCRF++ . . . 73 10.1 Aggregated confusion matrices (5.7) for the results on

Bokmål and Nynorsk held-out sets, using the joint NER model. 83

(15)

List of Tables

2.1 Comparison of the IO, IOB1, IOB2, IOBE and IOBES encod-

ing schemes . . . 7

2.2 Performance metrics for the Norwegian NER systems for their particular evaluation sets, as reported in Nøklestad (2009) 10 2.3 F1scores among the NER systems mentioned in 2.5.7, when evaluated on the CoNLL-2003 and CoNLL-2012 data sets. . . 21

3.1 Distribution of tokens and NE’s across the Nomen Nescio subcorpora . . . 25

3.2 Distribution of NE’s in the whole Nomen Nescio corpus . . 26

3.3 Proportions of NE’s across the Nomen Nescio subcorpora . 26 3.4 total number of sentences for all data splits among the Bokmål and Nynorsk partitions of NorNE. . . 27

3.5 Class frequencies, proportions and average entity lengths in the Bokmål (nob) section of NorNE. . . 27

3.6 Class frequencies, proportions and average entity lengths in the Nynorsk(nno)section of NorNE. . . 27

3.7 The different label sets used among the data sets. We exclude the subcategories for temporal and monetary expressions of MUC-7 and the subcategories for geopolitical entities of NorNE. . . 28

4.1 Precision and recall formulas for SemEval Task 9.1 . . . 34

6.1 Baseline model configuration . . . 48

6.2 Evaluation scores for the baseline system . . . 48

(16)

6.3 Evaluation scores with the selected pretraining options (dimensionality and fastText architecture), showing exact labeled (EL)F1scores and training times for NER. . . 49 6.4 How the sequence "Da Norge ble spurt av FN om å sende

helikopterpersonell til konflikten i Kongo" is annotated using NorNE-full, NorNE-7 and NorNE-6. . . 51 6.5 Exact labeled (EL) F1 scores for NorNE-full, NorNE-7

and NorNE-6, when evaluated on the NorNE Bokmål development set . . . 51 6.6 Perfomance comparison of models trained on NorNE-full

using IOB2, IOBE, IOBS and IOBES encodings, when evaluated on the NorNE development set . . . 54 6.7 Impact of seed usage over two batches of model instantia-

tions, showing exact labeledF1scores for each experiment, as well as the mean, standard deviation and variance for each of the two experiment sets. . . 55 7.1 Results for the Nynorsk and Bokmål development sets,

training on either of the partitions or their combination. . . . 59 7.2 Results for the Nynorsk and Bokmål development sets,

training on either (a) NOB, (b) NNO or (c) NNO + NOB, with the different embedding models described in 7.3. . . 61 7.3 Results for the Nynorsk and Bokmål development sets,

training on either (a)NOB, (b)NNOor (c)NNO+NOB, with the different embedding models trained on The Norwegian Newspaper Corpora and NoWaC, as reported in 7.4 . . . 62 7.4 Frequencies tokens and types for the different variants of

NNC+NoWaC . . . 64 7.5 Results for the Nynorsk and Bokmål development sets,

training on either (a)NOB, (b)NNOor (c)NOB+NNO, with the new collection of embedding models trained on The Norwegian Newspaper Corpora and NoWaC, as reported in Section 7.6 . . . 66 8.1 Category distribution of the gazetteer entries across the

different categories extracted from INESS and the Nomen Nescio corpus, showing total counts, the number of single- token entries and multi-token entries. . . 71

(17)

8.2 The frequency of the extracted gazetteers in the NorNE training sets . . . 71 8.3 Evaluation scores on the Bokmål development set, using

different combinations of custom features together with the 600-dimensional pre-trained word embeddings . . . 74 9.1 Overview of hyperparameters and the options or value

ranges to investigate, with default values marked with an asterix. . . 75 9.2 Results from using a range of combinations ofLSTMlayers

and units. The icondenotes the baseline setting. 300-dim . 77 9.3 Results from using a range of combinations of CNNlayers

and units. . . 78 9.4 Results on the NorNE development sets using the different

combinations of inference and character-level components. . 79 10.1 Exact labeled F1 scores for the development and held-out

data sets. . . 81 10.2 Exact unlabeled F1 scores for the development and held-out

data sets. . . 82 10.3 Partial labeled F1 scores for the development and held-out

data sets. . . 82 10.4 Partial unlabeled F1 scores for the development and held-

out data sets. . . 82

(18)

(19)

Chapter 1

Introduction

Within natural language processing and information extraction, Named Entity Recognition (NER) is the task of detecting and classifying proper names in textual data. Common entity types are locations, organizations, persons, events, and in some instances subcategories of these. For instance, in example (1.1), we could have designed an NER system which is able recognizeKirsten Danielsenas a person, Røde Korsas an organization, and Jugoslaviaas a location:

(1.1) Kirsten Danielsen Kirsten Danielsen

(47) (47)

er is

koordinator the coordinator

for for

Røde Kors’

the Red Cross’

sosiale social

arbeid work

blant among

flyktninger refugees

i in

det the

tidligere former

Jugoslavia.

Yugoslavia.

Proper names tend to be ambiguous, as they may refer to several different entities in the world.New Yorkcould either be the name of a location, or the title of song. Similarly,Thelmacould be recognized as the name of a person, or the title of a movie. Resolving these kinds of ambiguities between named entities is a challenge that a proper NER system should tackle.

Even though NER has been under intensive research the last decades, little work have been done on Norwegian. In this thesis we report the first NER results on the Norwegian Named Entities (NorNE) corpus, using a neural sequence labeling architecture. NorNE is the first publicly available NER corpus for Norwegian, with named entity annotations for both of the written standards, Bokmål and Nynorsk. Through our experiments we will investigate the effect of a number of techniques, such as:

• Unsupervised pretraining, and the impact of different word embedding dimensionalities and resources.

(20)

• Using different combinations of label sets and encoding schemes with the NorNE corpus.

• Combining Bokmål and Nynorsk resources to a train a joint NER pipeline for Norwegian.

• Employing additional features, such as gazetteers and capitalization patterns.

• Architecture optimization, exploring different settings for the word and character-level components of our sequence labeling model.

1.1 Overview

What follows is an overview of the different chapters and their content.

In Chapter 2 we will discuss the general task of NER, covering both traditional and neural system design, giving a brief tour through the basic theory of neural networks and unsupervised pretraining. Furthermore, we will consider some of the current state-of-the-art NER systems for English, and also previous research on Norwegian.

In Chapter 3 we will give a brief introduction of popular NER data sets. In more detail, we will also discuss Norwegian resources, providing descriptive statistics.

In Chapter 4 we will consider some of the existing NER evaluation schemes. Here, we will also highlight some of the special considerations for NER, and possible error scenarios we might encounter when evaluating NER systems.

In Chapter 5 we will give an overview of our chosen experimental setup for training our own NER system for Norwegian, covering the chosen sequence labeling framework, the high-performance computational environment, different pretraining resources, and finally our evaluation method.

In Chapter 6 we will report results from a baseline NER system. We will also report results from experiments using a collection of word embeddings. In this chapter we will also systematically compare different label sets and encoding schemes. In the end of the chapter, we will compare systems using fixed or random seeds, highlighting the reproducibility of our experiments.

In Chapter 7 we will examine the impact of combining the Bokmål and Nynorsk partitions of the NorNE corpus, comparing the results of joint and mono-standard NER models. We will also report results using a new

(21)

set of word embeddings which we have trained specifically for this task, using both Bokmål and Nynorsk for pretraining.

In Chapter 8 we will investigate the effect of using custom features together with our raw word and character-level input.

In Chapter 9 we will conduct a simple hyperparameter search, and also perform an ablation study, where we remove certain architecture components.

Finally, in Chapter 10 we will report results from evaluating our best- performing model on the held-out set, comparing these on both Norwegian standards with the development set results.

Chapter 11 contains an overview of the key findings of this thesis, and also provides some future research suggestions for Norwegian NER.

(22)

(23)

Chapter 2

Designing NER systems

From the early 1990’s, natural language processing has been though several methodological paradigms, all with favored techniques for approaching named entity recognition. In this chapter we will consider both traditional and modern approaches to this task, discussing important systems and innovations along the way. First, we will begin with a brief overview of the history of the task.

2.1 A short history of NER

NER has historically been a central task within the area of information retrieval (IR), and has often been paired with other NLP tasks such as co- reference resolution. Locating named entities in text data has also been of high interest in the digital humanities, as exemplified in Erdmannet al.

(2019).

The first systematic attempts at extracting proper names from text were done in the early 1990s, with notable examples being Coates-Stephens (1992) and Thielen (1995). The term named entity was first used as a part of the Message Understanding Conferences (MUC), more specifically during the MUC-6 evaluation campaign (Grishman and Sundheim, 1995).

At that time, the task was referred to as "Named Entity Recognition and Classification" (NERC), but in later years the term "Named Entity Recognition" (NER) has been adopted to mean the same thing. Besides MUC, NER has been the main focus for numerous conferences and shared tasks internationally, with notable examples being the Conferences on Natural Language Learning(CoNLL) and theAutomatic Content Extraction (ACE) research program.

As a formal side note – without delving too far into philosophy of language and its terminology – we can trace the concept of "naming" entities back to

(24)

Saul Kripke and his theory of rigid designators (Kripke, 1972). A term is considered a rigid designator if it refers to the same object in all possible world states, and such terms include proper names and also natural kind terms like biological species and various material and substances.

Conversely, terms referring to different objects depending on the world state are called flaccid designators. For example, according to Kripke’s theory, (a) "Erna Solberg" is a rigid designator, while (b) "the Norwegian Prime Minister" is a flaccid designator. The term (a) could rigidly refer to the same object (i.e. the person who in 2019 is the Norwegian Prime Minister) regardless of the world state, while the term (b) could have referred to

"Jens Stoltenberg", another person, or no one at all, depending on the world state¹.

However, for practical purposes, our exact definitions of named entities are usually not that strict, and there is a general consensus in the NER research community about the inclusion of entities with non-rigid reference or without any reference to named entities at all (Nadeau and Sekine, 2007). For instance, some annotated resources include entity categories for monetary expressions, dates, and also words derivedfromnames.

Over the years, a wide range of annotated resources has become available across different domains and languages. Data sources often include text form news articles, but they can also be traced from other places on the web, such as in blog entries or from social media services. Also, in the biomedical domain (Smithet al., 2008), NER is used to detect entity types such as genes, chemicals and diseases. There is a varying level of entity granularity, i.e. label set sizes, among existing NER resources, as we will observe in Chapter 3.

Early NER systems were often rule-based, and while such methods often prove to be very precise, they might suffer from poor recall scores. Since then, supervised machine learning techniques has become the dominant methodology, requiring increasing amounts of annotated resources. Traditional supervised NER systems were often designed using either Hidden Markov Models (HMM), Decision Trees, Maximum Entropy Models (ME), Support Vector Machines (SVM), or Conditional Random Fields (CRF).

Norwegian NER received some attention in the 2000s during the Nomen Nescio project (Johannessen et al., 2005), but since then Norwegian has been an under-researched language for this specific task. We will discuss previous work on Norwegian NER in Section 2.4 and also Norwegian resources in Section 3.3 and 3.4. For a more detailed rendition of the history oftraditionalNER, see Nadeau and Sekine (2007).

1A more detailed account ofpossible worldsand related terminology is given in Menzel (2017)

(25)

2.2 Label encoding schemes

Common to both traditional and modern NER systems is the way named entity data sets are annotated. Named entity labels are encoded using a labeling scheme, in order to give our classifiers positional information about named entity tokens. Earlier resources made use of XML-like formats, such as those we will describe in Section 3.1 and 3.3, but in later years the tab-separated CoNLL-formats have more or less been established as the conventional way of structuring NER resources. In such data sets, named entities are labeled in accordance with some positional encoding, which we will describe in this section.

TheIOBlabeling scheme, first introduced by Ramshaw and Marcus (1995), is widely used when labeling named entities. Words are marked in correspondence with their location relative to an entity; inside (I-<label>) or outside (O) of any given entity, or in the beginning (B-<label>) of an entity.

Also, we have slight variations of this scheme. Using the IO encoding we can exclude the B-<label>. InIOB1, theB-<label>is only used when a named entity token is followed by a token belonging to the same entity.

In IOB2, the B-<label> is used at the beginning of every named entity, regardless of its span. With the more nuanced IOBE, IOBS and IOBES schemes, we include two additional markers: E-labelfor the end-token of an entity, andS-labelfor single, unit-length entities. A comparison of these is shown in Table 2.1, with a sample sequence from the NorNE corpus

"Islands president Olafur Ragnar Grimsson".

Token IO IOB1 IOB2 IOBE IOBS IOBES

Islands I-GPE_ORG I-GPE_ORG B-GPE_ORG B-GPE_ORG S-GPE_ORG S-GPE_ORG

president O O O O O O

Olafur I-PER B-PER B-PER B-PER B-PER B-PER

Ragnar I-PER I-PER I-PER I-PER I-PER I-PER

Grimsson I-PER I-PER I-PER E-PER I-PER E-PER

. O O O O O O

Table 2.1: Comparison of the IO, IOB1, IOB2, IOBE and IOBES encoding schemes

Systems trained on IOBES-encoded resources have earlier been shown to give a slight performance increase compared to IOB2 (Ratinov and Roth, 2009). This has also been confirmed later in Reimers and Gurevych (2017) and Yanget al.(2018).

2.3 Traditional approaches to NER

As mentioned earlier, NER is held as a word-by-word sequence labeling task, and traditional NER systems were often statistical models or rule- based systems. The CRF is still a strong baseline for sequence tagging, and

(26)

is also an important component in modern neural architectures, which we will discuss in Section 2.5.7

In this section we will take a closer look at the central steps of traditional NER, and the common features utilized by traditional systems, focusing on the system by Ratinov and Roth (2009). Finally, we will also discuss historical NER systems for Norwegian text.

2.3.1 Traditional NER features

For traditional NER systems, it is common to extract a feature set encoding morphosyntactic information, extracted from both a target word and its surrounding context. Examples of common features includes POS-tags, syntactic chunk tags, prefixes, suffixes, capitalization patterns, as well as word shapes (Jurafsky and Martin, 2009). Word shapes refer to some abstract representation of how a word is structured, encoding the basic capitalization patterns a word form can have; for example,Olafurwill have the shapeXxxxxx, andpresidentwill havexxxxxxxxx.

The features mentioned above are all local in the sense that they are extracted within a certain context window of a given word. It is also possible to extract information outside this window, or from knowledge sources external to the document itself. Ratinov and Roth (2009) discuss the use of several non-local features and external sources, including gazetteers, unlabeled data, context aggregation features and extended prediction histories.

Gazetteersare large knowledge bases containing known named entities, for example lists of person names or locations. Whether a word is contained in an available gazetteer, or not, can be a useful feature - especially when our training resources are limited.

It is also possible to use word class models based onunlabeled textdata, an unsupervised technique resembling vector space models for distributional similarity (Brown et al., 1992). Unlabeled data from various sources are clustered hierarchically in order to generate a set of class-based features, such that words sharing the same contexts are assigned to the same clusters.

Another option iscontext aggregation features, which aggregate the contexts tokens are found in, leading to identical labeling of tokens found in similar contexts (Chieu and Ng, 2003). These features can be defined manually, or they can be extracted automatically. In addition to using the words within a specified context window as features, we can aggregate the contexts of all instances within a specified window in case we find a token several times in the text.

(27)

One last option is to use anextended prediction history, were we keep track of the labels that have been used for a word earlier. For each token we will store the labels that have been assigned to this earlier in the training process, and their relative frequencies.

In a traditional system, such as the one described in Ratinov and Roth (2009), all of these local and non-local features are combined to a single feature vector, which is then passed to the learning model (in this case, an averaged perceptron).

2.4 NER systems for Norwegian

We will now examine earlier Norwegian NER systems and relevant research. Other than the Nomen Nescio project, which we will discuss in 3.3, little work have been done on NER for Norwegian, and not all of the earlier systems have been constructed to perform both named entity recognition and classification.

2.4.1 Nomen Nescio systems

The Nomen Nescio project resulted in the three following NER systems for Norwegian (Johannessenet al., 2005), referred to as NorwegianCG,MEand MBL. Unfortunately, none of these are publicly available for the time being.

Norwegian CG was a rule-based system using constraint grammar, as described in Jónsdottir (2003). This system was also embedded in the Oslo- Bergen tagger (Johannessenet al., 2012), but it is not in use at this date. The tagger was able to recognize whether a given word was a named entity or not, but it could not perform full named entity classification of its input.

Norwegian ME(Haaland, 2008) was a system using a maximum-entropy model. The classifier used features such as suffixes, POS tags and capitalization patterns of names and their neighboring words. Additionally, the following set of gazetteers were used: person names from The National Statistics Agency, location names from The Norwegian Language Council, and a list of work titles compiled by Haaland’s research group. These resources contained 13 213 names in total. The system was designed to only take care of theclassificationpart of NER, relying on already NE-chunked text data as input.

Norwegian MBL (Nøklestad, 2009) made use of memory-based learning, using the same features as in Norwegian ME, in addition to lemmas and inflected word forms, and also gazetteers provided by the Nomen Nescio network. Similar to the previous system, Norwegian MBL relied on NE- chunked text input.

(28)

These systems were all evaluated in Johannessen et al. (2005), and the results are shown in table 1.7. CGwas tested on an unspecified corpus of 100 000 words. For the evaluation ofME, 10-fold cross-validation was used, training and testing the model on ten different partitions of the corpus.

MBLwas tested similarly, with leave-one-out cross-validation. Nøklestad (2009) also experimented with a maximum-entropy model, ending up with similar results as in Haaland (2008).

System Recall (%) Precision (%) F-score (%)

CG 96.5 38.4 54.9

ME 76.0 76.0 76.0

MBL 83.0 83.0 83.0

Table 2.2: Performance metrics for the Norwegian NER systems for their particular evaluation sets, as reported in Nøklestad (2009)

.

2.4.2 Later research

Later on, some research has been done on Named Entity Chunking with support vector machines (SVM), using the Nomen Nescio corpus for training (Johansen, 2015). NEC models will only recognize named entity segments in text, but does not classify which categories these belong to. The inspiration behind Johansen’s approach was to create a chunker separate from the classifier. This system could then be used together with ME or MBLto form a full-fledged NER system, but there are no reports of such experiments. Like with the previous Norwegian NER systems, the NEC model is not publicly available.

Simultaneously as the work on this master thesis, Johansen has also conducted newer research on NER, training models on a manually annotated edition of NDT (Johansen, 2019), as we will discuss more in Section 3.5. Keep in mind that this data set is not the same as NorNE, using only the four entity categories: locations (LOC), organizations (ORG), persons (PER) and miscellaneous entities (MISC). In his paper, he reports results on both Bokmål and Nynorsk and the two combined (referred to as Helnorsk), achieving evaluation scores surpassing the older Nomen Nescio systems in Table 2.2. Models trained on Helnorsk also obtain better results than training separate models for the two Norwegian standards. Johansen used a hybrid sequence labeling architecture similar to Chiu and Nichols (2016) (which we will consider later in this chapter), combining word and character-level features, as well as gazetteers. In the later chapters we will discuss Johansen’s results along with our own findings.

(29)

2.5 Modern approaches to NER

In the previous section we saw how traditional NER systems were designed, using mostly hand-crafted features and external knowledge resources - often paired with linear machine learning models. However, there are a few drawbacks to these; the development of language-specific resources is often a tedious and time consuming process, extracted feature spaces are often high-dimensional and sparse, and linear models are problematic in situations where a classification problem is non-linear.

In this section, we will consider the recent approaches to NER using Artificial Neural Networks (ANNs), where little feature engineering is done, and with few language-specific resources other than a labeled training set. It is far beyond the scope of this thesis to give a complete account of the theory behind ANNs, so we will only consider brief descriptions, with an emphasis on neural architectures. We will start with the fundamental design of basicfeed-forward networks. Then, we will examine some special network variants: convolutional neural netowrks (CNN) and recurrent neural networks (RNN). Lastly, we will look at the current state of NER and the recent trends revolving the design of sequence labeling systems.

2.5.1 Feed-forward neural networks

Neural networks are learning models containing a structured collection of neurons, where the major inspiration is drawn from the McCulloch- Pitts neuron (McCulloch and Pitts, 1943). One can think of a neuron as a computational unit with scalar inputs and outputs, and for every input element the neuron has a corresponding weight. A unit will then multiply all the elements with their weights, and sum them. Then, a non-linear activation function is applied on the result (Goldberg, 2017).

In afeed-forwardarchitecture a collection of neurons are connected to each other, resulting in a network where the computation flow starts from an input layer and ends with an output layer, as shown in Figure 2.1. Between these layers there is a set of hiddennon-linear layers, applying a specific activation function on their inputs. None of the unit connections are cyclic, meaning that no output is passed back to the earlier layers.

With a mathematical notation style similar to Jurafsky and Martin (2009) and Goldberg (2017), we can also define the network from Figure 2.1, as shown in 2.1

2http://www.texample.net/tikz/examples/neural-network/

(30)

Input #1 Input #2 Input #3 Input #4

Output Hidden

layer Input

layer

Output layer

Figure 2.1: A three-layered feed-forward neural network. It receives four inputs, has a layer of five units, and the output layer consists of one single unit.²

NNMLP(x) =yˆ a^[⁰^] = x

z^[¹^] =W^[¹^]a^[⁰^]+b^[¹^] a^[¹^] = g^[¹^](z^[¹^]) z^[²^] =W^[²^]a^[¹^]+b^[²^] a^[²^] = g^[²^](z^[²^])

ˆ

y= a^[²^] (2.1)

For a layeri, we refer toW^[ⁱ^]as its weight matrix, andb^[ⁱ^] as the bias term.

While z^[ⁱ^] is the combination of weights and biases W^[ⁱ^]a^[ⁱ⁻¹^]+b^[ⁱ^], a^[ⁱ^] is the layer output. g^[ⁱ^]() is the activation function applied on any given intermediate outputz^[ⁱ^].

In this notation style we treat the input vector x as the 0-th layer, and therefore we hold that a^[⁰^] = x. In the end, ˆy is the final prediction of the network. a^[¹^] is the output of the hidden layer, while a^[²^] is the output of the last layer, returning the final prediction ˆy.

There are several non-linear activation functions to choose from. One example is thesigmoidfunction:

sigmoid(x) = ¹

1+e⁻^x (2.2)

(31)

Here, the output is bounded between 0 and 1. Another similar function is thehyperbloic tangent (tanh):

tanh(x) = ^e

x−e⁻^x

e^x+e⁻^x (2.3)

with an output bounded between -1 and 1. ReLU has become a popular activation function later on:

ReLU(x) =max(0,x) (2.4) Figure 2.2 shows a comparison of the mentioned activation functions. It should also be noted that (2.2 - 2.4) are only applicable for the hidden layers, and we normally want to use another function for our output layer.

For a classification task, the softmax function is a popular option. Let us assume we have a vectorxwith a dimensionality ofd, then

so f tmax(x_i) = ^e

x_i

∑^dj=₁e^x^j1≤i≤d (2.5) With other words, softmax returns a vector encoding the probability distribution between 0 and 1 forddifferent output classes.

−4 −2 0 2 4 0.5

1

(a) sigmoid

−4 −2 0 2 4 0

1

(b) tanh

−4 −2 0 2 4 0

5

(c) ReLU Figure 2.2: Comparison of the sigmoid, tanh and ReLU activation functions

2.5.2 Convolutional neural networks

A convolutional neural network (CNN) (LeCun et al., 1998a) is a neural architecture designed to detect local and informative clues within in a larger structure (e.g. a sentence or document), and combine these in order to classify the input. CNNs have been widely used for image processing, but are now also known to be successful for various NLP tasks.

The architecture of such networks are similar to the feed forward variant described in 2.5.1, but with some notable modifications.

More specifically, for NLP tasks, the filter is a function which is applied to each word window in an input sentence, generating vectors capturing the central properties of the words located here. Following is the so-called

(32)

poolinglayer, which unites all the resulting vectors from the previous layer into one single vector. For this, there exists several techniques, such asmax andaveragepooling. This vector is then passed to a set of fully connected layers, as in Figure 2.1, where we can make the final classification of the input.

Figure 2.3: CNN architecture for text classification, using convolutional and pooling layers, followed by a fully connected layer (Kim, 2014).

2.5.3 Recurrent neural networks

In convolutional architectures, the order of the input is not preserved beyond the range of the sliding windows, and such architectures may therefore have problems with modeling long-distance relationships and dependencies between words in larger sequences.Recurrent neural networks (RNNs) offers an alternative, where we aim to preserve the structural properties of the inputs (Goldberg, 2017).

An RNN contains internal loops which allows us to pass information between a series ofstates, each corresponding to a specific input element.

In each state arecurrence formulais applied on the previousstate vectorand the currentinput vector, returning a new state vector. We can formulate an RNN as a function, again with a notation style simlar to Goldberg (2017):

RNN(x_1:n;s₀) =y_1:n y_i =O(s_i)

s_i = R(s_i−1,x_i) (2.6) We refer to x_1:n as the input sequence. For a stepi in the input, s_i is the state vector. Ris the recurrence formula, taking the previous state vector s_i−1and an input vectorx_ias inputs.Ois a function mapping a state vector s_i to an output vector y_i. s₀ is the initial vector, which is often chosen to be omitted or defined as a zero vector (Goldberg, 2017). As opposed to

(33)

the feed-forward network, where each layer has its separate weights and biases, an RNN has a shared set of parameters (often denoted θ) for all states in the network. The exact details of the functions RandO will be discussed below.

Simple-RNN

In the simplest RNN instantiation, also known as the Elman Network or the Simple-RNN (Elman, 1990), we define the state and output vectors as:

s_i =R(x_i,s_i₋₁) =g(s_i₋₁W^s+x_iW^x+b)

y_i =O(s_i) =s_i (2.7)

In other words, both the previous state s_i₋₁ and the current input x_i are linearly transformed. The sum of these operations are then fed to a non- linear activation functiong().

Long Short-Term Memory architecture (LSTM)

Training a Simple-RNN can involve some challenges due to the vanishing gradient problem (Pascanu et al., 2013), where the error signals in the end of a sequence decreases very quickly during backpropagation. Also, there might be instances where the values explode during the repeated multiplication between the sequence steps. Consequently, long-range dependencies could still be difficult to capture or the network could fail to learn anything at all (Goldberg, 2017). Gated architectures such as LSTM networks (Hochreiter and Schmidhuber, 1997) are designed control the recurrent function R, with the intention of avoiding these vanishing or exploding values. While a Simple-RNN multiplicates the weight matrix W over and over for each step of the input, the LSTM network avoids this using differentiable gating mechanisms, determining which parts of the input we can write to memory, and which parts we can disregard. An LSTM is defined below in 2.8 (Goldberg, 2017).

(34)

s_j =R(s_j−1,x_j) = [c_j;h_j]

c_j = f c_j₋₁+iz h_j =otanh(c_j)

i=σ(x_jW^xi+h_j₋₁W^hi) f =σ(x_jW^{x f} +h_j−1W^{h f}) o=σ(x_jW^xo+h_j−1W^ho) z=tanh(x_jW^xz+h_j₋₁W^hz)

y_j =O(s_j) =h_j (2.8)

To be more concise, each state s_j is made up of two vectors, a memory component c_j and a hidden state componenth_j, where the objective of c_j is to preserve the memory and gradients over time. To accomplish this, we have three gates: the input gate i, the forget gate f, and the output gate o. For each of these gates, a linear combination of our input x_j and our previous hidden state h_j−1 is computed, where the result is passed through thesigmoidactivation function (1.3). zis ourupdate candidate, with a linear combination ofx_jandh_j−1being fed to thetanhactivation function (1.4) For each step, our memory component c_j is updated: our forget gate determines how much of our previous memory we should preserve (f c_j−1), while the input gate determines how much of the proposed update should remain (iz). In the end, we compute h_j by passing the memory component vector c_j through the tanh function, and by letting the output gate o control this non-linearity (with the hadamard-product ) (Goldberg, 2017).

2.5.4 Training neural networks

Neural networks are trained using gradient-based optimization methods, and one of the central techniques is the backpropagation algorithm (LeCun et al., 1998b), (Rumelhart et al., 1988). Stochastic gradient descent (SGD) (Bottou, 2012) (LeCun et al., 1998c) is one such optimization algorithm, where we adjust a models parameters to minimize the total loss, using a specified learning rate (i.e. how much we should update the gradients).

In later years, adaptive optimization algorithms have emerged, such as the unpublished RMSProp (Tieleman and Hinton, 2012), and Adam (Kingma and Ba, 2014).

(35)

2.5.5 Feature representations & unsupervised pretraining

As discussed in Section 2.3, when training a sequence labeling system, we seek to extract feature vectors encoding information like POS-tags, abstract word shapes, and so on. Normally, a one-hot encoding of such categorical data could result in a feature vector containing unique dimensions for every possible feature, which results in a sparse and high-dimensional feature space. This could come with a high computational cost, and in some NLP tasks, such as syntactic parsing, the feature extraction process takes even more time than the parsing itself (Chen and Manning, 2014).

In later years, central to neural NLP is the use ofdensefeature vectors. By this, we mean that core features such as words, POS-tags, dependency tags, are represented as vectors embedded into a dense, n-dimensional space, which can be passed to the neural architectures described previously in this chapter. While traditional one-hot encoded vectors could have thousands (or even millions) of feature dimensions, these vectors often have just a few hundred, thereby avoiding the "curse of dimensionality". In addition to the neural architectures described earlier, we could then choose to employ a word embedding layer, which will serve as a look-up table for our dense word representations. When processing our input sequences, we can then check if the input tokens exist in the embedding layer, and then pass their representations to the following layers in our neural network (Goldberg, 2017).

A word embedding layer is often randomly initialized, and thereafter optimized during training in the same way as the other network layers.

However, we have already seen the use of unsupervised methods, like the word class models in Ratinov and Roth (2009), where large amounts of unlabeled data is used to supplement an NER system. This is something we can do with neural networks as well, ensuring that we have meaningful word representations before even training a network for a given task. Drawing from distributional semantics (Harris, 1954), through unsupervised pretraining we seek to find representations of the words in a raw, unlabeled corpus, based on their contexts. Words with similar meanings will tend to have resembling vectors, since they are often located in the same types of contexts. Preferably, the representations learned from an unlabeled resource will resemble those of our labeled data, which benefits the models in terms of generalization power (Goldberg, 2016).

Dense feature vectors can also be combined in different ways depending on the task at hand. For instance, in the case of sentence classification we could average all the word vectors of an input sentence, and then feed the resulting vector to a feed-forward network.

The utility of dense vectors was has been known for a while (Bengioet al., 2003), and there are several ways of generating them, with techniques such as Word2Vec, which we will cover in the next section.

(36)

2.5.6 Word embedding frameworks

While we have discussed the use of dense word embeddings, we have not really looked at how these representations are obtained. What follows is a short review of some approaches to unsupervised pretraining, including the algorithms Word2Vec and fastText.

Word2Vec & fastText

Initially, dense word embeddings were a positive side-effect from training neural language models. Word2Vec (Mikolov et al., 2013) is a fairly simple method for acquiring such vectors, where finding these word representations (or vectors) is the main objective – not the language modeling itself.

Under the hood, Word2Vec is basically a shallow neural network, designed to reconstruct input sequences given a specified context window. Given a corpus, this network will be initialized with aV×Nweight matrix, where V is the corpus vocabulary size, and N is the desired dimensionality of these vectors (e.g. 100). Then, for each unique word in the vocabulary, we will have a corresponding row in the weight matrix, which will be optimized during training. This weight matrix is also the end-product of this algorithm. For reconstructing input sequences, Mikolovet. alpropose two different architectures;CBOW, which attempts to predict a target word, given a set of context words; whileSkip-Gram, inversely, attempts to predict a whole context, given a target word.

Expanding on the Word2Vec algorithm, fastText (Bojanowski et al., 2017) also takes into account sub-word information, ensuring that morpholog- ically similar words will be assigned similar vectors. Furthermore, this makesfastTextmore apt to find good representations for the less frequent tokens in a corpus, and also lets us infer vectors of unseen words based on their charactern-grams. Both the CBOW and Skip-Gram architectures are available for thefastTextframework.

Other approaches

In addition to Word2Vec, there are other widely used frameworks for acquiring dense word representations:

• GloVe (Pennington et al., 2014) mixes a count-based approach – similar to traditional vector space models – and a predictive model, in order to find a "global" vector representation.

• In Word2Vec and fastText, regardless of the context, ambiguous

(37)

words like "bank" will receive the same vector, no matter if it is used to refer to a financial institution or a river bank. ELMo(Peterset al., 2018) offers a way to find contextual word representations, making sure that such ambiguous words will be assigned vectors based on their surrounding context.

• Attention mechanisms (Vaswaniet al., 2017) as also become popular tools for training language models, and large transformer-based models such asBERT(Devlin et al., 2018) have grown in popularity the recent years.

2.5.7 Modern NER architectures and techniques

In later years, the LSTM-centered architectures have been the favored approach for sequence labeling, either with an LSTM network as a standalone architecture, or as the core component coupled with other feature extractors. In this section we will review some of the widely used approaches for neural sequence labeling, taking a closer look at CRF inference layers and character-level embeddings. We will also consider a few important systems from the recent NER literature.

Applying conditional random fields in neural architectures

So far, we have only consideredsoftmax(1.6) as the classification function of our neural networks. As a sequence labeling task, NER requiresstructured output predictions, and the downside of thesoftmaxfunction is that it does not guarantee output sequences that comply with our chosen encoding scheme.

With the IOB2 label encoding scheme described in Section 2.2, valid sequences always start with B-label, and tokens inside a named entity are always labeled I-label, so we could expect a properly trained NER system to return valid sequences as well. Conditional random fields (CRF) (Laffertyet al., 2001) have been used with success earlier in the history of sequence labeling, but in later years they have also proved to be a very useful component in neural sequence labeling architectures, serving as the inference layer where we would normally employ asoftmaxactivation function. For tasks such as NER, where we have dependencies among the encoding tags, CRF layers have been reported to consistently outperform softmax(Reimers and Gurevych, 2017). We could also choose to view such architectures as CRF modelswithneural feature extractors. We will look at concrete examples later in this section.

(38)

Leveraging character-level input

A plain neural architecture, such as an LSTM, is able to take dense word vectors as input, and in that sense, is able to benefit from the distributional (semantic) information from a word embedding. However, if we process an input sequence on the character-level, we can also extract morphological information from our input.

In 2.5.6 we saw how fastText is able to do this, and we can implement similar components directly in a neural classification model as well. By training a character-embedding, which serves as a look-up table similar to a word embedding layer, we can then pass character-level input to a feature extractor, such as a CNN or an LSTM. Then, the feature extractors output can be concatenated with our word-level representation, forming a single feature vector.

Neural NER systems

What follows is a short overview of a few central NER systems, covering their network architectures and approaches. These architectures are often optimized for English, even though some report results on other languages as well. Their performances on the CoNLL-2003 and CoNLL-2012 data sets are shown in Table 2.3 in terms ofF1scores.

• Collobert et al. (2011) were among the first to introduce neural networks for NER, presenting a simple feed forward network trained on the CoNLL 2003 data set. Unlabeled data from the entire English Wikipedia and the Reuters RCV1 data set, were also used to pretrain word embeddings. This is an important system historically, as it was the first NER system to avoid using any hand-crafted features.

• Chiu and Nichols (2016) presents a hybrid neural architecture using a bidirectional LSTM and a character-level CNN. They report results from experiments with pre-trained word embeddings, such as the publicly available GloVe and Google embeddings. They also use additional word-level features, encoding both capitalization patterns and gazetteer memberships. The resulting system were for the moment regarded as the state-of-the-art on English text (Chiu and Nichols, 2016), and is still considered a strong system. It is worth mentioning that this system usedbothtraining and development sets for training in the final held-out evaluation.

• Lample et al. (2016) presents two neural architectures for NER.

The first is a bidirectional LSTM with a CRF inference layer. The second is a model that uses an algorithm similar to Shift-Reduce for transition-based parsing, with the states being represented by so- called stack LSTM’s. Similar to Chiu and Nichols (2016), their system

(39)

also makes use of character-level features which are concatenated with the vectors from word embeddings. Unlabeled data from the English Gigaword 4 dataset were used to pretrain additional word embeddings (Lampleet al., 2016).

• Akbiket al.(2018) presents a contextual language modeling method, similar to ELMo (see 2.5.6), which is purely trained on character- sequences of words, achieving state-of-the-art results on both the CoNLL-2003 and CoNLL-2012 data sets.

• Clarket al. (2018) introduces a technique called cross-view training, where a combination of supervised and unsupervised learning is used to find better word representations.

F1 score

System CoNLL 2003 CoNLL 2012 Collobert et al. 89.59

Chiu & Nichols 91.62 86.28 Lample et al. 90.94

Clark et al. 92.61 88.81

Akbik et al. 93.09 89.71

Table 2.3: F1 scores among the NER systems mentioned in 2.5.7, when evaluated on the CoNLL-2003 and CoNLL-2012 data sets.

Open NER frameworks

Reported NER systems are often made available open-source, but are rarely prepared to be used off the shelf. Here, we we will consider a few neural sequence labeling frameworks, which enables users to instantiate and train their own systems, including the ones discussed in 2.5.7.

• NeuroNER (Dernoncourt et al., 2017) is an open sequence labeling program, allowing users to annotate their own data through a web- based interface, in addition to configure and train a neural NER system. With a configuration file, users can define the neural components and hyperparameters they wish to use.

• NCRF++(Yang and Zhang, 2018) is very similar to NeuroNER, and let users define more or less the same architectures. In addition to the components and hyperparameters you can configure in NeuroNER, NCRF++ also let the user encode custom features for their data, like capitalization patterns or gazetteer memberships.

• Flair(Akbik et al., 2018) is another open framework, which allow users to train and optimize sequence labeling systems. However,

(40)

compared to NeuroNER and NCRF++, you cannot train and use new embeddings.

(41)

Chapter 3

NER data sets

As mentioned earlier, there exists annotated resources for NER across a variety of languages and domains, and in this section we will consider four widely used NER data sets. TheCoNLL-2003shared task data and the MUC-7data set, are two historically important resources for English. At the moment, the NorNE corpus and the Norwegian Nomen Nescio corpus are the only resources available for Norwegian. Also, another data set based on the exact same source as NorNE has been created simultaneously (Johansen, 2019). Both the English and Norwegian data sets are of the same domain, where most of the data are gathered from news sites, communication platforms and other types of web content.

We will give the Norwegian resources the most attention, with more in- depth descriptions of the corpora and their partitions. In the end we will compare all the different sets, and discuss the most striking differences and resemblances.

3.1 The MUC-7 data set

The MUC-7 data set (Chinchor, 1998b) is based on the North American News Text Corpora, and was created during the seventh Message Under- standing Conference, a series of conferences on information extraction, sup- ported by the defense agency in USA (Sundheim and Grishman, 1995). The data set is copyrighted and only available through the Linguistic Data Con- sortium.

This corpus is annotated with three entity labels, along with respective subcategories: enamex (person, organization, location), timex (date, time) and numex (money, percent). Named entities are located within XML- like tags corresponding to their the span and entity category, where the subcategory is the attribute of the given tag. Figure 3.1 shows an example

(42)

from this data set.

The <ENAMEX TYPE="LOCATION">U.K.</ENAMEX> satellite television broadcaster said its subscriber base grew

<NUMEX TYPE="PERCENT">17.5 percent</NUMEX> during

<TIMEX TYPE="DATE">the past year</TIMEX>.

Figure 3.1: The MUC-7 format. U.Kis labelled as a location, 17,5 percentas percent, andthe past yearas a date.

3.2 The CoNLL-2003 shared task data

The Conference on Natural Language Learning 2003 Shared Task data (Sang and Meulder, 2003) is based on the Reuters 1996 news corpus, and is available in both English and German. The main theme of this shared task was language independent NER.

The four following entity labels are used: PER (person), ORG (organization), LOC (location) and MISC (miscellaneous). The CoNLL-2003 data is annotated using the IOB labeling scheme, where labels are marked to in- dicate the position of a token relative to its entity type; in the beginning (B-label) or on the inside (I-label) of an entity, or on the outside (O) of any given entity (see 2.2). In this format each line will contain a word, its POS tag, syntactic chunk tag, and IOB tag. Sentence boundaries are indicated with an empty line. This format has become a standard way for annotating data sets. Figure 3.2 displays an example from this data set.

U.N. NNP I-NP I-ORG

official NN I-NP O

Ekeus NNP I-NP I-PER

heads VBZ I-VP O

for IN I-PP O

Baghdad NNP I-NP I-LOC

. . O O

Figure 3.2: The CoNLL-2003 format. U.N. is labeled as an organization,Ekeusas a person, andBaghdadas a location.

3.3 The Nomen Nescio corpus

Until 2019, upon the publication of NorNE, the only Norwegian NER resource had been the Nomen Nescio corpus. This was created during the Nomen Nescio project (Johannessenet al., 2005), a research project dealing with NER for the Scandinavian languages. The data is based on articles from several news papers and magazines, in addition to some works of

(43)

fiction, all in Norwegian Bokmål. Unfortunately, this corpus is not publicly available at this moment.

The following six entity labels are used: &pe* (person), &or* (organization),

&st* (location), &he* (event), &ve* (work of art), and &an* (miscellaneous).

This resource is in the OBT format, inherited from the Oslo-Bergen tagger, a multi-purpose tagger for Norwegian (Johannessenet al., 2012). Each word wwill be listed as "<w>", with the following indented line containing its morphosyntactic tags and entity label. Sentence boundaries are indicated

with<<<after the last indented line. Figure 3.3 displays an example from

this data set.

"<industrigigantene>"

"industrigigant" subst mask appell be fl @<p-utfyll

"<Raytheon>"

"Raytheon" subst prop &or* @app

"<og>"

"og" konj @kon

"<Hughes>"

"Hughes" subst prop &or* @app

"<i>"

"i" prep @adv

"<USA>"

"USA" fork subst prop &st* @<p-utfyll <sted> <org>

Figure 3.3: The OBT format. BothRaytheonandHughesare labeled as organizations, andUSAas a location.

The corpus is divided into three subcorpora; newspapers, fiction and weekly magazines, with a total 226 984 tokens, where 7590 of these are named entities. Table 3.1, 3.2 and 3.3 from Nøklestad (2009) illustrates the distributions and proportions of named entities across the corpus.

Tokens Token % NE’s NE% Newspapers 105,493 46.5 4545 59.9 Magazines 62,870 27.7 1926 25.4

Fiction 58,621 25.8 1119 14.7

Total 226,984 100.0 7590 100.0

Table 3.1: Distribution of tokens and NE’s across the Nomen Nescio subcorpora

From Table 3.1, we can observe that the newspaper subcorpora not only contains the most tokens, it holds almost 60 percent of all entities. As seen in Table 3.2, person is the most frequent entity type, constituting approximately half of all the entities in the corpus. On the other hand,event is the least frequent type, constituting 0.5% percent of all entities. Table 3.3 illustrates how these categories are distributed among the subcorpora.

(44)

Person Location Org. Other Work Event Total

Tokens 3588 1949 1295 267 137 39 7275

Proportion 49.3% 26.8% 17.8% 3.7% 1.9% 0.5% 100%

Table 3.2: Distribution of NE’s in the whole Nomen Nescio corpus Person Location Org. Other Work Event Newspapers 40.6% 25.5% 29.1% 2.0% 2.0% 0.8%

Magazines 51.0% 28.9% 9.9% 7.6% 2.5% 0.2%

Fiction 76.8% 17.8% 2.4% 2.3% 0.6% 0.09%

Table 3.3: Proportions of NE’s across the Nomen Nescio subcorpora

3.4 Norwegian Named Entities (NorNE)

The Norwegian Named Entities (NorNE) corpus is a publicly available data set for Norwegian¹, with named entity annotations on top of the Norwegian Dependency Treebank (Solberg et al., 2014). This resource is built as a joint effort between the Schibsted Media Group, the National Library of Norway, and the Language Technology Group at the University of Oslo. The data is gathered from a variety of sources on the web, including news sites and blogs.

NorNE follows the CoNNL format, similar to CoNLL-2003, and is annotated with the following entity types: persons (PER), organizations (ORG), locations (LOC), geo-political entities (GPE), products (PROD), events (EVT), and entities derived from names DRV. Also, geo-political entities have two subcategories: locations (GPE_LOC) and organizations (GPE_ORG). Initially, annotations of miscellaneous entities (MISC) were included, but are removed from the later versions of the corpus. As with CoNLL-2003, NorNE is annotated using the IOB2 scheme.

The corpus is divided in two parts for each of the official standards of the Norwegian language: bokmål (nob) and nynorsk (nno). The distribution of entity labels in these two are shown in Tables 3.5 and 3.6. Thenobpartition is slightly larger than nno, with 14 369 and 13 912 entities respectively.

However, the distribution of labels are somewhat similar between the two parts.

1https://github.com/ltgoslo/norne

(45)

Partition # Sentences # Tokens # Named Entities Bokmål (nob) 16 309 301 897 14 369 Nynorsk (nno) 14 878 292 315 13 912

Table 3.4: total number of sentences for all data splits among the Bokmål and Nynorsk partitions of NorNE.

Type Train Dev Test Total Proportion (%) Avg. length

PER 4033 607 560 5200 36.18 1.54

ORG 2828 400 283 3511 24.43 1.32

GPE_LOC 2132 258 257 2647 18.42 1.09

PROD 671 162 71 904 6.29 1.87

LOC 613 109 103 825 5.74 1.37

GPE_ORG 388 55 50 493 3.43 1.04

DRV 519 76 48 644 4.48 1.24

EVT 131 9 5 145 1.00 1.49

Table 3.5: Class frequencies, proportions and average entity lengths in the Bokmål (nob) section of NorNE.

Type Train Dev Test Total Proportion (%) Avg. length

PER 4250 481 397 5128 36.86 1.59

ORG 2752 284 236 3272 23.51 1.43

GPE_LOC 2086 195 171 2452 17.62 1.06

PROD 728 86 60 874 6.28 2.15

LOC 893 85 82 1060 7.61 1.23

GPE_ORG 367 66 11 444 3.19 1.13

DRV 445 50 30 525 3.77 1.13

EVT 141 7 9 157 1.12 1.60

Table 3.6: Class frequencies, proportions and average entity lengths in the Nynorsk(nno)section of NorNE.

(46)

3.5 Other named entity annotations of NDT

As mentioned in 2.4, another NE annotated variant of NDT is also available through the recent work on Norwegian NER in Johansen (2019). Here, the same resources as NorNE are tagged using the following categories: person (PER), organization (ORG), location (LOC), and miscellaneous (MISC), i.e., the same label set as in CoNLL-2003 (4.2).² As opposed to NorNE, the IOBES label encoding scheme was used (see Section 2.2).

3.6 Summary & Comparison

Table 3.7 provided a comparison of the entity categories in the data sets discussed so far. We can observe that location, person and organization categories are used across all the data sets, and the data set with the largest set of categories is the one from NorNE, while CoNLL-2003 has the narrowest.

Entity CoNLL-2003 MUC-7 NorNE Nomen Nescio

Location

Person

Organization

Miscellaneous

Work

Product

Event

Temporal

Monetary

Geopolitical

Derived

Table 3.7: The different label sets used among the data sets. We exclude the subcategories for temporal and monetary expressions of MUC-7 and the subcategories for geopolitical entities of NorNE.

It is clear that there is no universal annotation scheme for natural language resources, and some entity types may be labeled differently among the different resources. For instance, as the CoNLL-2003 set does not include a type forevents, such entities are probably labeled asmiscellaneous, and the same applies to other non-covered entity types for any given data set. If we disregard the non-named entities from MUC-7, for temporal and monetary expressions, NorNE covers nearly every entity type except work, found in the Nomen Nescio data set. As discussed above, entities labeled as work

2This corpus can be accessed here: https://github.com/ljos/navnkjenner/tree/master/

data

Named Entity Recognition for Norwegian

Named Entity Recognition for Norwegian

Experiments on the NorNE corpus

Tobias Aasmoe

Thesis submitted for the degree of Master in Language Technology

60 credits

Department of Informatics

Faculty of mathematics and natural sciences

UNIVERSITY OF OSLO

Named Entity Recognition for Norwegian

Experiments on the NorNE corpus

Tobias Aasmoe

Abstract

Acknowledgments

Contents

List of Figures

List of Tables

Chapter 1

Introduction

1.1 Overview

Chapter 2

Designing NER systems

2.1 A short history of NER

2.2 Label encoding schemes

2.3 Traditional approaches to NER

2.4 NER systems for Norwegian

2.5 Modern approaches to NER

Chapter 3

NER data sets

3.1 The MUC-7 data set

3.2 The CoNLL-2003 shared task data

3.3 The Nomen Nescio corpus

3.4 Norwegian Named Entities (NorNE)

3.5 Other named entity annotations of NDT

3.6 Summary & Comparison