Dataset Inspection and Analysis

3. Related Work

4.3. Dataset Inspection and Analysis

did SomeCompany steal the code for a $T$ application?

ArbitraryOperatingSystem 0

stoked on the new $T$ game!

ArbitraryGameDeveloper 1

...

4.2.3. SentiHood

Saeidi et al. (2016) released a dataset containing conversations on urban neighborhoods from online forums. The annotation style closely matches the desired input data from real-world applications, illustrated in Table 4.1. The labels are restricted to binary sentiment, and the data is consequently not of interest for this thesis. However, this type of data may be relevant for more fine-grained aspect applications.

Sentence Labels

Hampsteadarea, more expensive but a better quality of living than in Tufnell Park

(Hampstead, price, Negative) (Hampstead, live, Positive) Table 4.1.: SentiHood annotation scheme

4.3. Dataset Inspection and Analysis

By inspecting the data to be used for further modeling, a better understanding of the models’ faults and feats may be obtained. For the CR datasets, four were deemed necessary to include (as determined throughout Section 4.1) before proceeding with the analysis. The four datasets had to be manually processed and translated into a unified format beforehand. For Entity-Level Sentiment, all listed datasets were included to perform a more in-depth analysis, as sentiment analysis datasets require minimal overhead to include (in contrast to the CR data). This analysis was also motivated by the suggestions of Sukthanker et al. (2018) for more exhaustive evaluations of SA datasets – before continuing with application of CR.

4.3.1. Unification of Coreference Data

For a better overview of the most promising datasets (OntoNotes,GUM,LitBank and PreCo), an analysis of their content is presented. Firstmost, the four datasets all follow a different annotation scheme, illustrated in Table 4.3, which has to be converted into a unified format. The reason for previous lack of cross-domain evaluation might be due to this specific task of unification being a necessary first step – which is also supported by the claims on non-CoNLL corpora by Moosavi (2020). The recency and in-depth examination of coreference found in Moosavi (2020) indicates that there are, in fact, no

4. Data

Source Format File size (MB) Removal actions Pradhan et al. (2012) .parse, .prop,

.sense, .coref, .names, .lemma

803/55/100

958 total

-Pradhan et al. (2012) .conll 188/24/25

237 total Merging of multiple files into .conll Lee et al. (2018) .jsonline 45.4/5.9/5.6

56.9 total POS-tags, lemmas, word sense This work .coreflite 12.3/1.6/1.5

15.4 total Speaker info, constituents, entity metadata

Table 4.2.: OntoNotes 5.0 dataset processing steps. File size is separated into train/dev/test and total size

Dataset File format Coreference format Minimized OntoNotes jsonline [M s_token,M e_token]

GUM conll chain of {entitytype+iterated index}

LitBank conll chain of {iterated index}

PreCo jsonline [sentence index, M s_subtoken,M e_subtoken] Table 4.3.: The four used datasets for CR and their file format as well as coreference

annotation format. M e/M e denotes a mention with its start and end indices.

subtoken denotes a sentence-level (local) token, where token is a global token.

current cross-corpora evaluations of extent in any other literature. While the datasets have similarities (e.g. the CoNLL format), minor intricacies cause incompatibility when parsing. To combat this, a new, simple, unified format is defined, based around the minimization process of the OntoNotes dataset (Pradhan et al., 2012) by Lee et al.

(2018)³. The format is coined “CorefLite” (a lightweight coreference format). Table 4.2 illustrates the transition of the OntoNotes dataset and its file size from its original format to CorefLite. The same process for the remaining datasets can be found in Table 4.4. Note that the reduced file size is not the goal for the format, but rather to reduce all four datasets into the same format, with the exact same input fields (tokens and clusters). The extent of the reduction is rather an indication of how much extraneous data is contained within these datasets. The CorefLite structure is shown below, where the specific coreference clusters format is described in Background, p. 25.

{

doc_key: # document identifier, tokens: # list of all tokens, clusters: # coreference clusters }, { ... }

3https://github.com/kentonl/e2e-coref/blob/master/minimize.py

4.3. Dataset Inspection and Analysis

Dataset Initial size (MB) Coreflite size (MB)

GUM 1.7 1.3

LitBank 12.6 2.0

PreCo 154.2 146.5

Table 4.4.: The remaining datasets and their parsed file size

Dataset N L_avg L_min L_max C_total C_avg/doc

OntoNotes (dev) 343 475.52 33 2314 4545 13.25

GUM 130 872.11 165 1866 4401 33.85

LitBank 100 2105.32 1999 3419 4975 49.75

PreCo (dev) 500 332.38 50 966 31793 63.59

Table 4.5.: Overview of coreference dataset features, sampled on subsets with similar sizes. N denotes the total number of documents, L denotes the document length,C denotes coreference links.

Dataset N L_avg L_min L_max C_total C_avg/doc

SemEval 2014 – Task 4 7694 93.68 8 470 2207 0.29

SemEval 2017 – Task 4 2872 104.13 26 144 1060 0.37

SemEval 2017 – Task 5 1156 56.54 25 112 48 0.04

ACL-14 6940 89.17 10 161 2214 0.32

SentiHood 4333 77.48 7 564 658 0.15

Table 4.6.: Entity-level sentiment dataset features. N denotes the total number of documents, L denotes the document length, C denotes coreference links generated with NeuralCoref.

4.3.2. Coreference Dataset Analysis

The analysis process, dependent on the creation of the unified format, involve fairly simple categorization of the data involved in each dataset, to use as a baseline for future datasets.

In Table 4.5 an overview of the four chosen datasets can be found. For OntoNotes and PreCo, smaller subsets of the data was used, with similar file sizes of the smaller datasetsGUM andLitBank. Observe the number ofC_totalandC_avg/doc, as these will be calculated for the same analysis of entity-level sentiment data in next section. Further, thePreCo dataset has a much higher occurrence of coreference links with a lower average document length. Handling this type of data might be troublesome for models trained strictly on the data inOntoNotes. A visualization of the relationship between coreference links and document length can be found in Appendix C. Further analysis will occur throughout the Preliminary Experiments chapter on CR (Chapter 7).

4. Data

4.3.3. Restrictions of Entity-Level Sentiment Data

The document length is crucial in order to resolve coreferences. Thus, the document length of each dataset is compared, together with the number of detected coreference links.

Detailed plots from this analysis can be found in Appendix B. Presented in Table 4.6 is an overview of the analysis. For the datasetsSemEval 2014 – Task 4,SemEval2017 – Task 4 and ACL-14 there are approximately one coreference cluster in every three documents, using the NeuralCoref CR model. Further, the sentiment polarity distribution was looked at, illustrated in Figure 4.1. None of the available datasets were balanced (i.e.

having an approximately even distribution between its classes), which might result in issues for evaluating the news domain. There are currently no available datasets that show distributions on sentiment for full news texts, so there is currently no way of verifying whether this data is representative across domains. Continuing based on this analysis, the datasets ACL-14 and SemEval 2014 – Task 4 provide the most distributed sentiment, as well as a fair amount of coreference links.

In document Augmenting Entity-level Sentiment Analysis with Coreference Resolution (sider 65-68)