Lexical Patterns in the Climate Discourse

(1)

Lexical Patterns

in the Climate Discourse

A collocational network analysis of a subject- specific corpus

Sarah Gundhus Bøe

ENG4191 – English Language and Linguistics 60 ETCS

A Master’s Thesis Presented to the Department of Literature, Area Studies

and European Languages Faculty of Humanities

Autumn 2021

(2)

ii

Abstract

This thesis aims to test the ‘recent’ concept of collocation networks in corpus assisted discourse analysis, in hope that it will reveal a new level discourse and lexical structures not found in traditional analysis of keywords and collocations. The three levels of analysis are applied to a subject-specific ‘homemade’ corpus, aided by the software

#LancsBox v.5.x. They are used to compare levels of information gained from each process and to discover how the processes can complement each other in a discourse analysis. The corpus compiled consists of articles concerning the global climate change in the international English newspaper, the Guardian, who have pledged accurate and transparent global climate change reportage. Because language, knowledge and behaviour are inseparable, the linguistic results provided insight into the Guardians’

climate discourse community, scaling from scientific and global communication to the discourse community preforming social acts such as imperatives. All three processes of lexical analysis are effective tools in discourse analysis, together or separate, depending on what levels of discourse are of interest.

(3)

iii

Acknowledgements

To my supervisor, Hilde Hasselgård, for a steady and patient guiding hand,

to Signe Oksefjeld Ebeling for introducing me to the wonderful world of corpus linguistics, to my former lecturer, Agnieszka Pysz, who gave me confidence to pursue English linguistics, to my parents, Heidi and Øystein, for always believing in me and always giving support when needed,

to Halstein, for encouraging me to follow my dreams and to all my friends for being there and cheering me on,

Thank you!

(4)

iv

List of Tables

Table 2.3.1 Example of position-dependent collocates 12

Table 2.3.2 Example of position-free collocates 12

Table 2.3.3 Example of different collocation spans 13

Table 2.3.4 Example of association measures favouring exclusivity 18 Table 2.3.5 Example of association measures favouring frequency 18

Table 3.2.1 Collocation table of power in the CCC 30

Table 4.1.1 Keywords identified as describing the global climate change 43 Table 4.1.2 Keywords which connotations may be associated with potential causes 48 Table 4.1.3 Keywords which may relate to a potential solution 53

Table 4.1.4 Keywords specifically referring to Australia 57

Table 4.2.1 The collocates of climate, change, emergency, and crisis 60 Table 4.2.2 The collocates of global, warming, and impacts 63

Table 4.2.3 The collocates of scientists and report 67

Table 4.2.4 The collocates of pollution, emissions, carbon, and greenhouse 70 Table 4.2.5 The collocates of gas, coal, fossil, fuel, and industry 73 Table 4.2.6 The collocates of the keywords planet, action, and extinction 80 Table 4.2.7 The collocates of the keywords Paris, agreement, policy, energy and

renewable

83

Table 4.2.8 The collocates of the keywords reduce, reduction, target, targets, 2030, 2050 and zero

88

(8)

ii

List of Illustrations

Figure 3.2.1 Collocation visualisation of power 31

Figure 3.2.2 Collocation network graph of power and coal-fired 31 Figure 3.2.3 GraphColl screen capture of a collocation network 36 Figure 3.2.4 An illustration of how to read the ‘homemade’ collocation network graphs. 37 Figure 3.2.5 The reconstruction of a collocation network 38 Figure 4.1.1 Top 50 keywords in the CCC compared to BNC2014 Baby+. 41 Figure 4.1.2 Illustration of frequency of terms for the climate situation 45 Figure 4.3.1 Collocation network of climate change impacts 93

Figure 4.3.2 Collocation network of emergency services 95

Figure 4.3.3 Collocation network of weather patterns 96

Figure 4.3.4 Collocation network of species extinction and loss 98

Figure 4.3.5 The Great Barrier Reef collocation 99

Figure 4.3.6 Collocation network of rising temperatures and sea levels 100

Figure 4.3.7 Collocation network of pre-industrial levels 102

Figure 4.3.8 Collocational networks of types of pollution 103

Figure 4.3.9 Collocation network of the grassroots movement 105

Figure 4.3.10 Collocation network of Paris and Kyoto 106

Figure 4.3.11 Collocation network of carryover credits 108

Figure 4.3.12 Collocation network of transition 109

Figure 4.3.13 Collocation of energy sources 110

(9)

iii

List of Abbreviations

CL: Corpus Linguistics

KWIC: Keyword in context

MI: Mutual Information

MI²: Mutual Information Squared

MI³: Mutual Information Cubed

LL: Log-Likelihood

MS: Minimum Sensitivity

SMP: Simple Maths Parameter

The CCC: The Climate Change Corpus

The BNC: The British National Corpus

KA: Keyword analysis

CA: Collocation analysis

NA: Network analysis

(10)

1

Chapter 1: Introduction

We are not the lords, we are the Lord's creatures, the trustees of this planet, charged today with preserving life itself—preserving life with all its mystery and all its wonder. May we all be equal to that task.

––– Margaret Thatcher (1990)

In August 2021, the United Nations reported that the unprecedented rapid climate changes we are experiencing today are beyond any doubt caused by human activities. Their headline statement read: ‘It is unequivocal that human influence has warmed the atmosphere, ocean and land. Widespread and rapid changes in the atmosphere, ocean, cryosphere and biosphere have occurred’ (IPCC 2021). The outlook of these changes and the following human implications are looking more and more dire. Even if the scientific consensus point fingers at pollution caused by human activities and call for immediate action, it is not easily communicated in the information and misinformation diffusion of the Digital Age. In this age, concepts such as news-propaganda, conspiracy theories and fake news have found fertile grounds on social media platforms and other globally extensive digital platforms (Giusti and Piras 2021). A so-called ‘post-truth’ era has created circumstances in which objective facts no longer shape public opinion compared to previous tendencies (Carlson 2018). In this environment, the function of traditional news media to convey happenings in the world objectively and accurately is greatly challenged. While the norm of conventional journalism is to be professional and objective, information is now a commodity and source of power, and journalists are not above social manipulation (ibid. 2018).

In this journalistic environment, the Guardian made a pledge to report on the global climate change accurately, be transparent in their process and refuse funding from fossil fuel companies (The Guardian 2019) and to portray the global climate change with appropriate language

(Carrington 2019), see section 1.1.2.

(11)

2 Stubbs (2001, 6) wrote that ‘[s]ubstantial arguments have been put forward that language, social action and knowledge are inseparable’, and thus the field of linguistics is concerned with how language shapes human worldview, perceptions, and beliefs, how we communicate with each other, what world is hidden behind our vocabulary and what message lies behind our choices of words. Languages are ever changing to adapt to the changing world around us, and consequently change lexical meaning to accommodate new concepts, ideas, events, and inventions. And so, how does the media talk about the changing climate in a society flooded by mistrust,

misinformation and a seemingly lack of relevance of objective facts to its readers? Actual language use, otherwise known as descriptive grammar, is often a concern of the linguistic subfield corpus linguistics. As the Guardian is known for their credible climate change reporting, the aim of this corpus-driven study is to find out what the Guardian is writing as a collective journalistic voice. What patterns and tendencies can be discovered and applied to understand the Guardian’s climate change discourse and how they evaluate the global climate change context and what their attitudes to it are.

Corpus linguistics have mainly been a computer-aided quantitative process in which to chart actual language use by large amounts of data. Several corpus techniques have been popularised, such as calculating word association and the idea of lexical collocation. It is now over a half a century’s years old field which has revealed numerous ways to quantitatively assess language use. In the last decades, there has been quite a few articles, from especially English collocation studies, that tries to map the field as it stands today and future avenues (chapter 2). An article even includes the question in its title, ‘What is or should be next…’ (Gries 2013). A suggestion is made by presenting a new collocation criterion to the many existing ones (section 2.2.1) by Brezina et al. (2015). They chose to call this criterion connectivity, by which they mean to extend the non-linear idea of association to more than two lexical items into a collocation network. It is a natural extension of Firths’ philosophy ‘You shall know a word by the company it keeps’ (Firth 1957b), by including an even larger context to the lexical association research (section 2.3).

Brezina and McEnery, together with Weill-Tessier, (2020) has also developed a new corpus tool to achieve easy calculation and extraction of these suggested association networks (read section 3.2.1). This new collocational idea is a hitherto uncharted concept and remains to be tired and tested to find the most effective use.

(12)

3 The aim is therefore to create a subject-specific corpus about this pressing matter in the global discourse, climate change, and to discover what the collocation networks of subject-specific keywords of climate change in the British newspaper the Guardian can reveal about the content of the climate discourse in 2019. The intention is also to consider if the collocations and

collocation networks, beyond the content of the articles from the Guardian, could reveal the sender’s attitude to and evaluation of climate change content in the discourse. Since this method is not explored to extent that there have been established a specific procedure for collocation networks, I will attempt to test whether collocation networks can be used as an effective tool to analyse the overarching discourse and as means to do a more in-depth discourse analysis through these lexical patterns, and by that, perhaps, contributing to the future of the field of corpus assisted discourse analysis.

1.1 Preliminaries

1.1.1 Climate Change

The climate is, by definition, ever changing. What has come to be known as climate change in the modern era, however, is the variations of the climate that are caused by a rising global temperature influenced by human activities since at least the Industrial Revolution. The global temperature is measured to have increased by 1°C since pre-industrial times. If human activities that lead to the growing rate of temperatures are not slowed, it is predicted that the global temperatures may increase up to 3 or 4°C. Such a rise in global temperature will lead to

extinction of various plant and animal species and a rise in sea levels as well, which will render now populated areas uninhabitable (Mann and Selin 2021).

Any public notice of the global warming was not taken seriously until the late 80’s and early 90’s when James Hanson addressed the US Government with a scientific report on the link between the rising global temperatures and human activities (Shabecoff 1988). The UN established the International Panel of Climate Change due its recognition that global warming may cause a threat to humanity (IPCC n/d) and Margaret Thatcher gave her famous climate

(13)

4 speech (from which the introductory quote was extracted) to the United Nations committee asking for action and to recognise responsibility (Thatcher 1990). In 1997, the Kyoto Protocol was initiated with the goal to reduce CO2 emission by 5%, but its’ support was meagre and unstable (United Nations Climate Change n/d-b). Little happened in the public eye until 2006 and the release of The Economics of Climate Change: The Stern Review, in which Nicolas Stern concluded that delayed action would be much more financially costly than changing to solutions that would stabilise the climate now (Stern 2007). 2006 also saw the release of former US Vice President Al Gore’s documentary An Inconvenient Truth. Al Gore, together with the IPCC, in the event of the release of their fourth assessment report, won the Nobel Peace Prize (Nobel Peace Prize 2007). The fourth IPCC stated that it was very likely that global warming was manmade (IPCC 2007, 5). In 2013, the fifth report was published, and the certainty had increased to extremely likely (IPCC 2014, 47). In December 2015, the Paris Agreement was established by the United Nations Framework Convention on Climate Change (the UNFCC or UN Climate Change) and adopted by 196 countries and is still in force. The goal of the Paris Agreement is to limit the global heating to under 2°c, preferably to 1.5°c (United Nations Climate Change n/d-a).

Prior to the global Covid-19 pandemic, the 2019 media was flooded by natural disasters such as the cyclones that hit South East Africa (Johnson 2019) and India and Bangladesh (n/a 2019), typhoons in China (Green 2019) and Japan (McCurry 2019; Blair 2019) and hurricanes in the US (Mann and Dessler 2019), raging wildfires in California (Cagle 2019) and in New South Wales and Queensland in Australia (Morton 2019). It was also the year with the hottest summer in Europe recorded in history with temperatures 3–4°C higher than normal (Harvey 2020). The year saw a surge in climate awareness around the world, an awareness that manifested itself in

massive global demonstration (Taylor, Watts, and Bartlett 2019).

(14)

5

1.1.2 The Guardian and Climate Change

The Guardian (then Manchester Guardian) gained international recognition under editorship of one Charles Prestwich Scott in the late 19^th and early 20^th century. His ideal was to promote unbiased journalism based on fact:

‘Comment is free, but facts are sacred. “Propaganda”, so called, by this means is hateful.

The voice of opponents no less than that of friends has a right to be heard. Comment also is justly subject to a self-imposed restraint. It is well to be frank; it is even better to be fair. This is an ideal. […] We can but try, ask pardon for shortcomings, and there leave the matter’ (Scott 1921)

He wanted the Guardian to have political independency. He claims ‘[o]ne of the virtues, perhaps almost the chief virtue, of a newspaper is its independence. Whatever its position or character, at least it should have a soul of its own’ (ibid. 1921) but he was prone to favour liberal politics.

This is still reflected in the Guardian’s readership today, which is described to be liberal without any alliance to a specific political party (n/a 2008), with a readership that are mostly voters of labour or liberal democrats (n/a 2009). An Observer article also reported that the Guardian was currently found to be the most trusted newspaper in the UK (Sweney 2020), confirmed by the UK’s communication regulator’s yearly report. This report also record that users find the Guardian to be the most trustworthy, impartial, in-depth, and accurate newspaper, and the print newspaper of the highest quality. The online newspaper is also considered the most trustworthy and in-depth of the British newspapers (Jigsaw Research 2021).

As mentioned in the introduction, in 2019, the Guardian gave a pledge to commit themselves to journalistic coverage of the global climate change. The first point of their pledge states that they

‘will continue [their] longstanding record of powerful environmental reporting, which is known around the world for its quality and independence.’ (the Guardian 2019). This ‘powerful

environmental reporting’ is confirmed by McAllister et al.’s (2021) research on journalistic coverage of the global climate change, in which the Guardian/Observer had the biggest sample size with 520 unique articles about climate change from 2005–2019 which scored a 95%

accuracy rate for their scientific journalism.

(15)

6 The Guardian continue their pledge with subsequently five more promises:

2. We will report on how environmental collapse is already affecting people around the world, including during natural disasters and extreme weather events.

3. We will use language that recognises the severity of the crisis we’re in.

4. The Guardian will achieve net zero emissions by 2030.

5. We will be transparent with our progress.

6. We will no longer accept advertising from fossil fuel extractive companies

(The Guardian 2019) Point number three is especially pertinent here. Damian Carrington, the Guardian’s environment editor, notes that the newspaper updated their style guide to use new terms, in favour of the old climate change and global warming, to portray the issue more accurately, by using stronger language. He quotes the editor-in-chief, Kathrine Viner, who said, ‘We want to ensure that we are being scientifically precise’ and that the issue ‘[…] scientists are talking about is a

catastrophe for humanity’ (Carrington 2019). It is this language that describe this situation that has been used as material for this research.

(16)

7

Chapter 2: Theoretical Background

2.1 Corpus Linguistics

Corpus Linguistics (CL) is distinct in the field of linguistics. It is not necessarily concerned with the study of any certain aspect of language itself, but rather a means of studying language (McEnery & Hardie, 2019, 1), or a series of methods, if you will. It involves using a machine- readable collection, a database, of either spoken or written texts to study the patterns of (usually) natural language. There are multiple approaches to the construction and methods of CL, as well as different opinions of what CL actually is. While some view CL as a means of research, some view it as theory in itself. In this thesis we will mostly be concerned with the Neo-Firthian understanding of CL, which is within the framework of language suggested by J.R. Firth who stated that ‘the meaning of linguistic forms at the grammatical and lexical levels should be determined with reference to the system of the language and identified by linguistic context.’

(1957) and that ‘each word when used in a new context is a new word’ (ibid.), a linguistic philosophy with great focus on lexis in context. Firth’s framework was further elaborated by Michael Halliday and John Sinclair (Carter, 2004). Halliday carved the path of systemic

functional grammar (e.g., Halliday,1994), while Sinclair aimed to move linguistic analysis away from speculation to analysis of quantitative data (Sinclair, 2004). The two most prominent sub- fields of Neo-Firthian research are collocation and discourse (McEnery and Hardie 2012, 122).

The benefits of CL are the opportunity to research language with large amounts of data, where one can make descriptive deductions based on frequency. The research is often lexical, due to the ease of search in the corpus tools. However, if the corpus is annotated and lemmatised, one can also more easily study patterns of grammar, and for spoken corpora, phonetics and the like.

When collecting smaller corpora, one needs to be thorough in the collection and construction to achieve balance and representativeness of the language in question. The smallest corpora are usually specialised corpora which represent a more specific genre, discourse community, mode or text type. This, and lexical study, will be the focus in this research.

(17)

8

2.2 Keywords

A prominent data mining analysis made more readily available with CL is the extraction of keywords. The keywords are ‘words that play a role in identifying important elements of the text’ (Bondi 2010), or as Stubbs (2010, 23) puts it; ‘Keywords are the tips of icebergs: pointers to complex lexical objects which represent the shared beliefs and values of a culture’. They are often considered the markers of the ‘aboutness’ of the text. They may reveal something about the writer and their identity and stance, as well as the discourse community in which the text appears (Bondi 2010).

Michael Stubbs (2010) suggests three types of senses of keywords: (1) cultural, (2) textual, and (3) lexico-grammatical, all with different extraction methods. Sense 1 focuses on words

perceived as pertinent to a culture and is not based in CL. Sense 3 is centred around everyday phrases that shape sociolinguistic acts, i.e., a pragmatic analysis. Here, the concern will be the second sense, which is statistical. The method suggested by Scott (1999) extracts frequent and statistically significant words by contrasting two different corpora. Scott developed the corpus- tool WordSmith (Scott 2020b) in which keyword analysis is a central feature. More specifically, one derives keywords via statistical frequency and/or comparison between the corpora of interest and a reference corpus. One can perhaps contrast two different corpora of approximately the same size which represents different language varieties, or one can compare a specialized corpus to a general one. A word is key if it occurs within the researchers’ cut-off point of minimum frequency, or if the word appears more frequently than expected compared to the reference corpus, that is if the frequency difference is statistically significant (Baker 2004). Scott (2020a) notes three types of keywords which will frequently occur. Those are (1) proper nouns, (2) words that humans will recognise as key, i.e., the words that reveal the ‘aboutness’, and (3) high-

frequency words.

Despite the initial advantage of producing keyword lists automatically through corpus tools using machine-calculated frequency and statistics, semantics cannot easily be deducted without human linguistic reasoning and intuition. A word or lexical item typically carries more than one ‘sense’.

(18)

9 Baker (2004) exemplifies cases where a keyword may be key because of unusually high

frequency, but, if polysemous, finds key meaning in just one of the senses the word holds. The simple keyword lists may therefore be unrepresentative and hide word-forms or lemmas in which one of these senses are key. These ‘senses’ can be analysed in concordance lines and sorted in key categories in the corpus, which again can reveal something about the ‘aboutness’ of the discourse. However, this is done manually, and the analysis is subjective (Baker 2004). A feature of keyword analysis is that relatively low-frequency words can also be key. This is due to the juxtaposition of the two corpora. A word may occur a small number of times in one corpus, while it is absent in the other corpus. Although high-frequency words stand out more easily, these low-frequency keywords can reveal themselves to be interesting (Baker 2004). It is

necessary to mention that juxtaposing two corpora in a keyword analysis will only reveal lexical differences, not lexical similarities. This may cause an overemphasis of divergent trends, while other lexical tendencies and patterns may go unnoticed. Another potential problem for keyword extraction is the distribution discrepancies. A word may be key due to overuse in one or two texts. It is a good idea to always check the textual distribution of the keywords (Baker 2004).

Keywords are rarely analysed in isolation, rather within their lexical and syntactic

environment. This can be done through what in CL is called key word in context, or KWIC, which offers observation of the keyword in its immediate context in the form of concordance lines. Keywords may also be considered together with frequently recurring sequence s of words. These are sometimes called keyword-clusters but are more popularly called lexical bundles (Biber, Conrad, and Leech 2005). Biber, Conrad, and Leech (2005) argues that these need to be recognised due to language meaning often appearing in chunks, rather than in isolated words. In the Neo-Firthian tradition words in context are one of the main elements of study. It builds on the idiom principle, as opposed to the open-choice principle, which views language as segments with open slots which can be filled with virtually any word. The idiom principle, proposed by John Sinclair (1991), argues that these linguistic choices are not random, and the slots cannot be filled with any word. There are restraints in the open-choice model. Sinclair calls these ‘semi-preconstructed phrases.’ These phrases are not idioms, but rather idiomatic. That means phrases and words chosen in language production contain mentally stored lexical units whose meaning relates to the whole expression. If the words in

(19)

10 the sequences are immediately consecutive, they are called n-grams if the unit is longer than two words and/or collocation (see section 2.2).

2.3 Collocation

The most famous and most frequently quoted statement from collocational studies is John Firth’s words ‘You shall know a word by the company it keeps’(1957b). From these words the idea of collocations derives. The concept is built on the idea that a word’s meaning should not be

considered in isolation but should rather be extracted from its environment, viz. its neighbouring words. Halliday (1994, 333) emphasizes that the collocation is not dependent on any semantic relationship, but on ‘a particular association between the items in question’ which is based on a co-occurrence tendency. The probability of a word occurring is increased ‘given the presence of a certain other word’ (Halliday and Matthiessen 2004, 38), or simply; that ‘“collocation” is frequent co-occurrence’ (Stubbs 2001, 29). One of the characteristics of collocates and the idea of association and expectancy, is that their frequency of co-occurrence may result in

imbuement¹. The words’ meaning will colour and take on aspects of meaning of each other (Baker 2016). The constituents in the collocation relationship are individual word-forms or lemmas. They are called ‘node’ and ‘collocate’. Sinclair et al. (2004) points out that ‘essential there is no difference in status between node and collocate’. The node is the word under examination, while the collocate is any word that frequently co-occurs with the node. In other words, what word that takes the role of the node or the collocate depends on the focus of the study (Stubbs 2001, 29). There are other lexical relationships which are based on frequent co- occurrence that does not couple individual words, but rather links a node to a category or a set of words (section 2.3.3).

1 Lexical imbuement: The ability of a word ‘to take on aspects of meaning of another word’ (Baker 2016)

(20)

11 Though this idea appears to be simple, a unified definition of collocates and their criteria, is after more than 50 years of study, still under ongoing discussion (Gries 2013). It seems to be up to the researchers to define the relationship between the node and the collocate and determine the preliminaries for their individual collocation research. Because the definitions of collocation vary, the interpretation of collocation can include other types of multi-word units. These are mainly idioms or n-grams (the so-called lexical bundles, clusters, or chunks). The term

‘collocation’ has been used for all these lexical units, and the different definitions may or may not overlap depending on the view of the researcher (Xiao 2015). An important difference between collocation and other multi-word units is that a collocation does not necessarily need to be a sequence. A collocate appears within a predefined span, not exclusively adjacently or in any fixed order. Note that this does not exclude the definition of collocation to overlap with other lexical units.

Brezina et al. (2015) stress that collocation research ‘[has] yet to be systematically evaluated and fully implemented in the tools that corpus linguists use’, but they have suggested seven criteria to consider when defining a collocation. These proposed criteria, of which the first three, (i)

distance between node and collocate; (ii) frequency; (iii) exclusivity, are usually generally acknowledged. The next three, (iv) directionality; (v) dispersion; (vi) type-token distribution, have been proposed by Gries (2013), and the last one, (vii) connectivity, is added by Brezina et al. (2015).

2.3.1 Collocation Criteria

a) Distance

The distance of collocations is usually defined with the term span, or collocation window. Span is defined by the number of tokens between the node and the collocate. If there is a span of two, the collocate is placed one or two tokens away from the node in either direction of the running text. If the wish is to be more specific about the distance of the collocate, one can use the term span position to accurately describe specifically were the collocate is placed; if the span position is -1 the collocate immediately precedes the node (Sinclair, Jones, and Daley 2004, 34).

(21)

12 Collocates that are significant in certain positions to the node are called position-dependent;

those that occur anywhere within the selected span is called position-free (Sinclair, Jones, and Daley 2004, 35). See tables 2.3.1 and 2.3.2 for examples of position-dependent and -free

collocates. Adjacent collocations, also called bi-grams (an n-gram with a sequence of two lexical items), can undergo fossilisation and become either fixed or result in the creation of idioms (Saeed 2016, 57).

Table 2.3.1 Example of position-dependent collocates in position +1 of change as collocate of climate in the CCC.

Left Node Right

[…] take action on climate change 20 years […]

[…] the minimum on climate change. But it […]

[…] Donald Trump about climate change at the […]

[…] the worst impacts of climate change. The energy […]

[…] neutral by 2022. “The climate change agenda may […]

Table 2.3.2 Example of position-free collocates in the span +/-4 of policy as collocate of climate in the CCC.

Left Node Right

[…] won’t change its climate change policy as […]

[…] of shifting on climate policy in 2020, […]

[…] Asked whether the climate policy failure was […]

[…] and policy institute Climate Analytics found there […]

[…] joke of a climate change policy: they […]

(22)

13 The use of different spans will affect the outcome to a certain extent. The results when larger spans are used, are usually an appearance of more high-frequency function words (see example in table 2.3.3). According to McEnery and Hardie (2012), most corpus linguists working with English follow Stubbs’ guideline (2001) of a span of +/- 4. However, there is no agreement in what collocation window size which is the most effective. Other common spans are from +/-2 to +/-5, to even bigger spans. It is also argued that collocation should not be controlled by a fixed length of span (McEnery and Hardie 2012).

Table 2.3.3 Example of different collocation spans. Content words marked comparison.

Top 10 coll. of industry with span of +/- 2 in the CCC

Top 10 coll. of industry with span of +/- 5 in the CCC

the the

and and

of to

to of

fossil in

fuel that

that a

associations fossil

coal fuel

in said

(23)

14 b) Frequency

In corpus linguistic generally, the most basic and most advantageous aspect is the ability to detect trends by frequency lists (Gries 2009). This pertains to collocations as well. Function or grammatical items are usually the most frequently occurring items in collocation. It is suggested that for the study of lexical imbuement trends they are perhaps not very important, as function words carry little to no meaning. Halliday (1966) argues that grammatical items are most likely to be collocationally neutral. Hunston (2002), on the other hand, includes grammatical items in her definition of collocation and says that ‘“small words” […] are crucial to textual meaning’

(ibid., 2008). Frequency of use is a meter for the collocation’s typicality (Brezina, McEnery, and Wattam 2015). If the collocation does not occur with a relative high frequency or ‘more

frequently than expected by chance’ (Gries 2013), there are probably limited consequences for meaningful semantic transfer. The object for the researcher is therefore to determine the minimal number of occurrences per collocation, which is dependent on the size of corpora in use.

However, as described in section 2.2.2, frequency is not the exclusive factor to determine the strength of the collocation.

c) Exclusivity

Exclusivity concerns the number of times two words ‘solely or predominantly [appear] in each other’s company’ (Gablasova, Brezina, and McEnery 2017), and the likelihood of the items to appear in co-occurrence rather than in isolation or in other collocations, i.e., the strength of the attraction. If a collocation is exclusive, the prediction of the collocation should be high; one should expect to see the item b in close proximity of item a. One should also predict a higher imbuement effect in exclusive collocations. In the CCC, change appears 1758 times in the corpus, and collocates with climate 1533 times. This means that, in this corpus, change appears 87% of the time as a collocate of climate. This gives reason to believe that change, in this context, rather exclusively collocate with climate and one should expect to see the one in the neighbouring environment of the other.

(24)

15 d) Directionality

Directionality refers to the direction of attraction in the collocation. Traditionally, measurements of collocation have been bidirectional or symmetric, but ‘the strength of the attraction between [the] two words are rarely symmetrical’ (Brezina, McEnery, and Wattam 2015). Compare the example in Gries (2013), (of → in spite) and (in spite → of), it is quite clear that these options are not symmetrical in collocation strength. Thus, the strength of influence may be stronger in one direction rather than the other (Gablasova, Brezina, and McEnery 2017).

e) Dispersion

Dispersion is the even, or uneven, distribution of the collocation across the texts in the corpus. If the collocation only appears in a few of the texts, it is perhaps not very regular or significant for the overall discourse (Gablasova, Brezina, and McEnery 2017). The distribution of high-

frequency words, similar to raw frequency, may uncover general tendencies of the corpus, but if some of the high-frequency words only appear in one or few texts, the probability of this word being significant for the corpus is on the lower end. An unevenly dispersed collocation, i.e., a collocation that appears in a low number of texts in the corpus, is generally considered more specialized (Gries 2008).

f) Type-Token Distribution

Type-token distribution, or ratio, concerns the lexical diversity of a corpus, i.e., whether the corpus comprises a wide range of vocabulary, or if it recycles a limited sum of words (Brezina 2018b, 57). A larger number of words forms (types) in proportion to the running words (tokens) points to a more lexically diverse text (ibid.).

g) Connectivity

The last criterion of collocations suggested by Brezina et al. (2015) is connectivity. They argue that collocations should not be viewed in isolation but should be analysed in the larger context

(25)

16 they appear in, called collocation networks. This, again, can be viewed as an extension of the general collocation concept. Connectivity and collocation networks are detailed in section 2.3.

This list from Brezina et al. (2015) is an example of a few of the most dominant features of what can help us define Firth’s ‘company’, but as the discord of definition prevails, collocation studies are not unproblematic. The variance and continuous evolution of this field enables exploration and gives opportunity to form a study in accordance with the researcher’s own wishes and purposes. Although this opportunity of variable collocation definitions can give subjective results, but it can also be a contribution to the advancement of systematization of what defines a collocation, (if this is the ideal). However, such creativity it calls for substantial replicability.

While lacking a clear and universal definition, collocations are argued to be ‘a fundamental organizing principle of language in use’ (Stubbs 2001, 60).

2.3.2 Association Measures

Computer aided collocational studies makes use of a range of statistical tools to measure the criteria above². Equal to the numerous differences in opinions of what defines a collocation, so are the tools to measure said collocations. The most basic statistical tools are measures of different types of frequency, such as raw frequency, percentages, and normalised frequency, as well as type-token distribution. These types of statistics are called descriptive statistics (McEnery and Hardie 2012, 49–51). The statistical tools that are most relevant for collocation practices are called association measures. Their objective is to measure the strength of association between the node and the collocate. More advanced association measures generally compare the observed and expected frequency in various forms (Brezina, McEnery, and Wattam 2015). Extensive overviews of those association measures invented can be found in Evert (2004), Weichmann (2008) and Pecina (2009).

2 There are non-statistical methods of studying collocation, where the study is based on intuition and scanning of concordance lines. This technique is called collocation-via-concordance, but usually there are some statistical testing is applied to collocation analysis.

(26)

17 The association measures described in this section are the selection of measures that can be found in the corpus tool GraphColl (see section 3.2.1). Most measures available through corpus tools base their association on frequency and exclusivity (Brezina 2018b, 69–71). Exclusivity is determined by the effect size – the size of the ratio between observed and expected frequency (Hoffmann et al. 2008, 150), and which measures the strength of the relationship. It is usually measured with Mutual Information score (MI) (Church and Hanks 1990), which effectively filters out seemingly unimportant co-occurrences. It favours rare exclusive collocations, and therefore has a low-frequency bias. This can be adjusted by setting a higher minimum frequency of collocations. For approaches to avoid the low-frequency bias the squared version of MI, M1² (Evert 2004), has been recommended. This favours exclusive collocations but not rare ones.

Another option is the cubed version, MI³(Evert 2004; Daille 1995), which finds frequent in addition to exclusive collocations (Hoffmann, et al. 2008, 156). MI³ gives weight to exclusivity, but also co-occurrence frequency, generating lists can include both rare collocations and frequent collocations, including function words (ibid.). Smadja et al. (1996) argue that the Dice

coefficient is a better tool for measuring exclusivity, which calculates frequent exclusive collocates and exclusivity of frequent collocates, showing neither a low-frequent nor high- frequent bias, that is to say that its function is the opposite of MI³(Hoffmann, et al. 2008, 156).

These association measures are compared in table 2.3.4.

Other association measures favours significance and frequency. Log-likelihood measures the significance of the collocate, i.e., whether the observed frequency is higher than the expected frequency and if the results are not likely to be due to chance (ibid., 150–153; Dunning 1993). T- score measures frequency of rare collocates and the significance of frequent collocates (Church et al. 1991), while Z-score considers the effect size of rare collocates and the significance of frequent ones (Hoffmann, et al. 2008). Minimum Sensitivity (MS) calculates how often a word appears alone in the corpus compared to how often it appears as a collocation, or dependence (Pedersen 1998), and will therefore not likely include high-frequency grammatical words.

Effects of these association measures can be viewed table 2.3.5.

(27)

18 Table 2.3.4 Example of association measures favouring exclusivity. Here, showing the top 5 collocates of weather in the CCC calculated by MI, MI², MI³, and Dice coefficient. (GraphColl’s default thresholds and min. freq. of 3. Recurring words marked for comparison).

Raw frequencies MI MI² MI³ Dice coefficient

the forecasters extreme extreme extreme

extreme extreme events events events

and events forecasters and patterns

events patterns patterns forecasters heatwaves

in forecasts forecasts patterns storms

Table 2.3.5 Example of association measures favouring frequency. Here, showing the top 5 collocates of weather in the CCC calculated by, LL, T-score, Z-score and MS. (GraphColl’s default thresholds and min. freq. of 3. Recurring words marked).

Raw freq. Log-likelihood T-score Z-score MS

the extreme extreme extreme extreme

extreme events the events events

and and and forecasters heatwaves

events the events patterns conditions

in in in forecasts patterns

(28)

19 In addition to these association measures which base their association on either frequency or exclusivity, Gries (2013) suggests Delta P, which accounts for directionality, and Brezina (2014) Cohen’s d which measures dispersion. There have been several attempts to compare the different association measures, especially for MI and T-score (see Hunston (2002), Church et al. (1991) and Stubbs (1995)). As there is no universally agreed upon definition of what constitutes a collocation, there can be no clear conclusion of which association measure that produces the best results. Again, the choice of association measure is up to the researcher and their preferred definition of collocation. One should keep in mind that this choice of statistical tool has a major impact on the results, and therefore needs to be carefully considered (McEnery and Hardie 2012).

2.3.3 Abstractions of Collocation

In addition to collocation, I will also discuss other word co-occurrence concepts, such as

abstractions of collocation in this section and networks in section 2.4. To avoid confusion, I will therefore use the term simple or basic collocation to address the original two-word association when discussed in the environment of other word association theories. Some of these theories are abstractions of collocation with a shared tendency of frequent word co-occurrence. Abstractions from collocation presented by Stubbs (2001) are (1) colligations: a pairing of lexis and grammar, where the node frequently co-occurs with a particular grammatical category, (2) semantic

preference, where the node is associated with lexical items of a specific semantic category, and (3) discourse prosody, which is a description of the sense of the environment the node is found in. An example of discourse prosody is if the node is usually found in an environment that denotes something happy or sad. Both semantic preference and discourse prosody can be illustrations lexical imbuement (Baker 2016).

As mentioned above, collocation may also overlap with other word units, or lexical units. These often consist of sequences of two or more items, while collocations, for the most part, are just two words. While also being included in lexical studies, as collocation, these sequences find their

(29)

20 place somewhere between lexis and syntax, such as in the study of phraseology. If a collocation are two adjacent lexical items, also called a bigram, the collocation can be part of a larger consecutive multi-word unit or recurring phrases. A further abstraction of collocation and colligation, is collostruction, first proposed by Gries and Stefanowitsch (2004). They suggest three ways of observing a node in frequently recurring grammatical construction, which are (1) the node co-occurring with one grammatical construction, (2) the node co-occurring with two similar constructions, or (3) where the node in a dependent slot co-occurs with a collocate in another dependent slot in a construction (Gries 2019). Some collocations, colligation and collostructions can be viewed as semi-fixed phrases, that are idiomatic in nature, but rarely idioms (Stubbs 2001, 59). The reason for this (misleading) term is their typical, probabilistic components, which are also highly variable (ibid.).

The abstractions of collocation are important to mention, because of Brezina et al.’s (2015) criterion of connectivity. The production of collocation networks creates units larger than original collocation. The probability of results showing multi-word units are higher in networks and can give results that overlap with the abstraction described above. These are still based on the idea of association measured through frequent co-occurrence.

2.4 Collocation Networks

Firth (1957a, 7) said: ‘[…] the complete meaning of a word is always contextual, and no study of meaning apart from a complete context can be taken seriously’, which has inspired the extensive literature on collocations and lexical units. The sentiment is reflected in the development of considerable word co-occurrence propositions, and in the difficulty of isolating both the word and the multi-word unit and their definitions. Accordingly, it has been proposed that collocations too do not appear separate from their context, but form collocation clusters or networks. The theoretical foundation of collocation networks is credited to Phillips (1985) in his search for a non-linear ‘aboutness’ in textual macrostructure, and his want to divorce the ‘aboutness’ from the constraints of generative grammar. While his general idea is said to be the catalyst for this

(30)

21 type of lexical analysis, his methods are evaluated as questionable (Brezina, McEnery, and Wattam 2015; Grabe 1987).

Contrary to the original collocation studies, collocation networks are the not widely explored.

While the variation in the definitions of collocation is due to a multitude of studies, the

difficulties in defining collocation networks may be due to the lack of studies. Of the few studies found, Phillips (1985) argues for a paradigmatic³ understanding of collocation networks. Brezina et al. (2015) criticise him for this by regarding his finds as ‘pseudo-synonyms’ rather than

collocation networks. They say collocations networks are syntagmatic⁴ lexical sets, which they define as ‘sets of words that co-occur in sentences/discourse’ (Brezina, McEnery, and Wattam 2015). Baker (2016) finds interesting paradigmatic relationships between the lexical sets. He finds that those words that share the same collocates, but are not collocates themselves, may in fact be synonymous in the corpus from which they are extracted. Williams (2001) calls this a mediation between lexis and text.

The consensus, however, is that collocation networks can reveal something about the text or discourse through meaningful lexical patterns. Brezina et al. (2015) say ‘[c]ollocation networks […] demonstrate that meaningful patterns can be extended beyond this narrow scope and be identified at the level of the text or discourse’, while Williams (2001) also points out that these lexical patterns ‘are significant for texts emanating from a discourse community’. Most

importantly, collocation networks can reveal something about the corpus and text in question, which was not evident at first glance. This was found in Brezina et al. (2015) after revisiting McEnery’s (2006) study of swearing, where they confirm previous findings, but also found a new layer in the profanity. The ability to discover new layers is also pointed out by Baker (2016):

3 “Of or denoting the relationship between a set of linguistic items that form mutually exclusive choices in particular syntactic roles” (Oxford Dictionary of English via Ordnett.no).

4 “Of or denoting the relationship between two or more linguistic units used sequentially to make well-formed structures” (Oxford Dictionary of English via Ordnett.no).

(31)

22 […] collocational networks give “added value” to corpus analysis by indicating

relationships between multiple words which can help to suggest equivalencies,

synonyms, rewordings, or related terms and concepts, which (in the case of a discourse- based analysis) may have ideological significance. They can also help to suggest relevant terms which may not have been considered for analysis in the first instance […]

Collocation networks are an extension of the simple collocation where all the principles that pertain to collocation extraction must also be considered when producing collocation networks.

As discussed in section 2.3.3, it is important to keep in mind the distinction between collocation networks and other multi-word units, or extended units of meaning, such as n-grams, idioms, colligations and collostructions. One thing that separates the idea of collocation networks from these phrasal units, though without excluding them, is that the multi-word units are linear and sequential. The collocation networks are non-linear meaning units operating with the same set of parameters as simple collocations. There is still no generally agreed way of extracting

collocation networks. Phillips (1985; 1989) extracted his collocation networks by cluster analysis. Williams (1998) takes another approach, which is to make ‘collocation-chains’, where his method is to consider each collocate as new node, while Baker (2005; 2014) and McEnery (2006) calculate networks based on keywords in their corpora. But the idea of collocation networks did not really gain notice until the release of GraphColl, available through #LancsBox launched by Vaclav Brezina, Tony McEnery and Stephen Wattam in 2015. The newest version (version 5) available in 2020 (Brezina, McEnery, and Weill-Tessier 2020). They claim that before this software package was launched, the extraction of collocation networks was rather cumbersome. To my knowledge, this is the only corpus-tool, still available and functional, that offers collocation network extraction. It offers a new and easy way to extract collocation networks ‘on the fly’ with a graphic visualisation. #LancsBox and its functions are further described in section 3.2.1, and suggestions and attempts of how to analyse collocation networks is described down below in section 2.4.1 Network Analysis.

(32)

23 Since collocation networks are not very well established, conclusive analysis methods are yet to be found. It is still on its marks in the discovery what these networks should look like and how these networks of words shape the text, discourse, and discourse communities. Brezina (2018a) briefly exemplifies how collocation can be used for (1) discourse analysis, (2) language learning, and (3) lexicography. In the field of lexicography Williams (1998) and Alonso et al. (2011) show how collocation networks can be used for building (dynamic) dictionaries based on observed language use. Most pertinent to this paper, Baker (2005; 2014) and McEnery (2006) use corpora and collocational networks to analyse different discourses and discourse communities. After 2015, there have been few that studies of collocation networks with the help of the GraphColl- tool (Dong and Buckingham 2018; Germond and Ha 2019; Vu and Lynn 2020), but, again, the field of collocation networks is still a rather unexplored area of corpus linguistics.

2.4.1 Network Analysis

The first type of analysis suggested for locating collocational networks was proposed by Phillips (1985; 1989), who used a type of cluster analysis. While he himself claims that hierarchical cluster analysis is ‘appropriate for investigation of the syntagmatic organisation of text’ (Phillips 1985, 77), it was criticised by Brezina et al. (2015), who argue that this way of analysing the lexical sets, reveals paradigmatic, not syntagmatic relationships, as mentioned in the preceding section. Williams’s (1998; 2001) suggests a stepwise procedure. This also produces a

hierarchical collocation chain, where he starts with a node and expands the network out of this word, each collocate turning into a possible node.

With the introduction of #LancsBox, a graph theory approach was suggested by Baker (2016) as possible solution for an analysis of the networks. #LancsBox also operates in a fashion after Williams approach, where each collocate can turn into a node simply by clicking on the

collocate. They suggest a collocation network ordered hierarchically, where the first collocates of the node is called first order collocates followed by second order collocates. One can explore collocation strings up to n-order collocates. In this exploration of collocates, one can find that

(33)

24 perhaps second- or third order collocates of the original node also collocate with each other, creating a system which is rather more complex than a hierarchical cluster analysis and a

contingent collocation-chain. These connections can create elaborate shapes, which Baker (2016) recognised as collocation graphs. They suggested an alternative analysis which they base on graph theory (for an account of graph theory see Harris (2008) and West (2001)) and suggests that these different graphs can reveal the relationships between the words in the network. While the suggestion of hierarchical structure of collocates in a collocation network can be an organised fashion in which to analyse the networks, I find it to conflict with Sinclair et al.’s (2004)

definition of collocate and node, which is that there is no difference in status between the two.

Considering the tolerance for differing opinions and interpretation in collocational studies, it should be expected a transfer of open-mindedness to various translations of its networks as well.

One can perhaps expect, if embraced in further studies, as numerous perceptions of networks as for collocations.

(34)

25

Chapter 3: Material and Method

3.1 Material

3.1.1 Preliminaries

The media is often regarded as the fourth power of state, its role often thought to be holding the three governmental estates accountable for their conduct. The newspapers set the public agenda, direct the public conversation and communities’ information acquisition (Aronsen 2016), and can thus lead to inspire reaction, or lack thereof, in the face of local, national, and global events.

By being generally accessible to all, it can, as a non-governmental power, shape the public beliefs and mindsets to an extent any other controlling estate cannot. News media is therefore paramount in the shaping of their receiver’s opinion of the climate change. This is the reason for the importance of media language study – to uncover their account, narrative and perception of global climate change and how that is reflected in word choice and text structure.

The material chosen for this thesis is articles from the British newspaper the Guardian. The choice was made on the grounds of the newspaper’s active choice of changing their vocabulary in their global climate change discourse to ‘more accurately describe the environmental crisis facing the world.’ (Carrington 2019). Their choice was made after the UK’s declaration of ‘a climate emergency’. One of the advantages of using the Guardian as source material is that, even though it strictly speaking is a British paper, it is also international. The Guardian represents several inner circle English speaking countries such as the US, Australia, and New Zealand in addition to Great Britain, and therefore covering more ground than many other national

newspapers. Another reason for using the Guardian as the source of study is due to them having claimed a deliberate consciousness of their approach to the global climate change, put climate change on the agenda, and wish to ‘play a leading role in reporting on the environmental catastrophe’ (the Guardian 2019). Moreover, they resolved to change their language in the approach global climate change accordingly (Carrington 2019). One of the aims in this thesis is

(35)

26 to explore what this attention and awareness of language use of and attitude to a specific subject does to the overall discourse.

Usually, the most recent articles would be preferable for such an analysis as this, but as 2020 has faced the Covid-19 pandemic, the media coverage from the year is coloured by this unique situation. I have made the decision not include material from 2020 for this reason. The material that will be used are articles from 2019, which is fitting as this year the media was especially focused on the global climate change, mentioned in section 1.1.1. The events of 2019 include a young, Swedish activist, Greta Thunberg, leading the largest climate change demonstration in history (Sengupta 2019; Barclay and Resnick 2019; Taylor, Watts, and Bartlett 2019) and Oxford Languages reporting from their own corpus the rise of climate-specific language in their article ‘Word of the Year’ in 2019, a title they awarded climate emergency (Oxford Oxford Languages 2019).

To achieve a lexical and collocation study of this type of material, a small subject-specific, or specialised, corpus was collected. It has been argued (Walter 2010; Sinclair 2004) that smaller corpora are insufficient in carrying out lexical research, as the results will be limited. Though Sinclair (2004) so harshly states: ‘There is no virtue in being small’, the advantage of using a smaller corpus is that all occurrences of an item may be examined. With smaller corpora, it is much easier to examine the contextual aspects of the specific corpora and its texts (Flowerdew 2004), which is the purpose of this thesis. The quantitative findings in small, specialised corpora

‘can be balanced and complemented with qualitative findings’ (Koester 2010; Flowerdew 2004), instead of being balanced through sheer size, and they can ‘reveal connections between linguistic patterning and context of use’ (O'Keeffe 2007 in Koester 2010)). In smaller specialised corpora with even relatively small amounts of data, ‘specialized lexis and structures are likely to occur with more regular patterning and distribution’ (ibid.) than in the more general corpora, where larger amount of data is needed to uncover lexical patterning (Koester 2010).

(36)

27

3.1.2 The Climate Change Corpus

As apparent in the paragraphs above, the parameters for the specialised corpus collected for this thesis are electronic newspaper articles, editorials and features concerning the subject matter

‘climate change’ from the Guardian in 2019. The articles were found using the Factiva newspaper database used with permission from MR, a web site from Dow Jones & Company through a University of Oslo subscription. In this search engine, the newspaper the Guardian was chosen with articles marked with the subject ‘Climate Change’ published from 1 January 2019 to 31 December 2019. This resulted in 1,288 hits.

A general assessment is that a sufficient size for a specialized corpus is up to 250 000 tokens (Koester 2010; Flowerdew 2004), to be certain to reach this minimal number of tokens I aimed for 300 000 tokens. Many corpus compilers have argued that their average text sample of a corpus to be around 2,000 tokens (see e.g., the Brown corpus or the International Corpus of English), while other’s claim that the text sample may be 1,000 tokens (Flowerdew 2004;

Koester 2010). The average newspaper article is between 500–800 tokens, which means that it is impossible for all the sample texts in this corpus to conform to the average sample size.

Flowerdew (2004), however, argues that it is more important that the corpus’ text are complete texts rather than to adhere to a specific size. The average newspaper article size suggests a sampling size of about 450 articles. A systematic sampling method was chosen, selecting every 3rd article, which would result in the preferred sampling size. When this was done, I eliminated irrelevant texts, which did not qualify as newspaper articles, such as reader’s letters and

correction statements, and cleared the articles of elements that were not part of the running text, such as clippings from social media platforms Twitter and Instagram and corrected obvious spelling mistakes.

The corpus contains 393 electronic newspaper articles with an average of 825 tokens per article.

The corpus overall contains 324 029 tokens. All articles are from either the Guardian or the Observer (a Sunday paper published under the Guardian Media Group (Guardian Media Group 2018)), and represents articles, editorials, and features that in some way or other concern climate

(37)

28 change. All texts from the Guardian and the Observer used single closing quotation marks

instead of an apostrophe, which I needed to change the punctation choices to able to compare accurately to the reference corpus. I decided to call the corpus ‘the Climate Change Corpus’, abbreviated the CCC. A list of all texts found in the corpus can be viewed in appendix 1.

Due to copyright issues, I cannot publish the corpus created, making this study not easily replicated and the result not easily verified. However, the results can be tested on a corpus compiled accordingly to the criteria described here. If sampling method is replicated, one should have acquired at least close to identical material. This material may give an understanding of patterns in the discourse of an international English newspaper that has taken a hitherto unique choice on how to communicate one subject, and what the consequences of this choice may be.

3.1.3 Reference Corpus: BNC2014 Baby+

The BNC2014 Baby+ was chosen as a comparable or reference corpus. The BNC Baby+ is a small subset of the new BNC2014 (The British National Corpus 2014), released to mark the second stage of the release of BNC2014 and is available through the #LancsBox software (Brezina and McEnery 2019). I originally wanted to use the reference material found in the

‘serious news’ section of the BNC2014 Baby+ corpus, as this section is approximately the same size as the CCC and contains material from the top serious newspaper in the UK such as the Times, the Sunday Times, and Financial Times, as well as the Guardian and the Observer. The material is electronic articles collected between 2014 and 2016. Even though these parameters make this section of the corpus it an excellent comparable corpus for the CCC, two relatively small corpora yielded far too many false hits when comparing effect size, as many words had zero appearances in one or the other, I had to choose the whole corpus as a refence corpus. The BNC2014 Baby+ is a 5 mill. word corpus, containing material from several written and spoken genres of British English.

(38)

29

3.2 Method

This corpus assisted discourse analysis is a three-part stepwise procedure used to uncover significant words and lexical items. The aim is to see what patterns, trends and networks may reveal crucial information about the genre, theme, and sender, and the geographical and temporal location of the texts. Keywords (section 3.2.2), collocation (section 3.2.3) and collocational networks (section 3.2.4) will be identified in the abovementioned corpus, the Climate Change Corpus (the CCC), with the help of the relatively new and untried corpus tool #LancsBox (section 3.2.1). The essential keywords and collocations will be analysed with the consideration of statistical values, morphological, syntactic, and semantic realisations with a special

consideration for co- and context in the spirit of the Neo-Firthian school of thought.

3.2.1 Software: #LancsBox

#LancsBox (Brezina, McEnery, and Weill-Tessier 2020) is a relatively new corpus software that allows you to upload your own, or existing, corpora and use familiar corpus tools, such as KWIC (keyword in context), dispersion analysis, word lists and n-grams. #LancsBox gives the user an easy way to work with ‘homemade’ corpora with minimal material preparation, and it includes a feature called TreeTagger (Schmid 1994) which automatically tags any uploaded corpus for part- of-speech tagging, using the TreeTagger tag set (Anthony 2015) and lemmatisation. Part-of- speech tagging enables analysis of lexicosyntactic functions and realisations, while automatic lemmatization groups each word-form. In #LancsBox, traditional corpus tools such as keywords in context and concordance lines are found in the KWIC-tool, distribution analysis in the Whelk- tool, keyword extraction by corpus comparison and frequency analysis in the Words-tool,

identification of n-grams and lexical bundles in the Ngram-tool, corpus overview in the Text-tool and combines all the above-mentioned tools in the Wizard-tool. What #LancsBox offers that is

(39)

30 new, is the GraphColl-tool, first presented by Brezina et al. (2015). GraphColl allows for a completely new way of visualising collocation and to create collocation networks. Brezina (2018a) points out that collocations have usually been displayed tabularly. GraphColl lets one view collocation in a graphical format. These graphs are used to illustrate the information otherwise found in the tables. The strength of the association of the collocate, measured by the association measure (section 2.3.2), is illustrated by distance to the node. The closer to the node the collocate are, the stronger the association. Frequency is displayed in the strength of colour in the collocate. The darker the shade of colour, the higher the frequency. The position of the collocate in the graph indicates whether the collocate tends to appear on left or right side of the node. Compare visualisation of table 3.2.1 to figure 3.2.1.

Table 3.2.1 Collocation table of power in the CCC. Span +/- 3. MI 4.0. Min. freq. 10.

In addition to this new visualization of the standard collocation, GraphColl also allows for the production of collocation networks. Simply by double clicking one of the collocates, GraphColl creates a new graph where we now have two nodes, their collocates and, most importantly, their shared collocates. In figure 3.2.2, we see the both power and coal-fired, coal-fired a collocate of power, with also the shared collocates plants and stations. Clicking on more collocates will result in even bigger collocation networks. For more on the software #LancsBox, see Brezina (2020).

Index Status Position Collocate Stat Freq (coll.) Freq (corpus)

1 ○ L coal-fired 10.06404 30 35

2 ○ R stations 10.03489 21 25

3 ○ R plants 8.242036 16 66

4 ○ L nuclear 8.023396 10 48

5 ○ L wind 7.654162 10 62

6 ○ L our 4.461004 17 964

(40)

31

Figure 3.2.1 Collocation visualisation of power in the CCC created by the GraphColl tool in #LancsBox. Span +/- 3. MI: 4.0. Min. freq.: 10.

Figure 3.2.2. Collocation network graph of power and coal-fired in the CCC created by the GraphColl tool in #LancsBox. Span +/- 3. MI: 4.0. Min. freq. 5.

Lexical Patterns in the Climate Discourse