Corpus linguistics - Description and representation in language resources of Spanish and Englis

The discipline of corpus linguistics provides a relevant methodology to study authentic texts in their context. According to Hunston (2006), a “corpus is an electronically stored collection of samples of naturally occurring language”.

McEnery (2003) asserts that a corpus is machine readable. He defines a corpus as “a body of machine-readable linguistic evidence, which is collected with reference to a sampling frame” (McEnery, 2003, 450). Corpus data are stored and indexed in such a way that they are searchable with computer software. Additionally, corpus data can be preprocessed and tagged with structural markers to identify documents, chapters, sections, paragraphs and sentences. Next, the data can be tokenized to identify each unit, then it can be annotated with part-of-speech tags, lemmatized and chunked. Other researchers prefer to store corpora without any of these annotations in an attempt to keep the data as close as possible to the original text. Besides, corpora can be monolingual, parallel or multilingual (McEnery, 2003; Aijmer, 2008).

Contrary to doing linguistic research by means of examples obtained by

the linguist through introspection, corpus linguistics relies heavily on find-ing real examples extracted from authentic material (McEnery and Wilson, 2001).

A corpus also allows researchers from other disciplines than linguistics, such as sociologists, lawyers, economists and anthropologists, to carry out studies based on authentic texts, such as the ones included in the corpus used for this research. However, users of corpora differ in their method and approach to the use of a corpus.

To carry out this study, a parallel and annotated corpus is a vital resource because it makes it possible to find the occurrences of FTA terms along with the collocates of these terms in their occurring context and not in isolation.

A corpus is an efficient tool to generate a concordance of the words under consideration, in order to perform a vertical and a horizontal examination of the words and their surrounding context, each one offering differing insights into these lexical units. Tognini-Bonelli (2001) explains that a horizontal reading enables to focus on larger units such as clauses, sentences and para-graphs. In contrast, a vertical reading is suitable to scan for patterns co-occurring with the node word. Thus, using a corpus-generated concordance to perform a vertical and horizontal reading of the words under consideration offers the researcher many advantages. According to (Wynne, 2009, 711)

reading concordances allows the user to examine what occurs in the corpus, to see how meaning is created in texts, how words co-occur and are combined in meaningful patterns, without any fixed preconcep-tions about what those units are. It can be a method of approaching the corpus in a theory-neutral way. This is part of what Tognini-Bonelli (2001) calls corpus-driven linguistics.

Among corpus linguists there is not a single and unified method to do research using corpus linguistics. However, there are several approaches, which are supplementary methods for corpus exploitation, i.e. based, corpus-driven and corpus-assisted research.

2.7.1 Corpus-based vs. corpus-driven research

proaches to research done using corpus linguistics. These approaches have several common features while other features differ. Corpus-based refers to a type of research where the researcher uses a corpus as as test-bed. Instead of relying solely on his/her intuitions, the corpus provides examples to test or exemplify theories and descriptions that were formulated before the creation of large electronic corpora.

The second approach refers to a type of linguistic research in which the researcher lets the corpus “speak for itself” by using tools and techniques that exploit the frequency and other statistical information from the data with no pre-conceived idea on the theoretical constraints that might rule the types of possible queries. However, some authors express their criticism toward this approach because of its full reliance on data and claim that in the end all corpus methods are “corpus-based” (McEnery and Hardie, 2011).

In my view, no corpus research can claim a total adherence to any of the two approaches. Most modern approaches today use a combination of both approaches and thus are hybrid in nature. One approach uses linguistic knowledge expressed in the form of rules obtained from grammars while the other relies heavily on statistical data. Today, with the growing availability of computerized corpora and the production of corpus-aware grammars, lin-guists have more resources available to carry out research with the aid of cor-pora. Some linguists also use statistical methods applied to huge repositories of data, with excellent results. This way, a combination of both approaches gives the researcher more elements to process an amount of data that was not possible before.

In accordance with what is customary in corpus linguistics, lexicography and corpus-based terminology, I use a combination of both approaches for doing corpus linguistics. This work is corpus-based in the sense that mor-phosyntactic patterns that form collocations in English and Spanish are used to query a corpus that was previously lemmatized and annotated with part of speech tags. It is also corpus-based because a set of previously identified terms or candidate terms are used as “seeds” (Baroni and Bernardini, 2004).

Other studies have used terms as seeds (Jacquemin et al., 1997; De Groc, 2011; Ljubeˇsic et al., 2012; Burgos, 2014). In the case of this work, these seed terms serve as a starting point to identify semi-automatically the

col-locates found in the list of terms. However, this work is also corpus-driven because several applications and techniques that rely on statistics without a priori conceptions of what is in the corpus are used to calculate the colloca-bility between a term and its collocates. These applications are explained in Chapter 4.

The remainder of this chapter is organized as follows. First, I present a theoretical background on collocations, followed by a review of the definitions proposed by representative authors in the field and the salient characteristics of collocations. Then, I present a view on collocations from different disci-plinary perspectives. Before attempting to propose a definition of specialized collocation, I describe the criteria for collocability between two or more lex-ical units in Section 2.11. Then, in Section 2.12, I account for the features that give these units a specialized nature.

In document Description and representation in language resources of Spanish and English specialized collocations from Free Trade Agreements (sider 41-44)