Language resources - Description and representation in language resources of Spanish and Englis

Since the notion of language resources has been mentioned in the previous paragraphs, it is pertinent to define it at this point. In this work, language resources refer to sets of language data and descriptions in electronic form, used to build, improve or evaluate systems or algorithms for NLP (Godfrey and Zampolli, 1997).

Cunningham and Bontcheva (2006) call these resources “the raw mate-rial of language engineering” and differentiate between language resources and processing resources. Examples of language resources are dictionaries, term bases, corpora, treebanks and lexicons. Additionally, some examples of

systems, automatic translators, parsers and speech recognition systems.

One of the most important aspects of NLP is that of lexical knowledge acquisition, since the performance of any system to process written or spoken text relies heavily on the degree of “knowledge” that the system incorporates on the linguistic data that is being processed (Grishman and Calzolari, 1997).

Lexical knowledge acquisition is defined as “the production or augmentation of a lexicon for a natural language processing system” (McCarthy, 2006).

Since the manual creation of these language resources is an extremely difficult task, modern lexicography and terminography rely on lexical acquisition.

However, it is considered a bottleneck for the development of NLP tools, since the manual creation of a lexicon is expensive and requires a large team of qualified professionals, who are not always readily available. Furthermore, the manual creation of a lexicon is a tedious and time-consuming process, one that is prone to errors and inconsistencies, even though the same could be said of conventional printed dictionaries (Fontenelle, 1994; Matsumoto, 2003). Because of this, lexical acquisition has to be aided with automated tools to be feasible.

After processing the data, the resulting lexicon is a resource such as a dictionary or thesaurus in an electronic format but is presented in such a way that it is readable by a machine and not by a human only. This in-cludes for example, the enrichment of a lexicon by the inclusion of the forms, meanings, synonyms, antonyms, hypernyms, and phraseological information (idioms and collocations) that a given word can take. Additional informa-tion includes the associated statistical informainforma-tion of their distribuinforma-tion, which may be of no interest for a human reader, but which proves vital for a compu-tational system designed to perform complex operations such as word sense disambiguation, ATE, collocation extraction and similar tasks (Lyse, 2011).

Calzolari (1994) points out that it is almost a tautology to say that a good computational lexicon is an essential component of any linguistic appli-cation within the so-called “language industries”, ranging from NLP systems to lexicographic projects. In other words, if an automated system for the processing of lexica is going to perform its tasks in an efficient and effective manner, it has to rely on the most complete repertoire of lexical information available (Pustejovsky, 1998).

Language resources are relevant for this project because with existing language processing tools, general and specialized lexicons and corpora, it is possible to find terms and the specialized collocations associated to these terms, which can in turn help create or improve other resources. The lan-guage resources used in this work are described under Section 4.4.2.

2.4.1 Dictionaries and Computational Lexicons

Currently, dictionaries are produced increasingly more in an electronic for-mat, because of the clear advantages that it offers for a faster and more effi-cient retrieval of the desired information. Electronic dictionaries are simple to use and some of them allow the user to copy and paste the equivalents on a word processor or a translation memory software. In contrast, the traditional way of finding equivalents in a bulky printed dictionary can be cumbersome and demands more time from the user to find the precise information.

However, “traditional” dictionaries are not codified for computational processing, even though they might have been published in electronic format to be read online, because they are designed to be read by humans and not by machines. This means that initially, electronic dictionaries were a faithful transcription of its printed counterpart, yet with some added values such as the possibility of carrying out faster and more comprehensive searches, listening to the pronunciation of the entry through audio files, and gaining access to synonyms or additional information by means of hyperlinks.

Besides, electronic dictionaries are not bound to the space limitations of their paper versions and therefore, it is not necessary to save space by en-tering phraseological information as is normally done in paper dictionaries, for example by inserting a symbol such as ˜ to replace the current entry.

Nonetheless, if a processing task is intended, electronic dictionaries present disadvantages for their use as a repository from which to extract linguis-tic features from words, such as the lexical, semanlinguis-tic, phonological or mor-phosyntactic data (Hanks, 2003). One reason for this is the fact that in these dictionaries the data are not separated from the linguistic annotations, i.e., the linguistic information attached to each word. In other cases, there are no annotations at all because in certain types of dictionaries it could be

re-dundant, while a computer system needs the full explicitation of an entry to be able to process these annotations.

To overcome these problems, researchers and developers have proposed to standardize certain procedures for making electronic dictionaries in a more effective manner to be able to process the information adequately. This is described in the following section.

2.4.2 Standardization of language resources

The standardization of language resources is relevant for the present work.

One of the objectives proposed in Chapter 1 is to assess the applicability of linguistic annotation schemes for the representation of specialized col-locations in term bases and computational lexicons. This means that the protocols used to annotate the data should be in accordance with existing standards so that the data can be used, merged or imported into other re-sources that are based on the same standards.

Standardization emerged as a means to meet the need of producing reusable resources in electronic format. It is essential for creating a dictionary that can be processed computationally, and then it can be exchanged, updated or merged with other resources in a transparent way (Hanks, 2003; Calzolari et al., 2013).

If each project for the creation of language resources uses a particular annotation scheme to encode information, as has been the case over the years, at the moment of combining an existing resource with other resources or exporting or importing data, data reuse becomes difficult, to say the least, because the developers have to adapt their system to other data structures to be able to reuse the data.

Francopoulo et al. (2006b) suggest some benefits derived from the imple-mentation of standards for linguistic resources. One of these is the possibility of having a stable foundation for their representation and being able to deploy a solid infrastructure for a network of language resources. Besides, it facili-tates the reuse of software and data that is not tied to proprietary formats.

This type of product is always subject to commercial issues and sometimes requires the use of a specific tool that could disappear from the market. This

would leave the data linked to that product, or would require the periodic renewal of an expensive license whenever a new version is launched.

According to Moreno (2000), two decades ago, researchers from the field of computational lexicography started to observe the importance of design-ing a set of standards for the creation of reusable and interoperable language resources. To this end, several projects have been undertaken to unify the coding of computational lexicons and terminologies through the creation of norms (Calzolari et al., 2013). Once the standard has been approved, one objective of the developers of these standards is to promote their implemen-tation among organizations, research groups, companies and professionals of the field, for the sake of promoting the exchange of information without ob-stacles or loss in the transmission of data due to incompatibility by using dissimilar technologies or protocols.

Among these projects, several are worth mentioning:

• Preparatory Action for Linguistic Resources Organization for Language Engineering (PAROLE) (Zampolli, 1997);

• Generic model for reusable lexicons (GENELEX);⁶

• Multilingual Text Tools and Corpora (MULTEXT) (Ide and V´eronis, 1994);

• Expert Advisory Group on Language Engineering Standards (EAGLES);⁷

• International Standards for Language Engineering (ISLE) (Calzolari et al., 2001) and

• Semantic Information for Multifunctional Plurilingual Lexica (SIM-PLE).⁸

Regarding the information that is stored in computational lexicons, Maks et al. (2008), classify the information that is pertinent for three intended categories:

• Humans, such as definitions, lexicographic comments and descriptions;

• Computational applications, such as semantic information, examples and complementary patterns, and

6http://llc.oxfordjournals.org/cgi/content/abstract/9/1/47

7http://www.ilc.cnr.it/EAGLES/browse.html

• Relevant information for both, where Maks et al. mention the lemma and word forms, part of speech, tagging of semantic and pragmatic information, phraseological units and translation equivalents.

Hanks (2003) argues that a dictionary in an electronic format that was orig-inally meant for human reading, after an adequate preparation stage, can be an important data source. Similarly, Wilks et al. (2008) introduce the difference between dictionaries in an electronic format (“machine-readable dictionaries” or MRD) (Amsler, 1982), and processing-ready dictionaries (“machine-tractable dictionaries” or MTD), and present several strategies for the conversion from MRD to MTD. Likewise, Litkowski (2006) and McCarthy (2006) state that there are significant differences between the requirements of a lexicon meant for a computer system and the contents of a dictionary or thesaurus written for human readers.

For a dictionary to be prepared for computational processing, the meta-data must be separated from the linguistic information. To solve this need, markup languages are used, such as the Standard Generalized Markup Lan-guage (SGML) and especially eXtensible Markup LanLan-guage (XML). Initially, SGML was a popular choice, but over the last decade XML has become the most widely used option due to its versatility and capabilities for data ma-nipulation (Litkowski, 2006).

Language resources designed specifically for NLP such as lexicons, dic-tionaries or thesauruses, should ideally include the lexical, syntactic, mor-phological, phonetic, semantic, pragmatic, phraseological and terminological information, besides examples, in a code processable by the machine. The most widely used machine-readable thesaurus to date is WordNet (Miller, 1995), according to McCarthy (2006).

In document Description and representation in language resources of Spanish and English specialized collocations from Free Trade Agreements (sider 32-37)