Data representation - Description and representation in language resources of Spanish and Engli

“Representation” refers in this context to the XML code that can be used to encode specialized collocational information in a computational lexicon. The aim of this representation is to prepare the data for machine-readable lexicons which can be interchanged across different language resources (Litkowski,

2006). This representation is carried out by means of linguistic annotations that are done automatically on the data after it has been prepared.

Wilcock (2009, 1) defines linguistic annotation in this way:

Linguistic annotations are notes about linguistic features of the anno-tated text that give information about the words and sentences of the text.

This means that, ideally, these annotations are meant to be a formalized explicitation, one that is readable by a computer system, of the implicit knowledge that humans have of words at different linguistic levels: their phonetics, morphology, syntax, semantics and pragmatics. In addition to this, terminological and phraseological information should also be included.

To be able to represent information on specialized collocations in machine-readable dictionaries, there is some prior information that has to be taken into account.

Several questions arise regarding the issue of the computational represen-tation of specialized collocations. To begin with, which constituent should include the collocation, the node or the collocate or both? In this regard, there is no standard procedure defined by current lexicographical practices.

I agree with Thomas (1993), who argues that it is important to define con-sistent criteria to choose the headword or “entry point” for the storing of LSP collocations and terms made up of multiple lexical units for precision and time-saving.

L’Homme (2009, 239) asserts that “specialised dictionaries that take into account collocations differ with respect to the method chosen to list and rep-resent them in entries”. To illustrate, let us consider one example from two economics dictionaries, which employ different ways to list the related terms and their collocates. First, the Diccionario de comercio internacional: im-portaci´on y exportaci´on: ingl´es-espa˜nol, Spanish-English (Alcaraz and Cas-tro, 2007), under the entry for tariff offers a list of complex terms including the term tariff, which is frequent in FTA texts, plus another noun, such as agreement, amendment, anomaly, barrier, benefit, classification or conces-sion. Also, the Routledge Spanish Dictionary of Business, Commerce and Finance (Routledge, 1998) provides several complex terms that also include

the same term, such asagreement, barrier, concession, cut, expenditures, leg-islation andlevel. The former dictionary includes all the related terms under the umbrella term tariff while the latter lists separate entries for each term.

Unsurprisingly, a legal dictionary, the Diccionario de T´erminos Jur´ıdicos, Espa˜nol-Ingl´es English-Spanish (Ostojska-Asensio, 2002) offers the equiva-lent of tariff but does not provide any collocational information.

Which information should be included using tags to encode the linguis-tic data that is related to the collocational information? This information could include the morphosyntactic data, such as the part of speech, the subcategorization frame of the intervening lexical items, and the semantic information such as the domain(s) in which these lexical units are used. Ac-cording to Matsumoto (2003), the subcategorization frame of a verb defines the set of syntactic constituents with which a certain verb can appear. These frames usually specify the syntactic constraints or preferences of a verb. Fur-thermore, information on the semantic constraints is not only desirable but mandatory.

How can specialized collocations be represented in schemes for linguis-tic annotation issued by the International Organization for Standardization (ISO), specifically standards for terminological and computational lexicons?

Several of these schemes provide a model to represent phraseological informa-tion, such as the information contained in specialized collocations with vary-ing degrees of detail. In contrast, other schemes were not designed for the transmission of phraseological information. These standards are discussed in Section 2.6.

2.6 Standards for computational lexicons

Several initiatives have been developed with the aim of establishing a stan-dard for the interchange of lexical data, especially for machine translation purposes. The ISO website offers a catalogue of standards.⁹

Some of these initiatives are:

9 http://www.iso.org/iso/home/store/catalogue ics/

catalogue ics browse.htm?ICS1=01&ICS2=020&

• the Machine-Readable Terminology Interchange Format (MARTIF) ISO 12200:1999,

• the Open Lexicon Interchange Format (OLIF),¹⁰

• the Terminological Markup Framework (TMF) ISO 16642:2003,¹¹

• the TermBase eXchange (TBX) ISO 30042:2008 and

• the Lexical Markup Framework (LMF) ISO 24613:2008.

Other newer standards, not directly relevant for this work, have been released from 2012 to 2016:

• the ISO 24615 Syntactic annotation framework (SynAF), composed of two parts,

• ISO 24612:2012, Language resource management - Linguistic annota-tion framework (LAF),¹²

• ISO 24611:2012, Language resource management - Morpho-syntactic annotation framework (MAF),¹³ and

• the Semantic annotation framework (SemAF) ISO 24617, composed of eight parts (the third part is not yet available in the online ISO standards catalogue).

These standards are XML-compliant specifications for the implementa-tion of a lexicon. Some of these standards, such as MARTIF, use an ono-masiological or concept-oriented approach rather than a seono-masiological or lexically-oriented one, which, in my view, makes them unsuitable for repre-sentation in NLP or lexicographic applications.

The adoption of standards for the constitution of lexical and terminolog-ical resources raises several questions:

• How can language resources be encoded in an interoperable, scalable and interchangeable format? This would ensure that the data could be

10http://www.olif.net/

11 http://www.iso.org/iso/iso catalogue/catalogue tc/

catalogue detail.htm?csnumber=32347

12 http://www.iso.org/iso/home/store/catalogue ics/

catalogue detail ics.htm?ics1=01&ics2=020&ics3=&csnumber=37326/

13 http://www.iso.org/iso/home/store/catalogue ics/

catalogue detail ics.htm?ics1=01&ics2=020&ics3=&csnumber=51934

merged with or exported to other language resources and that the data would not be lost due to technology incompatibilities, which is known as blind interchange.

• Are there commercial factors that affect the adoption and implementa-tion of a given standard? This implies that the industry could prefer a certain technology while academia adopts a different protocol to store information but the two might be incompatible, which would hamper the development of language resources.

Some aspects of the LMF, TMF, OLIF and the TBX standards will be commented in subsection 6.1.1 and 6.1.2, with a focus on their suitability for the computational representation of MWEs, and specifically specialized collocations.

Corpora are another vital resource for NLP, and are described in the following section.

In document Description and representation in language resources of Spanish and English specialized collocations from Free Trade Agreements (sider 37-41)