• No results found

Complex terms, specifically two-word terms are the most prevalent in the English data. In detail, one-word terms account for 19.6% in the gold stan-dard of terms and 16.8% in the list of candidate terms, while two-word terms correspond to 44% of the first subset of the data and 51.5% in the list of candidate terms; three-word terms represent 15.3% and 19.61%, respectively, while four-word terms account for 10.6% in the gold standard and 9.4% in the case of the candidate terms. In other words, terms are more often composed by multiword strings than by simple lexemes. The token count distribution of the English gold standard and the candidate terms is presented in Table 5.3.

Terms made up by 1 to 4 tokens were included in the extraction, while terms composed by 5 to 7 tokens were not taken into account because of their low frequency.

Table 5.3: Word count distribution of the English gold standard and the candidate terms

Words Gold st. of terms % Cand. terms %

1 87 19.6 1060 16.8

2 195 44.0 3238 51.5

3 68 15.3 1232 19.6

4 47 10.6 595 9.4

5 27 6.0 120 1.9

6 11 2.4 31 0.4

7 5 0.6 7 0.0

In previous works done in the field of ATE, other authors have excluded units longer than 4 words, due to their low frequency (Daille, 1994), while other researchers have presented lists of morphosyntactic patterns to extract English and Spanish candidate terms that span up to 9 words (Quiroz, 2008;

Burgos, 2014).

Table 5.4: Distribution of patterns for the English candidate terms

Pattern Examples Percentage Freq.

Adj N financial service 33.2 2105

intellectual property competent authority financial institution

N N service supplier 18.3 1165

custom duty

Adj N N regional value content 3.6 234

financial service supplier economic need test intellectual property right

N Prep Adj N supplier of public telecommunication 3.3 211 notice of intended procurement

enforcement of intellectual property form of numerical quota

Adj Adj N national central bank 2.6 170

ordinary / special legislative procedure equal annual stage

relevant international standard

Adj Conj Adj N sanitary or phytosanitary measure 2.5 164 sanitary and phytosanitary measure

arbitrary or unjustifiable discrimination natural or legal person

Figure 5.1 illustrates the word-count distribution for both the English gold standard and the candidate terms extracted with Termostat. As is evident from the figure, in both datasets, two-word terms are the most frequent type.

Of these, terms with the pattern Adjective + Noun are the most frequent ones.

Table 5.4 presents the distribution of the eight most salient morphosyn-tactic patterns for the candidate terms. It also offers some examples for the candidate terms in English extracted semi-automatically with Termostat,

af-Figure 5.1: Word count distribution of English gold standard and candidate terms

ter the list was manually cleaned to discard non-candidate terms. These eight patterns account for 91.6% of the whole list of candidate terms.

Out of this list, the rst two patterns in frequency are Adjective+ Noun with 33.2% and 2,105 occurrences out of 6,285 terms, and Noun + Noun with 18.3% and 1,165 occurrences. In the third place come terms composed by a noun with 16.9% and 1,073 cases in the English data. The fourth most frequent pattern is Noun + Preposition + Noun with 685 occurrences which represents10.8%of the candidate terms. Therefore, these fourpatterns which account for 80% of the whole list of candidate terms were selected as the primary target to query the corpus to search for candidate specialized collocations.

These phraseological units are used in di erent disciplines. Some of the terms are mostly used in macroeconomics and nance, such as collective in-vestment, debt instrument and service supplier. Other terms are more com-monly associated to international trade, a subdomain of macroeconomics that comes from economics, such as cross-border supply, customs duty and preferential tari , while other terms are related to law such as intellectual property, domestic law, domestic legislation, legal entity, legal person and legislative act. Other terms refer to the goods that are included in the

agree-ments, such as animal hair, man-made fibre, milk powder, woven fabric and agricultural product.

These findings document the most productive patterns in term formation for this domain. This suggests that extraction efforts should prioritize these highly productive patterns. This finding is also useful for the teaching of LSP, specialized translation and specialized phraseology, where future prac-titioners should be taught to focus on these patterns as the most frequent carriers of specialized information in highly specialized texts from the domain of economics, including international trade.

For the Spanish data, the morphosyntactic distribution of the list of 10,436 candidate terms extracted with Termostat is illustrated in Table 5.5.

The four more frequent patterns account for 87.4% of the list of candidate terms and were therefore selected to query the corpus to find the verbal col-locates that these terms take in the FTA corpus. These patterns are relevant for term extraction besides their interest in the teaching of LSP, terminol-ogy, specialized translation and phraseology. Combined, the patterns Noun + Preposition + Noun and Noun + Adjective, the two most frequent patterns for the Spanish candidate terms, account for 60.81% of the units. Next come two other frequent patterns. In the first place appear simple terms composed by a noun and then come complex terms consisting of four words: Noun + Preposition + Noun + Adjective, with roughly 14% and 12% respectively.

5.4 Frequent Spanish and English verbs

As a preliminary step to focus the extraction efforts in finding the most rel-evant verbs that form specialized collocations in the FTA corpus, the most frequent verbs appearing in the corpus were identified and ranked according to their frequency. First, 1,205 lexical verbs were extracted. The most fre-quent are 214 verbs, which occur from 2,900 to 100 times in the Spanish data.

Their frequency suggests that these verbs are thus the most representative ones that form specialized collocations in Spanish FTA texts.

Table 5.6 presents the top-20 Spanish and English lexical verbs in the data along with their frequencies. They are not translations of each other.

Table 5.5: Distribution of patterns for the Spanish candidate terms

Pattern Examples Percentage Freq.

N Prep N proveedor de servicio 31.17 3253

fecha de entrada medida de salvaguardia soluci´on de controversia derecho de propiedad

N Adj parte contendiente 29.64 3093

contrataci´on p´ublica

N Prep N Adj derecho de propiedad intelectual 12.57 1,312 valor de contenido regional

proveedor de servicio financiero prueba de necesidad econ´omica rama de producci´on nacional

N Adj Adj procedimiento legislativo ordinario 5.66 591

transporte mar´ıtimo internacional

N Adj Coord Conj Adj medida sanitaria y fitosanitaria 1.72 179 asunto exterior y pol´ıtico

disposici´on legal y reglamentaria derecho antidumping y compensatorio fibra artificial y sint´etica

For the English data, 1,555 unique lexical verbs were extracted and are also the most frequent verbs that form specialized collocations in English FTA texts. The most frequent of these lexical verbs are 258 and occur from 5,435 to 100 times in the English subcorpus.

Table 5.6: Top 20 verbs for the Spanish and English data

Freq Spanish Verbs Freq English Verbs 2,904 establecer 5,436 provide

5.4.1 Candidate terms found in the FTA corpus

A list of the 100 most frequent candidate terms that were extracted automat-ically was processed into a “cloud” of words by Termostat. The size of the font indicates the frequency of the term in the subcorpus. Figure 5.2 shows the 100 most frequent candidate terms in the English component of the FTA corpus, which highlights salient terms such as agreement, measure, service, procedure and supplier. Later, Figure 5.3 presents the 100 most frequent candidate terms found in the Spanish component of the FTA corpus, which presents relevant terms such as mercanc´ıa, proveedor, servicio, subpartida and parte contendiente. Regarding their morphosyntactic composition, 86 out of the 100 most frequent candidate terms found in the cloud of words by Termostat are simple terms. Thus, only 14 are complex terms, where one corresponds to the pattern Noun + Preposition + Noun, 8 correspond to the pattern Adjective + Noun and 5 to the pattern Noun + Noun.

Figure 5.2: Top 100 terms in the FTA English subcorpus

5.5 Candidate specialized collocations in the