Corpus Analysis Tool

The corpus analysis tool used in our experiment was WordSmith Tools, developed by Mike Scott. WordSmith provides the following features which we felt could be of use to translators:

a) a concordancer, which finds and displays, in an easy-to-read format (e.g. KWIC — key word in context) all occurrences of a search term (and minor varia- tions thereof);

b) a collocation viewer, which allows users to see which words "go together;"

c) frequency operations, which provide statistical information about the central- ity of a pattern (i.e. whether it is just one author's idiosyncratic usage or an ac- cepted pattern in expert discourse).

The students had been given training on how to use the tool during their corpus linguistics class, so they were familiar with these basic features; however, they had never previously tried to apply the tool to a translation situation.

8 Meta, XLIII, 4, 1998

Criteria	Requirements for specialized translation	Positive features of Computer Select	Drawbacks of Computer Select
Corpus size	large selection of texts (to determine if usage is widespread or idiosyncratic)	contains thousands of articles
Text size	complete texts (so examples of usage or explanations of concepts are not cut short)	many articles are complete	some articles are only abstracts and so do not fully explain some concepts
Text type	mixture of instructional, expert, and popularized texts (to help translators achieve an understanding of the subject field and allow them to see different registers of usage)	contains a wide range of expert and popularized texts with a few instructional texts12	the number of instructional texts is relatively low
Date of publication	mainly recent texts, but also some older texts (older texts are useful because concepts are better explained when they first come out; recent texts are needed to reflect the state of the field at present)	contains a mixture of relatively current and older texts (May 1989 - Feb. 1995)
Author	texts by a variety of authors (to determine if usage is widespread or idiosyncratic)	contains texts by thousands of different authors
Language	preferably texts by authors writing in their native language (to show idiomatic usage)	contains only English- language articles which come primarily from journals originally published in English- speaking countries	cannot easily verify if the texts are written in the author's native language
Culture	texts written by authors with different cultural backgrounds (e.g. British, American, etc.) (to show appropriate regional usage)	most articles come from journals originally published in the US, though some are from the UK and Canada; one journal is originally from the Netherlands	most articles seem to be of US origin, cannot easily verify the cultural backgrounds of the authors

Table 1

An evaluation of Computer Select with regard to the criteria required for a corpus that is a balanced and representative source for specialized translation

NATIVE-LANGUAGE CORPORA AS A TRANSLATION RESOURCE 9

Type of Publication	Instructional	Expert	Popularized
Learned Journals (e.g. AI Expert, IBM Journal of Research and Development)		X
Proceedings (e.g. Proceedings of the IEEE)		X
Popular Journals (e.g. Byte, MacUser, PC Magazine)			X
Newspapers (e.g. The New York Times, The Wall Street Journal)			X
Newsletters/Bulletins (e.g. Information Industry Bulletin, Soft*letter)			X

Table 2

The five basic types of publications found in Computer Select are classified according to the type of information they are intended to impart

Genres of texts	Instructional	Expert	Popularized
Buyers' Guides			X
Columns			X
Company Profiles			X
Correction Notices			X
Cover Stories			X
Directories			X
Editorials			X
Evaluations		X	X
Glossaries		X
Hardware Reviews		X	X
Industry Overviews			X
Interviews		X
Letters to the Editor			X
Obituaries			X
Panel Discussions		X
Product Announcements			X
Software Reviews		X	X
Technical13		X
Tutorials	X

Table 3

The nineteen genres of texts identified by the publishers of Computer Select

are classified according to the type of information they are intended to impart

10 Meta, XLIII, 4, 1998

File name	Total number of words	Total number of bytes
scan1	229,407	1,534,695
scan2	466,095	3,172,317
scan3	416,810	2,728,605
scan4	385,948	2,519,623
All files combined	1,498,260	9,955,240

Table 4

Total number of words and bytes in the scanner corpus

Experiment

For the purpose of this experiment, all students were translating out of their for- eign language (French) into their native language (English). Students were presented with two different extracts from an article on optical scanners. They were asked to translate one of the texts using the following resources as necessary: a selection of gen- eral bilingual dictionaries and a selection of monolingual specialized lexicographic and non-lexicographic resources.15 They were then asked to translate the second text using the following resources as necessary: a selection of general bilingual dictionaries and a specialized monolingual native-language corpus coupled with the WordSmith Tools corpus analysis tool. In an attempt to compensate for 1) any potential text-specific dif- ficulties, and 2) differences in student ability, students 1-7 were asked to translate text i using the conventional resources and text ii using the corpus, whereas students 8-14 were asked to translate text ii using conventional resources and text i using the corpus.

The students were given two hours to translate each text. They were also asked to comment on the usefulness of the monolingual resources they had at their disposal (whether it be the conventional resources or the corpus).

Data Analysis

The aim of the pilot study was to determine whether a specialized monolingual native-language corpus would help translators to produce improved quality transla- tions. Because our sample size was small in a number of respects (14 students, 2 texts, 1 subject field), we felt that we would not be able to make any definitive conclusions, but could only reasonably measure general trends. Therefore, translations were assessed for the following broad categories of errors: 1) comprehension errors, specifically errors resulting from a lack of comprehension of the subject field; 2) production errors including incorrect choice of term, non-idiomatic constructions, grammatical errors, and incorrect register.

3.7.1. Improved subject comprehension: an example

The following example shows how the corpus has the potential to help students to

acquire an increased understanding of the subject field. In text i, most of the students had difficulty grasping one of the concepts in the opening sentence of the text.

quelle que soit leur sensibilité aux nuances, leur rapidité, leur précision,...

which should logically be translated along the following lines:

Regardless of such characteristics as colour-recognition capability, speed, precision,...

NATIVE-LANGUAGE CORPORA AS A TRANSLATION RESOURCE 11

The specialized dictionaries treated the concept scanner in general, but did not discuss the colour-recognition capability of scanners. The user manual made a passing reference to "black-and-white vs. colour scanners." The desktop publishing monograph contained a half-page discussion on the difference between "non-greyscale scanners, greyscale scanners, and colour scanners," but this section did not feature specifically in the table of contents and was therefore not easy to locate (though there were index ref- erences to each of the terms individually). The journal article treated the issue of colour recognition in some depth, but this discussion was towards the end of the article and was not visibly set off from the rest of the text. In other words, the necessary informa- tion was there, but it was not easy to find. Not surprisingly, none of the students using the conventional resources came close to rendering the concept correctly; in fact, some of them proposed rather peculiar translations which showed a definite lack of subject field understanding.

The corpus users, however, were able to go directly to those areas of the text that dealt with this subject. Even if students did not know the correct translation for nuances, a collocation search on the word form "sensitiv*" (i.e. the translation of sensi- ble) revealed that the following words were among those that appeared in its vicinity: colour (5), greyscale (4), shade* (122), shading (8).16Students were able to read these particular contexts and achieve a somewhat better understanding of the subject field. As shown in table 5, three of these students came quite close to expressing the correct idea by referring to shades or shading, and another student actually referred to colour. Admittedly, none of them came up with a particularly elegant rendering for that spe- cific concept, however, they did at least seem to have a better understanding of the idea that was being referred to in the source text.

1-14=student; D=dictionary user; C=corpus user; i=text i; +=subject understanding
1-D-i	no matter how much attention to detail they pay...
2-D-i	no matter how sensitive they are...
3-D-i	even though their sensitivity to touch...
4-D-i	regardless of their sensitivity...
5-D-i	while adaptability...
6-D-i	no matter how good the resolution...
7-D-i	whatever their feeling to the suggestion...
8-C-i +	no matter their sensitivity to shading...
9-C-i +	despite differences in their sensitivity to shading...
10-C-i	whatever their sensitivity to detail...
11-C-i	regardless of how sensitive they are to differences...
12-C-i ++	whatever their sensitivity to colour...
13-C-i	whatever their sensitivity to small differences...
14-C-i +	whatever differences there may be in shade...

Table 5

Proposed translations of the expression quelle que soit leur sensibilité aux nuances... The plus symbol (+) indicates which students came up with a rendering that seems to show a reasonable understanding of the concept

12 Meta, XLIII, 4, 1998

3.7.2. Improved term choice: an example

The following example shows how the corpus has the potential to help students find and use the correct terms. In text i, the term vitre is most properly translated as either glass platen or scan bed. The specialized dictionaries provided did not contain the term glass platen, and the translations given in the general bilingual dictionaries included glass, pane of glass, and window. The term glass platen did appear in the user guide, but was not in the index and was therefore difficult to find. It did not appear (to the best of my knowledge) in the desktop publishing monograph or the journal article. None of the students using the conventional resources used either of these terms, whereas three of the students using the corpus correctly used the term glass platen, which appeared in the corpus 41 times and ranked highly as one of the collocates of glass.

1-14=student; D=dictionary user; C=corpus user; i=text i; +=correct term used
1-D-i	Glass
2-D-i	sheet of glass
3-D-i	pane of glass
4-D-i	Window
5-D-i	glass surface
6-D-i	Glass
7-D-i	Window
8-C-i +	glass platen
9-C-i	piece of glass
10-C-i +	glass platen
11-C-i	glass
12-C-i	screen
13-C-i	window
14-C-i +	glass platen

Table 6

Proposed equivalents for the term vitre. The plus symbol (+) indicates which students used one of the preferred terms

A similar example occurred with the term scanner à plat, which is properly trans- lated as flatbed scanner. Of the students using the conventional resources, three trans- lated the term improperly as flat scanner, even though the term flatbed scanner appeared several times in the monograph on desktop publishing. All of the students using the corpus came up with the correct term, and student (14) even made the fol- lowing comment: "I was unsure of which spelling to use — flatbed or flat-bed — because I had seen both in the corpus. I looked up both terms in the frequency list and saw that flatbed occurred 1508 times and flat-bed only occurred 92 times, so I went with flatbed."

3.7.3. Improved idiomatic construction: an example

The following example shows how the corpus has the potential to help students create more idiomatic constructions. One of the phrases appearing in text i is: photo-

NATIVE-LANGUAGE CORPORA AS A TRANSLATION RESOURCE 13

diodes sensibles à la lumière. Some of the specialized dictionaries contained the term photodiode and the desktop publishing monograph referred to light-sensitive elements (though not in the index, thereby making it difficult to locate). The journal article did not make any reference to photodiodes. I hypothesize that the majority of the students using the conventional resources verified in the dictionaries that photodiode was a term and then simply followed the syntax of the source text to produce a construction that, while grammatically correct, is not idiomatic according to the expert discourse. There were no instances in the corpus where this concept was expressed using the syntax pat- tern photodiodes sensitive to the light, and this may explain why all the students who had access to the corpus used one of the two more idiomatic constructions which appeared there: light-sensitive photodiodes (which appeared twice) or photosensitive diodes (which also appeared twice).

1-14=student; D=dictionary user; C=corpus user; i=text i; X=non-idiomatic
1-D-i	light sensitive photodiodes
2-D-i X	photo sensors that are sensitive to this light
3-D-i X	photodiodes sensitive to the light
4-D-i X	photodiodes which are sensitive to light
5-D-i	light sensitive photodiodes
6-D-i X	laser diodes which are sensitive to this light
7-D-i X	photodiodes sensitive to this light
8-C-i	light-sensitive photodiodes
9-C-i	light-sensitive photodiodes
10-C-i	light-sensitive photodiodes
11-C-i	light-sensitive photodiodes
12-C-i	photosensitive diodes
13-C-i	light-sensitive photodiodes
14-C-i	photosensitive diodes

Table 7

Proposed translations of the expression photodiodes sensibles à la lumière. The

X indicates a non-idiomatic construction

A similar example occurred with the expression la tête de numérisation du scanner, which is best translated as scan head, but could also be translated as scanning head or scanner head. Of the students using the conventional resources, four of them used an idiomatic construction, but three of them followed the French syntax and rendered the phrase as head of the scanner. There were no instances in the corpus where this concept was expressed using the syntax pattern head of the scanner, and all of the students using the corpus employed one of the idiomatic constructions.

⇐ Предыдущая 1 234 5 Следующая ⇒