Machine translation (MT) systems are now everywhere. This spread is due to the combination of translation needs in today's global market and the exponential growth in computing power that made these systems viable. Under the right circumstances MT systems are effective tools. Low quality translations are offered in cases where low-quality translation is better than no translation at all, or where a rough translation of a large document delivered in a second or minute is more useful than a translation done within a good three weeks. Unfortunately, despite the wide availability of MT, it is clear that the purpose and limitations of such systems are often misunderstood and overestimated. In this article I would like to give you a brief overview of how MT systems work and how to make the best use of them. Then I present the data of the use of Internet-based MT and show that there is a gap between the planned and actual use of such systems and that users still need to educate the MT system effectively

that a computer compiler applies the grammatical rules of the languages ​​in question, combining them into a kind of "memory" dictionary to produce the resulting translation. And indeed, this was basically how some earlier systems worked. But the most up-to-date MT systems actually use a statistical approach that is fairly "linguistic blind". In essence, the system is the corpus of the example texts. The result is a statistical model that contains information such as:

– "if words in succession (a, b, c) occur, X% has the chance to use the words (d, e, f) occur in translation "(NB: you do not have to have the same number of words in each pair);

– "if word (a) is in two consecutive words in the target language, then if there is an X% chance that word (b) ends in -Y.

Because of the huge amount of such observations, the system then compiles a sentence, taking into account the various candidate translations – almost randomly (actually through some "naive selection processes") – and most statistically

Listening to this high-level description of how MT, most people are surprised that such a "linguistic-blind" approach works at all. Even more surprisingly, it generally works better than a rule-based system. This is partly because it is based on grammatical analysis itself and gives errors in the equation (automated the analysis is not completely accurate and people do not always agree on the analysis of the sentence.) The "bare text" system training allows you to rely on a much more data-based system than otherwise possible: grammatically analyzed texts are small and small and far apart; The "bare text" pages are available in their trillion.

However, this approach means that the quality of the translations depends very much on how much the elements of the source text represent the elements originally used in the system. If you accidentally enter it, you either send it back or vous avez demander (instead of going back or vous avez demandé), the system is hampered by the fact that the returned series probably did not occur in the training corps (or even worse , they could have quite different meanings because they needed to return their will to the attorney). And because the system has little grammatical meaning (such as a return form of return, and "the infinite will probably be afterwards"), there is actually little continuation.

Similarly, you can request the system to translate a sentence that is perfectly grammatical and general in everyday use, but contains features that are not necessarily common in the training corpus. MT systems are usually formulated for text types for which human translations are available, such as the transmission of technical or business documents or multilingual parliaments and conferences. Thus, MT systems show natural prejudices towards certain formal or technical texts. And even if the everyday vocabulary is still part of the training corpus, the grammar of ordinary talk (for example, instead of spanish, rather than the current tension, instead of the spanish language, is not possible in the future).

MT systems in practice

Researchers and developers of computer translation systems have always been aware that one of the greatest threats is the public perception of their purpose and limitations. Somers (2003) [1] observed the use of MT on the Internet and in chat rooms, notes that: "The increased visibility of MT has many side effects […]. The public's low quality of raw MT is important why it is so low quality." Observing the use of MT in 2009 sadly provides little evidence that users are aware of these issues.

As an illustration I present a small sample of data from a Spanish-English MT service, which is the Español-English web site. The service works by using user input, it uses some "cleaning" processes (for example, correcting some correct translation errors and decrypting frequent "SMS texts"), then looking for a translation in the Spanish-English dictionary and ( b) MT motor. Currently, Google Translator is used for the MT engine, although you can use a custom engine in the future. The figures presented here are from 549 Spanish-English queries analyzed from machines originating from Mexico [2] – in other words, we assume that most users are translating their mother tongue.

First of all, what are the people who use the MT system? For each query, I made a "best guess" attempt to compile a query. In many cases, the goal is obvious; in some cases it is clearly ambiguous. With this speech, I consider that in about 88% of cases, proper use is fairly straightforward and will categorize these uses as follows:

    Search for a single word or phrase: 38%
  • Translate formale text: 23%
  • Internet chat: 18%

Surprising (possibly not alarming!) Observation is that in most cases, users use the translator to express a single word or phrase. In fact, 30% of queries consisted of a single word. The statement is somewhat surprising as the Spanish-English dictionary on the site in question also suggests that users are disturbed by the purpose of dictionaries and translators. Though there were no raw numbers, there were obvious cases when successive searches were made when a user appeared to have deliberately split a sentence or phrase that would probably have been better translated if they were left together. Perhaps as a result of the overrun of a student, when using dictionaries, we see a query called "quarter to", which is immediately followed by a query number. Obviously, there is a need for the general education of students and users between the electronic dictionary and the machine translator [3]: in particular, a dictionary directs the user to select the correct translation by context, word or single-person searches, they work on sentences and receive a single word or phrase, simply the most commonly translated translation.

It is estimated that in less than a quarter of cases users use the MT system as a "trained" goal of translating or formatting an official text (and a whole sentence or at least a partial sentence rather than an isolated noun). Of course it is impossible to say whether any of the translations were intended to be published without further proof, which is not necessarily the purpose of the system.

The use of translation of formal texts now almost compete with translation is informal on-line chat – an environment that MT systems do not normally form. The online chat environment poses particular problems for MT systems as there are frequent features such as non-standard spelling, lack of punctuation, and conversational scenes not found in other contextual contexts. For efficient translation effectively, a dedicated (and possibly custom-built) corpus dedicated system is required.

Not surprisingly, students use MT to perform their homework. But it's interesting to note how and how. It is a fact that the homework involves a mixture of "fair use" (the practice of understanding) with the attempt to "take the computer to carry out their homework" (sometimes with potentially serious results). Queries classified as homework include sentences that are obviously the instructions of the exercises and sentences that explain trivial generalities that are not common in a text or conversation but which are characteristic of beginners' homework.

Whatever use, users and designers are both the source of errors in the source text that may hinder the translation. In fact, more than 40% of queries contained such errors, with some queries that contained more. The most common errors were the following (Query for each word and phrase was excluded for these numbers):

  • Missing pronunciations: 14% Query
  • Missing punctuation: 13%
  • Other spelling mistakes: 8%
  • Grammatically incomplete sentence: 8%

Considering that, in the majority of cases, users who have been translated from their mother tongue, users seem to underestimate the importance of using standard orthography to get the best chance of good translation. It is far finer that users do not always understand that the translation of a word may depend on one another and that translator work is more difficult if the grammatical components are incomplete, so queries like hoy and awards are not uncommon. Such queries hinder the translations because the chance of a block of a training corpus, say such a "tedious" submission, will be thin.

There is currently a lack of performance between MT systems and user expectations. I believe that I take responsibility for eliminating this gap in the hands of both developers and users and instructors. Users need to think more about their source theory as "MT-friendly" and learn how to evaluate the output of MT systems. Language courses should address these issues: learning the use of computer translation tools should be seen as an essential part of language learning. And developers, including myself, have to think about how we can meet the tools we offer to meet the needs of the users.

Somers (2003), "Machine Translation: The Latest Developments" is the Oxford Handbook of Computer Linguistics, OUP.

[2] This odd number is simply because queries matching the selection criteria have been fixed randomly within a specified time. Note that the system for disconnecting the machine from the IP address is not completely accurate.

[3] If a user speaks to the system in question, a translation will be displayed during the translation, indicating that the user gets a better result using the site's dictionary.

Supported by sbobet

Leave a Reply

Your email address will not be published. Required fields are marked *