[tt] Lost in translation: hands-on with Google's new stats-based translator

Brian Atkins <brian at posthuman.com> on Wed Oct 24 17:20:31 UTC 2007

(quality pretty horrible currently, but due to Google directly controlling the 
underlying code, plus the ability for users to add corrections, it should improve)

http://arstechnica.com/news.ars/post/20071024-lost-in-translation-hands-on-with-googles-new-stats-based-translator.html

Automated translation systems, such as Alta Vista's Babelfish, have relied on a 
set of human-defined rules that attempt to encapsulate the underlying grammar 
and vocabulary used to construct a language. Although Google has been using that 
approach to power much of its translation service, it's not really in keeping 
with the company's philosophy of using some clever code and a massive data set. 
So it should be no surprise that the company has started developing its own 
statistical machine translation service. According to some Google-watchers, 
Google's homegrown translation process is now being used for all languages 
available through the service.

We took the new service for a spin. Five years of Spanish in high school and 
college, as well as countless years of exposure to the language through ads on 
the subway and watching the World Cup on Univision, have left me 
borderline-literate in the language. I chose a web page that was inspired by my 
contributions to Urs Technica: a description of the native bear population of 
the Iberian peninsula. The page contains a mix of some basic descriptive 
language, along with more detailed discussions of ursine biology. A second 
translation using Babelfish was performed at the same time.

Overall, it was difficult to discern a difference in quality between the two. 
Each service had some difficulty with Spanish's sentence structure, which places 
adjectives after the nouns they modify. For example, instead of "Discover Bear 
Country," Google suggested that a link was inviting people to "Discover the 
Country Bears." Maybe Disney paid for that one.

Both also ran into a number of words they didn't know what to do with; for 
example, Spanish has a specific word for "bear den"—osera—that neither service 
recognized and so left untranslated. Neither correctly figured out the proper 
context for the use of "celo". This is a term that didn't come up during my 
years of Spanish, but it apparently can be used to describe the annual period of 
female fertility. Both services went literal when faced with "celo", with 
Babelfish choosing "fervor" and Google picking "zeal" as its translation. This 
caused Google to suggest that female bears "can be mounted by several different 
males over the same zeal."

There were also what might be termed Spanish 101 level errors. The verb 
"molesta" is generally used to mean "bother" or "harass." Yet Google made a 
novice-level mistake and did a literal translation to "molest." Neither service 
demonstrated a human's ability to recognize when they were producing gibberish. 
Google, for example, described a group of bears gathering around a rich food 
source as "They can also occur by coincidence, rallies temporary copies in a few 
places with abundant food."

There was one case where Google's statistical method seemed to lead it astray. 
Both services went Spanish 101 on the term "crudo," which was used to describe 
the harshest or roughest part of winter, when bears hibernate. Google apparently 
applied undue statistical weight to the word "crude." In one case, this trashed 
the entire sentence that contained "crudo"—a photo of a cold winter scene was 
captioned: "The period of winter as crude bears spend winter." In a second 
instance, the more typical context of "crudo" was applied, with hilarious 
results: "The life of a bear begins as crude oil during winter."

To test a language that is more distant from English, I located a press release 
in both Japanese and English: the one announcing the 2002 Nobel Prize in 
Physics, which went to researchers running parallel experiments in the US and 
Japan. The release in Japanese was available only as a PDF, so I copied and 
pasted the text into the translation box. The results, which seem to have 
preserved the line breaks from the PDF, were practically poetic:

     I do so without interaction, thus detected is extremely
     Difficult for. For example, the trillions of pieces of New
     Torino is our second body to penetrate, but I
     We are absolutely not aware. Raymond Davis Jr.
     Coal giant tank is placed 600 tons of liquid meets applicable
     The construction of a completely new detection equipment. He was 30 years...

That bears a slight resemblance to Japanese Zen poetry, which is supposed to 
startle its readers out of their normal perception of reality, allowing them to 
reach a Buddhist enlightenment.

This may sound like I'm being excessively harsh regarding Google's new 
translation method, so I'll reemphasize that it appears to produce translations 
that are roughly equal in quality to those provided by other services. Where it 
really shines, however, is its interface. On a translated web page, you can 
hover the mouse over any translated sentence, and the untranslated version will 
appear. This is a tremendous aid for those that have a partial command of the 
language, as the immediate comparison between the texts can help eliminate any 
confusion caused by mistranslation.

This same feature may ultimately help Google move beyond the quality of other 
services. Each of these popups comes with a link that offers you the opportunity 
to suggest a better translation. If people are willing to spend the time 
suggesting fixes for mistranslations (and vandalism doesn't become a problem), 
Google may ultimately have a dataset that allows their service to provide an 
exceptional degree of accuracy.
-- 
Brian Atkins
Singularity Institute for Artificial Intelligence
http://www.singinst.org/

More information about the tt mailing list