[tt] Lost in translation: hands-on with Google's new stats-based translator
Brian Atkins
<brian at posthuman.com> on
Wed Oct 24 17:20:31 UTC 2007
(quality pretty horrible currently, but due to Google directly controlling the
underlying code, plus the ability for users to add corrections, it should improve)
http://arstechnica.com/news.ars/post/20071024-lost-in-translation-hands-on-with-googles-new-stats-based-translator.html
Automated translation systems, such as Alta Vista's Babelfish, have relied on a
set of human-defined rules that attempt to encapsulate the underlying grammar
and vocabulary used to construct a language. Although Google has been using that
approach to power much of its translation service, it's not really in keeping
with the company's philosophy of using some clever code and a massive data set.
So it should be no surprise that the company has started developing its own
statistical machine translation service. According to some Google-watchers,
Google's homegrown translation process is now being used for all languages
available through the service.
We took the new service for a spin. Five years of Spanish in high school and
college, as well as countless years of exposure to the language through ads on
the subway and watching the World Cup on Univision, have left me
borderline-literate in the language. I chose a web page that was inspired by my
contributions to Urs Technica: a description of the native bear population of
the Iberian peninsula. The page contains a mix of some basic descriptive
language, along with more detailed discussions of ursine biology. A second
translation using Babelfish was performed at the same time.
Overall, it was difficult to discern a difference in quality between the two.
Each service had some difficulty with Spanish's sentence structure, which places
adjectives after the nouns they modify. For example, instead of "Discover Bear
Country," Google suggested that a link was inviting people to "Discover the
Country Bears." Maybe Disney paid for that one.
Both also ran into a number of words they didn't know what to do with; for
example, Spanish has a specific word for "bear den"—osera—that neither service
recognized and so left untranslated. Neither correctly figured out the proper
context for the use of "celo". This is a term that didn't come up during my
years of Spanish, but it apparently can be used to describe the annual period of
female fertility. Both services went literal when faced with "celo", with
Babelfish choosing "fervor" and Google picking "zeal" as its translation. This
caused Google to suggest that female bears "can be mounted by several different
males over the same zeal."
There were also what might be termed Spanish 101 level errors. The verb
"molesta" is generally used to mean "bother" or "harass." Yet Google made a
novice-level mistake and did a literal translation to "molest." Neither service
demonstrated a human's ability to recognize when they were producing gibberish.
Google, for example, described a group of bears gathering around a rich food
source as "They can also occur by coincidence, rallies temporary copies in a few
places with abundant food."
There was one case where Google's statistical method seemed to lead it astray.
Both services went Spanish 101 on the term "crudo," which was used to describe
the harshest or roughest part of winter, when bears hibernate. Google apparently
applied undue statistical weight to the word "crude." In one case, this trashed
the entire sentence that contained "crudo"—a photo of a cold winter scene was
captioned: "The period of winter as crude bears spend winter." In a second
instance, the more typical context of "crudo" was applied, with hilarious
results: "The life of a bear begins as crude oil during winter."
To test a language that is more distant from English, I located a press release
in both Japanese and English: the one announcing the 2002 Nobel Prize in
Physics, which went to researchers running parallel experiments in the US and
Japan. The release in Japanese was available only as a PDF, so I copied and
pasted the text into the translation box. The results, which seem to have
preserved the line breaks from the PDF, were practically poetic:
I do so without interaction, thus detected is extremely
Difficult for. For example, the trillions of pieces of New
Torino is our second body to penetrate, but I
We are absolutely not aware. Raymond Davis Jr.
Coal giant tank is placed 600 tons of liquid meets applicable
The construction of a completely new detection equipment. He was 30 years...
That bears a slight resemblance to Japanese Zen poetry, which is supposed to
startle its readers out of their normal perception of reality, allowing them to
reach a Buddhist enlightenment.
This may sound like I'm being excessively harsh regarding Google's new
translation method, so I'll reemphasize that it appears to produce translations
that are roughly equal in quality to those provided by other services. Where it
really shines, however, is its interface. On a translated web page, you can
hover the mouse over any translated sentence, and the untranslated version will
appear. This is a tremendous aid for those that have a partial command of the
language, as the immediate comparison between the texts can help eliminate any
confusion caused by mistranslation.
This same feature may ultimately help Google move beyond the quality of other
services. Each of these popups comes with a link that offers you the opportunity
to suggest a better translation. If people are willing to spend the time
suggesting fixes for mistranslations (and vandalism doesn't become a problem),
Google may ultimately have a dataset that allows their service to provide an
exceptional degree of accuracy.
--
Brian Atkins
Singularity Institute for Artificial Intelligence
http://www.singinst.org/
More information about the tt
mailing list