[info] Connectionists: DGT-TM - Translation Memory for 231 language pairs available for distribution
Eugen Leitl
<eugen at leitl.org> on
Thu Nov 29 10:38:52 UTC 2007
----- Forwarded message from Ralf Steinberger <ralf.steinberger at jrc.it> -----
From: Ralf Steinberger <ralf.steinberger at jrc.it>
Date: Wed, 28 Nov 2007 14:53:19 +0100
To: connectionists at cs.cmu.edu
Subject: Connectionists: DGT-TM - Translation Memory for 231 language pairs
available for distribution
Organization: European Commission - Joint Research Centre
X-Mailer: Microsoft Office Outlook 11
Apologies for cross-postings.
This dataset may be of interest to people and organisations working on
Statistical Machine Translation and other multilingual Machine Learning
applications.
DGT-TM Translation Memory
Freely available
22 languages
231 language pairs
Format: TMX version 1
<http://langtech.jrc.it/DGT-TM.html> http://langtech.jrc.it/DGT-TM.html
The European Commission's Directorate General for Translation (DGT) and the
Joint Research Centre (JRC) have made available a multilingual Translation
Memory (sentences and their translations, in standard TMX format) for the 22
official European Union languages Bulgarian, Czech, Danish, Dutch, English,
Estonian, German, Greek, Finnish, French, Hungarian, Italian, Latvian,
Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovene, Spanish
and Swedish.
This release follows the public release - in May 2006 - of the
<http://langtech.jrc.it/JRC-Acquis.html> JRC-Acquis multilingual parallel
corpus with sentence alignment for 231 language pairs and a total size of
over 1 Billion words.
The data releases of DGT and JRC are in line with the general effort of the
European Commission to support multilingualism, language diversity and the
re-use of Commission information.
The Translation Memory contains most, but not all of the Acquis
Communautaire, which is the entire body of European legislation, including
all the treaties, regulations and directives adopted by the European Union
(EU) and the rulings of the European Court of Justice. Since each new
country joining the EU is required to accept the whole Acquis Communautaire,
this body of legislation is translated into 22 official EU languages. For
the 23rd official EU language, Irish, the Acquis is not translated on a
regular basis.
A translation memory is a collection of small text segments and their
translation. These segments can be sentences or sentence parts. Translation
memories are used to support translators by ensuring that pieces of text
that have already been translated do not need to be translated again.
Both translation memories and parallel texts are an important linguistic
resource that can be used for a variety of purposes, including:
* training automatic systems for Statistical Machine Translation
(SMT);
* producing monolingual or multilingual lexical and semantic resources
such as dictionaries and ontologies;
* training and testing multilingual information extraction software;
* checking translation consistency automatically;
* testing and benchmarking alignment software (for sentences, words,
etc.).
* For usage conditions, details regarding the difference between
<http://langtech.jrc.it/DGT-TM.html> DGT-TM and the
<http://langtech.jrc.it/JRC-Acquis.html> JRC-Acquis, size information,
downloading instructions, etc. go to <http://langtech.jrc.it/DGT-TM.html>
http://langtech.jrc.it/DGT-TM.html.
Achim Blatt
Directorate General for Translation (DGT)
Unit DGT.R.3 Informatics ( <http://ec.europa.eu/dgs/translation/>
http://ec.europa.eu/dgs/translation/)
Ralf Steinberger
European Commission - Joint Research Centre (JRC)
IPSC - SeS - Language Technology ( <http://langtech.jrc.it/>
http://langtech.jrc.it)
The JRC's Language Technology group specialises in the development of highly
multilingual text analysis tools and in cross-lingual applications. Many
applications are accessible online, e.g.:
. <http://press.jrc.it/NewsExplorer/> NewsExplorer: multilingual news
aggregation and analysis (19 languages); allows to navigate the news over
time and across languages; trend analysis; collects information about people
from the news; social network detection.
. <http://press.jrc.it/> NewsBrief: breaking news detection and
display of the very latest thematic news from around the world; email
alerting (22+ languages).
. <http://medusa.jrc.it/> MedISys Medical Information System: latest
health-related news from around the world according to themes and diseases
(22+ languages).
----- End forwarded message -----
--
Eugen* Leitl <a href="http://leitl.org">leitl</a> http://leitl.org
______________________________________________________________
ICBM: 48.07100, 11.36820 http://www.ativel.com http://postbiota.org
8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE
More information about the info
mailing list