Tuesday, January 22, 2008

EC: 1 million sentences in 22 languages

The European Commission’s collection of about 1 million sentences and their high quality translations in 22 of the 23 official EU languages — including those of the new Member States — is the biggest ever collection in so many languages and is now freely available. The data can help the development of other linguistic software tools such as grammar and spell checkers, online dictionaries and multilingual text classification systems. By offering free and open access to this JRC-Acquis collection of sentences, the EU hopes to foster multilingualism.

The EU institutions have more multilingual texts than any other organisation in the world because of the requirements that EU law exist in each of its 23 official languages. Their translation services work with 253 possible language pair combinations and produce around 1.5 million translated pages a year.

Whereas large amounts of translations of English or French texts can be found on the Internet, such resources are scarce for languages such as Latvian, Romanian or Dutch, and they are practically nonexistent for the combination of two languages for which few resources exist.

Through co-operation between its translators and its in-house scientists, The EC is releasing large collections of sentences from legal documents covering technical, political and social issues which are available in 22 languages (Bulgarian, Czech, Danish, Dutch, English, Estonian, German, Greek, Finnish, French, Hungarian, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovene, Spanish and Swedish). In this translation repository it is possible to find sentences with their equivalent in all other official languages . Only Irish translations are not yet available. This release of language data is a good example of the Commission's open policy of re-use of its information resources and follows the opening of the EU's documentary and terminological databases Eur-Lex and IATE.

The EC has extensive experience with the development of multilingual text processing tools and is at the forefront of multilingualism, offering publicly accessible news search sites covering up to 35 languages via its European Media Monitoring tool. The 7th Framework programme for research and development – in its Information and Communication Technologies strand – supports research on machine translation and other language related technologies.

Blog Posting Number: 985

Tags: ,

No comments: