Parallel Corpora & Domains (bilingual and multilingual)

Full Official Name: Parallel Corpora & Domains (bilingual and multilingual)
Submission date: Oct. 11, 2023, 4:48 p.m.

Parallel corpora for nearly 400 language pairs and numerous multilingual combinations, including 10 million bilingual segments and 90 million tokens in 20 languages: Arabic, Chinese (Simplified), Danish, Dutch, English, Finnish, French, German, Greek, Hebrew, Italian, Japanese, Korean, North Sami, Norwegian, Polish, Portuguese (Brazilian and European), Russian, Spanish, Swedish, and Turkish. The segments consist of full sentences and short phrases with translation equivalents, based on corpus evidence and frequency, and were originally created by editors and translators worldwide as examples of usage for dictionary entries. Some of the bilingual pairs were generated via a third pivot language. The data can be applied to train Machine Learning and Large Language Models and to boost the performance of Machine Translation solutions. Besides general language vocabularies, there are segments for over a hundred vertical domains: administration, advertising, aeronautics, agriculture, anatomy, anthropology, archaeology, architecture, art, astrology, astronomy, automobiles, aviation, biology, botanics, cartography, chemistry, cinema, clothing, color, commerce, computers, construction, cosmetics, culinary, culture, dance, data, dress, drinks, drugs, ecology, economics, education, electricity, electronics, energy, engineering, entertainment, environment, family, fashion, finance, furniture, games, genetics, geography, geology, geometry, grammar, health, history, hygiene, industry, informatics, Internet, IT, journalism, law, leisure/hobbies, linguistics, literature, maritime, marketing, mathematics, measurements/units, mechanics, medicine, meteorology, military, music, mythology, nautical, occupation, oceanography, optics, pharmacology, philosophy, photography, physics, physiology, police, politics, post, psychology, publishing, radio, rail, religion, school, sex, sociology, space, sport, statistics, technical, technology, telecommunication, telephone, television, theatre, theology, time, tourism, transportation, university, zoology. Note: Prices are indicated per segment unit. Please contact us to obtain our quotation corresponding to expected languages and domains.

Creator(s)
Distributor(s)
Right Holder(s)