81 languages, 142 bitexts total number of files: 126 total number of tokens: 774.1M total number of sentences: 63.1M
Please, acknowledge the Wikimedia Foundation for the data and cite the following paper if you use data from this distribution:@inproceedings{tiedemann-2020-ttc,
title = "The {T}atoeba {T}ranslation {C}hallenge -- {R}ealistic Data Sets for Low Resource and Multilingual {MT}",
author = {Tiedemann, J{"o}rg},
booktitle = "Proceedings of the Fifth Conference on Machine Translation (Volume 1: Research Papers)", year = "2020",
publisher = "Association for Computational Linguistics",
url = {https://arxiv.org/abs/2010.06354}
}
Below you can download data files for all language pairs in different formats and with different kind of annotation (if available). You can click on the various links as explained below.
License: CC-BY-SA 4.0
You need to download the monolingual corpus files and the standoff alignment files between them:
af | ar | az | bg | bn | br | bs | ca | ceb | cs | cy | da | de | el | en | eo | et | eu | fi | fr | fy | ga | gl | he | hr | hu | hy | ia | id | ilo | it | lb | lt | lv | mk | ml | ms | mt | nb | nds | nn | pl | pt | ro | ru | sk | sq | sr_Cyrl | sr_Latn | sv | sw | ta | th | tl | tr | uk | ur | uz | war | zh | zh_Hans | zh_Hant | af | 1.6k | af | ang | 55 | ang | ar | 22.0k | ar | ast | 0.3k | ast | az | 50.3k | az | be | 1.2k | be | bg | 0.1M | bg | br | 1.2k | br | bs | 46.8k | bs | ca | 47.3k | ca | co | 0.2k | co | cs | 68.2k | 59.6k | cs | cy | 1.7k | cy | da | 7.1k | da | de | 71.9k | 70.6k | de | el | 14.5k | el | en | 0.9M | 0.9M | 10.6k | 0.9M | 0.9M | 0.9M | 0.7M | 0.8M | 0.9M | 0.9M | 0.9M | 0.9M | 0.9M | 0.9M | 0.9M | 0.9M | 0.9M | 0.9M | 0.9M | 0.9M | 0.9M | 0.8M | 1.0M | 0.8M | 0.9M | 0.9M | 14.9k | 0.9M | 0.9M | 0.9M | 0.8M | 0.9M | 0.9M | 0.9M | 1.0M | 0.9M | 0.9M | 0.8M | 65.9k | 0.8M | 0.9M | 0.9M | 0.9M | 0.9M | 0.8M | 0.7M | 0.9M | 0.9M | 1.0M | 0.9M | 0.9M | 0.9M | 0.9M | 0.9M | 0.8M | 0.9M | 0.8M | 0.9M | 0.8M | en | eo | 6.9k | eo | es | 0.1M | es | et | 14.0k | et | eu | 6.6k | eu | fa | 0.1M | fa | fi | 29.4k | 31.2k | 28.8k | 28.8k | 31.2k | fi | fr | 0.1M | fr | gl | 3.7k | gl |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
af | ar | az | bg | bn | br | bs | ca | ceb | cs | cy | da | de | el | en | eo | et | eu | fi | fr | fy | ga | gl | he | hr | hu | hy | ia | id | ilo | it | lb | lt | lv | mk | ml | ms | mt | nb | nds | nn | pl | pt | ro | ru | sk | sq | sr_Cyrl | sr_Latn | sv | sw | ta | th | tl | tr | uk | ur | uz | war | zh | zh_Hans | zh_Hant | gu | 5.2k | gu | hr | 28.7k | hr | hu | 59.0k | 59.8k | hu | hy | 36.5k | hy | id | 15.9k | id | it | 0.6M | it | ka | 2.5k | ka | kk | 0.1k | kk | ko | 12.6k | ko | lb | 15 | lb | lt | 31.9k | 31.7k | lt | ml | 0.7k | ml | mr | 1.8k | mr | nb | 10.2k | nb | nl | 6.7k | nl | no | 9.5k | no | pl | 0.7M | 0.6M | pl | pt | 0.1M | pt | ro | 8.6k | 9.0k | ro | ru | 0.2M | 0.2M | ru | sa | 60 | sa | sah | 3.1k | sah | sk | 37.1k | sk | sl | 16.4k | sl | sq | 4.5k | sq | sr_Cyrl | 7.9k | sr_Cyrl |
af | ar | az | bg | bn | br | bs | ca | ceb | cs | cy | da | de | el | en | eo | et | eu | fi | fr | fy | ga | gl | he | hr | hu | hy | ia | id | ilo | it | lb | lt | lv | mk | ml | ms | mt | nb | nds | nn | pl | pt | ro | ru | sk | sq | sr_Cyrl | sr_Latn | sv | sw | ta | th | tl | tr | uk | ur | uz | war | zh | zh_Hans | zh_Hant | sv | 17.4k | 17.4k | sv | ta | 0.4k | ta | te | 6.2k | te | th | 1.3k | th | tr | 98.6k | tr | uk | 53.9k | 53.6k | 52.9k | 53.9k | 54.7k | 57.3k | 56.1k | 54.5k | 56.7k | uk | ur | 2.5k | ur | uz | 0.5k | uz | vi | 17.2k | vi | wo | 9 | wo | zh | 56.3k | zh | zh_Hans | 32.7k | zh_Hans | zh_Hant | 0.5k | zh_Hant |
af | ar | az | bg | bn | br | bs | ca | ceb | cs | cy | da | de | el | en | eo | et | eu | fi | fr | fy | ga | gl | he | hr | hu | hy | ia | id | ilo | it | lb | lt | lv | mk | ml | ms | mt | nb | nds | nn | pl | pt | ro | ru | sk | sq | sr_Cyrl | sr_Latn | sv | sw | ta | th | tl | tr | uk | ur | uz | war | zh | zh_Hans | zh_Hant |