Home / Query / WordAlign / Wiki    [ada83] [bible] [bianet] [books] [CCAligned] [CCMatrix] [CAPES] [DGT] [DOGC] [ECB] [EhuHac] [EiTB] [Elhuyar] [ELRC] [EMEA] [EUbooks] [EU] [Europarl] [EuroPat] [finlex] [fiskmö] [giga] [GNOME] [GlobalVoices] [hren] [infopankki] [JRC] [KDE4/doc] [MBS] [memat] [MontenegrinSubs] [MultiUN] [MultiParaCrawl] [MultiCCAligned] [MT560] [NC] [Ofis] [OO/OO3] [subs/16/18] [Opus100] [ParaCrawl] [ParCor] [PHP] [QED] [sardware] [SciELO] [SETIMES] [SPC] [Tatoeba] [Tanzil] [TEP] [TED] [tico19] [Tilde] [Ubuntu] [UN] [UNPC] [WikiMatrix] [Wikimedia] [Wikipedia] [WikiSource] [WMT] [XhosaNavy]

CCAligned v1

This corpus was created from 68 Commoncrawl Snapshots (up until March 2020). The documents are split into sentences based on punctuations and deduplication is performed. No claims of intellectual property are made on the work of preparation of the corpus. The original distribution is available from http://www.statmt.org/cc-aligned/

CCAligned consists of parallel or comparable web-document pairs in 137 languages aligned with English. These web-document pairs were constructed by performing language identification on raw web-documents, and ensuring corresponding language codes were corresponding in the URLs of web documents. This pattern matching approach yielded more than 100 million aligned documents paired with English. Recognizing that each English document was often aligned to mulitple documents in different target language, we can join on English documents to obtain aligned documents that directly pair two non-English documents (e.g., Arabic-French).

Sentence pairs were extracted using similarity scores of LASER embeddings from the document pairs (minimum similarity 1.04, sorted based on decreasing similarity score). It misses some languages not covered by LASER.

113 languages, 112 bitexts
total number of files: 36,185
total number of tokens: 26.39G
total number of sentence fragments: 2.25G

If you use the dataset or code, please cite (pdf):

 @inproceedings{elkishky_ccaligned_2020,
author = {El-Kishky, Ahmed and Chaudhary, Vishrav and Guzmán, Francisco and Koehn, Philipp},
booktitle = {Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020)},
month = {November},
title = {{CCAligned}: A Massive Collection of Cross-lingual Web-Document Pairs},
year = {2020}
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.emnlp-main.480",
doi = "10.18653/v1/2020.emnlp-main.480",
pages = "5960--5969"
}
and, please, acknowledge OPUS (bib, pdf) as well for this service. For more information on the sentence pair mining method, see Chaudhary et al., WMT 2019 (bib, pdf).

Download

Below you can download data files for all language pairs in different formats and with different kind of annotation (if available). You can click on the various links as explained below. In addition to the files shown on this webpage, OPUS also provides pre-compiled word alignments and phrase tables, bilingual dictionaries, frequency counts, and these files can be found through the resources search form on the top-level website of OPUS.

Release history:

Bottom-left triangle: download files
  • ces = sentence alignments in XCES format
  • leftmost column language IDs = tokenized corpus files in XML
  • TMX and plain text files (Moses): see "Statistics" below
  • lower row language IDs = parsed corpus files (if they exist)
Upper-right triangle: sample files
  • view = bilingual XML file samples
  • upper row language IDs = monolingual XML file samples
  • rightmost column language IDs = untokenized corpus files

af ak am ar as ay az be bg bm bn br bs ca cb cs cx cy da de el en es et fa ff fi fr gu ha he hi hr ht hu hy id ig is it ja jv ka kg kk km kn ko ku ky lg ln lo lt lv mg mi mk ml mn mr ms mt my ne nl no ns ny om or pa pl ps pt qa qd ro ru si sk sl sn so sq sr ss st su sv sw sz ta te tg th ti tl tn tr ts tz uk ur ve vi wo xh yo zh_CN zh_TW zu zz
af viewaf af af af af
ak viewak ak ak ak ak
am viewam am am am am
ar viewar ar ar ar ar
as viewas as as as as
ay vieway ay ay ay ay
az viewaz az az az az
be viewbe be be be be
bg viewbg bg bg bg bg
bm viewbm bm bm bm bm
bn viewbn bn bn bn bn
br viewbr br br br br
bs viewbs bs bs bs bs
ca viewca ca ca ca ca
cb viewcb cb cb cb cb
cs viewcs cs cs cs cs
cx viewcx cx cx cx cx
cy viewcy cy cy cy cy
da viewda da da da da
de viewde de de de de
el viewel el el el el
en ces ces ces ces ces ces ces ces ces ces ces ces ces ces ces ces ces ces ces ces ces viewen viewviewviewviewviewviewviewviewviewviewviewviewviewviewviewviewviewviewviewviewviewviewviewviewen viewviewviewviewviewviewviewviewviewviewviewviewviewviewviewviewviewviewviewviewviewviewviewviewen viewviewviewviewviewviewviewviewviewviewviewviewviewviewviewviewviewviewviewviewviewviewviewviewen viewviewviewviewviewviewviewviewviewviewviewviewviewviewviewviewviewviewen
es ces es es es es es
af ak am ar as ay az be bg bm bn br bs ca cb cs cx cy da de el en es et fa ff fi fr gu ha he hi hr ht hu hy id ig is it ja jv ka kg kk km kn ko ku ky lg ln lo lt lv mg mi mk ml mn mr ms mt my ne nl no ns ny om or pa pl ps pt qa qd ro ru si sk sl sn so sq sr ss st su sv sw sz ta te tg th ti tl tn tr ts tz uk ur ve vi wo xh yo zh_CN zh_TW zu zz
et ces et et et et et
fa ces fa fa fa fa fa
ff ces ff ff ff ff ff
fi ces fi fi fi fi fi
fr ces fr fr fr fr fr
gu ces gu gu gu gu gu
ha ces ha ha ha ha ha
he ces he he he he he
hi ces hi hi hi hi hi
hr ces hr hr hr hr hr
ht ces ht ht ht ht ht
hu ces hu hu hu hu hu
hy ces hy hy hy hy hy
id ces id id id id id
ig ces ig ig ig ig ig
is ces is is is is is
it ces it it it it it
ja ces ja ja ja ja ja
jv ces jv jv jv jv jv
ka ces ka ka ka ka ka
kg ces kg kg kg kg kg
kk ces kk kk kk kk kk
km ces km km km km km
kn ces kn kn kn kn kn
af ak am ar as ay az be bg bm bn br bs ca cb cs cx cy da de el en es et fa ff fi fr gu ha he hi hr ht hu hy id ig is it ja jv ka kg kk km kn ko ku ky lg ln lo lt lv mg mi mk ml mn mr ms mt my ne nl no ns ny om or pa pl ps pt qa qd ro ru si sk sl sn so sq sr ss st su sv sw sz ta te tg th ti tl tn tr ts tz uk ur ve vi wo xh yo zh_CN zh_TW zu zz
ko ces ko ko ko ko ko
ku ces ku ku ku ku ku
ky ces ky ky ky ky ky
lg ces lg lg lg lg lg
ln ces ln ln ln ln ln
lo ces lo lo lo lo lo
lt ces lt lt lt lt lt
lv ces lv lv lv lv lv
mg ces mg mg mg mg mg
mi ces mi mi mi mi mi
mk ces mk mk mk mk mk
ml ces ml ml ml ml ml
mn ces mn mn mn mn mn
mr ces mr mr mr mr mr
ms ces ms ms ms ms ms
mt ces mt mt mt mt mt
my ces my my my my my
ne ces ne ne ne ne ne
nl ces nl nl nl nl nl
no ces no no no no no
ns ces ns ns ns ns ns
ny ces ny ny ny ny ny
om ces om om om om om
or ces or or or or or
af ak am ar as ay az be bg bm bn br bs ca cb cs cx cy da de el en es et fa ff fi fr gu ha he hi hr ht hu hy id ig is it ja jv ka kg kk km kn ko ku ky lg ln lo lt lv mg mi mk ml mn mr ms mt my ne nl no ns ny om or pa pl ps pt qa qd ro ru si sk sl sn so sq sr ss st su sv sw sz ta te tg th ti tl tn tr ts tz uk ur ve vi wo xh yo zh_CN zh_TW zu zz
pa ces pa pa pa pa pa
pl ces pl pl pl pl pl
ps ces ps ps ps ps ps
pt ces pt pt pt pt pt
qa ces qa qa qa qa qa
qd ces qd qd qd qd qd
ro ces ro ro ro ro ro
ru ces ru ru ru ru ru
si ces si si si si si
sk ces sk sk sk sk sk
sl ces sl sl sl sl sl
sn ces sn sn sn sn sn
so ces so so so so so
sq ces sq sq sq sq sq
sr ces sr sr sr sr sr
ss ces ss ss ss ss ss
st ces st st st st st
su ces su su su su su
sv ces sv sv sv sv sv
sw ces sw sw sw sw sw
sz ces sz sz sz sz sz
ta ces ta ta ta ta ta
te ces te te te te te
tg ces tg tg tg tg tg
af ak am ar as ay az be bg bm bn br bs ca cb cs cx cy da de el en es et fa ff fi fr gu ha he hi hr ht hu hy id ig is it ja jv ka kg kk km kn ko ku ky lg ln lo lt lv mg mi mk ml mn mr ms mt my ne nl no ns ny om or pa pl ps pt qa qd ro ru si sk sl sn so sq sr ss st su sv sw sz ta te tg th ti tl tn tr ts tz uk ur ve vi wo xh yo zh_CN zh_TW zu zz
th ces th th th th th
ti ces ti ti ti ti ti
tl ces tl tl tl tl tl
tn ces tn tn tn tn tn
tr ces tr tr tr tr tr
ts ces ts ts ts ts ts
tz ces tz tz tz tz tz
uk ces uk uk uk uk uk
ur ces ur ur ur ur ur
ve ces ve ve ve ve ve
vi ces vi vi vi vi vi
wo ces wo wo wo wo wo
xh ces xh xh xh xh xh
yo ces yo yo yo yo yo
zh_CN ces zh_CN zh_CN zh_CN zh_CN zh_CN
zh_TW ces zh_TW zh_TW zh_TW zh_TW zh_TW
zu ces zu zu zu zu zu
zz ces zz zz zz zz zz
af ak am ar as ay az be bg bm bn br bs ca cb cs cx cy da de el en es et fa ff fi fr gu ha he hi hr ht hu hy id ig is it ja jv ka kg kk km kn ko ku ky lg ln lo lt lv mg mi mk ml mn mr ms mt my ne nl no ns ny om or pa pl ps pt qa qd ro ru si sk sl sn so sq sr ss st su sv sw sz ta te tg th ti tl tn tr ts tz uk ur ve vi wo xh yo zh_CN zh_TW zu zz

Statistics and TMX/Moses Downloads

Number of files, tokens, and sentences per language (including non-parallel ones if they exist)
Number of sentence alignment units per language pair

Upper-right triangle: download translation memory files (TMX)
Bottom-left triangle: download plain text files (MOSES/GIZA++)
Language ID's, first row: monolingual plain text files (tokenized)
Language ID's, first column: monolingual plain text files (untokenized)

language files tokens sentencesaf ak am ar as ay az be bg bm bn br bs ca cb cs cx cy da de el en es et fa ff fi fr gu ha he hi hr ht hu hy id ig is it ja jv ka kg kk km kn ko ku ky lg ln lo lt lv mg mi mk ml mn mr ms mt my ne nl no ns ny om or pa pl ps pt qa qd ro ru si sk sl sn so sq sr ss st su sv sw sz ta te tg th ti tl tn tr ts tz uk ur ve vi wo xh yo zh_CN zh_TW zu zz
af 31 27.6M 2.1M1.5M
ak 1 5.0k 0.5k0.5k
am 7 5.4M 0.4M0.3M
ar 507 389.8M 25.7M13.0M
as 1 0.4M 27.3k26.9k
ay 1 11.0k 0.8k0.5k
az 25 14.3M 1.4M1.2M
be 23 21.2M 1.7M1.1M
bg 209 148.1M 13.4M10.4M
bm 1 1.3k 0.2k0.1k
bn 71 77.0M 3.6M3.5M
br 3 1.2M 0.1M0.1M
bs 4 6.5M 0.3M0.2M
ca 117 108.1M 7.4M5.8M
cb 2 0.4M 52.4k52.2k
cs 255 163.8M 16.1M12.7M
cx 5 3.5M 0.3M0.2M
cy 17 13.3M 1.0M0.8M
da 215 151.0M 13.7M10.7M
de 1,852 1.4G 121.7M15.3M
el 178 110.0M 10.5M8.8M
en 18,204 13.3G 1.1G1.5M 0.5k 0.3M 13.1M 27.0k 0.5k 1.2M 1.1M 10.4M 0.1k 3.5M 0.1M 0.2M 5.8M 52.3k 12.7M 0.2M 0.8M 10.7M 15.3M 8.9M 15.2M 4.1M 5.2M 72.8k 9.7M 15.5M 0.2M 0.3M 5.3M 8.1M 9.4M 0.6M 11.6M 1.0M 15.6M 0.1M 1.2M 14.4M 14.9M 1.5M 1.3M 75 0.7M 0.4M 0.2M 8.7M 0.1M 0.2M 14.7k 21.5k 0.2M 5.2M 4.8M 0.4M 0.1M 1.8M 0.6M 0.6M 0.7M 5.4M 27 0.3M 0.5M 13.1M 9.2M 14.1k 0.1M 22.1k 5.5k 0.2M 12.9M 0.3M 13.6M 0.1k 0.2k 10.5M 13.8M 0.6M 6.9M 4.4M 86.7k 0.4M 2.3M 2.0M 22.9k 0.9k 0.5M 12.5M 2.0M 13 0.9M 0.6M 0.3M 10.6M 7.6k 6.6M 70.9k 13.6M 2.0k 34 8.5M 1.4M 1.6k 12.3M 88.3k 0.1M 0.2M 15.1M 8.7M 0.1M 35
es 1,967 1.5G 123.1M15.3M
et 83 53.0M 5.4M4.1M
fa 106 85.6M 5.4M5.3M
ff 2 0.9M 88.1k73.0k
fi 194 109.8M 12.5M9.7M
fr 2,067 1.9G 132.0M15.6M
gu 4 3.4M 0.2M0.2M
ha 7 4.1M 0.4M0.3M
he 107 65.8M 5.4M5.3M
hi 164 213.6M 8.3M8.2M
hr 188 114.2M 11.7M9.4M
ht 12 9.7M 0.7M0.6M
hu 232 139.7M 14.5M11.6M
hy 21 16.5M 1.2M1.0M
id 541 233.9M 30.4M15.7M
ig 3 2.8M 0.2M0.1M
is 24 20.6M 1.6M1.2M
it 1,161 950.8M 72.5M14.5M
ja 525 183.9M 26.4M15.0M
jv 31 6.7M 1.6M1.5M
ka 26 17.9M 1.3M1.3M
kg 1 0.6k 8375
kk 14 9.3M 0.9M0.7M
km 9 7.2M 0.4M0.4M
kn 4 3.0M 0.2M0.2M
ko 181 108.5M 9.4M9.0M
ku 3 2.9M 0.2M0.1M
ky 5 3.3M 0.3M0.2M
lg 1 80.6k 15.1k14.7k
ln 1 0.1M 22.3k21.6k
lo 4 1.5M 0.2M0.2M
lt 105 69.1M 6.7M5.2M
lv 98 68.0M 6.4M4.9M
mg 8 7.1M 0.5M0.4M
mi 3 2.6M 0.2M0.1M
mk 36 28.3M 2.3M1.8M
ml 12 15.5M 0.6M0.6M
mn 12 6.2M 0.7M0.6M
mr 15 15.6M 0.8M0.7M
ms 108 80.4M 6.9M5.4M
mt 1 86 2627
my 6 5.6M 0.3M0.3M
ne 10 10.4M 0.5M0.5M
nl 727 550.8M 48.0M13.2M
no 184 124.0M 11.7M9.2M
ns 1 0.1M 15.8k14.1k
ny 3 2.1M 0.2M0.1M
om 1 0.1M 22.9k22.2k
or 1 0.1M 5.6k5.5k
pa 4 3.2M 0.2M0.2M
pl 520 346.3M 33.0M13.0M
ps 6 4.3M 0.3M0.3M
pt 931 654.7M 56.5M13.7M
qa 1 1.8k 0.1k0.1k
qd 1 2.4k 0.2k0.2k
ro 211 154.9M 13.3M10.5M
ru 1,386 1.0G 90.2M13.9M
si 13 13.3M 0.6M0.6M
sk 139 94.2M 8.9M6.9M
sl 88 63.4M 5.7M4.4M
sn 2 1.5M 0.1M86.9k
so 8 3.7M 0.4M0.4M
sq 47 34.2M 2.9M2.3M
sr 40 40.1M 3.0M2.0M
ss 1 0.1M 27.2k23.0k
st 1 19.4k 1.2k0.9k
su 10 5.9M 0.6M0.5M
sv 251 182.0M 16.7M12.5M
sw 41 21.5M 2.4M2.0M
sz 1 0.1k 1213
ta 18 24.0M 0.9M0.9M
te 12 11.7M 0.6M0.6M
tg 6 3.8M 0.3M0.3M
th 215 70.7M 11.0M10.7M
ti 1 0.2M 7.7k7.7k
tl 132 50.5M 7.2M6.6M
tn 2 0.4M 74.1k71.3k
tr 406 245.5M 25.1M13.7M
ts 1 31.7k 2.5k2.0k
tz 1 0.4k 3834
uk 171 127.6M 11.6M8.5M
ur 28 25.8M 1.4M1.4M
ve 1 26.2k 2.0k1.6k
vi 248 204.3M 14.9M12.4M
wo 2 0.6M 94.5k88.4k
xh 3 1.6M 0.2M0.1M
yo 4 3.6M 0.2M0.2M
zh_CN 304 103.5M 15.3M15.2M
zh_TW 176 55.3M 8.8M8.8M
zu 3 1.7M 0.2M0.1M
zz 1 0.4k 3435

Note that TMX files only contain unique translation units and, therefore, the number of aligned units is smaller than for the distributions in Moses and XML format. Moses downloads include all non-empty alignment units including duplicates. Token counts for each language also include duplicate sentences and documents.