Home / Query / WordAlign / Wiki | [ada83] [bible] [bianet] [books] [CCAligned] [CCMatrix] [CAPES] [DGT] [DOGC] [ECB] [EhuHac] [EiTB] [Elhuyar] [ELITR_ECA] [ELRC] [EMEA] [EUbooks] [EU] [Europarl] [EuroPat] [finlex] [fiskmö] [giga] [GNOME] [GlobalVoices] [hren] [infopankki] [JRC] [KDE4/doc] [liv4ever] [MBS] [memat] [MontenegrinSubs] [MultiUN] [MultiParaCrawl] [MultiCCAligned] [MT560] [NC] [Ofis] [OO/OO3] [subs/16/18] [Opus100] [ParaCrawl] [ParCor] [PHP] [QED] [sardware] [SciELO] [SETIMES] [SPC] [Tatoeba] [Tanzil] [TEP] [TED] [tico19] [Tilde] [Ubuntu] [UN] [UNPC] [WikiMatrix] [Wikimedia] [Wikipedia] [WikiSource] [WMT] [XhosaNavy] |
This data set has 473M sentences, 9 Billion source tokens and 9 Billion English tokens, after deduplication and cleaning. The raw data (train.raw.tsv.gz) is much larger.(head -1 && tail -1) < train.v1.tok.stats.tsv Lang Sents Source English Total 473791770 9001780032 9072887211
It has 560 languages on source side. Target side is English. However, not all languages have sufficient training data.$ cut -f1 train.v1.tok.stats.tsv | grep '^[a-z]' | wc -l 560
334 langs have at least 10,000 sentencescat train.v1.tok.stats.tsv | grep '^[a-z]' | less | awk -F '\t' 'int($2) >= 10000' |wc -l 334
214 languages have atleast 1 Million source tokens.cat train.v1.tok.stats.tsv | grep '^[a-z]' | less | awk -F '\t' 'int($3) >= 1000000' |wc -l 214