

Find your corpora
Loading search…


102,912,051,826 total sentence pairs
1005 languages available
This table displays 101 corpora, which make up a total 93.50% of the entire OPUS collection
| Corpus | Sentences | % of OPUS |
|---|---|---|
| OpenSubtitles | 27.2B | 26.47 |
| NLLB | 22.7B | 22.09 |
| CCMatrix | 17.1B | 16.61 |
| ParaCrawl | 4.6B | 4.50 |
| CCAligned | 3.1B | 3.05 |
| MultiParaCrawl | 2.8B | 2.74 |
| MultiHPLT | 2.7B | 2.59 |
| MultiCCAligned | 2.4B | 2.34 |
| HPLT | 2.2B | 2.10 |
| GNOME | 1.7B | 1.61 |
| LinguaTools-WikiTitles | 1.5B | 1.42 |
| DGT | 1.2B | 1.18 |
| XLEnt | 1.1B | 1.04 |
| WikiMatrix | 933.6M | 0.91 |
| UNPC | 543.9M | 0.53 |
| EUbookshop | 459.3M | 0.45 |
| ParaCrawl-Bonus | 316.2M | 0.31 |
| EMEA | 282.5M | 0.27 |
| translatewiki | 264.9M | 0.26 |
| MultiUN | 255.9M | 0.25 |
| EuroPat | 252.2M | 0.25 |
| KDE4 | 224.7M | 0.22 |
| Europarl | 217.4M | 0.21 |
| JRC-Acquis | 215.9M | 0.21 |
| TildeMODEL | 193M | 0.19 |
| QED | 191.9M | 0.19 |
| TED2020 | 153.1M | 0.15 |
| Samanantar | 151.2M | 0.15 |
| Mozilla-I10n | 124.3M | 0.12 |
| bible-uedin | 88.3M | 0.08576 |
| MaCoCu | 81.6M | 0.07929 |
| NeuLab-TedTalks | 79.7M | 0.07745 |
| JParaCrawl | 79.3M | 0.07705 |
| MultiMaCoCu | 79M | 0.07680 |
| wikimedia | 75.9M | 0.07380 |
| giga-fren | 70.1M | 0.06808 |
| GoURMET | 62.7M | 0.06094 |
| StanfordNLP-NMT | 58.3M | 0.05665 |
| Tanzil | 50M | 0.04856 |
| Anuvaad | 49.1M | 0.04772 |
| ECB | 45.9M | 0.04459 |
| Wikipedia | 38.9M | 0.03775 |
| ELITR-ECA | 28.3M | 0.02749 |
| SETIMES | 26.4M | 0.02566 |
| DOGC | 26.1M | 0.02531 |
| WikiTitles | 24.1M | 0.02342 |
| Tatoeba | 19.8M | 0.01921 |
| MBS | 15.1M | 0.01466 |
| GlobalVoices | 14M | 0.01359 |
| Finlex | 11.3M | 0.01099 |
| News-Commentary | 11.1M | 0.01077 |
| SciELO | 10.8M | 0.01051 |
| PHP | 10.6M | 0.01025 |
| JESC | 8.4M | 0.00816 |
| ParIce | 6.4M | 0.00626 |
| fiskmo | 6.3M | 0.00616 |
| EOPC | 6.1M | 0.00590 |
| MDN_Web_Docs | 6M | 0.00583 |
| TED2013 | 5.7M | 0.00558 |
| EhuHac | 5.7M | 0.00554 |
| IITB | 4.9M | 0.00473 |
| Nunavut_Hansard | 4.1M | 0.00394 |
| infopankki | 4M | 0.00385 |
| ChuBiCo | 3.7M | 0.00361 |
| SCB_MT_EN_TH | 3.5M | 0.00338 |
| CAPES | 3.5M | 0.00338 |
| MIZAN | 3.1M | 0.00301 |
| OpenOffice | 2.7M | 0.00262 |
| EUconst | 2.3M | 0.00227 |
| Books | 2.2M | 0.00214 |
| SUMMA | 2.1M | 0.00201 |
| Elhuyar | 2M | 0.00193 |
| EiTB-ParCC | 1.9M | 0.00189 |
| TEP | 1.8M | 0.00178 |
| ALT | 1.6M | 0.00160 |
| Joshua-IPC | 1.6M | 0.00157 |
| KFTT | 1.3M | 0.00129 |
| tldr-pages | 1.1M | 0.00109 |
| KDEdoc | 1M | 0.00099 |
| WMT-News | 1M | 0.00098 |
| tico-19 | 983.8K | 0.00096 |
| pmindia | 856.7K | 0.00083 |
| ECDC | 749.1K | 0.00073 |
| memat | 489.3K | 0.00048 |
| hrenWaC | 297K | 0.00029 |
| TedTalks | 260.3K | 0.00025 |
| FFR | 246.4K | 0.00024 |
| SPC | 219.7K | 0.00021 |
| MontenegrinSubs | 211.5K | 0.00021 |
| OfisPublik | 191.1K | 0.00019 |
| Bianet | 186.9K | 0.00018 |
| XhosaNavy | 154.8K | 0.00015 |
| WikiSource | 113.3K | 0.00011 |
| sardware | 19.3K | 0.000018711 |
| Salome | 15.9K | 0.000015423 |
| ada83 | 12.5K | 0.000012188 |
| InterdialectCorpus | 7.3K | 0.000007133 |
| RF | 2.2K | 0.000002150 |
| Ubuntu | Not specified | Not specified |
| liv4ever | Not specified | Not specified |
| komi | Not specified | Not specified |





