HPLT and MultiHPLT v2 released

2025-01-25

MDN_Web_Docs

2023-09-25

NLLB

2023-09-07

Liv4ever and ELITR-ECA

2021-12-08

CCMatrix

2021-06-28

Updated: ParaCrawl and MultiParaCrawl

2021-06-11

New: MT560 dataset

2021-04-02

GoURMET and MIZAN

2020-11-27

EuroPat and tico-19

2020-10-31

OPUS-100 corpus

2020-06-30

ELRC public

2020-05-22

MultiParaCrawl

2019-10-16

Infopankki v1

2019-10-14

New corpus: memat (Xhosa/English)

2018-10-06

New corpora: ParaCrawl, XhosaNavy

2018-02-15

New version: OpenSubtitles2018

2017-11-06

An overview of the OPUS collection

1,213 corpora

59,953,757,221 total sentence pairs

1005 languages available

This table displays 101 corpora , which make up a total 94.94% of the entire OPUS collection

CorpusSentences% of OPUS
OpenSubtitles20B33.71
NLLB13B21.69
CCMatrix11B18.12
MultiCCAligned2.2B3.73859
MultiHPLT1.7B2.82215
ParaCrawl1.5B2.50008
DGT1.1B1.82274
XLEnt883M1.47254
MultiParaCrawl789M1.31547
HPLT681M1.13537
LinguaTools-WikiTitles487M0.81297
CCAligned439M0.73142
UNPC323M0.53937
EUbookshop279M0.46538
EMEA243M0.40524
GNOME225M0.37483
KDE4201M0.33505
Europarl186M0.31044
MultiUN159M0.26602
JRC-Acquis147M0.24564
TED2020143M0.23860
TildeMODEL128M0.21347
WikiMatrix127M0.21213
QED122M0.20382
ParaCrawl-Bonus102M0.17025
EuroPat89M0.14861
bible-uedin85M0.14234
NeuLab-TedTalks74M0.12360
Samanantar50M0.08302
Tanzil42M0.07047
MultiMaCoCu28M0.04589
JParaCrawl26M0.04307
MaCoCu26M0.04259
wikimedia23M0.03822
giga-fren23M0.03756
ELITR-ECA20M0.03364
Anuvaad18M0.03023
StanfordNLP-NMT16M0.02656
ECB15M0.02559
Wikipedia13M0.02160
SETIMES8.8M0.01468
Tatoeba8.7M0.01452
DOGC8.5M0.01413
WikiTitles8M0.01339
GlobalVoices7.3M0.01210
News-Commentary6.4M0.01072
PHP6.1M0.01023
MBS5M0.008387693
SciELO3.8M0.006295584
Finlex3.1M0.005194237
infopankki2.9M0.004894437
JESC2.8M0.004665911
fiskmo2.1M0.003502700
ParIce2.1M0.003497732
GoURMET2.1M0.003495684
EUconst2.1M0.003444473
OpenOffice2M0.003414301
EOPC2M0.003359409
TED20131.9M0.003176892
EhuHac1.8M0.002998114
SUMMA1.6M0.002626376
IITB1.6M0.002594890
ALT1.4M0.002353034
Books1.3M0.002085994
ChuBiCo1.2M0.002004904
CAPES1.2M0.001930838
Joshua-IPC1.1M0.001842066
MIZAN1M0.001703975
SCB_MT_EN_TH988K0.001648370
MDN_Web_Docs874K0.001458614
ECDC683K0.001140029
Elhuyar642K0.001071404
EiTB-ParCC637K0.001062789
TEP612K0.001020930
KDEdoc610K0.001018158
WMT-News447K0.000746222
KFTT440K0.000734376
pmindia422K0.000704163
tldr-pages370K0.000616360
tico-19319K0.000532717
memat155K0.000258137
hrenWaC99K0.000165129
TedTalks86K0.000144024
FFR82K0.000136997
SPC68K0.000113020
MontenegrinSubs65K0.000108487
OfisPublik63K0.000105785
XhosaNavy50K0.000083366
Bianet48K0.000080891
WikiSource33K0.000055514
Salome9.4K0.000015722
sardware6.2K0.000010258
ada834.1K0.000006875
InterdialectCorpus2.4K0.000004081
RF1.2K0.000001928
komiNot specifiedNot specified
liv4everNot specifiedNot specified
Mozilla-I10nNot specifiedNot specified
Nunavut_HansardNot specifiedNot specified
translatewikiNot specifiedNot specified
UbuntuNot specifiedNot specified

OUR CONTRIBUTORS

NLPLuniversity of helsinkicschpltlets mt