MDN_Web_Docs

2023-09-25

NLLB

2023-09-07

Liv4ever and ELITR-ECA

2021-12-08

CCMatrix

2021-06-28

Updated: ParaCrawl and MultiParaCrawl

2021-06-11

New: MT560 dataset

2021-04-02

GoURMET and MIZAN

2020-11-27

EuroPat and tico-19

2020-10-31

OPUS-100 corpus

2020-06-30

ELRC public

2020-05-22

MultiParaCrawl

2019-10-16

Infopankki v1

2019-10-14

New corpus: memat (Xhosa/English)

2018-10-06

New corpora: ParaCrawl, XhosaNavy

2018-02-15

New version: OpenSubtitles2018

2017-11-06

An overview of the OPUS collection

1,212 corpora

46,838,693,917 total sentence pairs

744 languages available

This table displays 100 corpora , which make up a total 93.53% of the entire OPUS collection

CorpusSentences% of OPUS
NLLB13B27.77
CCMatrix11B23.19
OpenSubtitles8.5B18.17
MultiCCAligned2.2B4.78542
ParaCrawl1.5B3.20011
DGT1.1B2.33312
XLEnt883M1.88486
MultiParaCrawl789M1.68381
MultiHPLT583M1.24533
LinguaTools-WikiTitles487M1.04060
CCAligned439M0.93622
HPLT381M0.81281
UNPC323M0.69039
EUbookshop279M0.59569
EMEA243M0.51871
GNOME225M0.47979
KDE4201M0.42887
Europarl186M0.39737
MultiUN159M0.34051
JRC-Acquis147M0.31442
TED2020143M0.30541
TildeMODEL128M0.27324
WikiMatrix127M0.27153
QED122M0.26090
ParaCrawl-Bonus102M0.21791
EuroPat89M0.19023
bible-uedin85M0.18220
NeuLab-TedTalks74M0.15821
Samanantar50M0.10627
Tanzil42M0.09021
MultiMaCoCu28M0.05874
JParaCrawl26M0.05514
MaCoCu26M0.05452
wikimedia23M0.04893
giga-fren23M0.04808
ELITR-ECA20M0.04306
Anuvaad18M0.03870
StanfordNLP-NMT16M0.03400
ECB15M0.03276
Wikipedia13M0.02765
SETIMES8.8M0.01879
Tatoeba8.7M0.01858
DOGC8.5M0.01809
GlobalVoices7.3M0.01549
News-Commentary6.4M0.01373
PHP6.1M0.01310
MBS5M0.01074
SciELO3.8M0.008058378
Finlex3.1M0.006648648
infopankki2.9M0.006264904
JESC2.8M0.005972389
fiskmo2.1M0.004483473
ParIce2.1M0.004477115
GoURMET2.1M0.004474493
EUconst2.1M0.004408942
OpenOffice2M0.004370323
EOPC2M0.004300060
TED20131.9M0.004066437
EhuHac1.8M0.003837601
pmindia1.7M0.003621117
SUMMA1.6M0.003361774
IITB1.6M0.003321472
Books1.3M0.002670083
ChuBiCo1.2M0.002566286
CAPES1.2M0.002471482
Joshua-IPC1.1M0.002357854
MIZAN1M0.002181096
SCB_MT_EN_TH988K0.002109922
MDN_Web_Docs874K0.001867033
ECDC683K0.001459242
Elhuyar642K0.001371402
EiTB-ParCC637K0.001360375
TEP612K0.001306796
KDEdoc610K0.001303247
WMT-News447K0.000955168
KFTT440K0.000940005
tico-19319K0.000681881
tldr-pages258K0.000550368
memat155K0.000330417
hrenWaC99K0.000211366
TedTalks86K0.000184352
FFR82K0.000175357
SPC68K0.000144667
MontenegrinSubs65K0.000138864
OfisPublik63K0.000135405
XhosaNavy50K0.000106709
Bianet48K0.000103540
WikiSource33K0.000071059
ALT18K0.000038620
Salome9.4K0.000020124
sardware6.2K0.000013130
ada834.1K0.000008800
RF1.2K0.000002468
komiNot specifiedNot specified
liv4everNot specifiedNot specified
Mozilla-I10nNot specifiedNot specified
MPC1Not specifiedNot specified
Nunavut_HansardNot specifiedNot specified
UbuntuNot specifiedNot specified
WikiTitlesNot specifiedNot specified

OUR CONTRIBUTORS

NLPLuniversity of helsinkicschpltlets mt