HPLT and MultiHPLT v2 released

2025-01-25

MDN_Web_Docs

2023-09-25

NLLB

2023-09-07

Liv4ever and ELITR-ECA

2021-12-08

CCMatrix

2021-06-28

Updated: ParaCrawl and MultiParaCrawl

2021-06-11

New: MT560 dataset

2021-04-02

GoURMET and MIZAN

2020-11-27

EuroPat and tico-19

2020-10-31

OPUS-100 corpus

2020-06-30

ELRC public

2020-05-22

MultiParaCrawl

2019-10-16

Infopankki v1

2019-10-14

New corpus: memat (Xhosa/English)

2018-10-06

New corpora: ParaCrawl, XhosaNavy

2018-02-15

New version: OpenSubtitles2018

2017-11-06

An overview of the OPUS collection

1,213 corpora

58,859,280,467 total sentence pairs

1005 languages available

This table displays 101 corpora , which make up a total 94.85% of the entire OPUS collection

CorpusSentences% of OPUS
OpenSubtitles20B34.34
NLLB13B22.10
CCMatrix11B18.46
MultiCCAligned2.2B3.80811
ParaCrawl1.5B2.54657
DGT1.1B1.85664
MultiHPLT897M1.52481
XLEnt883M1.49992
MultiParaCrawl789M1.33993
LinguaTools-WikiTitles487M0.82808
CCAligned439M0.74502
HPLT381M0.64682
UNPC323M0.54940
EUbookshop279M0.47403
EMEA243M0.41277
GNOME225M0.38180
KDE4201M0.34128
Europarl186M0.31622
MultiUN159M0.27097
JRC-Acquis147M0.25021
TED2020143M0.24304
TildeMODEL128M0.21744
WikiMatrix127M0.21608
QED122M0.20761
ParaCrawl-Bonus102M0.17341
EuroPat89M0.15138
bible-uedin85M0.14499
NeuLab-TedTalks74M0.12590
Samanantar50M0.08456
Tanzil42M0.07178
MultiMaCoCu28M0.04675
JParaCrawl26M0.04388
MaCoCu26M0.04338
wikimedia23M0.03894
giga-fren23M0.03826
ELITR-ECA20M0.03427
Anuvaad18M0.03079
StanfordNLP-NMT16M0.02706
ECB15M0.02607
Wikipedia13M0.02200
SETIMES8.8M0.01495
Tatoeba8.7M0.01479
DOGC8.5M0.01439
WikiTitles8M0.01364
GlobalVoices7.3M0.01233
News-Commentary6.4M0.01092
PHP6.1M0.01042
MBS5M0.008543660
SciELO3.8M0.006412649
Finlex3.1M0.005290822
infopankki2.9M0.004985448
JESC2.8M0.004752673
fiskmo2.1M0.003567832
ParIce2.1M0.003562772
GoURMET2.1M0.003560686
EUconst2.1M0.003508522
OpenOffice2M0.003477790
EOPC2M0.003421877
TED20131.9M0.003235965
EhuHac1.8M0.003053863
SUMMA1.6M0.002675213
IITB1.6M0.002643141
ALT1.4M0.002396788
Books1.3M0.002124783
ChuBiCo1.2M0.002042184
CAPES1.2M0.001966742
Joshua-IPC1.1M0.001876319
MIZAN1M0.001735660
SCB_MT_EN_TH988K0.001679022
MDN_Web_Docs874K0.001485737
ECDC683K0.001161227
Elhuyar642K0.001091327
EiTB-ParCC637K0.001082551
TEP612K0.001039914
KDEdoc610K0.001037090
WMT-News447K0.000760098
KFTT440K0.000748032
pmindia422K0.000717256
tldr-pages370K0.000627821
tico-19319K0.000542623
memat155K0.000262937
hrenWaC99K0.000168199
TedTalks86K0.000146702
FFR82K0.000139545
SPC68K0.000115122
MontenegrinSubs65K0.000110504
OfisPublik63K0.000107752
XhosaNavy50K0.000084916
Bianet48K0.000082395
WikiSource33K0.000056547
Salome9.4K0.000016014
sardware6.2K0.000010449
ada834.1K0.000007003
InterdialectCorpus2.4K0.000004157
RF1.2K0.000001964
komiNot specifiedNot specified
liv4everNot specifiedNot specified
Mozilla-I10nNot specifiedNot specified
Nunavut_HansardNot specifiedNot specified
translatewikiNot specifiedNot specified
UbuntuNot specifiedNot specified

OUR CONTRIBUTORS

NLPLuniversity of helsinkicschpltlets mt