MDN_Web_Docs

2023-09-25

NLLB

2023-09-07

Liv4ever and ELITR-ECA

2021-12-08

CCMatrix

2021-06-28

Updated: ParaCrawl and MultiParaCrawl

2021-06-11

New: MT560 dataset

2021-04-02

GoURMET and MIZAN

2020-11-27

EuroPat and tico-19

2020-10-31

OPUS-100 corpus

2020-06-30

ELRC public

2020-05-22

MultiParaCrawl

2019-10-16

Infopankki v1

2019-10-14

New corpus: memat (Xhosa/English)

2018-10-06

New corpora: ParaCrawl, XhosaNavy

2018-02-15

New version: OpenSubtitles2018

2017-11-06

An overview of the OPUS collection

1,210 corpora

45,945,946,108 total sentence pairs

744 languages available

This map displays 10 corpora , which make up a total 93.40% of the entire OPUS collection

CorpusSentences% of OPUS
NLLB13B28.31
CCMatrix11B23.64
OpenSubtitles8.5B18.53
MultiCCAligned2.2B4.87840
ParaCrawl1.5B3.26229
DGT1.1B2.37845
XLEnt883M1.92148
MultiParaCrawl789M1.71653
LinguaTools-WikiTitles487M1.06082
CCAligned439M0.95442

OUR CONTRIBUTORS

NLPLuniversity of helsinkicschpltlets mt