Here you find the corpora listed by name. The ELRC and ELRA link will take you to their entire collections.
ALT 20k Myanmar-English parallel sentences
Anuvaad links for popular Indian languages
Bianet Translated Turkish articles (tr, ku, en)
Books A collection of translated literature
CAPES Thesis and dissertation abstracts
CCAligned Parallel documents from Common Crawl
CCMatrix Parallel sentences from Common Crawl
ChuBiCo resources for the Chuvash language
DGT A collection of EU TMs provided by the JRC
DOGC Documents from the Catalan Goverment
ECB European Central Bank corpus
ECDC European Centre for Disease Prevention corpus
ELITR-ECA European Court of Auditors documents
ELRA Collection
ELRC Collection
EMEA European Medicines Agency documents
EUbookshop documents from the EU bookshop
EUconst The European constitution
EhuHac Hizkuntzen Arteko Corpusa
EiTB-ParCC Parallel Corpus of Comparable News
Elhuyar foundation Elhuyar corpus
EuroPat Parallel corpus of patents
Europarl European Parliament Proceedings
FFR Fon and French sentences
Finlex Legislative and other judicial information of Finland
GNOME GNOME localization files
GlobalVoices News stories in various languages
GoURMET Parallel data from web crawls
IITB IIT Bombay English-Hindi corpus
JESC Japanese-English Subtitle Corpus
JParaCrawl English-Japanese parallel corpus
JRC-Acquis legislative EU texts
Joshua-IPC Indian-language from Wikipedia pages corpus
KDE4 KDE4 localization files (v.2)
KDEdoc the KDE manual corpus
KFTT Kyoto Free Translation Task corpus
LinguaTools-WikiTitles bilingual titles of Wikipedia articles
MBS Belgisch Staatsblad corpus
MDN_Web_Docs MDN web docs
MIZAN A large Persian-English corpus
MontenegrinSubs Montenegrin movie subtitles
MultiCCAligned Pivot-based Bitexts from CCAligned
MultiParaCrawl Non-English Bitexts from ParaCrawl
MultiUN Translated UN documents
MT-560 A Many-to-English MT Dataset
NeuLab-TedTalks TED talk subtitles
News-Commentary News Commentaries
NLLB based on Meta AI metadata
OfisPublik Breton - French parallel texts
OpenOffice the OpenOffice.org corpus
OpenSubtitles translated subtitles
OPUS-100 English-centric multilingual corpus
PHP the PHP manual corpus
ParIce English-Icelandic parallel corpus
ParaCrawl Parallel corpora from Web Crawls
QED subtitles for educational videos and lectures
RF Declarations of Government Policy by the Swedish Government
SETIMES A parallel corpus of the Balkan languages
Samanantar Largest Indic corpora collection
SPC Stockholm Parallel Corpora
SUMMA corpus from SUMMA project
Salome translations of Oscar Wilde’s Salomé
SciELO Artciles from SciELO
StanfordNLP-NMT StanfordNLP-NMT
TED2013 TED talk subtitles
TED2020 a crawl of nearly 4000 TED/TEDX transcripts
TEP The Tehran English-Persian subtitle corpus
Tanzil A collection of Quran translations
Tatoeba A DB of translated sentences
TedTalks Croatian-English parallel corpus
TildeMODEL Multilingual Open Data for European Languages
UNPC The United Nations Parallel Corpus
Ubuntu Ubuntu localization files
WMT-News A parallel corpus of News Test Sets
WikiMatrix Parallel sentences extracted from Wikipedia
WikiSource small en-sv sample only
WikiTitles parallel wikipedia titles
Wikipedia translated sentences from Wikipedia
XLEnt CCAligned, CCMatrix, and WikiMatrix parallel sentences
XhosaNavy South African Navy parallel corpus
ada83 Ada 83 manuals
bible-uedin Collection of Bible translations
fiskmo Data from the fiskmö project
giga-fren French-English Gigal-Word Corpus
hrenWaC Croatian-English Parallel Web Corpus
infopankki infopankki.fi via the Open Data API
liv4ever Livonian 4-lingual parallel corpus
memat Xhosa/English parallel data
pmindia parallel corpus containing 13 Indian languages
sardware the sardware corpus
tico-19 Translation Initiative for COVID-19
wikimedia wikimedia article translation system
HPLT HPLT web crawled parallel sentences
MultiHPLT HPLT web crawled parallel sentences