Find the corpus you are looking for

Here you find the corpora listed by name. The ELRC and ELRA link will take you to their entire collections.

ALT 20k Myanmar-English parallel sentences

Anuvaad links for popular Indian languages

Bianet Translated Turkish articles (tr, ku, en)

Books A collection of translated literature

CAPES Thesis and dissertation abstracts

CCAligned Parallel documents from Common Crawl

CCMatrix Parallel sentences from Common Crawl

ChuBiCo resources for the Chuvash language

DGT A collection of EU TMs provided by the JRC

DOGC Documents from the Catalan Goverment

ECB European Central Bank corpus

ECDC European Centre for Disease Prevention corpus

ELITR-ECA European Court of Auditors documents

ELRA Collection

ELRC Collection

EMEA European Medicines Agency documents

EUbookshop documents from the EU bookshop

EUconst The European constitution

EhuHac Hizkuntzen Arteko Corpusa

EiTB-ParCC Parallel Corpus of Comparable News

Elhuyar foundation Elhuyar corpus

EuroPat Parallel corpus of patents

Europarl European Parliament Proceedings

FFR Fon and French sentences

Finlex Legislative and other judicial information of Finland

GNOME GNOME localization files

GlobalVoices News stories in various languages

GoURMET Parallel data from web crawls

IITB IIT Bombay English-Hindi corpus

JESC Japanese-English Subtitle Corpus

JParaCrawl English-Japanese parallel corpus

JRC-Acquis legislative EU texts

Joshua-IPC Indian-language from Wikipedia pages corpus

KDE4 KDE4 localization files (v.2)

KDEdoc the KDE manual corpus

KFTT Kyoto Free Translation Task corpus

LinguaTools-WikiTitles bilingual titles of Wikipedia articles

MBS Belgisch Staatsblad corpus

MDN_Web_Docs MDN web docs

MIZAN A large Persian-English corpus

MontenegrinSubs Montenegrin movie subtitles

MultiCCAligned Pivot-based Bitexts from CCAligned

MultiParaCrawl Non-English Bitexts from ParaCrawl

MultiUN Translated UN documents

MT-560 A Many-to-English MT Dataset

NeuLab-TedTalks TED talk subtitles

News-Commentary News Commentaries

NLLB based on Meta AI metadata

OfisPublik Breton - French parallel texts

OpenOffice the OpenOffice.org corpus

OpenSubtitles translated subtitles

OPUS-100 English-centric multilingual corpus

PHP the PHP manual corpus

ParIce English-Icelandic parallel corpus

ParaCrawl Parallel corpora from Web Crawls

QED subtitles for educational videos and lectures

RF Declarations of Government Policy by the Swedish Government

SETIMES A parallel corpus of the Balkan languages

Samanantar Largest Indic corpora collection

SPC Stockholm Parallel Corpora

SUMMA corpus from SUMMA project

Salome translations of Oscar Wilde’s Salomé

SciELO Artciles from SciELO

StanfordNLP-NMT StanfordNLP-NMT

TED2013 TED talk subtitles

TED2020 a crawl of nearly 4000 TED/TEDX transcripts

TEP The Tehran English-Persian subtitle corpus

Tanzil A collection of Quran translations

Tatoeba A DB of translated sentences

TedTalks Croatian-English parallel corpus

TildeMODEL Multilingual Open Data for European Languages

UNPC The United Nations Parallel Corpus

Ubuntu Ubuntu localization files

WMT-News A parallel corpus of News Test Sets

WikiMatrix Parallel sentences extracted from Wikipedia

WikiSource small en-sv sample only

WikiTitles parallel wikipedia titles

Wikipedia translated sentences from Wikipedia

XLEnt CCAligned, CCMatrix, and WikiMatrix parallel sentences

XhosaNavy South African Navy parallel corpus

ada83 Ada 83 manuals

bible-uedin Collection of Bible translations

fiskmo Data from the fiskmö project

giga-fren French-English Gigal-Word Corpus

hrenWaC Croatian-English Parallel Web Corpus

infopankki infopankki.fi via the Open Data API

liv4ever Livonian 4-lingual parallel corpus

memat Xhosa/English parallel data

pmindia parallel corpus containing 13 Indian languages

sardware the sardware corpus

tico-19 Translation Initiative for COVID-19

wikimedia wikimedia article translation system

HPLT HPLT web crawled parallel sentences

MultiHPLT HPLT web crawled parallel sentences