HPLT

51 languages, 50 bitexts
total number of files: 810
total number of tokens: 17.53G
total number of sentence fragments: 841.78M

Please, acknowledge the HPLT project at https://hplt-project.org/. This version is derived from the original release at their website adjusted for redistribution via the OPUS corpus collection. Please, acknowledge OPUS as well for this service.

Download

Below you can download data files for all language pairs in different formats and with different kind of annotation (if available). You can click on the various links as explained below. In addition to the files shown on this webpage, OPUS also provides pre-compiled word alignments and phrase tables, bilingual dictionaries, frequency counts, and these files can be found through the resources search form on the top-level website of OPUS.

Release history:

	af	ar	az	be	bg	bn	bs	ca	cy	en	eo	et	eu	fa	fi	ga	gl		gu	he	hi	hr	is	ja	kk	kn	ko	lt	lv	mk	ml	mr	ms	mt	nb	ne		nn	si	sk	sl	sq	sr	sw	ta	te	th	tr	uk	ur	uz	vi	xh
af										view								af																			af																	af
ar										view								ar																			ar																	ar
az										view								az																			az																	az
be										view								be																			be																	be
bg										view								bg																			bg																	bg
bn										view								bn																			bn																	bn
bs										view								bs																			bs																	bs
ca										view								ca																			ca																	ca
cy										view								cy																			cy																	cy
en	ces	ces	ces	ces	ces	ces	ces	ces	ces		view	view	view	view	view	view	view	en	view	view	view	view	view	view	view	view	view	view	view	view	view	view	view	view	view	view	en	view	view	view	view	view	view	view	view	view	view	view	view	view	view	view	view	en
eo										ces								eo																			eo																	eo
et										ces								et																			et																	et
eu										ces								eu																			eu																	eu
fa										ces								fa																			fa																	fa
fi										ces								fi																			fi																	fi
ga										ces								ga																			ga																	ga
gl										ces								gl																			gl																	gl
	af	ar	az	be	bg	bn	bs	ca	cy	en	eo	et	eu	fa	fi	ga	gl		gu	he	hi	hr	is	ja	kk	kn	ko	lt	lv	mk	ml	mr	ms	mt	nb	ne		nn	si	sk	sl	sq	sr	sw	ta	te	th	tr	uk	ur	uz	vi	xh
gu										ces								gu																			gu																	gu
he										ces								he																			he																	he
hi										ces								hi																			hi																	hi
hr										ces								hr																			hr																	hr
is										ces								is																			is																	is
ja										ces								ja																			ja																	ja
kk										ces								kk																			kk																	kk
kn										ces								kn																			kn																	kn
ko										ces								ko																			ko																	ko
lt										ces								lt																			lt																	lt
lv										ces								lv																			lv																	lv
mk										ces								mk																			mk																	mk
ml										ces								ml																			ml																	ml
mr										ces								mr																			mr																	mr
ms										ces								ms																			ms																	ms
mt										ces								mt																			mt																	mt
nb										ces								nb																			nb																	nb
ne										ces								ne																			ne																	ne
	af	ar	az	be	bg	bn	bs	ca	cy	en	eo	et	eu	fa	fi	ga	gl		gu	he	hi	hr	is	ja	kk	kn	ko	lt	lv	mk	ml	mr	ms	mt	nb	ne		nn	si	sk	sl	sq	sr	sw	ta	te	th	tr	uk	ur	uz	vi	xh
nn										ces								nn																			nn																	nn
si										ces								si																			si																	si
sk										ces								sk																			sk																	sk
sl										ces								sl																			sl																	sl
sq										ces								sq																			sq																	sq
sr										ces								sr																			sr																	sr
sw										ces								sw																			sw																	sw
ta										ces								ta																			ta																	ta
te										ces								te																			te																	te
th										ces								th																			th																	th
tr										ces								tr																			tr																	tr
uk										ces								uk																			uk																	uk
ur										ces								ur																			ur																	ur
uz										ces								uz																			uz																	uz
vi										ces								vi																			vi																	vi
xh										ces								xh																			xh																	xh
	af	ar	az	be	bg	bn	bs	ca	cy	en	eo	et	eu	fa	fi	ga	gl		gu	he	hi	hr	is	ja	kk	kn	ko	lt	lv	mk	ml	mr	ms	mt	nb	ne		nn	si	sk	sl	sq	sr	sw	ta	te	th	tr	uk	ur	uz	vi	xh

Statistics and TMX/Moses Downloads

Upper-right triangle: download translation memory files (TMX)
Bottom-left triangle: download plain text files (MOSES/GIZA++)
Language ID's, first row: monolingual plain text files (tokenized)
Language ID's, first column: monolingual plain text files (untokenized)


language	files	tokens	sentences	af	ar	az	be	bg	bn	bs	ca	cy	en	eo	et	eu	fa	fi	ga	gl	gu	he	hi	hr	is	ja	kk	kn	ko	lt	lv	mk	ml	mr	ms	mt	nb	ne	nn	si	sk	sl	sq	sr	sw	ta	te	th	tr	uk	ur	uz	vi	xh
af	4	95.7M	4.3M										4.0M
ar	18	481.2M	17.7M										17.5M
az	4	62.3M	3.5M										3.2M
be	4	60.9M	3.3M										3.1M
bg	23	501.5M	24.4M										22.7M
bn	3	106.0M	2.3M										2.3M
bs	5	108.8M	4.9M										4.6M
ca	14	346.7M	13.9M										13.1M
cy	4	93.1M	4.1M										3.9M
en	405	8.7G	429.7M	4.0M	17.5M	3.2M	3.1M	22.7M	2.3M	4.6M	13.1M	3.9M		1.5M	8.8M	1.5M	3.4M	29.1M	2.7M	2.8M	0.7M	8.7M	9.9M	14.3M	2.7M	18.9M	1.9M	0.7M	18.4M	12.9M	11.3M	4.0M	0.5M	0.7M	8.4M	1.5M	22.9M	0.3M	0.6M	0.3M	20.1M	10.3M	4.2M	5.3M	2.0M	1.1M	0.9M	4.1M	21.6M	25.1M	1.4M	1.2M	19.2M	0.4M
eo	2	37.9M	1.7M										1.5M
et	9	160.0M	9.4M										8.8M
eu	2	30.5M	1.7M										1.5M
fa	4	107.7M	3.5M										3.4M
fi	30	460.4M	31.5M										29.1M
ga	3	66.3M	2.8M										2.7M
gl	3	67.6M	3.0M										2.8M
gu	1	29.5M	0.7M										0.7M
he	9	198.1M	8.8M										8.7M
hi	10	422.8M	10.0M										9.9M
hr	15	292.8M	15.4M										14.3M
is	3	53.9M	2.9M										2.7M
ja	19	550.9M	19.0M										18.9M
kk	2	36.9M	2.1M										1.9M
kn	1	28.7M	0.7M										0.7M
ko	19	368.4M	19.0M										18.4M
lt	13	248.5M	13.9M										12.9M
lv	12	222.4M	12.0M										11.3M
mk	4	91.2M	4.2M										4.0M
ml	1	23.3M	0.6M										0.5M
mr	1	28.8M	0.7M										0.7M
ms	9	173.6M	8.9M										8.4M
mt	2	53.9M	1.6M										1.5M
nb	23	458.4M	24.4M										22.9M
ne	1	8.4M	0.3M										0.3M
nn	1	12.4M	0.6M										0.6M
si	1	6.8M	0.3M										0.3M
sk	21	388.3M	21.4M										20.1M
sl	11	222.3M	11.0M										10.3M
sq	5	108.2M	4.4M										4.2M
sr	6	106.5M	5.6M										5.3M
sw	2	46.3M	2.2M										2.0M
ta	2	54.6M	1.2M										1.1M
te	1	28.4M	0.9M										0.9M
th	5	148.5M	4.2M										4.1M
tr	22	485.1M	31.8M										21.6M
uk	26	490.5M	27.2M										25.1M
ur	2	52.4M	1.4M										1.4M
uz	2	23.6M	1.2M										1.2M
vi	20	564.9M	20.7M										19.2M
xh	1	8.1M	0.4M										0.4M

Note that TMX files only contain unique translation units and, therefore, the number of aligned units is smaller than for the distributions in Moses and XML format. Moses downloads include all non-empty alignment units including duplicates. Token counts for each language also include duplicate sentences and documents.

HPLT v2

Download

Release history:

Statistics and TMX/Moses Downloads

Disclaimer

Notice and take down policy