ParaCrawl

42 languages, 43 bitexts
total number of files: 59,996
total number of tokens: 56.11G
total number of sentence fragments: 3.13G

Please, acknowledge the ParaCrawl project at http://paracrawl.eu. This version is derived from the original release at their website adjusted for redistribution via the OPUS corpus collection. Please, acknowledge OPUS as well for this service.

Download

Below you can download data files for all language pairs in different formats and with different kind of annotation (if available). You can click on the various links as explained below. In addition to the files shown on this webpage, OPUS also provides pre-compiled word alignments and phrase tables, bilingual dictionaries, frequency counts, and these files can be found through the resources search form on the top-level website of OPUS.

Release history:

	bg	ca	cs	da	de	el	en	es	et	eu	fi	fr	ga	gl	hr	hu	is	it	km	ko	lt		lv	mt	my	nb	ne	nl	nn	pl	ps	pt	ro	ru	si	sk	sl	so	sv	sw	tl	uk	zh
bg							view															bg																						bg
ca								view														ca																						ca
cs							view															cs																						cs
da							view															da																						da
de							view															de								view														de
el							view															el																						el
en	ces		ces	ces	ces	ces		view	view		view	view	view		view	view	view	view	view	view	view	en	view	view	view	view	view	view	view	view	view	view	view	view	view	view	view	view	view	view	view	view	view	en
es		ces					ces			view				view								es																						es
et							ces															et																						et
eu								ces														eu																						eu
fi							ces															fi																						fi
fr							ces															fr						view																fr
ga							ces															ga																						ga
gl								ces														gl																						gl
hr							ces															hr																						hr
hu							ces															hu																						hu
is							ces															is																						is
it							ces															it																						it
km							ces															km																						km
ko							ces															ko																						ko
lt							ces															lt																						lt
	bg	ca	cs	da	de	el	en	es	et	eu	fi	fr	ga	gl	hr	hu	is	it	km	ko	lt		lv	mt	my	nb	ne	nl	nn	pl	ps	pt	ro	ru	si	sk	sl	so	sv	sw	tl	uk	zh
lv							ces															lv																						lv
mt							ces															mt																						mt
my							ces															my																						my
nb							ces															nb																						nb
ne							ces															ne																						ne
nl							ces					ces										nl																						nl
nn							ces															nn																						nn
pl					ces		ces															pl																						pl
ps							ces															ps																						ps
pt							ces															pt																						pt
ro							ces															ro																						ro
ru							ces															ru																						ru
si							ces															si																						si
sk							ces															sk																						sk
sl							ces															sl																						sl
so							ces															so																						so
sv							ces															sv																						sv
sw							ces															sw																						sw
tl							ces															tl																						tl
uk							ces															uk																						uk
zh							ces															zh																						zh
	bg	ca	cs	da	de	el	en	es	et	eu	fi	fr	ga	gl	hr	hu	is	it	km	ko	lt		lv	mt	my	nb	ne	nl	nn	pl	ps	pt	ro	ru	si	sk	sl	so	sv	sw	tl	uk	zh

Statistics and TMX/Moses Downloads

Upper-right triangle: download translation memory files (TMX)
Bottom-left triangle: download plain text files (MOSES/GIZA++)
Language ID's, first row: monolingual plain text files (tokenized)
Language ID's, first column: monolingual plain text files (untokenized)


language	files	tokens	sentences	bg	ca	cs	da	de	el	en	es	et	eu	fi	fr	ga	gl	hr	hu	is	it	km	ko	lt	lv	mt	my	nb	ne	nl	nn	pl	ps	pt	ro	ru	si	sk	sl	so	sv	sw	tl	uk	zh
bg	266	249.1M	13.8M							13.3M
ca	345	458.7M	17.8M								17.2M
cs	1,013	738.6M	52.7M							50.6M
da	685	613.3M	35.6M							34.2M
de	5,586	4.7G	292.3M							278.3M																						0.9M
el	429	394.0M	22.1M							21.4M
en	29,475	27.4G	1.5G	13.3M		50.6M	34.2M	278.3M	21.4M		269.4M	8.5M		31.3M	216.6M	3.2M		3.2M	36.4M	3.0M	97.0M	65.1k	4.0M	13.2M	13.1M	1.2M	31.4k	19.3M	92.1k	89.1M	0.3M	40.1M	26.3k	84.9M	25.0M	5.4M	0.2M	22.9M	9.5M	14.9k	49.1M	0.1M	0.2M	14.0M	14.2M
es	5,838	6.0G	304.9M		17.2M					269.4M			3.3M				1.9M
et	171	123.6M	8.9M							8.5M
eu	67	56.9M	3.6M								3.3M
fi	627	410.7M	32.8M							31.3M
fr	4,387	5.0G	228.6M							216.7M																				2.7M
ga	65	72.3M	3.4M							3.2M
gl	38	47.8M	1.9M								1.9M
hr	65	81.0M	3.5M							3.2M
hu	729	538.6M	38.0M							36.4M
is	60	48.5M	3.1M							3.0M
it	1,940	2.0G	100.3M							97.0M
km	2	2.0M	68.2k							65.1k
ko	81	58.3M	4.1M							4.0M
lt	264	190.0M	13.8M							13.2M
lv	262	204.0M	13.6M							13.1M
mt	25	26.7M	1.3M							1.2M
my	1	0.9M	33.9k							31.4k
nb	386	350.0M	20.2M							19.3M
ne	2	4.0M	94.7k							92.1k
nl	1,837	1.6G	96.0M							89.1M					2.7M
nn	6	6.3M	0.3M							0.3M
pl	821	661.5M	42.6M					0.9M		40.1M
ps	1	0.9M	27.4k							26.3k
pt	1,699	1.6G	87.7M							84.9M
ro	501	476.3M	26.2M							25.0M
ru	108	98.4M	6.1M							5.4M
si	5	7.6M	0.2M							0.2M
sk	459	325.4M	23.7M							22.9M
sl	191	162.0M	9.9M							9.5M
so	1	0.5M	20.5k							14.9k
sv	983	755.6M	51.4M							49.1M
sw	3	3.8M	0.2M							0.1M
tl	5	7.8M	0.3M							0.2M
uk	283	718.8M	14.8M							14.1M
zh	284	98.9M	14.2M							14.2M

Note that TMX files only contain unique translation units and, therefore, the number of aligned units is smaller than for the distributions in Moses and XML format. Moses downloads include all non-empty alignment units including duplicates. Token counts for each language also include duplicate sentences and documents.

ParaCrawl v9

Download

Release history:

Statistics and TMX/Moses Downloads

Disclaimer

Notice and take down policy