ParaCrawl

40 languages, 41 bitexts
total number of files: 20,995
total number of tokens: 21.40G
total number of sentence fragments: 1.12G

Please, acknowledge the ParaCrawl project at http://paracrawl.eu. This version is derived from the original release at their website adjusted for redistribution via the OPUS corpus collection. Please, acknowledge OPUS as well for this service.

Download

Below you can download data files for all language pairs in different formats and with different kind of annotation (if available). You can click on the various links as explained below. In addition to the files shown on this webpage, OPUS also provides pre-compiled word alignments and phrase tables, bilingual dictionaries, frequency counts, and these files can be found through the resources search form on the top-level website of OPUS.

Release history:

	bg	ca	cs	da	de	el	en	es	et	eu	fi	fr	ga	gl	ha	hr	hu	ig	is	it		km	lt	lv	mt	my	nb	ne	nl	nn	pl	pt	ro	ru	si	sk	sl	so	sv	sw	tl
bg							view														bg																					bg
ca								view													ca																					ca
cs							view														cs																					cs
da							view														da																					da
de							view														de										view											de
el							view														el																					el
en	ces		ces	ces	ces	ces		view	view		view	view	view		view	view	view	view	view	view	en	view	view	view	view	view	view	view	view	view	view	view	view	view	view	view	view	view	view	view	view	en
es		ces					ces			view				view							es																					es
et							ces														et																					et
eu								ces													eu																					eu
fi							ces														fi																					fi
fr							ces														fr								view													fr
ga							ces														ga																					ga
gl								ces													gl																					gl
ha							ces														ha																					ha
hr							ces														hr																					hr
hu							ces														hu																					hu
ig							ces														ig																					ig
is							ces														is																					is
it							ces														it																					it
	bg	ca	cs	da	de	el	en	es	et	eu	fi	fr	ga	gl	ha	hr	hu	ig	is	it		km	lt	lv	mt	my	nb	ne	nl	nn	pl	pt	ro	ru	si	sk	sl	so	sv	sw	tl
km							ces														km																					km
lt							ces														lt																					lt
lv							ces														lv																					lv
mt							ces														mt																					mt
my							ces														my																					my
nb							ces														nb																					nb
ne							ces														ne																					ne
nl							ces					ces									nl																					nl
nn							ces														nn																					nn
pl					ces		ces														pl																					pl
pt							ces														pt																					pt
ro							ces														ro																					ro
ru							ces														ru																					ru
si							ces														si																					si
sk							ces														sk																					sk
sl							ces														sl																					sl
so							ces														so																					so
sv							ces														sv																					sv
sw							ces														sw																					sw
tl							ces														tl																					tl
	bg	ca	cs	da	de	el	en	es	et	eu	fi	fr	ga	gl	ha	hr	hu	ig	is	it		km	lt	lv	mt	my	nb	ne	nl	nn	pl	pt	ro	ru	si	sk	sl	so	sv	sw	tl

Statistics and TMX/Moses Downloads

Upper-right triangle: download translation memory files (TMX)
Bottom-left triangle: download plain text files (MOSES/GIZA++)
Language ID's, first row: monolingual plain text files (tokenized)
Language ID's, first column: monolingual plain text files (untokenized)


language	files	tokens	sentences	bg	ca	cs	da	de	el	en	es	et	eu	fi	fr	ga	gl	ha	hr	hu	ig	is	it	km	lt	lv	mt	my	nb	ne	nl	nn	pl	pt	ro	ru	si	sk	sl	so	sv	sw	tl
bg	130	132.2M	7.0M							6.5M
ca	138	145.7M	7.1M								6.9M
cs	282	240.2M	15.1M							14.1M
da	209	204.0M	11.1M							10.4M
de	1,672	1.6G	90.8M							82.6M																							0.9M
el	189	197.8M	10.0M							9.4M
en	10,263	10.3G	546.7M	6.5M		14.1M	10.4M	82.6M	9.4M		78.7M	3.2M		7.3M	104.4M	2.7M		19.7k	7.0M	6.7M	28.8k	2.4M	40.8M	65.1k	4.4M	4.0M	0.9M	31.4k	17.6M	92.1k	31.3M	0.3M	13.7M	31.5M	6.2M	5.4M	0.2M	4.9M	3.7M	14.9k	11.6M	0.1M	0.2M
es	1,723	1.9G	90.9M		6.9M					78.7M			0.5M				1.2M
et	64	53.7M	3.4M							3.2M
eu	11	9.1M	0.6M								0.5M
fi	146	112.5M	7.8M							7.3M
fr	2,142	2.6G	113.6M							104.4M																					2.7M
ga	54	69.4M	3.0M							2.7M
gl	25	19.6M	1.2M								1.2M
ha	1	0.5M	22.2k							19.7k
hr	140	122.6M	7.7M							7.0M
hu	134	113.2M	7.1M							6.7M
ig	1	0.7M	30.2k							28.8k
is	48	39.5M	2.6M							2.4M
it	816	904.7M	43.3M							40.8M
km	2	2.6M	68.2k							65.1k
lt	89	75.2M	4.7M							4.4M
lv	80	69.1M	4.2M							4.0M
mt	18	16.6M	0.9M							0.9M
my	1	1.3M	33.9k							31.4k
nb	352	321.1M	20.0M							17.6M
ne	2	4.4M	94.7k							92.1k
nl	680	591.3M	36.1M							31.3M					2.7M
nn	7	4.0M	0.3M							0.3M
pl	294	268.1M	15.6M					0.9M		13.7M
pt	630	643.2M	33.1M							31.5M
ro	124	126.4M	6.6M							6.2M
ru	108	98.6M	6.1M							5.4M
si	5	8.4M	0.2M							0.2M
sk	98	82.8M	5.2M							4.9M
sl	75	71.4M	4.0M							3.7M
so	1	0.5M	20.5k							14.9k
sv	233	204.1M	12.5M							11.6M
sw	3	3.7M	0.2M							0.1M
tl	5	7.8M	0.3M							0.2M

Note that TMX files only contain unique translation units and, therefore, the number of aligned units is smaller than for the distributions in Moses and XML format. Moses downloads include all non-empty alignment units including duplicates. Token counts for each language also include duplicate sentences and documents.

ParaCrawl v7.1

Download

Release history:

Statistics and TMX/Moses Downloads

Disclaimer

Notice and take down policy