ParaCrawl

39 languages, 40 bitexts
total number of files: 68,924
total number of tokens: 60.66G
total number of sentence fragments: 3.58G

Please, acknowledge the ParaCrawl project at http://paracrawl.eu. This version is derived from the original release at their website adjusted for redistribution via the OPUS corpus collection. Please, acknowledge OPUS as well for this service.

Download

Below you can download data files for all language pairs in different formats and with different kind of annotation (if available). You can click on the various links as explained below. In addition to the files shown on this webpage, OPUS also provides pre-compiled word alignments and phrase tables, bilingual dictionaries, frequency counts, and these files can be found through the resources search form on the top-level website of OPUS.

Release history:

	bg	ca	cs	da	de	el	en	es	et	eu	fi	fr	ga	gl	hr	hu	is	it	km	ko		lt	lv	mt	my	ne	nl	no	pl	ps	pt	ro	ru	si	sk	sl	so	sv	sw	tl
bg							view														bg																				bg
ca								view													ca																				ca
cs							view														cs																				cs
da							view														da																				da
de							view														de								view												de
el							view														el																				el
en	ces		ces	ces	ces	ces		view	view		view	view	view		view	view	view	view	view	view	en	view	view	view	view	view	view	view	view	view	view	view	view	view	view	view	view	view	view	view	en
es		ces					ces			view				view							es																				es
et							ces														et																				et
eu								ces													eu																				eu
fi							ces														fi																				fi
fr							ces														fr						view														fr
ga							ces														ga																				ga
gl								ces													gl																				gl
hr							ces														hr																				hr
hu							ces														hu																				hu
is							ces														is																				is
it							ces														it																				it
km							ces														km																				km
ko							ces														ko																				ko
	bg	ca	cs	da	de	el	en	es	et	eu	fi	fr	ga	gl	hr	hu	is	it	km	ko		lt	lv	mt	my	ne	nl	no	pl	ps	pt	ro	ru	si	sk	sl	so	sv	sw	tl
lt							ces														lt																				lt
lv							ces														lv																				lv
mt							ces														mt																				mt
my							ces														my																				my
ne							ces														ne																				ne
nl							ces					ces									nl																				nl
no							ces														no																				no
pl					ces		ces														pl																				pl
ps							ces														ps																				ps
pt							ces														pt																				pt
ro							ces														ro																				ro
ru							ces														ru																				ru
si							ces														si																				si
sk							ces														sk																				sk
sl							ces														sl																				sl
so							ces														so																				so
sv							ces														sv																				sv
sw							ces														sw																				sw
tl							ces														tl																				tl
	bg	ca	cs	da	de	el	en	es	et	eu	fi	fr	ga	gl	hr	hu	is	it	km	ko		lt	lv	mt	my	ne	nl	no	pl	ps	pt	ro	ru	si	sk	sl	so	sv	sw	tl

Statistics and TMX/Moses Downloads

Upper-right triangle: download translation memory files (TMX)
Bottom-left triangle: download plain text files (MOSES/GIZA++)
Language ID's, first row: monolingual plain text files (tokenized)
Language ID's, first column: monolingual plain text files (untokenized)


language	files	tokens	sentences	bg	ca	cs	da	de	el	en	es	et	eu	fi	fr	ga	gl	hr	hu	is	it	km	ko	lt	lv	mt	my	ne	nl	no	pl	ps	pt	ro	ru	si	sk	sl	so	sv	sw	tl
bg	239	240.4M	12.9M							11.9M
ca	1,071	1.0G	54.9M								53.5M
cs	1,004	738.1M	52.4M							50.2M
da	839	690.2M	43.4M							41.9M
de	5,242	4.3G	274.9M							261.1M																					0.9M
el	692	592.1M	35.7M							34.6M
en	33,020	28.6G	1.7G	11.9M		50.2M	41.9M	261.1M	34.6M		396.5M	8.6M		15.3M	266.8M	2.0M		11.1M	12.7M	5.7M	120.1M	65.1k	4.0M	8.0M	8.2M	1.6M	31.4k	92.1k	98.5M	59.1M	45.4M	26.3k	102.6M	13.4M	5.4M	0.2M	13.0M	7.5M	14.9k	44.1M	0.1M	0.2M
es	9,300	8.2G	479.7M		53.5M					396.5M			2.4M				12.4M
et	172	130.5M	8.9M							8.6M
eu	49	36.2M	2.5M								2.4M
fi	307	226.2M	16.1M							15.3M
fr	5,391	5.9G	280.7M							266.9M																			2.7M
ga	40	48.7M	2.1M							2.0M
gl	249	152.1M	12.6M								12.4M
hr	222	175.9M	11.5M							11.1M
hu	254	208.2M	13.4M							12.7M
is	115	86.4M	6.0M							5.7M
it	2,403	2.3G	124.2M							120.1M
km	2	2.0M	68.2k							65.1k
ko	81	58.3M	4.1M							4.0M
lt	161	129.2M	8.4M							8.0M
lv	164	140.9M	8.6M							8.2M
mt	33	33.0M	1.7M							1.6M
my	1	0.9M	33.9k							31.4k
ne	2	4.0M	94.7k							92.1k
nl	2,024	1.7G	105.4M							98.5M					2.7M
no	1,182	867.6M	61.5M							59.1M
pl	927	738.9M	48.1M					0.9M		45.4M
ps	1	0.9M	27.4k							26.3k
pt	2,053	1.9G	106.1M							102.6M
ro	268	269.0M	14.4M							13.4M
ru	108	98.4M	6.1M							5.4M
si	5	7.6M	0.2M							0.2M
sk	261	210.8M	13.5M							13.0M
sl	151	141.9M	7.8M							7.5M
so	1	0.5M	20.5k							14.9k
sv	882	702.3M	46.0M							44.1M
sw	3	3.8M	0.2M							0.1M
tl	5	7.8M	0.3M							0.2M

Note that TMX files only contain unique translation units and, therefore, the number of aligned units is smaller than for the distributions in Moses and XML format. Moses downloads include all non-empty alignment units including duplicates. Token counts for each language also include duplicate sentences and documents.

ParaCrawl v8

Download

Release history:

Statistics and TMX/Moses Downloads

Disclaimer

Notice and take down policy