Wikipedia

62 languages, 80 bitexts
total number of files: 448
total number of tokens: 7.5G
total number of sentences: 447.7M

Release history:

Download

Below you can download data files for all language pairs in different formats and with different kind of annotation (if available). You can click on the various links as explained below.

You need to download the monolingual corpus files and the standoff alignment files between them:

Links on the language IDs of the top row and first column: zip-files of untokenized monolingual XML files
Links on the language IDs of the bottom row and last column: zip-files of tokenized monolingual XML files (if they exist)
Links in the table: Sentence alignment files in XCES Align format (standoff annotation)

	af	ar	az	bg	bn	br	bs	ca	ceb	cs	cy	da	de	el	en	eo	et	eu	fi	fr	fy	ga	gl	he	hr		hu	hy	ia	id	ilo	it	lb	lt	lv	mk	ml	ms	mt	nb	nds	nn	pl	pt	ro	ru	sk	sq	sr_Cyrl	sr_Latn	sv	sw		ta	th	tl	tr	uk	ur	uz	war	zh	zh_Hans	zh_Hant
bs															1.1M											bs																											bs												bs
ceb															15.3M											ceb																											ceb												ceb
cs																										cs																											cs					7.4M							cs
en	3.0M	3.5M	19.4k	3.5M	3.6M	3.1M	3.0M	3.1M	3.0M	3.0M	3.1M	3.0M	2.8M	3.1M		3.0M	2.9M	3.0M	3.0M	3.2M	3.1M	3.4M	2.5M	3.5M	2.9M	en	2.8M	3.3M	36.3k	2.7M	3.2M	3.0M	2.9M	3.1M	2.2M	3.2M	3.8M	2.8M	3.2M	2.8M	0.2M	2.7M	3.2M	3.1M		2.2M		2.3M	3.0M	2.7M	3.2M	3.3M	en	3.5M	3.3M	3.1M	2.3M	2.2M	3.1M	2.8M	3.1M	3.2M	3.3M	3.2M	en
hr															2.7M											hr																											hr												hr
hu																										hu																											hu					0.9M							hu
id															4.9M											id																											id												id
nb															5.7M											nb																											nb												nb
pl																										pl																											pl					8.4M							pl
ro																										ro																											ro					3.1M							ro
sk																										sk																											sk					2.0M							sk
sr_Cyrl															5.2M											sr_Cyrl																											sr_Cyrl												sr_Cyrl
uk										13.2M		12.8M	12.8M													uk	13.3M																13.9M		13.6M		13.3M				13.7M		uk												uk
zh_Hans															6.5M											zh_Hans																											zh_Hans												zh_Hans
zh_Hant															58.3k											zh_Hant																											zh_Hant												zh_Hant
	af	ar	az	bg	bn	br	bs	ca	ceb	cs	cy	da	de	el	en	eo	et	eu	fi	fr	fy	ga	gl	he	hr		hu	hy	ia	id	ilo	it	lb	lt	lv	mk	ml	ms	mt	nb	nds	nn	pl	pt	ro	ru	sk	sq	sr_Cyrl	sr_Latn	sv	sw		ta	th	tl	tr	uk	ur	uz	war	zh	zh_Hans	zh_Hant

Links to zip-files with aligned plain text files, one per language (Moses format).

	af	ar	az	bg	bn	br	bs	ca	ceb	cs	cy	da	de	el	en	eo	et	eu	fi	fr	fy	ga	gl	he	hr		hu	hy	ia	id	ilo	it	lb	lt	lv	mk	ml	ms	mt	nb	nds	nn	pl	pt	ro	ru	sk	sq	sr_Cyrl	sr_Latn	sv	sw		ta	th	tl	tr	uk	ur	uz	war	zh	zh_Hans	zh_Hant
bs															1.1M											bs																											bs												bs
ceb															15.3M											ceb																											ceb												ceb
cs																										cs																											cs					7.4M							cs
en	3.0M	3.5M	19.4k	3.5M	3.6M	3.1M	3.0M	3.1M	3.0M	3.0M	3.1M	3.0M	2.8M	3.1M		3.0M	2.9M	3.0M	3.0M	3.2M	3.1M	3.4M	2.5M	3.5M	2.9M	en	2.8M	3.3M	36.3k	2.7M	3.2M	3.0M	2.9M	3.1M	2.2M	3.2M	3.8M	2.8M	3.2M	2.8M	0.2M	2.7M	3.2M	3.1M		2.2M		2.3M	3.0M	2.7M	3.2M	3.3M	en	3.5M	3.3M	3.1M	2.3M	2.2M	3.1M	2.8M	3.1M	3.2M	3.3M	3.2M	en
hr															2.7M											hr																											hr												hr
hu																										hu																											hu					0.9M							hu
id															4.9M											id																											id												id
nb															5.7M											nb																											nb												nb
pl																										pl																											pl					8.4M							pl
ro																										ro																											ro					3.1M							ro
sk																										sk																											sk					2.0M							sk
sr_Cyrl															5.2M											sr_Cyrl																											sr_Cyrl												sr_Cyrl
uk										13.2M		12.8M	12.8M													uk	13.3M																13.9M		13.6M		13.3M				13.7M		uk												uk
zh_Hans															6.5M											zh_Hans																											zh_Hans												zh_Hans
zh_Hant															58.3k											zh_Hant																											zh_Hant												zh_Hant
	af	ar	az	bg	bn	br	bs	ca	ceb	cs	cy	da	de	el	en	eo	et	eu	fi	fr	fy	ga	gl	he	hr		hu	hy	ia	id	ilo	it	lb	lt	lv	mk	ml	ms	mt	nb	nds	nn	pl	pt	ro	ru	sk	sq	sr_Cyrl	sr_Latn	sv	sw		ta	th	tl	tr	uk	ur	uz	war	zh	zh_Hans	zh_Hant

Links to compressed TMX files, one per language pair.

	af	ar	az	bg	bn	br	bs	ca	ceb	cs	cy	da	de	el	en	eo	et	eu	fi	fr	fy	ga	gl	he	hr		hu	hy	ia	id	ilo	it	lb	lt	lv	mk	ml	ms	mt	nb	nds	nn	pl	pt	ro	ru	sk	sq	sr_Cyrl	sr_Latn	sv	sw		ta	th	tl	tr	uk	ur	uz	war	zh	zh_Hans	zh_Hant
bs															0											bs																											bs												bs
ceb															0											ceb																											ceb												ceb
cs																										cs																											cs					0							cs
en	0	0	0	0	0	0	0	0	0	0	0	0	0	0		0	0	0	0	0	0	0	0	0	0	en	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0		0		0	0	0	0	0	en	0	0	0	0	0	0	0	0	0	0	0	en
hr															0											hr																											hr												hr
hu																										hu																											hu					0							hu
id															0											id																											id												id
nb															0											nb																											nb												nb
pl																										pl																											pl					0							pl
ro																										ro																											ro					0							ro
sk																										sk																											sk					0							sk
sr_Cyrl															0											sr_Cyrl																											sr_Cyrl												sr_Cyrl
uk										0		0	0													uk	0																0		0		0				0		uk												uk
zh_Hans															0											zh_Hans																											zh_Hans												zh_Hans
zh_Hant															0											zh_Hant																											zh_Hant												zh_Hant
	af	ar	az	bg	bn	br	bs	ca	ceb	cs	cy	da	de	el	en	eo	et	eu	fi	fr	fy	ga	gl	he	hr		hu	hy	ia	id	ilo	it	lb	lt	lv	mk	ml	ms	mt	nb	nds	nn	pl	pt	ro	ru	sk	sq	sr_Cyrl	sr_Latn	sv	sw		ta	th	tl	tr	uk	ur	uz	war	zh	zh_Hans	zh_Hant

Wikipedia v1syn

Release history:

Download

Disclaimer

Notice and take down policy

	af	ar	az	bg	bn	br	bs	ca	ceb	cs	cy	da	de	el	en	eo	et	eu	fi	fr	fy	ga	gl	he	hr		hu	hy	ia	id	ilo	it	lb	lt	lv	mk	ml	ms	mt	nb	nds	nn	pl	pt	ro	ru	sk	sq	sr_Cyrl	sr_Latn	sv	sw		ta	th	tl	tr	uk	ur	uz	war	zh	zh_Hans	zh_Hant
bs															view											bs																											bs												bs
ceb															view											ceb																											ceb												ceb
cs																										cs																											cs					view							cs
en	view	view	view	view	view	view	view	view	view	view	view	view	view	view		view	view	view	view	view	view	view	view	view	view	en	view	view	view	view	view	view	view	view	view	view	view	view	view	view	view	view	view	view		view		view	view	view	view	view	en	view	view	view	view	view	view	view	view	view	view	view	en
hr															view											hr																											hr												hr
hu																										hu																											hu					view							hu
id															view											id																											id												id
nb															view											nb																											nb												nb
pl																										pl																											pl					view							pl
ro																										ro																											ro					view							ro
sk																										sk																											sk					view							sk
sr_Cyrl															view											sr_Cyrl																											sr_Cyrl												sr_Cyrl
uk										view		view	view													uk	view																view		view		view				view		uk												uk
zh_Hans															view											zh_Hans																											zh_Hans												zh_Hans
zh_Hant															view											zh_Hant																											zh_Hant												zh_Hant
	af	ar	az	bg	bn	br	bs	ca	ceb	cs	cy	da	de	el	en	eo	et	eu	fi	fr	fy	ga	gl	he	hr		hu	hy	ia	id	ilo	it	lb	lt	lv	mk	ml	ms	mt	nb	nds	nn	pl	pt	ro	ru	sk	sq	sr_Cyrl	sr_Latn	sv	sw		ta	th	tl	tr	uk	ur	uz	war	zh	zh_Hans	zh_Hant