Books

All texts are freely available for personal, educational and research use. Commercial use (e.g. reselling as parallel books) and mass redistribution without explicit permission are not granted. Please acknowledge the source when using the data!

16 languages, 64 bitexts
total number of files: 158
total number of tokens: 19.50M
total number of sentence fragments: 0.91M

Download

Below you can download data files for all language pairs in different formats and with different kind of annotation (if available). You can click on the various links as explained below. In addition to the files shown on this webpage, OPUS also provides pre-compiled word alignments and phrase tables, bilingual dictionaries, frequency counts, and these files can be found through the resources search form on the top-level website of OPUS.

Statistics and TMX/Moses Downloads

	ca	de	el	en	eo	es	fi	fr	hu	it	nl	no	pl	pt	ru	sv
ca		view		view					view		view						ca
de	ces			view	view	view		view	view	view	view			view	view		de
el				view		view		view	view								el
en	ces	ces	ces		view	view	view	view	view	view	view	view	view	view	view	view	en
eo		ces		ces		view		view	view	view				view			eo
es		ces	ces	ces	ces		view	view	view	view	view	view		view	view		es
fi				ces		ces		view	view			view	view				fi
fr		ces	ces	ces	ces	ces	ces		view	view	view	view	view	view	view	view	fr
hu	ces	ces	ces	ces	ces	ces	ces	ces		view	view	view	view	view	view		hu
it		ces		ces	ces	ces		ces	ces		view			view	view	view	it
nl	ces	ces		ces		ces		ces	ces	ces							nl
no				ces		ces	ces	ces	ces								no
pl				ces			ces	ces	ces								pl
pt		ces		ces	ces	ces		ces	ces	ces							pt
ru		ces		ces		ces		ces	ces	ces							ru
sv				ces				ces		ces							sv
	ca	de	el	en	eo	es	fi	fr	hu	it	nl	no	pl	pt	ru	sv

Upper-right triangle: download translation memory files (TMX)
Bottom-left triangle: download plain text files (MOSES/GIZA++)
Language ID's, first row: monolingual plain text files (tokenized)
Language ID's, first column: monolingual plain text files (untokenized)


language	files	tokens	sentences	ca	de	el	en	eo	es	fi	fr	hu	it	nl	no	pl	pt	ru	sv
ca	1	93.3k	5.0k		4.4k		4.6k					4.5k		4.3k
de	12	1.3M	71.0k	4.4k			51.1k	1.4k	27.3k		34.8k	51.4k	27.1k	15.6k			1.1k	17.1k
el	1	36.5k	1.6k				1.3k		1.1k		1.2k	1.1k
en	42	5.9M	0.2M	4.6k	51.5k	1.3k		1.6k	92.8k	3.6k	0.1M	0.1M	32.0k	38.5k	3.5k	2.8k	1.4k	17.3k	3.1k
eo	2	38.8k	2.0k		1.4k		1.6k		1.7k		1.6k	1.6k	1.5k				1.3k
es	18	2.4M	0.1M		27.5k	1.1k	93.5k	1.7k		3.3k	56.0k	78.3k	28.6k	32.2k	3.6k		1.3k	16.6k
fi	1	54.5k	3.8k				3.6k		3.3k		3.5k	3.5k			3.4k	2.8k
fr	29	3.6M	0.2M		34.9k	1.2k	0.1M	1.6k	56.3k	3.5k		88.9k	14.6k	39.9k	3.4k	2.8k	1.3k	8.2k	3.0k
hu	28	3.3M	0.2M	4.5k	51.8k	1.1k	0.1M	1.6k	78.8k	3.5k	89.3k		30.7k	43.3k	3.4k	2.9k	1.2k	25.8k
it	8	0.8M	36.0k		27.4k		32.3k	1.5k	28.9k		14.7k	30.9k		2.4k			1.2k	17.6k	3.0k
nl	9	1.3M	55.1k	4.3k	15.6k		38.7k		32.2k		40.0k	43.4k	2.4k
no	1	67.9k	4.0k				3.5k		3.6k	3.4k	3.4k	3.4k
pl	1	43.5k	3.3k				2.8k			2.8k	2.8k	2.9k
pt	1	32.3k	1.5k		1.1k		1.4k	1.3k	1.3k		1.3k	1.2k	1.2k
ru	3	0.5M	27.3k		17.4k		17.5k		16.8k		8.2k	26.1k	17.9k
sv	1	76.6k	3.2k				3.1k				3.0k		3.0k

Note that TMX files only contain unique translation units and, therefore, the number of aligned units is smaller than for the distributions in Moses and XML format. Moses downloads include all non-empty alignment units including duplicates. Token counts for each language also include duplicate sentences and documents.

Books

Terms of Use

Download

Statistics and TMX/Moses Downloads

Disclaimer

Notice and take down policy