Home / Query / WordAlign / Wiki    [ada83] [bible] [bianet] [books] [CCAligned] [CCMatrix] [CAPES] [DGT] [DOGC] [ECB] [EhuHac] [EiTB] [Elhuyar] [ELITR_ECA] [ELRC] [EMEA] [EUbooks] [EU] [Europarl] [EuroPat] [finlex] [fiskmö] [giga] [GNOME] [GlobalVoices] [hren] [infopankki] [JRC] [KDE4/doc] [liv4ever] [MBS] [memat] [MontenegrinSubs] [MultiUN] [MultiParaCrawl] [MultiCCAligned] [MT560] [NC] [Ofis] [OO/OO3] [subs/16/18] [Opus100] [ParaCrawl] [ParCor] [PHP] [QED] [sardware] [SciELO] [SETIMES] [SPC] [Tatoeba] [Tanzil] [TEP] [TED] [tico19] [Tilde] [Ubuntu] [UN] [UNPC] [WikiMatrix] [Wikimedia] [Wikipedia] [WikiSource] [WMT] [XhosaNavy]

KDEdoc

A parallel corpus of KDE manuals.

24 languages, 226 bitexts
total number of files: 3,678
total number of tokens: 3.74M
total number of sentence fragments: 0.30M

Please cite the following article if you use any part of the corpus in your own work:
J. Tiedemann, 2012, Parallel Data, Tools and Interfaces in OPUS. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012)

Download

Below you can download data files for all language pairs in different formats and with different kind of annotation (if available). You can click on the various links as explained below. In addition to the files shown on this webpage, OPUS also provides pre-compiled word alignments and phrase tables, bilingual dictionaries, frequency counts, and these files can be found through the resources search form on the top-level website of OPUS.

Bottom-left triangle: download files
  • ces = sentence alignments in XCES format
  • leftmost column language IDs = tokenized corpus files in XML
  • TMX and plain text files (Moses): see "Statistics" below
  • lower row language IDs = parsed corpus files (if they exist)
Upper-right triangle: sample files
  • view = bilingual XML file samples
  • upper row language IDs = monolingual XML file samples
  • rightmost column language IDs = untokenized corpus files

da de en_GB es et fr hu it ja nl nn pt pt_BR ro ru sk sl sr sv tr uk wa xh zh_TW
da viewviewviewviewviewviewviewda
de viewviewviewviewviewviewviewviewviewviewviewviewviewviewviewviewviewviewviewviewviewviewde
en_GB ces viewviewviewviewviewviewviewviewviewviewviewviewviewviewviewviewviewviewviewen_GB
es ces ces viewviewviewviewviewviewviewviewviewviewviewviewviewviewviewviewviewviewviewviewes
et ces ces ces viewviewviewviewviewviewviewet
fr ces ces ces ces viewviewviewviewviewviewviewviewviewviewviewviewviewviewviewviewviewviewfr
hu ces ces ces ces ces viewviewviewviewviewviewviewviewviewviewviewviewviewviewviewviewviewhu
it ces ces ces ces ces viewviewviewviewviewviewviewviewviewviewviewviewviewviewviewit
ja ces ces ces ces ces ces viewviewviewviewviewviewviewviewviewviewviewviewviewja
nl ces ces ces ces ces ces ces ces viewviewviewviewviewviewviewviewviewviewviewviewviewnl
nn ces ces ces ces ces viewviewviewviewviewviewnn
pt ces ces ces ces ces ces ces ces ces ces viewviewviewviewviewviewviewviewviewviewviewpt
pt_BR ces ces ces ces ces viewviewviewviewviewpt_BR
ro ces ces ces ces ces ces ces ces ces viewviewviewviewviewviewviewviewviewviewro
ru ces ces ces ces ces ces ces ces ces ces ces ces viewviewviewviewviewviewviewviewviewru
sk ces ces ces ces ces ces ces ces ces ces ces ces ces ces viewviewviewviewviewviewviewviewsk
sl ces ces ces ces ces ces ces ces ces ces ces ces ces ces ces viewviewviewviewviewviewviewsl
sr ces ces ces ces ces ces ces ces ces ces ces ces ces ces viewviewviewviewviewsr
sv ces ces ces ces ces ces ces ces ces ces ces ces ces ces ces ces ces viewviewviewviewviewsv
tr ces ces ces ces ces ces ces ces ces ces ces ces ces ces ces ces viewviewviewviewtr
uk ces ces ces ces ces ces ces ces ces ces ces ces ces ces ces ces ces viewviewuk
wa ces ces ces ces ces ces ces ces ces ces ces ces ces ces ces ces ces viewviewwa
xh ces ces ces ces ces ces ces ces ces ces ces ces ces ces ces ces ces ces ces ces ces viewxh
zh_TW ces ces ces ces ces ces ces ces ces ces ces ces ces ces ces ces ces ces ces ces zh_TW
da de en_GB es et fr hu it ja nl nn pt pt_BR ro ru sk sl sr sv tr uk wa xh zh_TW

Statistics and TMX/Moses Downloads

Number of files, tokens, and sentences per language (including non-parallel ones if they exist)
Number of sentence alignment units per language pair

Upper-right triangle: download translation memory files (TMX)
Bottom-left triangle: download plain text files (MOSES/GIZA++)
Language ID's, first row: monolingual plain text files (tokenized)
Language ID's, first column: monolingual plain text files (untokenized)

language files tokens sentencesda de en_GB es et fr hu it ja nl nn pt pt_BR ro ru sk sl sr sv tr uk wa xh zh_TW
da 369 0.5M 38.1k29 28.0k 0.1k 0.2k 19 4.4k 0.2k
de 374 0.5M 39.8k2.7k 17.1k 38 19.6k 7.7k 8.1k 0.2k 5.6k 96 11.1k 0.5k 0.2k 3.5k 11.9k 10.9k 29 25.7k 0.1k 0.2k 19 4.1k 0.2k
en_GB 132 41.5k 3.2k3.0k 2.8k 18 2.1k 0.8k 1.7k 0.1k 1.4k 2.9k 57 0.6k 2.8k 2.0k 29 2.4k 0.1k 0.2k 19 0.8k 0.1k
es 466 0.5M 36.5k21.5k 3.1k 39 15.6k 6.2k 6.2k 0.2k 3.9k 0.1k 8.3k 0.5k 0.2k 3.6k 10.0k 8.5k 29 24.6k 0.1k 0.2k 19 3.5k 0.2k
et 14 0.2k 4442 23 42 42 3 36 39 20 36 38
fr 330 0.4M 30.9k21.4k 2.3k 17.9k 42 7.2k 6.9k 0.1k 4.8k 88 7.3k 0.3k 0.2k 2.6k 9.3k 9.2k 27 21.9k 0.1k 0.2k 17 4.4k 0.2k
hu 231 0.1M 11.7k8.3k 0.9k 7.4k 4 7.8k 3.6k 0.2k 1.5k 0.1k 2.0k 0.3k 0.2k 1.4k 4.8k 3.5k 29 8.7k 96 0.2k 19 1.0k 0.2k
it 134 0.1M 12.3k10.0k 1.8k 8.7k 8.2k 4.3k 88 0.4k 2.6k 0.2k 0.2k 1.6k 4.0k 2.6k 29 8.7k 54 59 19 3.3k 0.2k
ja 10 6.2k 0.2k0.2k 0.1k 0.2k 0.1k 0.2k 88 0.2k 0.2k 36 84 0.2k 0.1k 28 0.2k 24 68 18 0.1k 73
nl 160 94.9k 6.8k6.1k 1.6k 4.5k 42 5.0k 1.6k 0.4k 0.2k 21 4.4k 61 0.5k 4.0k 4.3k 24 1.7k 0.1k 0.2k 19 0.4k 0.1k
nn 2 1.2k 0.1k96 0.1k 90 0.1k 21 21 0.1k 98 99 19 21
pt 213 0.2M 16.9k14.5k 3.1k 11.9k 42 9.2k 2.7k 4.5k 0.2k 6.2k 21 0.2k 1.3k 7.0k 6.1k 29 6.9k 0.1k 0.2k 19 2.0k 0.2k
pt_BR 25 9.7k 0.7k0.6k 0.5k 0.3k 0.4k 0.2k 0.3k 0.3k 0.2k 0.6k 23
ro 6 2.2k 0.2k0.2k 57 0.2k 0.2k 0.2k 0.2k 37 61 0.2k 0.1k 0.2k 0.2k 15 0.2k 16 14 5 0.2k 0.1k
ru 99 75.8k 4.4k3.9k 0.6k 4.0k 21 2.8k 1.6k 1.9k 85 0.5k 1.7k 0.3k 0.1k 2.1k 2.6k 28 3.5k 46 58 18 1.1k 0.2k
sk 311 0.2M 19.9k14.6k 3.1k 15.4k 42 11.1k 5.9k 6.2k 0.2k 4.3k 0.1k 10.8k 0.4k 0.2k 2.4k 7.5k 29 11.8k 0.1k 0.2k 19 3.1k 0.2k
sl 199 0.2M 14.0k12.7k 2.2k 10.7k 42 9.8k 4.0k 3.6k 0.1k 5.6k 0.1k 8.8k 0.2k 0.2k 3.1k 9.0k 29 7.2k 0.1k 0.2k 19 2.9k 0.2k
sr 2 0.2k 2929 29 29 29 27 29 29 28 24 29 15 28 29 29 29 19 19 29 28
sv 469 0.7M 55.9k36.5k 32.2k 2.5k 32.6k 24.5k 10.5k 11.9k 0.2k 1.9k 0.1k 11.7k 0.6k 0.2k 4.0k 16.6k 9.3k 29 0.1k 0.2k 19 4.4k 0.2k
tr 31 0.8k 0.1k0.1k 0.1k 0.1k 0.1k 0.1k 98 54 24 0.1k 0.1k 16 46 0.1k 0.1k 19 0.1k 55 18 50 36
uk 33 2.8k 0.2k0.2k 0.2k 0.2k 0.2k 0.2k 0.2k 59 68 0.2k 19 0.2k 14 58 0.2k 0.2k 0.2k 63 0.1k 32
wa 1 0.2k 1919 19 19 19 17 19 19 18 19 19 5 18 19 19 19 19 18 19 18
xh 61 74.3k 7.1k5.1k 4.6k 0.8k 4.2k 4.7k 1.1k 4.0k 0.1k 0.4k 21 2.7k 23 0.2k 1.5k 3.9k 3.6k 29 5.1k 50 0.1k 19 0.2k
zh_TW 6 0 0.2k0.2k 0.2k 0.1k 0.2k 0.2k 0.2k 0.2k 74 0.1k 0.2k 0.1k 0.2k 0.2k 0.2k 28 0.2k 36 32 18 0.2k

Note that TMX files only contain unique translation units and, therefore, the number of aligned units is smaller than for the distributions in Moses and XML format. Moses downloads include all non-empty alignment units including duplicates. Token counts for each language also include duplicate sentences and documents.


Disclaimer

Notice and take down policy

Notice: Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please: Take down: We will comply to legitimate requests by removing the affected sources from the next release of the corpus.