Home / Query / WordAlign / Wiki    [ada83] [bible] [bianet] [books] [CCAligned] [CCMatrix] [CAPES] [DGT] [DOGC] [ECB] [EhuHac] [EiTB] [Elhuyar] [ELITR_ECA] [ELRC] [EMEA] [EUbooks] [EU] [Europarl] [EuroPat] [finlex] [fiskmö] [giga] [GNOME] [GlobalVoices] [hren] [infopankki] [JRC] [KDE4/doc] [liv4ever] [MBS] [memat] [MontenegrinSubs] [MultiUN] [MultiParaCrawl] [MultiCCAligned] [MT560] [NC] [Ofis] [OO/OO3] [subs/16/18] [Opus100] [ParaCrawl] [ParCor] [PHP] [QED] [sardware] [SciELO] [SETIMES] [SPC] [Tatoeba] [Tanzil] [TEP] [TED] [tico19] [Tilde] [Ubuntu] [UN] [UNPC] [WikiMatrix] [Wikimedia] [Wikipedia] [WikiSource] [WMT] [XhosaNavy]

OpenSubtitles2016 - Intra-Lingual Alignments

The following table lists alignments between subtitles in the same language. There are often various alternative subtitle files for each movie in the collection. Many of them are identical or near identical. We have processed them all and sorted the results in various ways. The resulting files are linked in the table for each language. Here is an explanation of the different columns:

Some alignment files exist as XCES only (standoff annotation of sentence alignment) and some of them are also available in TMX format (to make it easier to inspect the actual sentence pairs). If you use the XCES alignment files, then you will also need the corpus, which is linked in the first column.

Please cite the following article if you use any part of the corpus in your own work:
Jörg Tiedemann, 2016, Finding Alternative Translations in a Large Corpus of Movie Subtitles.
In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016)

languagecorpusallinsertmisalignedotherpctspell
arzipxmlxml tmxxmlxml tmxxml tmxxml tmx
bgzipxmlxml tmxxmlxml tmxxml tmxxml tmx
bnzipxmlxml tmxxml tmx
brzipxmlxml tmxxml tmx
bszipxmlxml tmxxmlxml tmxxml tmxxml tmx
cazipxmlxml tmxxml tmxxml tmxxml tmx
cszipxmlxml tmxxmlxml tmxxml tmxxml tmx
dazipxmlxml tmxxmlxml tmxxml tmxxml tmx
dezipxmlxml tmxxmlxml tmxxml tmxxml tmx
elzipxmlxml tmxxmlxml tmxxml tmxxml tmx
enzipxmlxml tmxxmlxml tmxxml tmxxml tmx
eozipxmlxmlxml tmxxml tmx
eszipxmlxml tmxxmlxml tmxxml tmxxml tmx
etzipxmlxml tmxxmlxml tmxxml tmxxml tmx
euzipxmlxml tmxxmlxml tmxxml tmx
fazipxmlxml tmxxmlxml tmxxml tmxxml tmx
fizipxmlxml tmxxmlxml tmxxml tmxxml tmx
frzipxmlxml tmxxmlxml tmxxml tmxxml tmx
glzipxmlxml tmxxmlxml tmxxml tmxxml tmx
hezipxmlxml tmxxmlxml tmxxml tmxxml tmx
hizipxmlxml tmxxmlxml tmxxml tmxxml tmx
hrzipxmlxml tmxxmlxml tmxxml tmxxml tmx
huzipxmlxml tmxxmlxml tmxxml tmxxml tmx
idzipxmlxml tmxxmlxml tmxxml tmxxml tmx
iszipxmlxml tmxxmlxml tmxxml tmxxml tmx
itzipxmlxml tmxxmlxml tmxxml tmxxml tmx
jazipxmlxml tmxxmlxml tmxxml tmxxml tmx
kazipxml
kozipxmlxml tmxxmlxml tmxxml tmxxml tmx
ltzipxmlxml tmxxmlxml tmxxml tmxxml tmx
lvzipxmlxml tmxxml tmxxml tmxxml tmx
mkzipxmlxml tmxxmlxml tmxxml tmxxml tmx
mlzipxmlxml tmxxmlxml tmxxml tmxxml tmx
mszipxmlxml tmxxmlxml tmxxml tmxxml tmx
nlzipxmlxml tmxxmlxml tmxxml tmxxml tmx
nozipxmlxml tmxxmlxml tmxxml tmxxml tmx
plzipxmlxml tmxxmlxml tmxxml tmxxml tmx
ptzipxmlxml tmxxmlxml tmxxml tmxxml tmx
pt_brzipxmlxml tmxxmlxml tmxxml tmxxml tmx
rozipxmlxml tmxxmlxml tmxxml tmxxml tmx
ruzipxmlxml tmxxmlxml tmxxml tmxxml tmx
sizipxmlxml tmxxmlxml tmxxml tmxxml tmx
skzipxmlxml tmxxmlxml tmxxml tmxxml tmx
slzipxmlxml tmxxmlxml tmxxml tmxxml tmx
sqzipxmlxml tmxxmlxml tmxxml tmxxml tmx
srzipxmlxml tmxxmlxml tmxxml tmxxml tmx
svzipxmlxml tmxxmlxml tmxxml tmxxml tmx
thzipxmlxml tmxxmlxml tmxxml tmxxml tmx
tlzipxmlxml tmxxml tmx
trzipxmlxml tmxxmlxml tmxxml tmxxml tmx
ukzipxmlxml tmxxmlxml tmxxml tmxxml tmx
vizipxmlxml tmxxmlxml tmxxml tmxxml tmx
zhzipxmlxml tmxxmlxml tmxxml tmxxml tmx
zh_enzipxmlxml tmxxmlxml tmxxml tmxxml tmx
zh_twzipxmlxml tmxxmlxml tmxxml tmxxml tmx
zh_zhzipxmlxml tmxxmlxml tmxxml tmxxml tmx

Disclaimer

Notice and take down policy

Notice: Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please: Take down: We will comply to legitimate requests by removing the affected sources from the next release of the corpus.