Home / Query / WordAlign / Wiki | [ada83] [bible] [bianet] [books] [CCAligned] [CCMatrix] [CAPES] [DGT] [DOGC] [ECB] [EhuHac] [EiTB] [Elhuyar] [ELITR_ECA] [ELRC] [EMEA] [EUbooks] [EU] [Europarl] [EuroPat] [finlex] [fiskmö] [giga] [GNOME] [GlobalVoices] [hren] [infopankki] [JRC] [KDE4/doc] [liv4ever] [MBS] [memat] [MontenegrinSubs] [MultiUN] [MultiParaCrawl] [MultiCCAligned] [MT560] [NC] [Ofis] [OO/OO3] [subs/16/18] [Opus100] [ParaCrawl] [ParCor] [PHP] [QED] [sardware] [SciELO] [SETIMES] [SPC] [Tatoeba] [Tanzil] [TEP] [TED] [tico19] [Tilde] [Ubuntu] [UN] [UNPC] [WikiMatrix] [Wikimedia] [Wikipedia] [WikiSource] [WMT] [XhosaNavy] |
A collection of documents from http://www.opensubtitles.org/.
Look at the latest package of OpenSubtitles2018!
IMPORTANT: If you use the OpenSubtitle corpus:
Please, add a link to http://www.opensubtitles.org/ to your website and to your reports and publications produced with the data! I promised this when I got the data from the providers of that website!
This is a slightly cleaner and bigger version of the subtitle collection using improved sentence alignment and better language checking.
The previous release is still available here.
55 languages, 1,076 bitexts
total number of files: 1,415,879
total number of tokens: 8.48G
total number of sentence fragments: 1.24G
Please cite the following article if you use any part of the corpus in your own work:
Jörg Tiedemann, 2012, Parallel Data, Tools and Interfaces in OPUS. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012)
Bottom-left triangle: download files
| Upper-right triangle: sample files
|