Home / Wiki    [Europarl] [Wiki] [Wikibooks] [Wikinews] [Wikipedia] [Wikiquote] [Wikisource]

Latest News

the synthetic open parallel corpus

OPUS is a growing collection of translated texts. synOPUS is a new edition that provides synthetic data sets, i.e. data that has (partially) been generated, for example, by translating text into other languages using machine translation tools or large language models. We used several tools to compile the current collection. All pre-processing is done automatically. No manual corrections have been carried out.

Contributions are very welcome! Please contact <opus-project@helsinki.fi >

Released Datasets

Tools & Resources
Please look at the publications below for more information about OPUS.
Please cite the first one in the list if you use any part of the corpus in your own work!

Publications

Jörg Tiedemann, 2012,
Parallel Data, Tools and Interfaces in OPUS. [pdf]
In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC'2012)
Jörg Tiedemann and Santhosh Thottingal, 2020
OPUS-MT – Building open translation services for the World. In Proceedings of the 22nd Annual Conference of the European Association for Machine Translation, 2020 (bib, pdf)
Mikko Aulamo, Sami Virpioja, Jörg Tiedemann, 2020
OpusFilter: A Configurable Parallel Corpus Filtering Toolbox. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, 2020 (bib, pdf)
Mikko Aulamo, Umut Sulubacak, Sami Virpioja, Jörg Tiedemann, 2020
OpusTools and Parallel Corpus Diagnostics. In Proceedings of the 12th Language Resources and Evaluation Conference, 2020 (bib, pdf)
Jörg Tiedemann, 2016a
OPUS - Parallel Corpora for Everyone. In Baltic Journal of Modern Computing (BJMC), Vol 4, No. 2, Special Issue: Proceedings of the 19th Annual Conference of the European Association of Machine Translation (EAMT), 2016
Jörg Tiedemann, 2016b
Finding Alternative Translations in a Large Corpus of Movie Subtitles. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC-2016), 2016.
Pierre Lison and Jörg Tiedemann, 2016
OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC-2016), 2016.
Raivis Skadiņš, Jörg Tiedemann, Roberts Rozis and Daiga Deksne, 2014
Billions of Parallel Words for Free [bib] [pdf]
In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC'2014), Reykjavik, Iceland
Jörg Tiedemann, 2009,
News from OPUS - A Collection of Multilingual Parallel Corpora with Tools and Interfaces [pdf]
In N. Nicolov and K. Bontcheva and G. Angelova and R. Mitkov (eds.) Recent Advances in Natural Language Processing (vol V), pages 237-248, John Benjamins, Amsterdam/Philadelphia
Jörg Tiedemann, 2011,
Bitext Alignment, Synthesis Lecture on HLT, Morgan & Claypool Publishers, now published by Springer Nature
Jörg Tiedemann, 2008,
Synchronizing Translated Movie Subtitles. [pdf]
In Proceedings of the 6th International Conference on Language Resources and Evaluation (LREC'2008)
Jörg Tiedemann, 2007,
Building a Multilingual Parallel Subtitle Corpus. [pdf]
In Proceedings of CLIN 17, Leuven, Belgium, 2007.
Jörg Tiedemann, 2007,
Improved Sentence Alignment for Movie Subtitles. [pdf]
In Proceedings of RANLP '07, Borovets, Bulgaria, 2007.
Jörg Tiedemann, 2003
OPUS - an open source parallel corpus. [pdf]
In Proceedings of the 13th Nordic Conference on Computational Linguistics, University of Iceland, Reykjavik, 2003.
Jörg Tiedemann, Lars Nygaard, 2004
The OPUS corpus - parallel & free. [pdf]
In Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC'04). Lisbon, Portugal, May 26-28.
A text book on alignment: