@misc{degibert2024new, title={A New Massive Multilingual Dataset for High-Performance Language Technologies}, author={Ona de Gibert, Graeme Nail, Nikolay Arefyev, Marta Bañón, Jelmer van der Linde, Shaoxiong Ji, Jaume Zaragoza-Bernabeu, Mikko Aulamo, Gema Ramírez-Sánchez, Andrey Kutuzov, Sampo Pyysalo, Stephan Oepen, Jörg Tiedemann}, year={2024}, eprint={2403.14009}, archivePrefix={arXiv}, primaryClass={cs.CL} }Please also cite the following article if you use any part of the corpus in your own work: J. Tiedemann, 2012, Parallel Data, Tools and Interfaces in OPUS. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012
Languages | Bitexts | Number of files | Number of tokens | Sentence fragments |
---|---|---|---|---|
51 | 50 | 810 | 17G | 803M |
Please, select a language pair.
Please select a language pair. If you wish to download Opus resources, visit the website on desktop.
A note on formats: TMX files contain only unique translation units. Moses downloads include all non-empty alignment units including duplicates. Token counts for each language also include duplicate sentences and documents.