@misc{degibert2024new,
title={A New Massive Multilingual Dataset for High-Performance Language Technologies},
author={Ona de Gibert, Graeme Nail, Nikolay Arefyev, Marta Bañón, Jelmer van der Linde, Shaoxiong Ji, Jaume Zaragoza-Bernabeu, Mikko Aulamo, Gema Ramírez-Sánchez, Andrey Kutuzov, Sampo Pyysalo, Stephan Oepen, Jörg Tiedemann},
year={2024},
eprint={2403.14009},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Please also cite the following article if you use any part of the corpus in your own work: J. Tiedemann, 2012, Parallel Data, Tools and Interfaces in OPUS. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012
| Languages | Bitexts | Number of files | Number of tokens | Sentence fragments |
|---|---|---|---|---|
| 51 | 50 | 810 | 17G | 803M |
Please, select a language pair.
Please select a language pair. If you wish to download Opus resources, visit the website on desktop.
A note on formats: TMX files contain only unique translation units. Moses downloads include all non-empty alignment units including duplicates. Token counts for each language also include duplicate sentences and documents.