| Corpus | sentencessents | en tok | lb tok | sample | bilingual | monolingual |
|---|---|---|---|---|---|---|
| NLLB v1 | 12,765,99812.8M | 117,495,841117.5M | 111,166,632111.2M | |||
| CCMatrix v1 | 11,978,49512M | 184,813,130184.8M | 352,793,513352.8M | |||
| XLEnt v1.2 | 312,362312.4K | 694,840694.8K | 1,211,3571.2M | |||
| WikiMatrix v1 | 22,28222.3K | 323,087323.1K | 274,973275K | |||
| KDE4 v2 | 7,3767.4K | 20,85720.9K | 22,58322.6K | |||
| OpenSubtitles v2024 | 4,9545K | 32,72632.7K | 27,85027.9K | |||
| QED v2.0a | 1,3871.4K | 13,95314K | 12,53512.5K | |||
| Tatoeba v2023-04-12 | 405405 | 2,0772.1K | 2,1732.2K | |||
| wikimedia v20230407 | 218218 | 4,7614.8K | 5,2445.2K | |||
| TED2020 v1 | 9090 | 1,3961.4K | 1,4441.4K | |||
| EUbookshop v2 | 00 | 00 | 00 | |||
| Ubuntu v14.10 | 00 | 00 | 00 | |||
| Total | 25.1M | 303.4M | 465.5M |