Pre-processing, Tagging and Parsing
In the OPUS corpus, tools and language-specific models for segmentation, tagging and parsing have been collected. The language-specific models are available for download in the table below, whereas the tools are available for download here:
- Hunpos tagger: http://code.google.com/p/hunpos/downloads/list
- MaltParser?: http://maltparser.org/download.html (use version 1.4.1 for which the models below are trained for!)
- MElt tagger for French: https://gforge.inria.fr/frs/download.php/27240/
- SVMTool tagger (with pretrained models for English, Spanish and Catalan): http://www.lsi.upc.edu/~nlp/SVMTool/#DOWNLOAD
- Zpar statistical parser with language-specific features for Chinese and English: http://sourceforge.net/projects/zpar/
For a consistent tagging and parsing procedure, the same tagging and parsing tools have been used for most of the languages, i.e. the Hunpos tagger (Péter Halácsy, András Kornai, Csaba Oravecz, 2007, Hunpos - an open source trigram tagger) and the Maltparser (Joakim Nivre and Johan Hall, 2005, Maltparser: A language-independent system for data-driven dependency parsing). For some languages, alternative taggers and/or parsers are used.
Click on a language name for more information on the models available for this language.
Language | Tokenizer | Sentence splitter | Tagger(s) | Parser(s) |
Catalan | SVMTool | malt-1.4.1 | ||
Czech | hunpos | malt-1.4.1 | ||
Chinese | zpar | zpar | zpar | zpar |
Danish | hunpos | malt-1.4.1 | ||
Dutch | malt | |||
English | hunpos | malt-1.4.1 | ||
French | MElt | malt-1.4.1 | ||
German | hunpos | malt-1.4.1 | ||
Hungarian | hunpos | |||
Italian | TextPro | malt-1.4.1 | ||
Portuguese | hunpos | malt-1.4.1 | ||
Russian | hunpos | malt-1.4.1 | ||
Slovene | hunpos | malt-1.4.1 | ||
Spanish | SVMTool | malt-1.4.1 | ||
Swedish | hunpos | malt-1.4.1 | ||
Turkish | malt-1.4.1 |
Other tools
- language guesser: textcat with pre-trained language models