wiki:Tools

Tools

Various tools are used for preparing the data in OPUS. Most of them are common open source tools available for common GNU/Linux systems. Many pre-processing tools are integrated in Uplug which we use extensively for processing the parallel data. Uplug defines pre-processing pipelines for various languages and includes generic tools for tokenization and sentence alignment.

There are also packages for processing the data. Two main packages are available to make it easier to access the data and alignments in their native XML format:

Python package OpusTools

The original source code is available from github: https://github.com/Helsinki-NLP/OpusTools and the package can also be installed via pip:

pip install opustools-pkg

It implements a library for processing OPUS data in their native XML format and also command-line tools for reading and extracting data: opus_read and opus_cat. More details about the usage of these tools can be found at https://pypi.org/project/opustools-pkg/

  • opus_read: read parallel data sets and convert to different output formats
  • opus_cat: extract a given OPUS document from release data

These tools are specifically implemented for accessing the released data that are distributed in compressed zip archives. They include various options and some examples are given below:

opus_read -d Books -s en -t fi
opus_read -d MultiUN -s en -t es -a certainty -tr 0
opus_read -d Books -s en -t fi -m 10
opus_read -d Books -s en -t fi -S 1 -T 1 -wm links

opus_read and opus_cat will look at the default location of the OPUS data on taito in case you use CSC. The tools can also access local copies of the data. Look at the details in the online documentation

Note that you may need to set the root directory of OPUS and the pre-processing type depending on what kind of data is available on your system (see below for information for NLPL users and abel).

Perl package opus-tools

The Perl package opus-tools implements an alternative way of processing OPUS data. The tools are bundled in opus-tools and can be downloaded from github. There are tools for reading, converting and indexing parallel corpora in OPUS format. Here are some of the most common tools that you may want to use:

  • opus-read: read a bitext by combining the standoff annotation for sentence alignment with the standalone XML files of the source and target language corpora; you can also use this tool to filter the sentence alignment according to some constraints
  • opus2moses: convert the XML-based bitext into aligned plain text format (Moses format)
  • opus2multi: make a truly multilingual corpus out of the bilingually aligned sub-corpora using one language as a pivot
  • opus-pt2dic?: extract rough bilingual lexicons from Moses-style phrase translation tables
  • opus-pt2dice: similar to opus-pt2dic but produce Dice scores and include co-occurrence frequencies
  • opus-index: convert and index OPUS data using the corpus work bench (CWB)
  • opus-udpipe: parse OPUS data (in raw XML format) and produce XML-based annotation with dependency relations and morphosyntactic annotation (this requires a proper installation of UDPipe and the Perl interface to UDPipe)

NLPL users

The tools are installed on taito and CSC and provided under the umbrella of NLPL. For NLPL users: You can load OPUS related software by loading the module nlpl-opus, which includes OpusTools and opus-tools:

module load nlpl-opus

Running OousTool scripts like opus_read on Abel requires some extra parameters because there is only a partial copy of the data and the root directory is different than on CSC/taito. Add flags for setting the root directory and use the pre-processing type raw, for example:

opus_read -rd /projects/nlpl/data/OPUS/ -p raw -d Books -s en -t fr

Alternatively, you can use the script from the Perl package with similar functionality without the extra flags even on abel but note that this does not work for some bigger corpora like OpenSubtitles? because of the 32-bit restrictions of the library that is used to zip-files.

opus-read -d Books -s en -t fr

Old and mostly obsolete

OPUS provides models for PoS tagging and dependency parsing:

Last modified 10 months ago Last modified on Feb 15, 2020, 8:16:04 PM