wiki:Tools/Opus2Multi

opus2multi

opus2multi [OPTIONS] xmldir pivot [lang-ids]*

Combine sentence alignments for several language pairs using a pivot language as intermediate language for all other languages.

OUTPUT: sentence alignment files for all languages together with the pivot language

        <xmldir> should be the path to the XML directory that contains
                 sentence alignment files
                 for each individual language pair (e.g. xmldir/en-fr.xml.gz)
        <pivot> is the language ID of the pivot language (e.g. en)
        <lang-ids> are language IDs of the other language to be combined
                   in the multilingual corpus

SYNOPSIS

        # Combine all sentence alignments via Swedish
        # for German, English, Spanish and French.
        # The alignment units will cover the same English sentences.

        opus2multi /path/to/OPUS/corpus/RF/xml sv de en es fr

        # shortcut without full path to xml-dir
        # (requires OPUS in some standard directory)

        opus2multi RF sv de en es fr

        # use intralingual links (for pivot language) to extend the data set
        # (useful for OpenSubtitles-corpora)

        opus2multi -a OpenSubtitles2016 sv de en es fr

OPTIONS

        -e ................. keep segments with empty links in any of the languages
        -i pivot-links ..... intralingual pivot link file
        -a ................. same as -i but read intralingual links from xmldir/../alt/
        -s nr .............. max number of sentences in an alignment unit
        -h ................. this help
Last modified 3 years ago Last modified on Nov 16, 2017, 8:25:22 PM