SYNOPSIS
# convert sentence aligned bitexts to factored moses input # (requires XML::Parser) opus2moses [OPTIONS] < sentence-align-file.xml
OPTIONS:
-s srcfactors ......... specify source language factors besides surface words -t trgfactors ......... the same for the target language (separated by ':') factors should be attributes of <w> tags!! (except 'word' which is the word itself) -d dir ................ home directory of the OPUS subcorpus -n file-pattern ....... skip bitext files that match pattern (e.g. ep-00-1*) -i .................... inverse selection (only files matching file pattern) -e src-data-file ...... output file for source language data (default = src) -f src-data-file ...... output file for target language data (default = trg) -p sentence-pair-file . stores sentence ID pairs of the extracted pairs -l .................... convert to lower case -1 .................... 1:1 links only -x max ................ max size of sentences (in nr of words) -r .................... process untokenized (raw) XML (no length filtering) -M .................... read all sentences into memory for each linked document before extracting linked sentences (for non-monotonic links)
Last modified 3 years ago
Last modified on Nov 16, 2017, 8:20:52 PM