wiki:WordAlign

Word Alignments and Phrase Translation Tables

For most corpora we also provide word alignments and phrase translation tables in Moses format. If you search for the resources in the language pair that you are interested in (using the form at the front page of OPUS) then you will see links to the downloadable files in the column alg of the resource table. Click on those links that pack word alignments (alg) or phrase-tables and related files (smt).

The word alignments are stored in the format used by Moses and refer to the toknisation given in the tokenised versions of OPUS corpora. The lines correspond to the aligned sentences in the bitext.xml file in XCES align format that is included in the package. Word positions refer to tokens in the tokenised XML corpus files.

0-1 1-0 1-2 2-3 3-4
6-0 7-1 8-2 9-3 10-4 11-5 13-6 14-7 15-8 16-9 17-10 18-11
0-0 1-1 2-2 3-3 4-4 5-5 6-6 7-7 8-8 9-9 10-10 11-11 12-12
0-0 1-1 2-2 3-3 4-4 5-5 6-6 7-7 9-8 10-9 11-10 12-11 13-12 14-13
...

The included bitext.xml file in XCES align format can be used to extract the data from the native XML files. There are typically various alternative alignment files in each word alignment package including different kinds of symmetrisation heuristics that can be used to combine IBM-style word alignments. For more information, please, read the documentation at the Moses SMT package about aligning words and common symmetrisation heuristics

The {{{smt}} packages include translation models that are extracted from the word aligned bitterest: lexical translation probabilities and a filtered phrase translation table phrase-table-filtered.gz in Moses format. For filtering, we use the pruning method based on significance scores and the log-file in the same directory (if available) shows information about the pruning result:

...
------------------------------------------------------
  unfiltered phrases pairs: 433767941

     P(f|e) filter [first]: 66846556   (15.4107%)
       significance filter: 328790557   (75.7987%)
            TOTAL FILTERED: 395637113   (91.2094%)

     FILTERED phrase pairs: 38130828   (8.79061%)
------------------------------------------------------
Last modified 2 years ago Last modified on Jan 2, 2019, 7:16:27 PM

Attachments (1)

Download all attachments as: .zip