wiki:Tools/OpusRead

opus-read

Read sentence aligned OPUS data in XCES align / XML format

opus-read is a simple script to read sentence alignments stored in XCES align format and prints the aligned sentences to STDOUT. It requires monolingual alignments (ascending order, no crossing links) of sentences in linked XML files. Linked XML files are specified in the "toDoc" and "fromDoc" attributes (see below).

<cesAlign version="1.0">
  <linkGrp targType="s" toDoc="source1.xml" fromDoc="target1.xml">
    <link certainty="0.88" xtargets="s1.1 s1.2;s1.1" id="SL1" />
           ....
  <linkGrp targType="s" toDoc="source2.xml" fromDoc="target2.xml">
    <link certainty="0.88" xtargets="s1.1;s1.1" id="SL1" />

Several parameters can be set to filter the alignments and to print only certain types of alignments. The output will look something like this (example from the RF corpus):

# /proj/nlpl/data/OPUS/corpus/RF/xml/en/1988.xml.gz
# /proj/nlpl/data/OPUS/corpus/RF/xml/fr/1988.xml.gz

================================
(src)="s1.1"> Statement of Government Policy by the Prime Minister , Mr Ingvar Carlsson , at the Opening of the Swedish Parliament on Tuesday , 4 October , 1988 . 
(trg)="s1.1"> Declaration de Politique Générale du Gouvernement présentée mardi 4 octobre 1988 devant le Riksdag par Monsieur Ingvar Carlsson , Premier Ministre . 
================================
(src)="s2.1"> Your Majesties , Your Royal Highnesses , Mr Speaker , Members of the Swedish Parliament . 
(trg)="s2.1"> Majestés , Altesses Royales , Monsieur le Président , Mesdames et Messieurs les députés ! 
================================
(src)="s3.1"> Sweden 's policy of neutrality is of decisive importance for our peace and independence . 
(trg)="s3.1"> La politique suédoise de neutralité revêt une importance capitale pour la paix et l ' indépendance de notre pays . 
================================
(src)="s3.2"> It also contributes to stability and détente in our part of the world . 
(trg)="s3.2"> Elle contribue également à la stabilité et à la détente dans notre secteur du monde . 
================================
(src)="s3.3"> There is wide popular support for this policy . 
(trg)="s3.3"> Cette politique recueille une large adhésion populaire . 
...

opus-read can also be used to filter the XCES alignment files and to print the remaining links in the same XCES align format. Use the "-l" flag to enable this mode.

Examples

# read sentence alignments and print aligned sentences
# (make sure that you run the command in the directory 
#  where the aligned files coming from fromDoc and toDoc 
#  can be found by the system)
opus-read align-file.xml
opus-read align-file.xml.gz

# the script uses some heuristics to locate 
# the home directory of OPUS
# the following commands only work if you have the same structure
opus-read corpusname/lang-pair
opus-read -d corpusname lang-pair
opus-read -d corpusname -s srclang -t trglang

# print alignments with alignment certainty > LinkThr=0
opus-read -c 0 align-file.xml

# print alignments with max 2 source sentences and 3 target sentences
opus-read -S 2 -T 3 align-file.xml

# print aligned sentences marked as 'de' (source) and 'en' (target)
# (this only works if sentences are marked with languages:
#  for example, in the German XML file: <s lang="de">...</s>)
opus-read -s de -t en align-file.xml

# wrap aligned sentences in simple HTML
opus-read -h align-file.xml

# print max 10 alignments
opus-read -m 10 align-file.xml

# specify home directory of aligned XML files
opus-read -d /path/to/xml/files align-file.xml

# print XCES align format of all 1:1 sentence alignments
opus-read -S 1 -T 1 -l align-file.xml
Last modified 3 years ago Last modified on Nov 15, 2017, 5:40:40 PM