wiki:DataFormats

Data Formats

OPUS provides the data in various formats. The native data format is XML and all other formats are derived from those data files. The general file structure is common to all corpora. Corpus files are stored in sub-directories that are named by the corresponding language ID (using 2-letter ISO-639 language codes as much as possible). Aligned documents have usually the same path relative to this language-specific sub-directory (but do not have to have exactly the same name). Tokenized corpus files are stored in a sub-directory xml and untokenized files in raw. Sentence alignment files are named by the language IDs, for example, en-fr.xml.gz for the alignments of English and French documents. Alignments are symmetric and, therefore, only one direction is stored in the corpus using alphabetically sorted language IDs (i.e. fr-en.xml.gz does not exist for aligning French and English sentences). Here is an example of the file structure in Europarl:

Europarl
Europarl/xml
Europarl/xml/en
Europarl/xml/en/ep-10-07-05-004.xml.gz
Europarl/xml/fr
Europarl/xml/fr/ep-10-07-05-004.xml.gz
...
Europarl/xml/en-fr.xml.gz
...
Europarl/raw
Europarl/raw/en
Europarl/xml/en/ep-10-07-05-004.xml.gz
Europarl/raw/fr
Europarl/xml/fr/ep-10-07-05-004.xml.gz
...

Native XML

The native data format in OPUS is a simple standalone XML format. The amount of markup varies for each corpus and each language depending on the pre-processing pipeline that was available at the time of creating the particular corpus. More about pre-processing tools can be found [Tools here]. All data sets are available in a "raw" untokenized format in which only sentence boundaries are added (and possibly some additional basic document markup such as paragraph boundaries etc.).

A typical example of raw corpus data (taken from Europarl) looks like this

<?xml version="1.0" encoding="utf-8"?>
<document>
  <CHAPTER ID="0">
    <P id="1"></P>
    <SPEAKER ID="1" LANGUAGE="DE" NAME="Rübig">
      <P id="2">
        <s id="1">Madam President, I saw a few boats landing at Parliament this week and notified the security service.</s>
        <s id="2">Not only were there language difficulties; the telephone line was so poor that it was almost impossible to communicate.</s>
        <s id="3">I would be most obliged if the number on which the security service can be reached could also be clearly displayed in the House, so that if anyone wants to report an incident, they can do so quickly and efficiently.</s>
      </P>
      <P id="3"></P>
    </SPEAKER>
    ...

Sentences are numbered with unique IDs (unique within the XML file, not unique within the entire corpus). These IDs are used for the sentence alignment (see below).

In the tokenized versions, token boundaries are added with <w> tags. Additional annotation may be included as well. A typical example for English (again, taken from Europarl) looks like the following:

<?xml version="1.0" encoding="utf-8"?>
<document><CHAPTER ID="0">
<P id="1" /><SPEAKER ID="1" LANGUAGE="DE" NAME="Rübig"><P id="2">
<s id="1">
 <chunk type="NP" id="c-1">
  <w hun="NNP" tree="NN" lem="madam" pos="NNP" id="w1.1">Madam</w>
  <w hun="NNP" tree="NP" lem="President" pos="NNP" id="w1.2">President</w>
 </chunk>
 <w hun="," tree="," lem="," pos="," id="w1.3">,</w>
 <chunk type="NP" id="c-3">
  <w hun="PRP" tree="PP" lem="I" pos="PRP" id="w1.4">I</w>
 </chunk>
 <chunk type="VP" id="c-4">
  <w hun="VBD" tree="VVD" lem="see" pos="VBD" id="w1.5">saw</w>
 </chunk>
...

Sentence alignments are stored in a standoff annotation file using the XCES align format. Here is an example from Europarl:

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE cesAlign PUBLIC "-//CES//DTD XML cesAlign//EN" "">
<cesAlign version="1.0">
<linkGrp targType="s" fromDoc="en/ep-00-01-17.xml.gz" toDoc="fr/ep-00-01-17.xml.gz">
<link xtargets="1;1" />
<link xtargets="2;2" />
<link xtargets="3;3 4" />
...

Each linkGrp includes attributes for specifying the source language file (fromDoc) and the target language file (toDoc). All file names are relative to the xml directory of the current corpus. They can be used with both, the raw and the tokenized versions of the corpus (as sentence boundaries are identical in both variants).

The actual links are stored in the xtargets attribute of the <link> elements. Aligned sentence IDs are separated by ; and multiple source/target language IDs are separated by whitespaces. For example, in the sample above, sentence 1 from file en/ep-00-01-17.xml.gz is aligned to sentence 1 in file fr/ep-00-01-17.xml.gz whereas sentence 3 is aligned to sentences 3 and 4 in the French file.

There can be any number of sentences in each alignment, even no sentence at all. Sentence alignment files may also contain any number of aligned document pairs. More files are simply added by adding another <linkGrp> structure.

In some cases there will be additional information about the link likelihood in these sentence alignment files. They are, for example, taken from hunalign which we use as our default alignment tool. Here is an example (from MultiUN) that includes this information:

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE cesAlign PUBLIC "-//CES//DTD XML cesAlign//EN" "">
<cesAlign version="1.0">
 <linkGrp targType="s" toDoc="fr/2005/CES_AC71_2005_5SUMMARY.xml.gz" fromDoc="en/2005/CES_AC71_2005_5SUMMARY.xml.gz">
<link certainty="0" xtargets=";1 2" id="SL1" />
<link certainty="0.612088" xtargets="1;3" id="SL2" />
<link certainty="0.173077" xtargets="2;4" id="SL3" />
<link certainty="1.65065" xtargets="3;5" id="SL4" />
<link certainty="1.63824" xtargets="4;6" id="SL5" />
<link certainty="-0.3" xtargets=";7" id="SL6" />

Note that the certainty is not a likelihood in terms of proper probabilities. They are taken verbatim from the output of the underlying alignment tool (hunalign in this case). However, they may still be useful for filtering purposes, for example, for removing unreliable links from the set (e.g. the empty link to sentence number 7 in the example above).

Note that it is also easy to extract certain link types from the data. For example, extracting 1-to-1 sentence alignments can easily be done by counting aligned sentence IDs and filtering out the once that are not 1-to-1.

UD Parsed XML

Some corpora have been parsed with universal dependencies using UDPipe. The dependency relations and all other morphosyntactic information has been converted to XML to be compatible with the original data and its sentence alignments. The format looks like this with token attributes added to each token of each sentence based on the parser output in consul format:

<?xml version="1.0" encoding="utf-8"?>

<document>
  <CHAPTER ID="002">
    <P id="1">
      <s id="1">
        <w xpos="NOUN" head="1.2" feats="Number=Plur" upos="NOUN" lemma="document" id="1.1" deprel="nsubj">Documents</w>
        <w xpos="VERB" head="0" feats="Mood=Ind|Tense=Past|VerbForm=Fin" upos="VERB" misc="SpaceAfter=No" lemma="receive" id="1.2" deprel="root">received</w>
        <w xpos="PUNCT" head="1.2" upos="PUNCT" lemma=":" id="1.3" deprel="punct">:</w>
        <w xpos="VERB" head="1.2" feats="Mood=Imp|VerbForm=Fin" upos="VERB" lemma="see" id="1.4" deprel="parataxis">see</w>
        <w xpos="PROPN" head="1.4" feats="Number=Plur" upos="PROPN" misc="SpaceAfter=No" lemma="Minutes" id="1.5" deprel="obj">Minutes</w>
      </s>
    </P>
  </CHAPTER>
</document>

Note that some sentences from the original corpus may be split into several sub-sentences by UDpipe but the original sentence markup needs to stay the same to ensure compatibility with the standoff annotation of sentence alignments.

Plain Text / Moses

Plain text files are provided for each bitext in OPUS. They can be downloaded as zip archives and contain 2 files in which corresponding lines are aligned with each other. The name follows the typical name conventions used in Moses, i.e. using file extensions that correspond to the language ID. For example, for the RF corpus the two files for English and French are called:

RF.en-fr.en
RF.en-fr.fr

The contents of the English file looks like this:

Statement of Government Policy by the Prime Minister, Mr Ingvar Carlsson, at the Opening of the Swedish Parliament on Tuesday, 4 October, 1988.
Your Majesties, Your Royal Highnesses, Mr Speaker, Members of the Swedish Parliament.
Sweden's policy of neutrality is of decisive importance for our peace and independence.
It also contributes to stability and détente in our part of the world.
There is wide popular support for this policy.
It will be pursued with firmness and consistency.
...

And the corresponding French file looks like this:

Declaration de Politique Générale du Gouvernement présentée mardi 4 octobre 1988 devant le Riksdag par Monsieur Ingvar Carlsson, Premier Ministre.
Majestés, Altesses Royales, Monsieur le Président, Mesdames et Messieurs les députés!
La politique suédoise de neutralité revêt une importance capitale pour la paix et l' indépendance de notre pays.
Elle contribue également à la stabilité et à la détente dans notre secteur du monde.
Cette politique recueille une large adhésion populaire.
Elle sera poursuivie avec énergie et cohérence.
...

Note that the files are untokenized (raw format) and that they may contain multiple sentences per line in case they are aligned together to their corresponding sentence(s) in the other language. Empty alignments are excluded from the plain text files.

Monolingual files are also available for all languages. There are raw (untokenized) versions and tokenized versions available. In monolingual files, each sentence is strictly on one line.

A tokenized file looks like this:

Statement of Government Policy by the Prime Minister , Mr Ingvar Carlsson , at the Opening of the Swedish Parliament on Tuesday , 4 October , 1988.
Your Majesties , Your Royal Highnesses , Mr Speaker , Members of the Swedish Parliament .
Sweden 's policy of neutrality is of decisive importance for our peace and independence .
It also contributes to stability and détente in our part of the world .
There is wide popular support for this policy .
It will be pursued with firmness and consistency .
Our policy of neutrality is underpinned by a strong defence .
That safeguards our independence .
...

The raw file like this:

Statement of Government Policy by the Prime Minister, Mr Ingvar Carlsson, at the Opening of the Swedish Parliament on Tuesday, 4 October, 1988.
Your Majesties, Your Royal Highnesses, Mr Speaker, Members of the Swedish Parliament.
Sweden's policy of neutrality is of decisive importance for our peace and independence.
It also contributes to stability and détente in our part of the world.
There is wide popular support for this policy.
It will be pursued with firmness and consistency.
Our policy of neutrality is underpinned by a strong defence.
That safeguards our independence.

All plain text files are encoded in Unicode UTF-8.

TMX

Bitexts are also available in a simple TMX format. They only us a minimal markup and come in untokenized raw text format. Below you can see an example of such a TMX file from the OPUS collection:

<tmx version="1.4">
<header creationdate="Fri Aug 23 10:17:33 2013"
          srclang="en"
          adminlang="en"
          o-tmf="unknown"
          segtype="sentence"
          creationtool="Uplug"
          creationtoolversion="unknown"
          datatype="PlainText" />
  <body>
    <tu>
      <tuv xml:lang="en"><seg>Statement of Government Policy by the Prime Minister, Mr Ingvar Carlsson, at the Opening of the Swedish Parliament on Tuesday, 4 October, 1988.</seg></tuv>
      <tuv xml:lang="fr"><seg>Declaration de Politique Générale du Gouvernement présentée mardi 4 octobre 1988 devant le Riksdag par Monsieur Ingvar Carlsson, Premier Ministre.</seg></tuv>
    </tu>
    <tu>
      <tuv xml:lang="en"><seg>Your Majesties, Your Royal Highnesses, Mr Speaker, Members of the Swedish Parliament.</seg></tuv>
      <tuv xml:lang="fr"><seg>Majestés, Altesses Royales, Monsieur le Président, Mesdames et Messieurs les députés!</seg></tuv>
    </tu>
...
Last modified 3 years ago Last modified on Nov 15, 2017, 8:24:54 PM