Home / Query / WordAlign / Wiki | [ada83] [bible] [bianet] [books] [CCAligned] [CCMatrix] [CAPES] [DGT] [DOGC] [ECB] [EhuHac] [EiTB] [Elhuyar] [ELITR_ECA] [ELRC] [EMEA] [EUbooks] [EU] [Europarl] [EuroPat] [finlex] [fiskmö] [giga] [GNOME] [GlobalVoices] [hren] [infopankki] [JRC] [KDE4/doc] [liv4ever] [MBS] [memat] [MontenegrinSubs] [MultiUN] [MultiParaCrawl] [MultiCCAligned] [MT560] [NC] [Ofis] [OO/OO3] [subs/16/18] [Opus100] [ParaCrawl] [ParCor] [PHP] [QED] [sardware] [SciELO] [SETIMES] [SPC] [Tatoeba] [Tanzil] [TEP] [TED] [tico19] [Tilde] [Ubuntu] [UN] [UNPC] [WikiMatrix] [Wikimedia] [Wikipedia] [WikiSource] [WMT] [XhosaNavy] |
Using OPUS corpora with Uplug is very straightforward. Here is a small selection of some simple tools to process parallel corpora from OPUS:
uplug/tools/readalign
A simple Perl script to read sentence aligned OPUS corpora and printing them in plain text format to STDOUT. It can also add some very simple HTML tags to show sentence alignments on websites.
Example usage:
/path/to/uplug/tools/readalign xml/de-sv.ces.gz | less
(Note that you have to run readalign in the home directory of the corpus in order to match the relative file paths specified in the sentence alignment file xml/de-sv.ces.gz)
uplug/tools/opus2moses.pl
A simple Perl script (requires XML::Parser) to convert sentence aligned OPUS corpora to Moses / GIZA++ input format (two separate files for source and target language, one sentence per line, aligned sentences on corresponding lines).
Example usage:
zcat OPUS/corpus/Europarl3/xml/de-sv.ces.gz | uplug/tools/opus2moses.pl -d OPUS/corpus/Europarl3 -s word:lem:pos -t pos:word -e de-sv.src -f de-sv.trg
(This will read aligned sentences from the German-Swedish Europarl corpus and writes factors word+lemma+POS-tag for German and factors POS-tag+word for Swedish to the files de-sv.src (German) and de-sv.trg (Swedish))
uplug/tools/opus-indexer.pl
Yet another Perl script, this one for indexing OPUS corpora with the Corpus Work Bench (with support for sentence alignment)
The following tools have been used for pre-processing, annotation & alignment (not including standard GNU-tools):
srt2xml.pl
& srtalign.pl
- special scripts to convert and align movie subtitles
(also included in the latest versions of Uplug)tool | language | trained on | trained by |
---|---|---|---|
tagger | English | WSJ+Brown | Gann Bierner |
chunker | English | Penn Tree Bank | Jörg Tiedemann |
tool | language | trained on | trained by |
---|---|---|---|
tagger | German | NEGRA | Thorsten Brants |
tagger | English | WSJ | Thorsten Brants |
tagger | Swedish | SUC | Beáta Megyesi |
language | trained on | trained by |
---|---|---|
German | NEGRA | Helmut Schmid |
English | WSJ | Helmut Schmid |
French | Achim Stein | |
Italian | Achim Stein | |
Spanish | ||
Dutch |
language | trained on | trained by |
---|---|---|
English | 1 Mio words | Oliver Mason |
Portuguese | 500 Mio words | Tony Berber Sardinha and Rod de Lima-Lopes |
The following tools are used for data management: