INTRODUCTION

ParCor 1.0 is a parallel corpus of texts in which pronoun coreference -- reduced coreference in which pronouns are used as referring expressions -- has been annotated. It consists of a collection of parallel English-German documents from two different text genres: TED Talks (transcribed planned speech), and EU Bookshop publications (written text). All documents in the corpus have been manually annotated with respect to the type and location of each pronoun and, where relevant, its antecedent.

The corpus is intended to be used both as a resource from which to learn systematic differences in pronoun use between languages and ultimately for developing and testing informed Statistical Machine Translation systems aimed at addressing the problem of pronoun coreference in translation.

If you make use of the ParCor corpus in your work, please cite the following article:

Liane Guillou, Christian Hardmeier, Aaron Smith, Jörg Tiedemann and Bonnie Webber (2014). ParCor 1.0: A Parallel Pronoun-Coreference Corpus to Support Statistical MT. In Proceedings of LREC 2014. Reykjavik, Iceland.


CONTENTS

This download contains the following:

* The annotated English and German texts (in folder: "Annotated_Texts")
* The annotation guidelines used by our human annotators (in folder: "Documentation")
* Copies of the texts without annotation (in folder: "Raw_Texts")
* Sentence aligned texts, also without annotation (in folder: "Sentence_Aligned_Texts")

These components are described in more detail in the following sections.


ORIGINAL SOURCE OF DATA

The following TED Talks were downloaded from WIT3:
https://wit3.fbk.eu/
They form the test set of the IWSLT13 Shared Task dataset.

767 - Bill Gates on Energy: Innovating to Zero!
769 - Aimee Mullins: The Opportunity of Adversity
779 - Daniel Kahneman: The Riddle of Experience vs. Memory
783 - Gary Flake: Is Pivot a Turning Point for Web Exploration?
785 - James Cameron: Before Avatar ... a Curious Boy
790 - Dan Barber: How I Fell in Love With a Fish
792 - Eric Mead: The Magic of the Placebo
799 - Jane McGonigal: Gaming Can Make a Better World
805 - Robert Gupta: Music is Medicine, Music is Sanity
824 - Michael Specter: The Danger of Science Denial
837 - Tom Wujec: Build a Tower, Build a Team

The following EU Bookshop documents were downloaded from the EU Bookshop online archive in E-Book format:
https://bookshop.europa.eu/en/home/
The raw text was extracted using the Calibre E-Book Management tool: http://www.calibre-ebook.com/

KEBC11002 - Social Dialogue
KEBC12001 - Demography, Active Ageing and Pensions
KH7911105 - Soil
MI3112464 - Road Transport
MJ3011331 - Energy
NA3211776 - Europe in 12 Lessons
QE3011322 - Shaping Europe
QE3211790 - Active citizenship


ANNOTATED TEXTS

The "Annotated_Texts" folder contains the completed annotations for each of the TED Talks and EU Bookshop documents in English and German. The annotations are provided in the form of a number of MMAX-2 format XML files (using UTF-8 encoding). The main annotation layer is the "coref_level" layer, which contains pronoun and NP markables output by automated pre-processing pipelines (described in the LREC 2014 paper) together with additional markables and pronoun-level features added by human annotators. For each genre and language, we provide the annotations of the "main" annotator (Annotator1). Where we annotated the same text in parallel with a second annotator (for the purpose of computing inter-annotator agreement), the annotations produced by the second annotator are provided in an additional subfolder (Annotator2).

The MMAX-2 projects were constructed from the tokenised, sentence split data in the "Raw_Texts" folder.

MAAX-2 is required for the visualisation of pronoun-antecedent links and pronoun features provided in the coreference annotation layer for each text. MMAX-2 may be downloaded from: 
http://mmax2.sourceforge.net/

N.B. Release 1.0 contains several amendments that were made to the annotations in Annotated_Texts/TED/English/Annotator1 to TED talks 001_769, 003_799 and 004_767:

* A small number of speaker and addressee reference pronouns that were erroneously marked with pronoun type "generic" (this option was removed from the TED annotation scheme during the annotation process) have been correctly marked as speaker/addressee reference pronouns and their audience level features set accordingly. This affects:
*   14 pronouns in 001_769 (11 speaker reference / 3 addressee reference)
*   14 pronouns in 003_799 (2 speaker reference / 12 addressee reference)
*   13 pronouns in 004_767 (0 speaker reference / 13 addressee reference)
* One instance of "they" (markable_472 in TED talk 003_799) was erroneously marked with pronoun type "generic". This has been corrected and marked as an anaphoric pronoun.

These amendments will have a small effect on the pronoun type counts in the 2014 LREC paper. The pronoun form counts are unaffected.


DOCUMENTATION

The "Documentation" folder contains the annotation guidelines given to our human annotators. They are based on the pronoun annotation guidelines from the MUC-7 Coreference Task Definition. The annotation guidelines document is split into three sections: General guidelines that apply to both genres and specific instructions for the annotation of TED Talks and EU Bookshop documents.


RAW TEXTS

The "Raw_Texts" folder contains the tokenised, sentence split texts data that was used to build the MMAX-2 format annotation projects. Sentence splitting and tokenisation were provided using the relavant Moses scripts:
http://www.statmt.org/moses/


SENTENCE ALIGNED TEXTS

The "Sentence_Aligned_Texts" folder contains sentence aligned texts suitable for use in SMT experiments. These texts were first tokenised and sentence split (using Moses scripts) and then automatically aligned using the LFAligner:
http://sourceforge.net/projects/aligner/

The output of the LF Aligner was manually checked and some minor adjustments were made to the sentence splitting / sentence alignment.


USING THE ANNOTATIONS FOR SMT

We recommend the following:

* Obtaining additional training data from:
*   TED Talks: https://wit3.fbk.eu/
*   EU Bookshop: http://opus.lingfil.uu.se/EUbookshop.php
* Using the provided sentence aligned texts
* Normalising unicode and punctuation to match that of the remainder of the data that you intend to use

Note: The English TED Talks texts in the test set for the IWSLT13 Shared Task dataset contain minor differences for the English-French and English-German pairs. The pairs were created from snapshots of the TED Talks data taken at different times. The English texts that were annotated in this project were taken from the English-French dataset (as part of some intitial experiments in English-French translation). In order to keep the texts in line with their translations, we obtained the German translations from the same data snapshot that the English-French texts were taken from. We therefore recommend our texts as a drop-in replacement for the IWSLT13 Shared Task English-German test set.
