ParCor - A Parallel Pronoun-Coreference Corpus

ParCor 1.0 is a parallel corpus of texts in which pronoun coreference -- reduced coreference in which pronouns are used as referring expressions -- has been annotated. It consists of a collection of parallel English-German documents from two different text genres: TED Talks (transcribed planned speech), and EU Bookshop publications (written text). All documents in the corpus have been manually annotated with respect to the type and location of each pronoun and, where relevant, its antecedent.

The corpus is intended to be used both as a resource from which to learn systematic differences in pronoun use between languages and ultimately for developing and testing informed Statistical Machine Translation systems aimed at addressing the problem of pronoun coreference in translation.

If you make use of the ParCor corpus in your work, please cite the following article:

  • Liane Guillou, Christian Hardmeier, Aaron Smith, Jörg Tiedemann and Bonnie Webber (2014): ParCor 1.0: A Parallel Pronoun-Coreference Corpus to Support Statistical MT, In Proceedings of LREC 2014, Reykjavik, Iceland [poster] [bib] [pdf]

Release v1.0

Download and Browse

Please see the README file for more information.

Release Information

The release contains the following:

ORIGINAL SOURCE OF DATA

The following TED Talks were downloaded from WIT3: They form the test set of the IWSLT13 Shared Task dataset.

767 - Bill Gates on Energy: Innovating to Zero!
769 - Aimee Mullins: The Opportunity of Adversity
779 - Daniel Kahneman: The Riddle of Experience vs. Memory
783 - Gary Flake: Is Pivot a Turning Point for Web Exploration?
785 - James Cameron: Before Avatar ... a Curious Boy
790 - Dan Barber: How I Fell in Love With a Fish
792 - Eric Mead: The Magic of the Placebo
799 - Jane McGonigal: Gaming Can Make a Better World
805 - Robert Gupta: Music is Medicine, Music is Sanity
824 - Michael Specter: The Danger of Science Denial
837 - Tom Wujec: Build a Tower, Build a Team

The following EU Bookshop documents were downloaded from the EU Bookshop online archive in E-Book format: The raw text was extracted using the Calibre E-Book Management tool

KEBC11002 - Social Dialogue
KEBC12001 - Demography, Active Ageing and Pensions
KH7911105 - Soil
MI3112464 - Road Transport
MJ3011331 - Energy
NA3211776 - Europe in 12 Lessons
QE3011322 - Shaping Europe
QE3211790 - Active citizenship

Release v1.0pre

Acknowledgments

We would like to thank our annotators, Susanne Tauber, Petra Strom, Samuel Gibbon, David Lawrence and Aaron Smith for their many hours of painstaking work and Yannick Versley for making his German pre-processing pipeline available to us. The work was suppored by the European Union Seventh Framework Programme (FP7/2007-2013) under grant agreement 287658 (EU BRIDGE) and by the Swedish Research Council (Vetenskapsrådet) through the project on Discourse-Oriented Machine Translation (2012- 916).