ParCor 1.0 is a parallel corpus of texts in which pronoun coreference -- reduced coreference in which pronouns are used as referring expressions -- has been annotated. It consists of a collection of parallel English-German documents from two different text genres: TED Talks (transcribed planned speech), and EU Bookshop publications (written text). All documents in the corpus have been manually annotated with respect to the type and location of each pronoun and, where relevant, its antecedent.
The corpus is intended to be used both as a resource from which to learn systematic differences in pronoun use between languages and ultimately for developing and testing informed Statistical Machine Translation systems aimed at addressing the problem of pronoun coreference in translation.
If you make use of the ParCor corpus in your work, please cite the following article:
Please see the README file for more information.
The release contains the following:
ORIGINAL SOURCE OF DATA
The following TED Talks were downloaded from WIT3: They form the test set of the IWSLT13 Shared Task dataset.
767 - Bill Gates on Energy: Innovating to Zero! 769 - Aimee Mullins: The Opportunity of Adversity 779 - Daniel Kahneman: The Riddle of Experience vs. Memory 783 - Gary Flake: Is Pivot a Turning Point for Web Exploration? 785 - James Cameron: Before Avatar ... a Curious Boy 790 - Dan Barber: How I Fell in Love With a Fish 792 - Eric Mead: The Magic of the Placebo 799 - Jane McGonigal: Gaming Can Make a Better World 805 - Robert Gupta: Music is Medicine, Music is Sanity 824 - Michael Specter: The Danger of Science Denial 837 - Tom Wujec: Build a Tower, Build a Team
The following EU Bookshop documents were downloaded from the EU Bookshop online archive in E-Book format: The raw text was extracted using the Calibre E-Book Management tool
KEBC11002 - Social Dialogue KEBC12001 - Demography, Active Ageing and Pensions KH7911105 - Soil MI3112464 - Road Transport MJ3011331 - Energy NA3211776 - Europe in 12 Lessons QE3011322 - Shaping Europe QE3211790 - Active citizenship