A vast amount of usable electronic data is in the form of unstructured text. The relation extraction (RE) task aims to identify useful information in text (e.g., PersonW works for OrganisationX, GeneY encodes ProteinZ) and recode it in a format such as a relational database or RDF triple store that can be more effectively used for querying and automated reasoning. A number of resources have been developed for training and evaluation of automatic systems for RE in different domains. However, comparative evaluation is impeded by the fact that these corpora use different markup formats and different notions of what constitutes a relation.
The Edinburgh Regularised RE Corpora consist of publicly available data sets (ACE 2004, ACE 2005 and BioInfer) that have been standardised for evaluation of multi-type RE across domains.
Annotation includes: 1) a refactored version of the original data to a common XML document type, 2) linguistic information from LT-TTT and Minipar, and 3) a normalised version of the original RE markup that complies with a shared notion of what constitutes a relation across domains.
The corpora are converted to a common document type using token standoff, while maintaining compatibility with all information in the original annotation.
The original ACE data is encoded in SGML, does not include sentence or word token markup, and uses character standoff annotation for entities and relations. Therefore, conversion to RE XML requires SGML-to-XML conversion, tokenisation and mapping from character offset to token offset. [ACE example]
The original BioInfer data is already encoded in XML, includes sentence and word token markup, and uses token standoff annotation for entities and relations. Therefore, conversion to the RE XML document type is a matter of simple XML-to-XML transformation. [BioInfer example]
The data is enriched with various types of linguistic preprocessing information. This uses components available as part of LT-TTT2 to perform part-of-speech tagging, lemmatisation, identification and interpretation of nominalisations, verb and noun phrase chunking, identification of chunk heads and identifications of voice and polarity of verb phrases. [ACE example, BioInfer example]
The reannotation process normalised the two data sets so that they comply with a shared notion of relation that is intuitive, simple and informed by the semantic web. (See Hachey  for details.)
For the ACE data, we automatically converted many relations involving nested, nominal entity mentions to relations involving non-nested, named or pronominal entity mentions. For example, mapping the first entity from 'one' to 'Amidu Berry' for the membership relation described in the text snippet 'Amidu Berry, one half of PBS'.
For the BioInfer data, we flattened nested relations, mapped part-whole to part-part relations, and mapped n-ary to binary relations.
The modified versions of the ACE 2004 and 2005 data are available through the Linguistic Data Consortium (LDC) as Edinburgh Regularised ACE (reACE). The modified corpus is free of charge for 2011 members of the LDC.
The modified version of the BioInfer data is available as Edinburgh Regularised BioInfer (reBioInfer). The modified corpus is free of charge under the same open-source license terms as the original BioInfer data set.
Please acknowledge use by citing:
Ben Hachey, Claire Grover and Richard Tobin (2011). Datasets for Generic Relation Extraction. Journal of Natural Language Engineering.
[Official: doi, Preprint: pdf]
Ben Hachey (2009). Multi-Document Summarisation Using Generic Relation Extraction. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, Singapore.
Ben Hachey (2009). Generic Relation Identification: Models and evaluation. In: Australasian Language Technology Workshop, Sydney, NSW, Australia.
Original data sets:
Alexis Mitchell, Stephanie Strassel, Shudong Huang and Ramez Zakhary (2005). ACE 2004 Multilingual Training Corpus. Linguistic Data Consortium, Philadelphia.
[more information from corpus homepage]
Christopher Walker, Stephanie Strassel, Julie Medero and Kazuaki Maeda (2006). ACE 2005 Multilingual Training Corpus. Linguistic Data Consortium, Philadelphia.
[more information from corpus homepage]
Sampo Pyysalo, Filip Ginter, Juho Heimonen, Jari Björne, Jorma Boberg, Jouni Järvinen and Tapio Salakoski (2007). BioInfer: A corpus for information extraction in the biomedical domain. BMC Bioinformatics, 8:50.
[pdf, more information from corpus homepage]