Authors: Deanna Wong, Steve Cassidy and Pam Peters
To appear in Corpora, expected publication in 2012. Manuscript available on request.
The textual markup scheme of the International Corpus of English (ICE) corpus project evolved continuously from 1989 on, more or less independent of the Text Encoding Initiative (TEI). It was intended to standardise the annotation of all the regional ICE corpora, in order to facilitate inter-comparisons of their linguistic content. However this goal has proved elusive because of gradual changes in the ICE annotation system, and additions to it made by those working on individual ICE corpora. Further, since the project pre-dates the development of XML-based markup standards, the format of the ICE markup does not match that in many modern corpora and can be difficult to manipulate. As a goal of the original project was interoperability of the various ICE corpora, it is important that the markup of existing and new ICE corpora can be converted into a common format that can serve their ongoing needs, while allowing older markup to be fully included. This paper describes the most significant variations in annotation, and focuses on several points of difficulty inherent in the system: especially the non-hierarchical treatment of the visual and structural elements of written texts, and of overlapping speech in spontaneous conversation. We report on our development of a parser to validate the existing ICE markup scheme and convert it to other formats. The development of this tool not only brings the Australian version into line with the current ICE standard, it also allows for proper validation of all annotation in any of the regional corpora. Once the corpora have been validated, they can be converted easily to a standardised XML format for alternate systems of corpus annotation, such as that developed by the TEI.