Authors: Deanna Wong, Steve Cassidy and Pam Peters
To appear in Corpora, expected publication in 2012. Manuscript available on request.
The textual markup scheme of the International Corpus of English (ICE) corpus project evolved continuously from 1989 on, more or less independent of the Text Encoding Initiative (TEI). It was intended to standardise the annotation of all the regional ICE corpora, in order to facilitate inter-comparisons of their linguistic content. However this goal has proved elusive because of gradual changes in the ICE annotation system, and additions to it made by those working on individual ICE corpora. Further, since the project pre-dates the development of XML-based markup standards, the format of the ICE markup does not match that in many modern corpora and can be difficult to manipulate. As a goal of the original project was interoperability of the various ICE corpora, it is important that the markup of existing and new ICE corpora can be converted into a common format that can serve their ongoing needs, while allowing older markup to be fully included. This paper describes the most significant variations in annotation, and focuses on several points of difficulty inherent in the system: especially the non-hierarchical treatment of the visual and structural elements of written texts, and of overlapping speech in spontaneous conversation. We report on our development of a parser to validate the existing ICE markup scheme and convert it to other formats. The development of this tool not only brings the Australian version into line with the current ICE standard, it also allows for proper validation of all annotation in any of the regional corpora. Once the corpora have been validated, they can be converted easily to a standardised XML format for alternate systems of corpus annotation, such as that developed by the TEI.
The Graph Annotation Format (GrAF) is the XML data exchange format developed for the model of linguistic annotation described in the ISO Linguistic Annotation Framework (LAF). LAF is the abstract model of annotations represented as a graph structure, GrAF is an XML serialisation of the model intended for moving data between different tools. Both were developed by Nancy Ide and Keith Suderman in Vasser with input from the community involved in the ISO standardisation process around linguistic data. Continue reading
The DADA project is developing software for managing language resources and exposing them on the web. Language resources are digital collections of language as audio, video and text used to study language and build technology systems. The project has been going for a while with some initial funding from the ARC to build the basic infrastructure and later from Macquarie University for some work on the Auslan corpus of Australian Sign Language collected by Trevor Johnston. Recently we have two projects which DADA will be part of, and so the pace of development has picked up a little. Continue reading
The Linguistic Annotation Framework defines a generalised graph based
model for annotation data intended as an interchange format for transfer
of annotations between tools. The DADA system uses an RDF based representation
of annotation data and provides a web based annotation store. The annotation
model in DADA can be seen as an RDF realisation of the LAF model. This paper
describes the relationship between the two models and makes some comments on
how the standard might be stated in a more format-neutral way.
Download PDF: An RDF Realisation of LAF in the DADA Annotation Server
Steve Cassidy and Trevor Johnston.
Download PDF: Ingesting the Auslan Corpus into the DADA Annotation Store
Annotation data is stored and manipulated in various formats and there have been a number of efforts to build generalised models of annotation to support sharing of data between tools. This work has shown that it is possible to store annotations from many different tools in a single canonical format and allow transformation into other formats as needed. However, moving data between formats is often a matter of importing or exporting from one tool to another. This paper describes a web-based interface to annotation data that makes use of an abstract model of annotation in its internal store but is able to deliver a variety of annotation formats to clients over the web.
Presented at the The 2nd Linguistic Annotation Workshop (The LAW II) at LREC2008, Marrakech.
I have a PhD scholarship available for a project in applying Semantic Web technologies (RDF, Sparql, Annotea) to the Linguistic Annotation problem. Here’s an outline:
Shared collaborative distributed annotation using semantic web technologies.
The Semantic Web augments the current Web with machine-processable information enabling humans and machines to work in cooperation; in our context, we are using it as the basis of a linguistic annotation system that is used by language researchers to annotate language resources. This project will look at the issues raised when we allow many people to collaborate on authoring these annotations and making shared annotations available to a community of researchers. This crosses a number of existing areas of research including the semantic web and social computing, and extends the range of interactions available to researchers over the web.
Of course, as usual there is scope for variation on this theme, if you’re interested in this problem space and want to pursue a PhD in Australia, please get in touch. The scholarship is open to Australians and International students.
Update: Unfortunately this scholarship is no longer available, however Macquarie does have an active scholarships program and from time to time new scholarships are available that could cover this research area. Please check the Macquarie scholarships page for current details and feel free to contact me if you’d like to discuss options.