As part of the Alveo project we’ve been using the Galaxy Workflow Engine to provide a web-based user-friendly interface to some language processing tools. Galaxy was originally developed for Bioinformatics researchers but we’ve been able to adapt it for language tools quite easily. Galaxy tools are scripts or executable command line applications that read input data from files and write results out to new files. These files are presented as data objects in the Galaxy interface. Chains of tools can be run one after another to process data from input to final results.
One of the recent updates to Galaxy is the ability to group data objects together into datasets. These datasets can then form the input to a workflow which can be run for each object in the dataset. This is something we’ve wanted for Alveo for a long time since applying the same process to all files in a collection is a common requirement for language processing. After a bit of exploration I’ve worked out how to write a tool that generates a dataset and since the documentation for this is somewhat sparse and confusing, I thought I’d write up my findings.
Authors: Deanna Wong, Steve Cassidy and Pam Peters
To appear in Corpora, expected publication in 2012. Manuscript available on request.
The textual markup scheme of the International Corpus of English (ICE) corpus project evolved continuously from 1989 on, more or less independent of the Text Encoding Initiative (TEI). It was intended to standardise the annotation of all the regional ICE corpora, in order to facilitate inter-comparisons of their linguistic content. However this goal has proved elusive because of gradual changes in the ICE annotation system, and additions to it made by those working on individual ICE corpora. Further, since the project pre-dates the development of XML-based markup standards, the format of the ICE markup does not match that in many modern corpora and can be difficult to manipulate. As a goal of the original project was interoperability of the various ICE corpora, it is important that the markup of existing and new ICE corpora can be converted into a common format that can serve their ongoing needs, while allowing older markup to be fully included. This paper describes the most significant variations in annotation, and focuses on several points of difficulty inherent in the system: especially the non-hierarchical treatment of the visual and structural elements of written texts, and of overlapping speech in spontaneous conversation. We report on our development of a parser to validate the existing ICE markup scheme and convert it to other formats. The development of this tool not only brings the Australian version into line with the current ICE standard, it also allows for proper validation of all annotation in any of the regional corpora. Once the corpora have been validated, they can be converted easily to a standardised XML format for alternate systems of corpus annotation, such as that developed by the TEI.
The Graph Annotation Format (GrAF) is the XML data exchange format developed for the model of linguistic annotation described in the ISO Linguistic Annotation Framework (LAF). LAF is the abstract model of annotations represented as a graph structure, GrAF is an XML serialisation of the model intended for moving data between different tools. Both were developed by Nancy Ide and Keith Suderman in Vasser with input from the community involved in the ISO standardisation process around linguistic data. Continue reading
The DADA project is developing software for managing language resources and exposing them on the web. Language resources are digital collections of language as audio, video and text used to study language and build technology systems. The project has been going for a while with some initial funding from the ARC to build the basic infrastructure and later from Macquarie University for some work on the Auslan corpus of Australian Sign Language collected by Trevor Johnston. Recently we have two projects which DADA will be part of, and so the pace of development has picked up a little. Continue reading
The Linguistic Annotation Framework defines a generalised graph based
model for annotation data intended as an interchange format for transfer
of annotations between tools. The DADA system uses an RDF based representation
of annotation data and provides a web based annotation store. The annotation
model in DADA can be seen as an RDF realisation of the LAF model. This paper
describes the relationship between the two models and makes some comments on
how the standard might be stated in a more format-neutral way.
Download PDF: An RDF Realisation of LAF in the DADA Annotation Server
Steve Cassidy and Trevor Johnston.
Download PDF: Ingesting the Auslan Corpus into the DADA Annotation Store
I gave a talk last week introducing the Arduino platform to some MQ students and staff. It seemed to go well and there is a bit of interest in carrying on with a regular meetup in the Electronics labs – more details to come when we organise a time. Meanwhile, here are my slides from the talk, not that they’re very informative by themselves but I wanted to try out slideshare. Continue reading
This is a project idea for an Honours student or similar. Please contact me if you’d like to follow this up.
I’ve been having fun with arduino boards lately; these are small single chip development boards which have input output lines that can read sensors and control motors etc. They are programmed in Wiring which is really C with some sugar and libraries added. I’ve been thinking that the arduino would make a nice platform to stimulate some interest in beginning programmers as a break from the usual run of problems that we set them. This project would focus on developing a set of exercises suitable for a first or second programming class (I’m thinking COMP125) to develop some of the ideas explored there (data structures, simple algorithms) in a concrete context. Part of the project would be building a suitable platform (I fancy a Blimpduino) and then perhaps evaluating the use of the platform with real live first year students.
Annotation data is stored and manipulated in various formats and there have been a number of efforts to build generalised models of annotation to support sharing of data between tools. This work has shown that it is possible to store annotations from many different tools in a single canonical format and allow transformation into other formats as needed. However, moving data between formats is often a matter of importing or exporting from one tool to another. This paper describes a web-based interface to annotation data that makes use of an abstract model of annotation in its internal store but is able to deliver a variety of annotation formats to clients over the web.
Presented at the The 2nd Linguistic Annotation Workshop (The LAW II) at LREC2008, Marrakech.
As part of DADA (and yes, that page is a bit out of date) I wanted to provide a Sparql endpoint to allow experimentation with querying the raw RDF annotation data. So far, we’ve built everything using Redland in Python but it seems there is no exsiting Sparql endpoint implementation for this combination. The Sparql protocol document is long but as far as I can tell the core of the protocol is a simple GET request with an encoded Sparql query, results are returned as raw XML in the special Sparql result format or as RDF/XML if the return type is a graph. This proves to be very easy to implement on top of Redland since it’s query operator returns exactly those result types.
So, I present SparqlEndpoint-0.1, a python module that provides a WSGI conformant implementation of a Sparql Endpoint for Redland. It almost certainly doesn’t implement all of the protocol standard and it can be improved no end, for example by making it independant of the RDF backend it queries (eg. using RDFlib).
I’m not putting up a demo endpoint just yet as I’m having severe performance issues with my development server in combination with Redland. The triple store is growing rapidly to the millions of triples and the result is a huge latency (tens of minutes) to perform some queries. Given some recent discussion on the Redland list I’m wondering whether a jump to one of the RDF specific stores is the thing to do. This would probably mean rewriting my code in Java but based on the Berlin Sparql Benchmark numbers, Sesame and Jena have the kind of performance I need (sub second query response times on 100M triples).
Well, enough of that. If you are interested in SparqlEndpoint please download and take a look. If there is interest I’m happy to share it and host development somewhere accessible.