The DADA project is developing software for managing language resources and exposing them on the web. Language resources are digital collections of language as audio, video and text used to study language and build technology systems. The project has been going for a while with some initial funding from the ARC to build the basic infrastructure and later from Macquarie University for some work on the Auslan corpus of Australian Sign Language collected by Trevor Johnston. Recently we have two projects which DADA will be part of, and so the pace of development has picked up a little.
The Australian National Corpus (AusNC) is an effort to build a centralised collection of resources of language in Australia. The core idea is to take whatever existing collections we can get permission to publish and make them available under a common technical infrastructure. Using some funding from HCSNet we build a small demonstration site that allowed free text search on two collections: the Australian Corpus of English and the Corpus of Oz Early English. We now have some funding to continue this work and expand both the size of the collection and the capability of the infrastructure that will support it. What we’ve already done is to separate the text in these corpora from their meta-data (descriptions of each text) and the annotation (denoting things within the texts). While the pilot allows searching on the text the next steps will allow search using the meta-data (look for this in texts written after 1900) and the annotation (find this in the titles of articles). This project is funded by the Australian National Data Service (ANDS) and is a collaboration with Michael Haugh at Griffith.
The Big Australian Speech Corpus, more recently renamed AusTalk, is an ARC funded project to collect speech and video from 1000 Australian speakers for a new freely available corpus. The project involves many partners around the country each of who will have a ‘black box’ recording station to collect audio and stereo video of subjects reading words and sentences, being interviewed and doing the Map task – a game designed to elicit natural speech between two people. Our part of the project is to provide the server infrastructure that will store the audio, video and annotation data that will make up the corpus. DADA will be part of this solution but the main driver is to be able to provide a secure and reliable store for the primary data as it comes in from the collection sites. An important feature of the collection is the meta-data that will describe the subjects in the recording. Some annotation of the data will be done automatically, for example some forced alignment of the read words and sentences. Later, we will move on to support manual annotation of some of the data – for example transcripts of the interviews and map task sessions. All of this will be published via the DADA server infrastructure to create a large, freely available research collection for Australian English.
Since the development of DADA now involves people outside Macquarie, we have started using a public bitbucket repository for the code. As of this writing the code still needs some tidying and documentation to enable third parties to be able to install and work on it, but we hope to have that done within a month. The public DADA demo site is down at the moment due to network upgrades at Macquarie (it’s only visible inside MQ) – I hope to have that fixed soon with some new sample data sets loaded up for testing. 2011 looks like it will be a significant year for DADA. We hope to end this year with a number of significant text, audio and video corpora hosted on DADA infrastructure and providing useful services to the linguistics and language technology communities.