Category Archives: Language Resources

Supporting accessibility and reproducibility in language research in the Alveo virtual laboratory

Our paper discussing Alveo in the context of reproducibility in language sciences is now available in Computer Speech & Language: DOI:10.1016/j.csl.2017.01.003


  • Reviews a number of publications in CSL regarding their practice in using and citing data collections.
  • Finds that authors are keen to identify and share data but that practices vary in how precise they are or how easy it is to get the data.
  • Reviews research workflows in speech and language, including the use of software tools.
  • Suggests a ‘hierarchy of needs’ for reproducibility in speech and language research.
  • Describes how the Alveo Virtual Laboratory supports a model of research that facilitates data sharing and citation of software tools.


Reproducibility is an important part of scientific research and studies published in speech and language research usually make some attempt at ensuring that the work reported could be reproduced by other researchers. This paper looks at the current practice in the field relating to the citation and availability of both data and software methods. It is common to use widely available shared datasets in this field which helps to ensure that studies can be reproduced; however a brief survey of recent papers shows a wide range of styles of citation of data only some of which clearly identify the exact data used in the study. Similarly, practices in describing and sharing software artefacts vary considerably from detailed descriptions of algorithms to linked repositories. The Alveo Virtual Laboratory is a web based platform to support research based on collections of text, speech and video. Alveo provides a central repository for language data and provides a set of services for discovery and analysis of data. We argue that some of the features of the Alveo platform may make it easier for researchers to share their data more precisely and cite the exact software tools used to develop published results. Alveo makes use of ideas developed in other areas of science and we discuss these and how they can be applied to speech and language research.


Mobile Apps for Aboriginal Languages

My introduction to Darwin was on a borrowed bike used to discover the streets around CDU and eventually making my way to the city and Midil Beach markets for a Sunday evening feast of Gado-Gado watching the sunset on the sand.  I’m in Darwin for a workshop organised by Steven Bird aiming to build mobile apps aimed at “Keeping our Languages Strong”.   While a lot of the work with Australian languages is aimed at preservation and documentation, Steven’s work is aimed more at maintaining the living languages within their communities.
The invitees to the workshop were a mixture of technologists like me, linguists, people working with the language communities and members of the communities themselves.  The premise was to bring us together to imagine what mobile apps we might build in the context of Aboriginal languages and them maybe even try to build some demonstrations as a proof of concept in the week.  The first two days explored possibilities; the next two left the hackers alone to try to build something; the final morning was a show and tell and reflection on what we’d managed to achieve.
Continue reading

Updating the ICE Annotation System: Tagging, Parsing and Validation

Authors: Deanna Wong, Steve Cassidy and Pam Peters

To appear in Corpora, expected publication in 2012. Manuscript available on request.

The textual markup scheme of the International Corpus of English (ICE) corpus project evolved continuously from 1989 on, more or less independent of the Text Encoding Initiative (TEI). It was intended to standardise the annotation of all the regional ICE corpora, in order to facilitate inter-comparisons of their linguistic content. However this goal has proved elusive because of gradual changes in the ICE annotation system, and additions to it made by those working on individual ICE corpora. Further, since the project pre-dates the development of XML-based markup standards, the format of the ICE markup does not match that in many modern corpora and can be difficult to manipulate. As a goal of the original project was interoperability of the various ICE corpora, it is important that the markup of existing and new ICE corpora can be converted into a common format that can serve their ongoing needs, while allowing older markup to be fully included. This paper describes the most significant variations in annotation, and focuses on several points of difficulty inherent in the system: especially the non-hierarchical treatment of the visual and structural elements of written texts, and of overlapping speech in spontaneous conversation. We report on our development of a parser to validate the existing ICE markup scheme and convert it to other formats. The development of this tool not only brings the Australian version into line with the current ICE standard, it also allows for proper validation of all annotation in any of the regional corpora. Once the corpora have been validated, they can be converted easily to a standardised XML format for alternate systems of corpus annotation, such as that developed by the TEI.

Notes on Conversion of GrAF to RDF

The Graph Annotation Format (GrAF) is the XML data exchange format developed for the model of linguistic annotation described in the ISO Linguistic Annotation Framework (LAF). LAF is the abstract model of annotations represented as a graph structure, GrAF is an XML serialisation of the model intended for moving data between different tools. Both were developed by Nancy Ide and Keith Suderman in Vasser with input from the community involved in the ISO standardisation process around linguistic data. Continue reading