Amplify is a project maintained by the State Library of NSW for crowdsourcing the transcription of Oral History recordings. It is being used to generate transcripts of some of their extensive catalogue of OH recordings. Amplify starts with the results of an Automatic Speech Recognition system and presents these to users of the platform for correction and validation. Amplify is great! I’d encourage you to go along and try out the correction process.
I’m interested in how OH transcripts might be enhanced with the application of Natural Language Processing techniques. As part of the Alveo project, we’re working on developing tools to help with recording, archiving and enhancing OH recordings. As a platform for experimenting with this I’ve built a little web application that takes transcripts from the State Library’s Amplify platform and puts them through a few NLP processes and then presents the results. This is now available online as Amplify Amplifier.
The app will process any transcript from Amplify that has a 100% completed rating. This should mean that it has been corrected and verified by users but there are some wrinkles in Amplify (that SLNSW know about) that mean that often errors slip through the net. The app applies three processes at the moment:
- Topic segmentation using the NLTK implementation of Marti Hearst’s text-tiling algorithm. This chunks the interview into topics based on the distribution of words in the text.
- For each topic, we find keywords that are more common in that topic than in the rest of the text using the TF-IDF metric. This is an attempt to identify the main concepts in each topic.
- For each topic, we use the DB-Pedia Spotlight named entity linking service to find the names of people, places and concepts and link them to relevant entries in DBPedia (a machine readable version of Wikipedia).
All of this is presented in the page with buttons to allow you to play each topic.
The results are interesting. The topics that Text Tiling finds do give an idea of the overall structure of the interview, although in many cases it is over-segmenting. Keywords give some idea of what each topic is about and sometimes seem to act as a nice summary; other times they are entirely un-useful, such as the single keyword ‘er’ – although I guess this means the speaker is hesitating a lot in that chunk.
Named entities are perhaps the most random aspect of the result. This is not too surprising since we are using a system trained on very different kinds of text in the context of Australian oral histories. So it will try to find the most common entity to link to a name or place and often end up with Harry Potter or some other popular icon rather than a local alternative.
There is a lot of scope here for improving the results that are generated. However, I’m interested in feedback from OH researchers and other readers of these transcripts to see if this kind of presentation is useful at all. How could this be improved? What other elements of the text could be brought out in a useful way?
I’ll continue to experiment with this and hopefully develop some more useful tools with some feedback from interested users.
Our paper discussing Alveo in the context of reproducibility in language sciences is now available in Computer Speech & Language: DOI:10.1016/j.csl.2017.01.003
- Reviews a number of publications in CSL regarding their practice in using and citing data collections.
- Finds that authors are keen to identify and share data but that practices vary in how precise they are or how easy it is to get the data.
- Reviews research workflows in speech and language, including the use of software tools.
- Suggests a ‘hierarchy of needs’ for reproducibility in speech and language research.
- Describes how the Alveo Virtual Laboratory supports a model of research that facilitates data sharing and citation of software tools.
Reproducibility is an important part of scientific research and studies published in speech and language research usually make some attempt at ensuring that the work reported could be reproduced by other researchers. This paper looks at the current practice in the field relating to the citation and availability of both data and software methods. It is common to use widely available shared datasets in this field which helps to ensure that studies can be reproduced; however a brief survey of recent papers shows a wide range of styles of citation of data only some of which clearly identify the exact data used in the study. Similarly, practices in describing and sharing software artefacts vary considerably from detailed descriptions of algorithms to linked repositories. The Alveo Virtual Laboratory is a web based platform to support research based on collections of text, speech and video. Alveo provides a central repository for language data and provides a set of services for discovery and analysis of data. We argue that some of the features of the Alveo platform may make it easier for researchers to share their data more precisely and cite the exact software tools used to develop published results. Alveo makes use of ideas developed in other areas of science and we discuss these and how they can be applied to speech and language research.
Today’s problem was to write a wrapper for the MAUS automatic segmentation system in preparation for including it as a Galaxy tool. MAUS comes from Florian Schiel in Munich and is a collection of shell scripts and Linux executables that take a sound file and an orthographic transcription and generate an aligned phonetic segmentation. The core of this process is the HTK speech recognition system and getting it to work on anything other than Linux is a pain that is best not lived with. Continue reading
I’ve been working for a while now on adapting the Galaxy Workflow engine for use in speech analysis, specifically for acoustic phonetic analysis of vowel sounds. Galaxy is a system used in bioinformatics for constructing workflows to do genetic analysis and other things. As part of the Alveo project we’ve been building tools for doing text and speech analysis for Galaxy. My recent work has been specifically looking at acoustic phonetic analysis with the Emu library with a goal of reproducing some work on children’s vowels that I did with Catherine Watson many years ago. Continue reading
As part of the Alveo project we’ve been using the Galaxy Workflow Engine to provide a web-based user-friendly interface to some language processing tools. Galaxy was originally developed for Bioinformatics researchers but we’ve been able to adapt it for language tools quite easily. Galaxy tools are scripts or executable command line applications that read input data from files and write results out to new files. These files are presented as data objects in the Galaxy interface. Chains of tools can be run one after another to process data from input to final results.
One of the recent updates to Galaxy is the ability to group data objects together into datasets. These datasets can then form the input to a workflow which can be run for each object in the dataset. This is something we’ve wanted for Alveo for a long time since applying the same process to all files in a collection is a common requirement for language processing. After a bit of exploration I’ve worked out how to write a tool that generates a dataset and since the documentation for this is somewhat sparse and confusing, I thought I’d write up my findings.
Authors: Deanna Wong, Steve Cassidy and Pam Peters
To appear in Corpora, expected publication in 2012. Manuscript available on request.
The textual markup scheme of the International Corpus of English (ICE) corpus project evolved continuously from 1989 on, more or less independent of the Text Encoding Initiative (TEI). It was intended to standardise the annotation of all the regional ICE corpora, in order to facilitate inter-comparisons of their linguistic content. However this goal has proved elusive because of gradual changes in the ICE annotation system, and additions to it made by those working on individual ICE corpora. Further, since the project pre-dates the development of XML-based markup standards, the format of the ICE markup does not match that in many modern corpora and can be difficult to manipulate. As a goal of the original project was interoperability of the various ICE corpora, it is important that the markup of existing and new ICE corpora can be converted into a common format that can serve their ongoing needs, while allowing older markup to be fully included. This paper describes the most significant variations in annotation, and focuses on several points of difficulty inherent in the system: especially the non-hierarchical treatment of the visual and structural elements of written texts, and of overlapping speech in spontaneous conversation. We report on our development of a parser to validate the existing ICE markup scheme and convert it to other formats. The development of this tool not only brings the Australian version into line with the current ICE standard, it also allows for proper validation of all annotation in any of the regional corpora. Once the corpora have been validated, they can be converted easily to a standardised XML format for alternate systems of corpus annotation, such as that developed by the TEI.
The Graph Annotation Format (GrAF) is the XML data exchange format developed for the model of linguistic annotation described in the ISO Linguistic Annotation Framework (LAF). LAF is the abstract model of annotations represented as a graph structure, GrAF is an XML serialisation of the model intended for moving data between different tools. Both were developed by Nancy Ide and Keith Suderman in Vasser with input from the community involved in the ISO standardisation process around linguistic data. Continue reading
The DADA project is developing software for managing language resources and exposing them on the web. Language resources are digital collections of language as audio, video and text used to study language and build technology systems. The project has been going for a while with some initial funding from the ARC to build the basic infrastructure and later from Macquarie University for some work on the Auslan corpus of Australian Sign Language collected by Trevor Johnston. Recently we have two projects which DADA will be part of, and so the pace of development has picked up a little. Continue reading
The Linguistic Annotation Framework defines a generalised graph based
model for annotation data intended as an interchange format for transfer
of annotations between tools. The DADA system uses an RDF based representation
of annotation data and provides a web based annotation store. The annotation
model in DADA can be seen as an RDF realisation of the LAF model. This paper
describes the relationship between the two models and makes some comments on
how the standard might be stated in a more format-neutral way.
Download PDF: An RDF Realisation of LAF in the DADA Annotation Server