Most of my work to date on spoken language dialog systems has beeen in the context of consultancy work in the commercial domain, where I have worked with companies such as Nuance, Phillips and VeCommerce to deliver real deployed systems based on speech recognition; through this work I've been involved in a number of high-profile applications in everyday use in Australia.
As a consequence of this work, I have become very interested in the design and development of practical spoken language dialog systems, and in the question of how we can best design systems that are 'habitable'. I'm particularly conscious of what I perceive as a very wide gap between the kinds of systems that are deployed in the real world, and the kinds of systems that are developed in research laboratories and discussed at academic conferences. I am very interested in the reasons for this difference, and the question of the extent to which one might narrow the gap. I think it's reasonable to ask whether attempting to narrow that gap by making systems more 'natural' is the right thing to do: increasingly, I'm of the view that it is not, principally because the way current technologies are able to detect and react to error is fundamentally different from what we as humans are able to do.
With Stephen Choularton, I am exploring the problem of how we can handle speech recognition errors [Choularton and Dale 2004], with a particular focus on 'early detection': can we determine, simply from properties of the speech signal itself, that the user's utterance is likely to be misunderstood by the recognizer? Our results here so far are promising, and are the subject of a patent application; we believe we can predict with high accuracy whether a speech recogniser willl misrecognize a given utterance, thus allowing the system to take pre-eemptive action before the error takes the dialogue off-track.
From the beginning of 2007, we have received funding under the ARC's Thinking Systems Initiative, as part of a large multisite team and a five-year project, to build a sophisticated and reactive embodied conversational agent. Our role in this project is two-fold: first, we are interested in producing spoken output that is conveyed with appropriate intonational structure, and married to appropriate facial expressions; and second, we aim to pursue the idea of what we might call 'quasi-conversational systems', where the linguistic limitations of the system's interpretational capabilities are acknowledged and leveraged.