I have long been interested in how language technology can be used in the context of real documents and the processing of real texts, as opposed to a focus on simplified examples abstracted away from what other researchers sometimes consider 'noise'. There's much we can learn by looking at simplified examples, of course; but if we can't carry that knowledge over into the processing of language as it is found 'in the wild', the chances of our research making a significant impact on the world are small. This goes beyond a preference for corpus-based linguistics: I'm particularly interested in 'embodied language', which covers both language that is visually and typographically situated in a PDF file or on a web page, and spoken language that is 'situated' on a noisy telephone line. This page is concerned with the former; see here for more on my interests in the latter.
My early work in this area was primarily concerned with the automation of editorial assistance, where I was interested in rule-based methods for detecting and correcting transgressions of 'house style' [Dale 1989b, Dale 1990b, Dale and Douglas 1993, Dale 1997a], and in elegant methods for detecting and correcting grammatical errors [Douglas and Dale 1992]. Much of this work, which was funded in the UK as part of the Editor's Assistant project, is summed up in [Dale and Douglas 1996]. This work also led to an exploration of how editorial assistance could be provided for working with bibliographic data: see [Matheson and Dale 1993].
Much of my research in natural language generation also falls in this space, given its emphasis on producing real 'embodied text' from pre-existing underlying data sources: see here for more information.
More recently, the focus of my work in processing real documents has been on information extraction and related tasks. Here, the goal is to build technologies that can handle raw documents, irrespective of source format, extracting and re-purposing the key information they contain. Much of this work has been funded by the Capital Markets Co-operative Research Centre (CMCRC), and more recently by the Defence Science Technology Organisation (DSTO).
The CMCRC-funded work revolves around the GainSpring project, which aims to extract information from company announcements so that it can be delivered to interested parties speedily and efficiently, using a combination of information extraction and text summarisation [Dale et al 2004, Dale et al 2005]. Particular problems we've looked at here are the extraction of information from tabular data [Dale et al 2002, Dale et al 2003, Long et al 2005, Long et al 2006] and the handling of complex named entities that involve conjunction [Mazur and Dale 2005, Mazur and Dale 2006a]. Current work with Scott Nowson returns to the summarisation task.
The DSTO-funded work has so far focussed on the detection and interpretation of temporal expressions [Mazur and Dale 2006b, Dale and Mazur 2006]; under new funding for 2007, this work is growing to encompass a number of other subtasks in the ACE competition.
As a whole, the intelligent text processing strand of my research interests is well-represented by my current students: most recently, I've been working with Stephen Wan on statistical methods for multidocument summarisation [Wan and Dale 2001, Wan et al 2003a, Wan et al 2003b, Wan et al 2005], Brett Powley and Andrew Foster on doing interesting things with bibliographic data in journal and conference papers, Andrew Lampert on determining the dialogic relationships between email messages, and Alex Rafalovitch on using information extraction techniques to mine data from United Nations documents.