Natural Language Generation for Aboriginal Languages


Background: Generating from Data

There are a number of research projects around the world where the goal is to generate a textual description of some numerical data using artificial intelligence techniques. One example is the project called Baby Talk, where starting from data on heart rate, blood pressure, O2 and CO2 levels in the blood, respiration rate, etc, the system produces a text-based description of the sort that a nurse might read. Another example is the SumTime project, where numerical data such as found in weather predictions is translated into the sort of short text you might read in the newspaper: ``WSW 10-15 increasing 17-22 by early morning, then gradually easing 9-14 by midnight.'' In addition, there has been some work on generating texts related to sporting information (such as for basketball and gridiron).

This kind of scenario of generating from data allows the investigation of a lot of interesting research questions, especially concerning the relationship between human language and its processing by computers. We're interested in exploring these for an Aboriginal language, along with a parallel version for English. Given the extensive interest in Aboriginal communities in AFL, and the AFL's well-known interest in indigenous issues through the AFL Foundation and elsewhere, the domain of AFL games seems like a natural fit.

This Project: Short Description

The goal of this project is to generate a simplified version of reports on AFL matches. For example, consider the article in Perth Now on the Collingwood-West Coast semi-final, written by Roger Vaughan and dated September 14, 2007. We have analysed this article (and others) to see what information could be generated from various statistics available, and what requires the judgement of a human author. A surprisingly large amount can be generated from the available statistics, such as in the first sentence:

COLLINGWOOD staged a comeback to beat West Coast by 19 points in extra
time last night in their semi-final at Subiaco.

Other information (such as the judgement that the comeback was amazing or the semi-final pulsating, or the interpretation involved in veteran defender James Clement made his second clanger of the night) would be too complicated to capture, hence our intention to aim for a simplified version of the report.

We have two main reasons for investigating this:

  1. to carry out research into issues in generating text from data in general, and then
  2. to use the text generation framework as a vehicle for research into how computers can handle Aboriginal languages, which is a challenge because of their highly unusual structure.

This second reason is the main one. There is a lot of work on how computers can handle the world's major languages -- for example, there are many systems on the Web that can translate automatically between English and French -- and some work on smaller indigenous languages such as Maori in New Zealand and Inuktitut in Canada. However, very little work has been done on handling Aboriginal languages with computers. Our language of choice is Arrernte, because of the expertise of various people who would be involved in the project. It is sufficiently well documented, and with sufficient existing resources, that a computational treatment is feasible; but a system based on such a treatment is a big challenge, and an interesting one. In terms of a practical outcome at the end of the project, we envisage that a system that generates texts about AFL matches could be of interest to the Arrernte-speaking community, for example in helping to promote literacy, given the huge level of interest in AFL there (as seen in, for example, the Yuendumu games); our collaborators with expertise in Arrernte have a lot of experience in working together with the community in applications beyond research.

Investigators

The following are the chief / partner investigators on the grant:

We'll also be working with Dr John Henderson, University of Western Australia.


Last updated 3 November 2009