In this project we aim to develop models and a system to generate natural language texts in the Aboriginal language Arrernte from data in the domain of Australian Rules Football. This will provide a framework for investigating fundamental theoretical issues in formalisms for natural language representation, and for developing among the first language technology applications for the greatly underexplored indigenous languages of Australia. We discuss these issues below, followed by reasons for specific choices of each aspect of the project.
The great majority of work on understanding and manipulating languages has been carried out on languages that are configurational�that is, languages that have a fairly rigid word order�such as English, French, Spanish, Chinese and so on. Consequently, most formalisms for representing language, and applications to manipulate language, are only adapted to non-configurational languages with difficulty. Apart from the pure scientific interest of understanding a phenomenon, trying to better understand such languages is potentially useful for at least two reasons. First, a number of major languages have nonconfigurational aspects (free word order, null anaphora, or syntactically discontinuous expressions), such as German and Russian; any applications involving them, such as Machine Translation (MT), need to capture the differences from configurational languages. Second, investigating a broader range of languages with interesting characteristics can say something about what representations are necessary and sufficient to describe language in general. There is a strand of linguistics and computational linguistics�including Tree Adjoining Grammar (TAG), Combinatory Categorial Grammar (CCG), and other mildly context-sensitive grammar formalisms, as well as Generalized Phrase Structure Grammar (GPSG) and Chomskyan linguistics pre-1973�which aims (among other things) to design a formalism which has the minimum expressive power necessary for describing human language: the goal is to minimise the need for (arbitrary) stipulations and potentially to allow more efficient algorithms than formalisms with unrestricted computational power [Joshi et al., 1991]; and in parallel with this, to provide insight into the human language processing mechanism [Rambow and Joshi, 1994]. In this search for a suitable formalism, it was work on Swiss German and Bambara that verified that natural languages required a more expressive computational representation than the widely used context-free grammars. Relatedly, it was recent work on German�s non-configurational properties that uncovered some unexpected differences between formalisms previously believed to be equivalent [Hockenmaier and Young, 2008]. Even for formalisms with unrestricted computational power such as Head-driven Phrase Structure Grammar (HPSG), these non-configurational languages present a major representational challenge.
Australia is rich in non-configurational languages, and its Aboriginal languages have attracted interest for decades, for a number of reasons. They have represented a significant new frontier of languages previously unknown to the rest of the world; how they are related even to each other, much less to other language groupings, is still very much an open question, unlike the case for Indo-European languages; and in particular, they are different in many ways from other classes of languages, in terms of phonology, morphology, syntax, and so on. As a result, linguistic investigation of Aboriginal languages has been quite broad, covering languages from both Pama-Nyungan and non-Pama-Nyungan language families: Warlpiri, Arrente, Pitjantjatjara, and others. These three languages named are most widely spoken, estimated to have between 1500 and 6000 native speakers each. The communities of speakers of these languages are engaged in measures to preserve the languages, such as through bilingual education at schools [Hartman and Henderson, 1994]. To date, there has been at best a moderate amount of work on a few of the languages in computational linguistics, either in analysis of the languages or in development of applications. Reasons for adding a computational aspect to pure linguistic analysis are twofold. First, computational linguistics can use the tools of computer science to verify the consistency of the analyses of linguists when these are scaled up to a large proportion of a language, much as model checking and theorem provers allow logicians to test their formalisms, as argued by Bender [2008b]; this has been the experience in building large-scale computational grammars for English, for example, for the TAG formalism (in the XTAG project [XTAG Research Group, 2001]) and for the HPSG formalism (in the LinGO ERG project [Bender et al., 2002]). In doing so, issues are raised concerning what algorithms are most useful for processing these grammars; and the characteristics of unusual languages provide a challenge to existing algorithms that can inform language processing more generally. Second, the development of applications puts linguistic analyses to use in a way that allows them to be evaluated by a broader range of users.
For other indigenous languages around the world there have been some recent attempts to extend purely linguistic study and language preservation efforts to the application of computational linguistics to these languages. The most extensive programme is for Inuktitut, by the National Research Council Institute for Information Technology of Canada, which aims to develop information retrieval and other applications for the First Nation peoples of Canada [Johnson and Martin, 2003]. Another is with Maori at the University of Otago, where machine translation (MT) and human-computer dialogue applications are being developed [Knott et al., 2003]. Both of these have as goals both the questions of scientific interest related to the specific languages and the encouragement of language maintenance and preservation. However, there is no similar project for any Aboriginal language, or even much connection with Information Technology (IT) generally; witness the 2009 Puliima workshop attempting to bring together Aboriginal languages and IT, only the second ever such attempt.
Reasons for the choice of each aspect of the project are as follows:
language: Arrernte Arrernte is divided into Western and Eastern/Central; this project will focus on the latter, for reasons of resources and location. Arrernte is a good choice for a language of interest because of existing work on the language, and because of its unusual characteristics. Morphosyntactic analyses have been proposed that describe these: extensive use of morphology, fundamentally free word order (but with word order preferences and restrictions on various subparts of the language), lack of a copula verb, �quasi-inflections� on verbs including a �category of associated motion�, and so on. It is sufficiently well documented, and with sufficient existing resources, that a computational treatment is feasible; but a system based on such a treatment is a big challenge. In terms of linguistic analysis of Eastern Arrernte, there is good coverage of the grammar [Strehlow, 1944, Wilkins, 1989, Green, 1994, Henderson, 1998]. Arrernte also has a well-established mechanism for word-building, including incorporation of loan words from English to supplement any lack of vocabulary in the core language [Green, 1994, Henderson, 2002], making discourses on non-traditional topics feasible. There is also an electronic dictionary available for Eastern/Central Arrernte [Henderson and Dobson, 1994]. Further, it is one of the major Aboriginal languages in Australia, one of the few where children are still learning to speak the language as their first language. Eastern/Central Arrernte has a good deal of cultural support, for example through bilingual teaching in the Northern Territory � where it is taught inter alia as a compulsory language at primary schools � and as the first language of an estimated 25% of the population of Alice Springs, the urban center for much of the remote Northern Territory.
application: Natural Language Generation (NLG) We are interested in output of a high quality for a language that has few computational resources. UnrestrictedMT, in spite of significant improvements in the past decade, still produces quite poor output; in addition, the most successful current systems are statistical, and Arrernte has nowhere near enough text for training such systems for high quality output. Information retrieval applications such as for Inuktitut rely on texts for searching�this is reasonable for Inuktitut, where for example bilingual English�Inuktitut parliamentary proceedings are mandated by the Legislative Assembly of Nunavut, but Arrernte has many fewer texts available. NLG, unlike other applications, does not require complete coverage of a language (which from the experience of the XTAG project is very difficult to achieve), only coverage for the required domain and set of linguistic constructions to be used, which can be scaled to whatever is feasible. Also, NLG systems which generate text from numerical and historical data are well established and quite successful: two current ones are Baby Talk [Portet et al., 2007], where starting from data on heart rate, blood pressure, O2 and CO2 levels in the blood, respiration rate, etc, the system produces a text-based description of the sort that a nurse might read; and SumTime [Reiter et al., 2005], where numerical data such as found in weather predictions is translated into the sort of short text you might read in the newspaper. Note also that here we are interested in generating entire articles, not just sentences, necessitating an understanding of information structure, discourse and narrative as well as syntax. This is an interesting yet underexplored area for computational treatment of indigenous languages.
domain: Australian Rules (AFL) football Language technology applications are generally more successful in limited domains. In particular, as noted above, generating from a combination of numerical data (such as game scores) and historical data (such as player information) is quite well established: the technique was introduced for basketball box scores by Robin [1994] and has been extended to other domains such as stock exchange data [Reiter and Dale, 2000] and the above-mentioned medical and weather data. We note that there is widespread interest in AFL in Aboriginal communities. See for example, or the discussion of the prominence of AFL football in everyday life and in events such as the Yuendumu Games in Tatz [1987]
As a practical outcome of this system, we would like to see the generation of football articles that could be used to help in developing literacy in Arrernte. There are existing readers used for this purpose; we would see the articles generated by our system as supplementing these.
Overall, then, the specific aims of this project are as follows:
The following are the chief / partner investigators on the grant:
We'll also be working with Dr John Henderson, University of Western Australia.
