Natural Language Generation for Aboriginal Languages

Aims and Background

In this project we aim to develop models and a system to generate natural language texts in the Aboriginal language Arrernte from data in the domain of Australian Rules Football. This will provide a framework for investigating fundamental theoretical issues in formalisms for natural language representation, and for developing among the first language technology applications for the greatly underexplored indigenous languages of Australia. We discuss these issues below, followed by reasons for specific choices of each aspect of the project.

The great majority of work on understanding and manipulating languages has been carried out on languages that are configurational�that is, languages that have a fairly rigid word order�such as English, French, Spanish, Chinese and so on. Consequently, most formalisms for representing language, and applications to manipulate language, are only adapted to non-configurational languages with difficulty. Apart from the pure scientific interest of understanding a phenomenon, trying to better understand such languages is potentially useful for at least two reasons. First, a number of major languages have nonconfigurational aspects (free word order, null anaphora, or syntactically discontinuous expressions), such as German and Russian; any applications involving them, such as Machine Translation (MT), need to capture the differences from configurational languages. Second, investigating a broader range of languages with interesting characteristics can say something about what representations are necessary and sufficient to describe language in general. There is a strand of linguistics and computational linguistics�including Tree Adjoining Grammar (TAG), Combinatory Categorial Grammar (CCG), and other mildly context-sensitive grammar formalisms, as well as Generalized Phrase Structure Grammar (GPSG) and Chomskyan linguistics pre-1973�which aims (among other things) to design a formalism which has the minimum expressive power necessary for describing human language: the goal is to minimise the need for (arbitrary) stipulations and potentially to allow more efficient algorithms than formalisms with unrestricted computational power [Joshi et al., 1991]; and in parallel with this, to provide insight into the human language processing mechanism [Rambow and Joshi, 1994]. In this search for a suitable formalism, it was work on Swiss German and Bambara that verified that natural languages required a more expressive computational representation than the widely used context-free grammars. Relatedly, it was recent work on German�s non-configurational properties that uncovered some unexpected differences between formalisms previously believed to be equivalent [Hockenmaier and Young, 2008]. Even for formalisms with unrestricted computational power such as Head-driven Phrase Structure Grammar (HPSG), these non-configurational languages present a major representational challenge.

Australia is rich in non-configurational languages, and its Aboriginal languages have attracted interest for decades, for a number of reasons. They have represented a significant new frontier of languages previously unknown to the rest of the world; how they are related even to each other, much less to other language groupings, is still very much an open question, unlike the case for Indo-European languages; and in particular, they are different in many ways from other classes of languages, in terms of phonology, morphology, syntax, and so on. As a result, linguistic investigation of Aboriginal languages has been quite broad, covering languages from both Pama-Nyungan and non-Pama-Nyungan language families: Warlpiri, Arrente, Pitjantjatjara, and others. These three languages named are most widely spoken, estimated to have between 1500 and 6000 native speakers each. The communities of speakers of these languages are engaged in measures to preserve the languages, such as through bilingual education at schools [Hartman and Henderson, 1994]. To date, there has been at best a moderate amount of work on a few of the languages in computational linguistics, either in analysis of the languages or in development of applications. Reasons for adding a computational aspect to pure linguistic analysis are twofold. First, computational linguistics can use the tools of computer science to verify the consistency of the analyses of linguists when these are scaled up to a large proportion of a language, much as model checking and theorem provers allow logicians to test their formalisms, as argued by Bender [2008b]; this has been the experience in building large-scale computational grammars for English, for example, for the TAG formalism (in the XTAG project [XTAG Research Group, 2001]) and for the HPSG formalism (in the LinGO ERG project [Bender et al., 2002]). In doing so, issues are raised concerning what algorithms are most useful for processing these grammars; and the characteristics of unusual languages provide a challenge to existing algorithms that can inform language processing more generally. Second, the development of applications puts linguistic analyses to use in a way that allows them to be evaluated by a broader range of users.

For other indigenous languages around the world there have been some recent attempts to extend purely linguistic study and language preservation efforts to the application of computational linguistics to these languages. The most extensive programme is for Inuktitut, by the National Research Council Institute for Information Technology of Canada, which aims to develop information retrieval and other applications for the First Nation peoples of Canada [Johnson and Martin, 2003]. Another is with Maori at the University of Otago, where machine translation (MT) and human-computer dialogue applications are being developed [Knott et al., 2003]. Both of these have as goals both the questions of scientific interest related to the specific languages and the encouragement of language maintenance and preservation. However, there is no similar project for any Aboriginal language, or even much connection with Information Technology (IT) generally; witness the 2009 Puliima workshop attempting to bring together Aboriginal languages and IT, only the second ever such attempt.

Reasons for the choice of each aspect of the project are as follows:

language: Arrernte Arrernte is divided into Western and Eastern/Central; this project will focus on the latter, for reasons of resources and location. Arrernte is a good choice for a language of interest because of existing work on the language, and because of its unusual characteristics. Morphosyntactic analyses have been proposed that describe these: extensive use of morphology, fundamentally free word order (but with word order preferences and restrictions on various subparts of the language), lack of a copula verb, �quasi-inflections� on verbs including a �category of associated motion�, and so on. It is sufficiently well documented, and with sufficient existing resources, that a computational treatment is feasible; but a system based on such a treatment is a big challenge. In terms of linguistic analysis of Eastern Arrernte, there is good coverage of the grammar [Strehlow, 1944, Wilkins, 1989, Green, 1994, Henderson, 1998]. Arrernte also has a well-established mechanism for word-building, including incorporation of loan words from English to supplement any lack of vocabulary in the core language [Green, 1994, Henderson, 2002], making discourses on non-traditional topics feasible. There is also an electronic dictionary available for Eastern/Central Arrernte [Henderson and Dobson, 1994]. Further, it is one of the major Aboriginal languages in Australia, one of the few where children are still learning to speak the language as their first language. Eastern/Central Arrernte has a good deal of cultural support, for example through bilingual teaching in the Northern Territory � where it is taught inter alia as a compulsory language at primary schools � and as the first language of an estimated 25% of the population of Alice Springs, the urban center for much of the remote Northern Territory.

application: Natural Language Generation (NLG) We are interested in output of a high quality for a language that has few computational resources. UnrestrictedMT, in spite of significant improvements in the past decade, still produces quite poor output; in addition, the most successful current systems are statistical, and Arrernte has nowhere near enough text for training such systems for high quality output. Information retrieval applications such as for Inuktitut rely on texts for searching�this is reasonable for Inuktitut, where for example bilingual English�Inuktitut parliamentary proceedings are mandated by the Legislative Assembly of Nunavut, but Arrernte has many fewer texts available. NLG, unlike other applications, does not require complete coverage of a language (which from the experience of the XTAG project is very difficult to achieve), only coverage for the required domain and set of linguistic constructions to be used, which can be scaled to whatever is feasible. Also, NLG systems which generate text from numerical and historical data are well established and quite successful: two current ones are Baby Talk [Portet et al., 2007], where starting from data on heart rate, blood pressure, O2 and CO2 levels in the blood, respiration rate, etc, the system produces a text-based description of the sort that a nurse might read; and SumTime [Reiter et al., 2005], where numerical data such as found in weather predictions is translated into the sort of short text you might read in the newspaper. Note also that here we are interested in generating entire articles, not just sentences, necessitating an understanding of information structure, discourse and narrative as well as syntax. This is an interesting yet underexplored area for computational treatment of indigenous languages.

domain: Australian Rules (AFL) football Language technology applications are generally more successful in limited domains. In particular, as noted above, generating from a combination of numerical data (such as game scores) and historical data (such as player information) is quite well established: the technique was introduced for basketball box scores by Robin [1994] and has been extended to other domains such as stock exchange data [Reiter and Dale, 2000] and the above-mentioned medical and weather data. We note that there is widespread interest in AFL in Aboriginal communities. See for example http://www.aboriginalfootball.com.au, or the discussion of the prominence of AFL football in everyday life and in events such as the Yuendumu Games in Tatz [1987]

As a practical outcome of this system, we would like to see the generation of football articles that could be used to help in developing literacy in Arrernte. There are existing readers used for this purpose; we would see the articles generated by our system as supplementing these.

Overall, then, the specific aims of this project are as follows:

To verify the consistency of existing analyses of Arrernte through a large-scale implemented grammar, and investigate what unexpected new analyses might need to be developed based on coverage requirements; and also, complementarily, to examine how these will inform the requirements of linguistic formalisms.
To investigate what kinds of syntax�semantics�information structure�discourse interfaces are required for end-to-end language processing of Arrernte; and also investigate what kinds of new data structures and efficient algorithms can be designed for non-configurational languages within this context.
To investigate how the differences between configurational and non-configurational languages will affect the standard architectures for generation from numerical and historical data.
To develop a system that can generate Arrernte-language texts that would be of interest to Arrernte speakers, and that could be used in efforts to maintain the language and promote literacy among those speakers.

Investigators

The following are the chief / partner investigators on the grant:

Dr Mark Dras, Macquarie University
Dr Myf Turpin, University of Queensland
Dr Owen Rambow, Columbia University
Prof Robert Dale, Macquarie University

We'll also be working with Dr John Henderson, University of Western Australia.

References

Elisabeth Andr�e, Kim Binsted, Kumiko Tanaka-Ishii, Sean Luke, Gerd Herzog, and Thomas Rist. Three RoboCup Simulation League Commentator Systems. AI Magazine, 21(1):57�66, 2000.

Wendy Baarda. The design and trial of an interactive computer program Lata-kuunu to support Warlpiri school children�s literacy learning, 2003. Report for M.Ed., Northern Territory University.

Jason Baldridge. Lexically Specified Derivational Control in Combinatory Categorial Grammar. PhD thesis, University of Edinburgh, 2002.

Regina Barzilay and Mirella Lapata. Collective content selection for concept-to-text generation. In Proceedings of the HLT/EMNLP, pages 331�338, Vancouver, 2005.

Emily Bender. Radical Non-Configurationality without Shuffle Operators: An Analysis of Wambaya. In Proceedings of the Fifteenth Annual Conference on Head-Driven Phrase Structure Grammar (HPSG08), 2008a.

Emily Bender. Grammar Engineering for Linguistic Hypothesis Testing. In Proceedings of Texas Linguistic Society X, 2008b.

Emily Bender, Dan Flickinger, and Stephan Oepen. The Grammar Matrix: An Open-Source Starter-Kit for the Rapid Development of Cross-Linguistically Consistent Broad-Coverage Precision Grammars. In John Carroll, Nelleke Oostdijk, and Richard Sutcliffe, editors, Procedings of the Workshop on Grammar Engineering and Evaluation at the 19th International Conference on Computational Linguistics, pages 8�14, Taipei, Taiwan, 2002.

Emily M. Bender. Evaluating a Crosslinguistic Grammar Resource: A Case Study of Wambaya. In Proceedings of ACL-08: HLT, pages 977�985, Columbus, Ohio, June 2008c. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/P/P08/P08-1111.

Joan Bresnan. Lexical-Functional Syntax. Blackwell, Oxford, UK, 2000.

�Ozlem C� etino?glu and Kemal Oflazer. Morphology-Syntax Interface for Turkish LFG. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the ACL (COLINGACL06), pages 153�160, 2006.

Jean Clayton. Desert schools: An investigation of English languages and literacy among young aboriginal people in several communities. Queensland Journal of Educational Research, 15(1):101�112, 1999.

Cathryn Donohue and Ivan Sag. Domains in Warlpiri. In Proceedings of the Sixth Annual Conference on Head- Driven Phrase Structure Grammar (HPSG99), 1999.

Jenny Green. A Learner�s Guide to Eastern and Central Arrernte. IAD Press, Alice Springs, Australia, 1994. Barbara Grosz and Candy Sidner. Attention, intentions, and the structure of discourse. Computational Linguistics, 12:175�204, 1986.

Deborah Hartman and John Henderson, editors. Aboriginal Languages in Education. IAD Press, Alice Springs, 1994.

John Henderson. Topics in Eastern and Central Arrernte grammar. PhD thesis, University of Western Australia, 1998.

John Henderson. The word in eastern/central arrernte. In R. M. W. Dixon and Alexandra Aikhenvald, editors, Word: A Cross-Linguistic Typology, pages 100�124. Cambridge University Press, Cambridge, UK, 2002.

John Henderson and Veronica Dobson. Eastern and Central Arrernte to English Dictionary. IAD Press, Alice Springs, 1994.

Julia Hockenmaier and Peter Young. Non-local scrambling: the equivalence of TAG and CCG revisited. In Proceedings of The Ninth International Workshop on Tree Adjoining Grammars and Related Formalisms (TAG+9), T�ubingen, Germany, June 2008.

Beryl Hoffman. The Computational Analysis Of The Syntax And Interpretation Of �Free�Word Order In Turkish. PhD thesis, University of Pennsylvania, 1995.

Howard Johnson and Joel Martin. Unsupervised Learning of Morphology for English and Inuktitut. In Proceedings of Human Language Technology and North American Chapter of the Association for Computational Linguistics Conference (HLT-NAACL�03), Edmonton, Canada , May 2003.

Aravind Joshi, K. Vijay-Shanker, and David Weir. The Convergence of Mildly Context-Sensitive Grammar Formalisms. In Peter Sells, Stuart Shieber, and Thomas Wasow, editors, Foundational Issues in Natural Language Processing, pages 31�81. MIT Press, 1991.

Nikiforos Karamanis, Massimo Poesio, Chris Mellish, and Jon Oberlander. Evaluating Centering-based Metrics of Coherence Using a Reliably Annotated Corpus. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL �04), pages 391�398, Barcelona, Spain, 2004.

Alistair Knott, J. Moorfield, T. Meaney, and L. Ng. A human-computer dialogue system for M�aori language learning. In Proceedings of the World Conference on Educational Multimedia, Hypermedia and Telecommunications (ED-MEDIA), June 2003.

Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, and Evan Herbst. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions, pages 177�180, Prague, Czech Republic, June 2007. Association for Computational Linguistics.

William Mann and Sandra Thompson. Rhetorical Structure Theory: A Theory of Text Organization. Text, 8(3): 243�281, 1988.

Christopher Manning, Kevin Jansz, and Nitin Indurkhya. Kirrkirr: Software for browsing and visual exploration of a structured Warlpiri dictionary. Literary and Linguistic Computing, 16(2):135�151, 2001.

Megan Moser and Johanna Moore. Investigating cue selection and placement in tutorial discourse. In Proceedings of the 33rd Annual Meeting on Association for Computational Linguistics, pages 130�135, 1995.

A. Nijholt, H. J. A. op den Akker, and F. M. G. de Jong. Language interpretation and generation for football commentary. In ACTAS-1: VIII Symposio Social, pages 594�599, 2003.

Carl Pollard and Ivan Sag. Head-driven phrase structure grammar. University of Chicago Press, Chicago, USA, 1994.

Franc�ois Portet, Ehud Reiter, Jim Hunter, and Somayajulu Sripada. Automatic Generation of Textual Summaries from Neonatal Intensive Care Data. In Proceedings of the 11th Conference on Artificial Intelligence in Medicine (AIME 07), pages 227�236, 2007.

Owen Rambow. Multiset-valued linear index grammars: Imposing dominance constraints on derivations. In Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics, pages 263�270, Las Cruces, New Mexico, USA, June 1994. Association for Computational Linguistics.

Owen Rambow and Aravind Joshi. A Processing Model for Free Word Order Languages. In C. Clifton Jr., L. Frazier, and K. Rayner, editors, Perspectives on Sentence Processing. Lawrence Erlbaum Associates, 1994.

Owen Rambow, K. Vijay-Shanker, and David Weir. D-tree substitution grammars. Computational Linguistics, 27:87�121, 2001.

Ehud Reiter and Robert Dale. Building Natural Language Generation Systems. Cambridge University Press, 2000.

Ehud Reiter, Somayajulu Sripada, Jim Hunter, Jin Yu, , and Ian Davy. Choosing Words in Computer-Generated Weather Forecasts. Artificial Intelligence, 67:137�169, 2005.

Jacques Robin. Revision-Based Generation of Natural Language Summaries Providing Historical Background. PhD thesis, Columbia University, 1994.

Jane Simpson. Expressing pragmatic constraints on word order in Warlpiri. In Annie Zaenen, Jane Simpson, Chris Manning, and Jane Grimshaw, editors, Architectures, Rules, and Preferences: A Festschrift for Joan Bresnan. 2005.

Jane Simpson. Warlpiri morphosyntax: a lexicalist approach. Kluwer, Dordrecht, 1991.

Mark Steedman. The Syntactic Process. MIT Press, 2000.

T. G. H. Strehlow. Aranda phonetics and grammar. Oceania Monographs, Sydney, 1944.

Colin Tatz. Aborigines in Sport. The Australian Society for Sports History, Flinders University of South Australia, 1987.

Michael White and Jason Baldridge. Adapting Chart Realization to CCG. In Proceedings of the 9th European Workshop on Natural Language Generation, 2003.

David Wilkins. Mparntwe Arrernte (Aranda): studies in the structure and semantics of grammar. PhD thesis, Australian National University, 1989.

David Wilkins. The verbalization of motion events in Arrernte. In Sven Str�omqvist and Ludo Verhoeven, editors, Relating events in narrative : typological and contextual perspectives, pages 143�157. Lawrence Erlbaum Associates, Mahwah, NJ, USA, 2004.

David Wilkins. Towards an Arrernte grammar of space. In Stephen Levinson and David Wilkins, editors, Grammars of space: explorations in cognitive diversity, pages 24�62. Cambdridge University Press, Cambridge, UK, 2006.

Sandra Williams and Ehud Reiter. Appropriate Microplanning Choices for Low-Skilled Readers. In Proceedings of the Nineteenth International Joint Conference on Artificial Intelligence, 2005.

XTAG Research Group. A Lexicalized Tree Adjoining Grammar for English. Technical Report IRCS-01-03, IRCS, University of Pennsylvania, 2001.

Juntae Yoon, Chung hye Han, Nari Kim, and Mee sook Kim. Customizing the XTAG system for efficient grammar development for Korean. In Proceedings of the Fifth International Workshop on Tree Adjoining Grammars and Related Formalisms (TAG+5), 2000.

Simon Zwarts and Mark Dras. Statistical Machine Translation of Australian Aboriginal Languages: Morphological Analysis with Languages of Differing Morphological Richness. In Proceedings of the Australasian Language Technology Workshop (ALTA 2007), pages 134�142, Sydney, Australia, 2007.

Last updated 3 November 2009