A recap from my main page:
In general terms, the theme of my research is language structure and transformation. I'm interested in the ways that natural languages and their mathematical or formal representations can be transformed, the contexts in which these transformations occur, and the situations where it's better to use one kind of representation or another; specifically, I work on machine translation, paraphrase and related areas. I'm also interested in sentiment analysis, and how sentiment affects language choices.
More specifically, here's what I'm working on:
One of the significant characteristics of language is that there are multiple ways of expressing the same idea. There may be slight differences of emphasis, nuances of meaning, that differentiate one expression of an idea from another, but the importance of this nuancing varies with the context. A paraphrase example is ``The investigators made a distinction between the alleged attackers and their supporters'' versus ``The investigators distinguished between the alleged attackers and their supporters''; a near-synonym one is the difference between ``frugal'' and ``stingy''.
One question of interest is knowing when to apply these paraphrases. Altering, say, a document to fit specifications -- for example, to compress a document by 25% while maintaining the same density of concepts and improving readability -- can be seen as applying paraphrases under a set of constraints. This work has involved characterising this as a mathematical optimisation problem, and also find formal representations for paraphrase.
Another is in characterising and automatically acquiring the differences between near-synonyms. People at the University of Toronto have taken one approach to this; I'm interested in how corpus statistics approaches might work here.
Here, I'm particularly interested in issues of syntax in MT.
One issue is in formal grammar, in constructions where there are structural difficulties in translating between languages. In this, I've been particularly interested in Synchonous Tree Adjoining Grammar (S-TAG); my work here has involved investigating representations necessary for translation among a range of languages, including English, French, Spanish, Korean, and recently Dutch; on constructions that when paired cause difficulties, such as clitics or aspects of Korean word order; and on efficient algorithms for dealing with these representations.
Questions to ask here include:
Another issue is how syntax can work with Statistical MT. While there's still debate in the field about whether syntax is helpful or not (a lot of work shows that it can be, with the counterargument being that perhaps it's only more necessary to have more data and no syntax), I'm interested in seeing whether it's possible to identify where syntax is helpful and where it's not. Questions to ask here include:
In brief: Australian Aboriginal languages have a number of interesting characteristics that make them a challenge for language technology applications; as yet, there are none, unlike for the indigenous Inuit peoples of Canada and Maori of New Zealand. This project will carry out a large-scale computational treatment of an Aboriginal language, including morphology, syntax and discourse; this will result in a data-to-text natural language generation system which takes data from the domain of Australian Rules Football and automatically constructs texts based on the data. This will provide insights in formalisms for representing languages and architectures for data-to-text systems; and has potential applications to literacy and language maintenance.
This has been funded by a three-year ARC Discovery grant; see the project webpage for details.
In evaluating the output of language technology applications -- MT, natural language generation, summarisation -- automatic evaluation techniques generally conflate measurement of faithfulness to source content with fluency of the resulting text. I'm looking at developing automatic evaluation metrics to estimate fluency alone, by examining the use of parser outputs as metrics, and examining how they correlate with human judgements of generated text fluency; developing machine learners based on these; and examining how these are affected by different language models and different domains.
This has applications in my other work as well. One is its use in assessing where syntax can helpfully be applied in Statistical Machine Translation; another is in the area of errorful texts, such as found in phishing.
To fund some of this work, I've received a number of competitive grants.
2010-2012: ARC Discovery Grant - Australian Research Council (DP1095443). Dr M Dras, Dr MM Turpin, Dr O Rambow, Prof R Dale. Natural Language Generation for Aboriginal Languages ($425,000)
2010: MQSIS Research Infrastructure Block Grant - Macquarie University. Dr M Dras, Prof R Dale, Prof M Johnson, Dr D Molla, Prof B Mans, Prof M Orgun, Dr A Nayak, Dr S Zwarts. A High Performance Cluster for Computing ($60,000)
2009-2010: Microsoft Research Asia Research Grant. Dr S Zwarts, Dr M Dras, Prof R Dale. Statistical Machine Translation with Morphological Preprocessing for Asian Languages ($US30,000)
2007-2010: ARC Linkage Research Grant - Australian Research Council (LP0776267). Dr PA Watters, Mr BK Watson, Dr AC Ng, Dr M Dras, Dr S Cassidy, Mr S McCombie, Mr BJ Reardon, Prof JP Pieprzyk. Defence Against Phishing Attacks ($231,000)
2005-2008: ARC Linkage Research Grant - Australian Research Council (LP0561985). A/Prof MA Orgun, Dr M Dras, Dr WJ Graco, Dr WQ Lin. Classification and Prediction Modelling for Financial Distress, Tax Debt and Insolvency for ATO Clients ($72,000)
2005-2007: ARC Discovery Research Grant - Australian Research Council (DP0558852), Dr D Richards; Dr M Kavakli; Dr M. Dras, Risk Management using Agent-Based Virtual Environments ($363,000).
Last updated 3 March 2010