Current students.
Completed students.
Most statistical spam filters try to detect spam using the content of the email as the basis of the feature set; spam senders can defeat this by by including content in the emails that will fool the filters. It would be interesting to see how integrating stylistic and fluency features could work in a spam filter, as this might be more difficult to counterfeit than content. (After all, if spammers can generate grammatically immaculate text, they will have solved a major problem in language technology.)
Three students at MIT have built SCIgen, "a program that generates random Computer Science research papers, including graphs, figures, and citations"; a paper generated by it has actually been accepted at a conference.
This is done using a hand-built context-free grammar. However, it would be interesting to see if the same can be done with a statistical generator (e.g. one that uses n-gram statistics). There are actually some interesting research questions here: Can the sentences be made grammatical? What additional information needs to be used to do this? What kinds of statistical language models work best? What is the quality like relative to the CFG-generated text? Since there is no underlying content to be represented, unlike in ordinary text generation or summarisation, this is a good testbed for these questions.
Machine translation is a very popular field of research in language technology, and there's a lot of free online applications for automatically translating text (e.g. Google's). In terms of research projects, there are freely available statistical MT systems (e.g. Moses) that could form the basis of a project.
Starting from that, there's a range of possible projects, most in the area of integrating (syntactic) structure with the current state-of-the-art statistical approaches to MT. Ones that are related to some current work I'm doing are:
See research page for a description of the work that this would be part of.
The main part would be to build a system, using an existing broad-coverage parser, together with an existing mathematical optimisation package, that would take a text (e.g. a paper) and fit it to a set of constraints (e.g. a 2000 word limit with sentences of middling complexity). An extension would be to look at discourse-level parsing, such as SPADE, and incorporate that.
Epstein, Joshua and Robert Axtell (1996), Growing Artificial Societies. MIT Press. MIT, MA.
The trajectory of an instance of language change is an interesting one -- it starts off slowly, gathers pace, then slows again, in the sort of S-shaped pattern found in evolutionary biology. An example is when English diverged from other Germanic languages and lost the requirement for the tensed verb to be placed second in the sentence. Analytical models don't describe this trajectory well; agent-based simulation offers an alternative. The project would involve:
Here are the summary sheets of my student evaluations, carried out by the University's Learning and Teaching Centre. Students typically fill them in in class, and give scores ranging from Strongly Agree (5) to Strongly Disagree (1) on a number of questions; they're then averaged across all students.
There's no official interpretation of the numbers; there used to be some guidelines ("Science classes always score worse, so don't be disheartened at lower scores"), but they've disappeared. But I think they're still more useful than no information. In interpreting them for myself I consider a score of Neutral (3) as a Pass.
Last updated 4 August 2010