Robert Dale: Possible Honours Projects

You are here: Robert Dale's Home Page > Possible Honours and Masters Projects

Robert Dale: Possible Honours and Masters Projects

This page lists some specific honours and masters projects I would be happy to supervise. Of course, there is usually scope to tailor projects to the interests of specific students. You might want to read my document on supervision before deciding whether you'd like me as a supervisor. If you'd like to discuss anything here further, just mail me to arrange a chat.

Record Linkage for Crowd-Sourced Bibliographic Metadata

Record linkage is a common problem in databases: it's concerned with ensuring that duplicate records are identified and collapsed together, and it's very important---for example, it can substantially reduce the size of a mailing list that contains near duplicate entries, and consequently it can save companies a lot of money. If you have an extensive contacts list on your phone or your PC, you'll already be familiar with the problem, with the same person having a number of sets of contact information that you'd ideally like to merge; but it takes some smarts to know when merging should take place and when it should not, which is why this is not yet a solved problem.

The aim of this project is to explore record linkage in the context of blbiographic data: the kind of information you find in reference lists at the end of scholarly books, articles and papers. Managing the underlying data is a headache for most researchers; the very task of collecting the relevant publication information (authors, titles, journal names, page numbers, publishers ...) for the reference list for a paper is very laborious, and then of course you have to format it correctly.

Well, the formatting problem is already solved, thanks to tools like BibTeX and EndNote. But you still need to get the data from somewhere. And it's out there---many times over, on web pages and in web-accessible bibliographic databases. But any given record may be incomplete or may contain errors.

The aim of this project is to build a tool that mines the web for bibliographic data for a given pair, and then uses record linking techniques to merge this data to build a complete and consistent metadata record that can be used by a tool like BibTeX or EndNote. You'll be up against some real challenges: sites like Mendeley, Mr. dLib and c2bBib try to do something like this already, but we think we know how to do it better ...

For this project you'll need (a) good programming skills and (b) comfort with web-based technologies.

Corpus-based Correction of OCR-introduced Spelling Errors

A common way to archive legacy documents is to run them through a scanner to produce a PDF file, to which a searchable text layer is added using optical character recognition (OCR). Unfortunately, OCR is not perfect, so spelling errors are introduced that damage the effectiveness of search techniques.

Using an existing corpus of several thousand scanned academic papers (in the ACL Anthology), this project aims to develop automatic spelling correction techniques that use the corpus itself as a source of evidence for spelling corrections. For example, if the misrecognised string spe11in8 appears in a document, a simple distance metric may find other similar strings, such as spelling, to be much more frequent in the corpus, and on the basis of frequency then choose this as a correction. Of course it gets much more complicated than this, which is why i's interesting ...

Inferring Document Structure

Documents have a physical structure -- typically consisting of pages, columns, and paragraphs -- but they also have a logical structure, consisting of title information, sections, subsections, footnotes, tables and so on. PDF documents are primarily intended for rendering on a screen or a printer, and so are focussed on physical structure; they tend not to contain much information, if any, about the logical structure of the document. But that logical structure can be important for a variety of purposes; for example, knowing the logical structure of a document can assist in information retrieval, information extraction and text summarisation.

The aim of this project is to take a corpus of PDF documents, and to build a system that can automatically extract the logical structure of the document text, so that this can be provided in XML form for a variety of more sophisticated processing stages, or for a more flexible rendering model (for example as a hierarchically unfolding document in a web browser).

An Automated Newsreader

Automated newsreaders -- 'talking heads' that read out news stories in synthesized voice -- have been constructed before. These take a textual news source and then use a text-to-speech synthesis engine, in conjuncion with an animated head, to deliver that news in spoken language.

The aim of this project is to build such a system with increased realism, by incorporating both appropriate facial gestures and approptiate intonation in the voice. Watch some newsreaders carefully to see how they use their facial expressions to communicate informaton, and listen to how they use prosody to increase interest in what they are saying. The challenge here is to find techniques that will allow us to derive appropriate audio visual features from a 'flat text' provided as input.

An Intelligent Agent for the Map Task

The Map Task is an experimental scenario used to gather data on how people interact in certain kinds of situations. It involves two persons -- the route giver and the route follower -- who are looking at their own copies of a map of an island on which there is some buried treasure. The two maps are different: the one belonging to the route giver shows the location of the treasure, and the one belonging to the route follower does not. So, the route giver has to give the route follower instructions as to how to navigate the map to find the treasure. Unfortunately, there are other differences between the two maps that mean instructions are often misunderstood or interpreted incorrectly.

The aim of this project is to build a computational agent that is able to act as either the route giver or the route follower. This involves a number of challenging subtasks: we need (a) a model of the domain using some form of knowledge representation; (b) a language generation system that can work out how a route through the map should be described; (c) a reasoning system that can work out how to recover from problems when the route follower has not understood the provided instructions; and (d) a reasoning system that can interpret the route giver's instructions to plot a path through the map.

The project is suitable for someone who has a strong interest in artificial intelligence or natural language processing.