- Mark Lauer's Thesis Abstract

Designing Statistical Language Learners: Experiments on Noun Compounds







Statistical language learning research takes the view that many traditional natural language processing tasks can be solved by training probabilistic models of language on a sufficient volume of training data. The design of statistical language learners therefore involves answering two questions: (i) Which of the multitude of possible language models will most accurately reflect the properties necessary to a given task? (ii) What will constitute a sufficient volume of training data? Regarding the first question, though a variety of successful models have been discovered, the space of possible designs remains largely unexplored. Regarding the second, exploration of the design space has so far proceeded without an adequate answer.

The goal of this thesis is to advance the exploration of the statistical language learning design space. In pursuit of that goal, the thesis makes two main theoretical contributions: it identifies a new class of designs by providing a novel theory of statistical natural language processing, and it presents the foundations for a predictive theory of data requirements to assist in future design explorations.

The first of these contributions is called the meaning distributions theory. This theory specifies an architecture for natural language analysis in which probabilities are given to semantic forms rather than to more superficial linguistic elements. Thus, rather than assigning probabilities to grammatical structures directly, grammatical forms inherit likelihoods from the semantic forms that they correspond to. The class of designs suggested by this theory represents a promising new area of the design space.

The second theoretical contribution concerns development of a mathematical theory whose aim is to predict the expected accuracy of a statistical language learning system in terms of the volume of data used to train it. Since availability of appropriate training data is a key design issue, such a theory constitutes an invaluable navigational aid. The work completed includes the development of a framework for viewing data requirements and a number of results allowing the prediction of necessary training data volumes under certain conditions.

The experimental contributions of this thesis illustrate the theoretical work by applying statistical language learning designs to the analysis of noun compounds. Both syntactic and semantic analysis of noun compounds have been approached using probabilistic models based on the meaning distributions theory.

In the experiments on syntax, a novel model, based on dependency relations between concepts, was developed and implemented. Empirical comparisons demonstrated that this model is significantly better than those previously proposed and approaches the performance of human judges on the same task. This model also correctly predicts the observed distribution of syntactic structures.

In the experiments on semantic analysis, a novel model, the first statistical model of this problem, was developed and implemented. The system uses statistics computed from prepositional phrases to predict a paraphrase with significantly better accuracy than the baseline strategy. The training data used is both sparse and noisy, and the experimental results support the need for a theory of data requirements. Without a predictive data requirements theory, statistical language learning remains an artform.



Back to Mark's Academic Research
Back to Mark's Home Page
Other people at CLT

Last updated December 1st, 2003, by Mark Lauer. Engineered with vi.