Language Technology Tutorial

You are here: Robert Dale's Home Page > Teaching > Tutorials > PRICAI98 Tutorial

Language Technology: Techniques for Knowledge Extraction and Presentation

Chris Manning and Robert Dale

Brief Description of the Tutorial

Language Technology is natural language processing with an applied slant: it is concerned with the construction of artifacts that analyse or produce human language. In this tutorial, we survey the two sides of the technology: we look first at how Language Technology can be used to do useful things in the analysis of natural language, and then we look at how Language Technology can be used to automatically generate natural language from an underlying information source. Our emphasis throughout is on what is possible now.

Detailed Outline of the Tutorial

Part 1: Language Analysis

While there has been a great deal of work on mining data from databases, the majority of data within most businesses (and the data available about their competitors) is not in databases, but in material written in human languages, such as reports, brochures, manuals, etc. For this material to be effectively used, intelligent language analysis is necessary.

The tutorial will examine the rapid progress in practical language analysis systems for such purposes within the last decade. The general problem of fully understanding an arbitrary natural language text is clearly "AI-complete", and might moreover be complicated by numerous real-world factors such as spelling errors, or tables within the text (that assume a monospace font!). Within the engineering focus of language technology, language analysis systems do not attempt to understand everything, but to do low level parsing and extraction of relevant information. Such systems are imperfect, but can be made fast, robust, and practical for many applications.

The tutorial will outline some of the major methods, such as part of speech tagging, robust parsing, partial or chunk parsing, text clustering and classification, and probabilistic and other machine learning methods for the acquisition and use of linguistic knowledge. These will be discussed in relation to applications such as text segmentation and categorization, intelligent information access, and knowledge extraction. We will include a case study of an information extraction system that produces database records from natural language data.

Part 2: Language Generation

Natural language generation systems produce understandable texts in English or other human languages from some underlying non-linguistic representation of information. NLG systems combine knowledge about language and the application domain to automatically produce documents, reports, explanations, help messages, and other kinds of texts.

The late 1990s is an exciting time for applied NLG. 10 years ago NLG was purely a research activity, but in 1997 there are several fielded NLG systems in everyday use, and many more systems under development. In this tutorial, we will describe some of the techniques that are being used to build practical working applications today; we will also provide pointers to leading-edge research developments in the field. The material is based around a popular architectural model of NLG that encompasses the three stages of text planning, sentence planning and linguistic realisation. We will include a case study showing how to construct an NLG system which produces textual meteorological summaries from underlying numeric data sets.

Necessary Background and The Target Audience

The tutorial should be useful for managers, implementors, and researchers. For managers, it will provide a broad overview of the field and what is possible today; for implementors, it will provide a realistic assessment of available techniques; and for researchers, it will highlight the issues that are important in current language technology projects. Most of the tutorial will be accessible to someone with basic knowledge of concepts in artificial intelligence or information systems. Some parts of it will briefly discuss issues that assume more knowledge of linguistics, natural language processing, or statistical issues, but we will try to provide as much context as is possible in the time available.

Why The Tutorial is of Interest

Language Technology is a growing area of both research and commercial interest, encouraged in part by the availability of vast amounts of textual information available on the Internet and in other electronic forms. LT offers approaches to the processing and presentation of this resource that will have significant impact on how we use and view information in the future.

Brief Biography of the Presenters

Associate Professor Robert Dale is Director of the Microsoft Research Institute at Macquarie University, and leader of the Institute's Language Technology Group, Australia's largest concentration of expertise in natural language processing. He has been active in language technology research and development for 15 years. He received a PhD from the University of Edinburgh in the area of Computational Linguistics in 1988, and has published widely in the area; his 1992 book (Generating Referring Expressions; MIT Press) is widely-cited, and he has edited two published collections of papers on natural language generation. He is co-author of the forthcoming first textbook in the area of natural language generation, to be published in 1999; for a tutorial level introduction to the area, see:

E Reiter and R Dale [1997] Building Applied Natural Language Generation Systems. Journal of Natural Language Engineering, 3:57-87.

Christopher Manning is a lecturer at the University of Sydney, and manages SULTRY, the Sydney University Language Technology Research laboratorY. He received his PhD in Linguistics from Stanford University in 1994. His experience in language technology includes work as a consultant at Xerox PARC, as an Assistant Professor in the Computational Linguistics Program at Carnegie Mellon University, and doing contract research in Australia. He has worked and published in constraint-based syntax and linguistic typology, as well as in corpus-based computational linguistics. He is currently co-authoring a textbook on Statistical NLP, to be published by MIT Press.