A List of Summarization Projects

return to project index


Sentence Extraction

Surrey University: Summ-It applet

This summarization system works by extracting sentences using Lexical Cohesion.

Royal Institute of Technology (Sweden): SweSum

SweSum extracts sentences to product an extract type summary. It is closely related to the work at ISI. Summaries are created from Swedish or English texts in the either the newspaper or academic domains. Sentences are extracted by ranking sentences according to weighted word level features and was trained on a tagged Swedish news corpus. The summarization tool can be hooked up to search engine results.

University of Ottawa: The Text Summarization Project

Not much is available about this research project except their project proposal. In it they proposed to use machine learning techniques to identify keywords. Keyword identification can then be used to select sentences for extraction. They planned to use surface level statistics such as frequency analysis and surface level linguistic featurs such as sentence position.

Columbia University: FociSum (1998)

The FociSum system takes a question and answer approach to summarization. Sentences that answer key questions regarding participants, organisations and other wh-questions are extracted. The result is a concatenation of sentence fragments and clauses found in the original document. The system first uses a named entity extractor to find the foci of the document. A question generator is used to suggest relationships between these entities. The document is parsed to find candidate answers for these question on the basis of syntactic form. Sentence fragments and clauses are pulled out of the selected sentences.

University of Southern California: ISI Summarist

Summarist is produces summaries of web documents. It has been hooked up to the Systran translation system to provide a gisting tool for news articles in any language. Summarist first identifies the main topics of the document using statistical techniques on features such as position, and word counts. Current reseach is underway to use cue phrases and discourse structure. These concepts must be interpreted so that of a chain of lexically connected sentences, the sentence with the most general concept is selected and extracted. Subsequent work will take these extracted sentences to construct a more coherent summary.

Deep Understanding

The Sheffield University TRESTLE

This project produces summaries in the news domain. It uses MUC to extract the main concepts of the text which then presumably is used to generated summaries. Unfortunately, not much information is available on the official website regarding the system architecture.

Columbia University: SUMMONS (1996)

Summons is a multi-document summary system in the news domain. It begins with the results of a MUC-style information extraction process, namely a template with instantiated slots of pre-defined semantics. From this, it can generate a summary by using a sophisticated natural language generation stage. This stage was previously developed under other projects and includes a content selection substage, a sentence planning substage and a surface generation stage. Because the templates have well-defined semantics, the type of summary produced approaches that of human abstracts. That is they are more coherent and readable. However, this approach is domain specific, relying on the layout of news articles for the information extraction stage.

Hybrid Approaches (These combine extraction techniques with more traditional NLP techniques)

Columbia University: MultiGen (1999), 

MultiGen is a multi-document system in the news domain. It extracts sentence fragments that represent key pieces of information in the set of related documents. This is done by using machine learning to group together paragraph sized chunks of text into clusters of related topics. Sentences from these clusters are parsed and the resulting trees are merged together to form, building logical representations of propositions containing the commonly occuring concepts. This logical representation is turned into a sentence using the FUF/SURGE grammar. Matching concepts uses linguistic knowledge such as stemming, part-of-speech, synonymity and verb classes. Merging trees makes use of identified paraphrase rules.
Copy and Paste (1999).
The Copy and Paste system is a single document summariser that is domain independent. It is designed to take the results of a sentence extraction summariser and extract key concepts from these sentences. These concepts are then combined to form new sentences. The system thus, copies the surface form of these key concepts and pastes them into the new sentences. This is done by first reducing the sentence removing any extraneous information. This step uses probabilities learnt from a training corpus, and lexical links. The reduces sentences are merged by using rules such as adding extra information about speakers, adding conjunctives and merging common elements.


return to project index

(These are mostly extraction based)
note: The following descriptions are derived from information found on the products' official websites