return to project index
Sentence Extraction
Surrey University: Summ-It applet
This summarization system works by extracting sentences using Lexical Cohesion.Royal Institute of Technology (Sweden): SweSum
SweSum extracts sentences to product an extract type summary. It is closely related to the work at ISI. Summaries are created from Swedish or English texts in the either the newspaper or academic domains. Sentences are extracted by ranking sentences according to weighted word level features and was trained on a tagged Swedish news corpus. The summarization tool can be hooked up to search engine results.University of Ottawa: The Text Summarization Project
Not much is available about this research project except their project proposal. In it they proposed to use machine learning techniques to identify keywords. Keyword identification can then be used to select sentences for extraction. They planned to use surface level statistics such as frequency analysis and surface level linguistic featurs such as sentence position.Columbia University: FociSum (1998)
The FociSum system takes a question and answer approach to summarization. Sentences that answer key questions regarding participants, organisations and other wh-questions are extracted. The result is a concatenation of sentence fragments and clauses found in the original document. The system first uses a named entity extractor to find the foci of the document. A question generator is used to suggest relationships between these entities. The document is parsed to find candidate answers for these question on the basis of syntactic form. Sentence fragments and clauses are pulled out of the selected sentences.University of Southern California: ISI Summarist
Summarist is produces summaries of web documents. It has been hooked up to the Systran translation system to provide a gisting tool for news articles in any language. Summarist first identifies the main topics of the document using statistical techniques on features such as position, and word counts. Current reseach is underway to use cue phrases and discourse structure. These concepts must be interpreted so that of a chain of lexically connected sentences, the sentence with the most general concept is selected and extracted. Subsequent work will take these extracted sentences to construct a more coherent summary.
Deep Understanding
The Sheffield University TRESTLE
This project produces summaries in the news domain. It uses MUC to extract the main concepts of the text which then presumably is used to generated summaries. Unfortunately, not much information is available on the official website regarding the system architecture.Columbia University: SUMMONS (1996)
Summons is a multi-document summary system in the news domain. It begins with the results of a MUC-style information extraction process, namely a template with instantiated slots of pre-defined semantics. From this, it can generate a summary by using a sophisticated natural language generation stage. This stage was previously developed under other projects and includes a content selection substage, a sentence planning substage and a surface generation stage. Because the templates have well-defined semantics, the type of summary produced approaches that of human abstracts. That is they are more coherent and readable. However, this approach is domain specific, relying on the layout of news articles for the information extraction stage.Hybrid Approaches (These combine extraction techniques with more traditional NLP techniques)
Columbia University: MultiGen (1999),
MultiGen is a multi-document system in the news domain. It extracts sentence fragments that represent key pieces of information in the set of related documents. This is done by using machine learning to group together paragraph sized chunks of text into clusters of related topics. Sentences from these clusters are parsed and the resulting trees are merged together to form, building logical representations of propositions containing the commonly occuring concepts. This logical representation is turned into a sentence using the FUF/SURGE grammar. Matching concepts uses linguistic knowledge such as stemming, part-of-speech, synonymity and verb classes. Merging trees makes use of identified paraphrase rules.Copy and Paste (1999).The Copy and Paste system is a single document summariser that is domain independent. It is designed to take the results of a sentence extraction summariser and extract key concepts from these sentences. These concepts are then combined to form new sentences. The system thus, copies the surface form of these key concepts and pastes them into the new sentences. This is done by first reducing the sentence removing any extraneous information. This step uses probabilities learnt from a training corpus, and lexical links. The reduces sentences are merged by using rules such as adding extra information about speakers, adding conjunctives and merging common elements.
return to project
indexCommercial
note: The following descriptions are derived from information found on the products' official websites
Datahammer is a product to designed to summarise online texts and works in conjunction the user's web browser. It extracts sentences by using an algorithm called 'Microword Tree Trimming' which they created. A demo version is available from their website.
Text Analyst extracts sentences of documents on the user's computer. The official website of Text Analyst describes the summarization process they use. A semantic network is constructed from the source document using a neural network. They state that the construction of the semantic network is not dependent on prior domain specific knowledge. A graphical representation of concepts and relationships in the source document is shown to the user for selection. Sentences with matching concepts and relationships are extracted.
IBM Japan incorporates summarization tools in two of its products: Internet King of Translation (Japanese) and Lotus Word Pro (Japanese Version). The type of summaries produced are sentence extracts, selected using rhetorical relations and position within the document. Extraction is done statistically and can utilise genre specific features.
IBM also has a toolset called Text Analysis, which has as one of its components, a summarization tool. This toolset is part of the Intelligent Text Miner product. It produces summaries by extracting sentences. As with most commercial products, this is done by ranking the sentences by a measure of importance and then selecting the topmost ranks. Ranking is achieved by word level features and the user can select the extract length.
Mitre's WebSumm performs sentence extraction over single or multiple documents in conjunction with a search engine. The resulting summary is an extract of sentences based on a users query. This is done by representing the source document(s) as a network of sentences. Using the query terms to select nodes which are related, the sentences are extracted. The summarisation tool is able to handle both similar and contrasting sentences across multiple documents.
This search engine development kit is used to make a search engine for a particular website. In doing so, it is supposed to allow dynamic creation of summaries of documents in the website when requested by users. There is no information as to how this is done on their website.
InText extracts key sentences by using key words, although the exact technique is is not mentioned on their website. Their description mentions that the user may choose one of several extraction techniques. InText is a product that the user installs and uses on documents already residing on the computer.
It is difficult to get to official site, so this paragraph is based on second-hand descriptions. British Telecom produced a summarization tool for both offline and online texts which works by selecting key sentences and extracting them as a summary. However, since ProSum and NetSum are commercially available products, the internal mechanisms behind the extraction process. Alternatively, instead of extracting the sentences, the tool highlights them in the original document. The user is able to change the length of the summary produced. According to researchers at the University of Ottawa, ProSum works best with factual documents of a single theme. These include genres such as news, paper articles and technical journals. It doesn't work as well with lists and narrative works.
inXight's Summary Server is an application that creates extraction-based summaries offline. Users view summaries when they move their mouse over a hypertext link to a document that has been previously summarised. It uses statistical extraction techniques based on features such as sentence position, sentence length and keywords. The application allows the user to specify length and salience of certain keywords. It also provides the ability for further training on structured documents of other genres.
Extractor picks out keywords from a document. From my home page, extractor produced the following list:
- Natural Language
- Communication Sciences
- Intelligent Interactive Technology
- CMIS
- Technology Group
- Macquarie University
- Honours student.
Extractor simply finds key phrases and the sentences that use them. The extractor algorithm uses a set of parameters (such as stem length) which are tuned by a genetic algorithm (GenEx). The extracted phrases are then matched with their occurences in the document and the corresponding sentence is extracted. There is simple stemming and morphological analysis of extracted key words to score non-noun phrases lower, however, there is no synonym detection. The list of extracted sentences is filter based on heuristics regarding presentation.
"The importance of a sentence is determined by some surface clues such as the number of important keywords, the type of sentence (fact, conjecture, opinion, etc.), rhetorical relations in the context, and the location in which a sentence exists in a document."
return to project index