[Graduation design] multi-document automatic summary. Disaster-oriented events

Source: Internet
Author: User

What am I doing here?

As the name implies, multi-document automatic summarization, that is, the completion of multiple documents summary content extraction.

Here, our research object, single refers to disaster events.

Design Ideas

Before you complete a multi-document automatic summary, consider the following questions first:

    • What is the base unit for content extraction?
    • What can be called the important content?

basic Unit

For a document, we can split it into: the document itself, paragraph, sentence, Word, word.

    • First of all, the summary content will be a readable content, so we can not be "words", "word" as a unit, because it is clear that they can not express the full semantics.
    • Second, a document is a complete description of an event that contains many aspects of the content, and therefore cannot be in the document itself, it contains too much content, not atomic.
    • Third, in some cases, the paragraph describes the content of one aspect of an event, but we still have to consider that because the document comes from the web, in some cases a document has only one paragraph that describes too much content.

Conclusion:

The sentence is the basic unit, it expresses the complete semantics, only describes a specific content.

Clustering

An event is made up of multiple elements. For example, a news event: a plane crash. It will inevitably involve: on-site situation, property loss, casualties, rescue situation, post-processing and other aspects of the content. Therefore, we need to classify the basic unit. Unfortunately, we have no way of predicting how much content an event will consist of. So, here's an algorithm: clustering, for automatic categorization.

Important Content

Important content is very good understanding, a lot of documents are mentioned in the content, must be important content.

So we can think that a sentence, if it appears in large numbers, it must be an important content.

Design Conclusion
    1. Splits multiple documents into a collection of sentences.
    2. Sentence set clustering.
    3. Get the important class.
    4. For class combinations, get summaries (delete similar elements in a class and get important elements).

Implement

For the first step, you can get a collection of sentences from a collection of documents by doing a simple string splitting process. For the second step, the clustering algorithm is described in detail below. For the third step, the more elements a class contains, the more important it is to represent the class. For the fourth step, there are two questions to be solved, what are the similar elements, and what are the important elements.

sentence similarity judgment

Since the beginning of the 50 's, the sentence similarity calculation has already had the research history, although the Chinese sentence similarity calculation starts later, but also has a lot of research results. For the similarity calculation algorithm this article does not do too much narration ( Note: This does not mean that this algorithm is not important for this article ) in order to further research can be consulted relevant information. This article will use a simpler sentence similarity calculation: editing distance vectors.

Clustering Algorithm

Based on the similarity calculation, we can propose the following more ingenious and simple clustering algorithm.

Clustering is the grouping of similar elements together, where the set of sentences can be abstracted as an no-go graph: A sentence is a node, and an edge is attached to a similar sentence node.

All of the sub-graphs are the final result, but we still have to notice many graphs with a single node and a small number of nodes, which means they may not be important, so there is no need to consider these collections.

set up similar sentences and important sentences

Similar sentences are well handled, obviously, if there is a very high similarity in the aggregation of the sentence sub-set, the most important thing to keep.

Important sentence: The sum of similar values for any of the sentences in the collection A,a other elements is recorded as Va. In descending order of the V-value, you can get the important sentences in front of you.

[Graduation design] multi-document automatic summary. Disaster-oriented events

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.