Study on Methods of automatic text summarization

Last Update:2018-12-07 Source: Internet

Author: User

Tags processing text to domain

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Http://xieweifeng008.blog.163.com/blog/static/548138272009101413136674/

After decades of research, the automatic text summarization mainly adopts the following four methods:Automatic summarization based on statistics, automatic summarization based on understanding, automatic summarization based on information extraction, and automatic summarization Based on Structure.

4.1 automatic summary based on statistics

Statistical-based automatic summarization is also called automatic excerpt. It regards text as the linear sequence of sentences and sentences as the linear sequence of words. Perform the following steps:

(1) raw text processing: Enter text information in a form that can be recognized by a computer, such as keyboard input, handwriting input, text scanning, graphic recognition, and speech recognition.

(2) Word weight calculation: Term Frequency Statistics are performed on the "keywords" in the original text information.

(3) Sentence weight calculation: Calculate the sentence Weight Based on the Word Frequency and other information in the sentence. The criterion is: the sentence weight is proportional to the number of "keywords" contained in the sentence; the sentence weight is increased when the text information contains prompt words; the sentence weight is increased in the special position of the text information; if a sentence contains discarded indicator words, the sentence weight is reduced. The sentence length is inversely proportional to the sentence weight.

(4) Abstract sentence extraction: All sentences in the original text are sorted in descending order of weights. Several sentences with the highest weights are determined as abstract sentences.

(5) Abstract sentence output: Output all abstract sentences in the order they appear in the original text. The words weight, sentence weight, and digest sentence selection are based on the following six forms of text features:

(1) f frequency (frequency): can indicateArticleSignifi-cantw ords is usually an intermediate frequency word. The sentence weight can be calculated based on the number of valid words in the sentence.

(2) T title: The title is the phrase given by the author prompting the content of the article. With the help of the stop list ), remove functional words or nouns with only general meanings in titles or subtitles. The remaining words are often closely related to the original content and can be used as valid words.

(3) Location: The text information is in a special position, such as the first segment, the last segment, the first segment, and the last segment. The sentence weight should be increased.

(4) s syntax structure (syntactic structure): There is a link between the sentence structure and the importance of the sentence. For example, the sentence in the abstract is mostly declarative sentence, and the question and exclamation sentence cannot be abstract sentence.

(5) C. Prompt words (cue): Some words or phrases in a sentence are not keywords, but they can trigger the function and tell readers that the sentence contains important information, such as "signifi-cant ", "im portant", "so", "In summary", etc.

(6) I indicative phrase (indicative phrase): refers to those phrases with themes. For example, "the purpose of", "The m ain aim of", "This article puts forward", "We think", and so on.

The six form features of text are the basis for automatic excerpt. They indicate the topic of the article from different perspectives, but they are not accurate and comprehensive. We need to combine the above features "organically" and use W = f (F, T, L, S, C, I) as the criterion for calculating sentence weights.

The statistical-based method is not limited in the field, fast in speed, and adjustable in the Digest length. However, it is limited to the text surface information and the Digest quality is poor, problems such as incomplete content, statement redundancy, and inconsistency exist.

At present, many automatic summarization systems use this method. On this basis, different methods are used to calculate the weights of words and sentences so that the extraction of abstract sentences is continuously optimized.

4.2 comprehension-based automatic summary

Understanding-based automatic summarization is centered on artificial intelligence technology, especially natural language understanding technology. In addition to analyzing the syntax structure of a text, the field knowledge is used to analyze the semantics of the text. Through judgment and reasoning, the semantic description of the abstract sentence is obtained and the abstract is automatically generated based on the semantic description. Perform the following steps:

Text Analysis is the most important part, including syntax analysis, semantic analysis, and syntax analysis.

(1) syntax analysis: the dictionary and grammar rules in the knowledge base are used to analyze the entered text information, determine the word shape and meaning, divide the sentence, and find out the Syntactic Relationship between words, describe these relationships in a data structure, such as the syntax structure tree [4].

(2) semantic analysis: isolate sentences in their environments and analyze meaning literally. The most important method is text annotation. It indicates the dependency between words, the semantic cohesion between sentences, the semantic aggregation or transfer relationship between segments, and the knowledge described by the domain knowledge base, converts a semantic annotation to a semantic network that the machine can "understand.

(3) syntactic analysis: analyzes each word in the document and gives its contribution to the full text, including rhetorical, syntactic and semantic knowledge and the discourse structure attributes of the document. This method uses a complex natural language understanding and generation technology to grasp the meaning of the document more accurately. Therefore, the abstract is of good quality and has the advantages of being concise, refined, comprehensive, accurate, and readable. However, understanding summaries not only requires computers to have the ability to understand and generate natural languages, but also needs to express and organize various backgrounds and domain knowledge, which is extremely difficult. Therefore, this method is limited to small application fields.

4.3 automatic summarization Based on Information Extraction

The automatic summary method based on understanding requires comprehensive analysis of the article to generate detailed semantic expressions, which is difficult to implement for Large-Scale Real text. However, information extraction only analyzes useful text fragments in a limited depth, greatly improving the efficiency and flexibility.

An automatic summary Based on Information Extraction is also called a template-filled automatic summary. It is centered on the digest framework and is divided into two stages: Selection and generation. Perform the following steps:

Because the preparation of the digest Framework relies entirely on domain knowledge, information extraction is still subject to domain restrictions. To apply information extraction to multiple fields, you must compile a digest framework for each field. When processing text, you must first identify the topic and call the corresponding digest framework based on the topic. In addition, because the abstract is generated using templates, the language is the same and it is very dull.

4.4 structure-based automatic summary

The text information is considered as the sentence association network, and the central sentence associated with many sentences is selected to form a summary, which is a structure-based automatic summary.

A chapter is an organic structure. Different parts of a chapter have different functions, and each part has a complex relationship. The chapter structure analysis is clear, and the core part of the article can naturally be found. However, linguistics does not have enough research on the structure of the chapter, and there are very few formal rules available. This makes structure-based automatic summarization not yet mature. Methods used include automatic summarization Based on associated networks, automatic summarization Based on Rhetorical structures, and automatic summarization based on pragmatic functions.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More