Notes on key word Extraction extracted from recent studies

Last Update:2014-08-29 Source: Internet

Author: User

Tags html header

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Source: http://blog.csdn.net/caohao2008/article/details/3144639

Organize previous content

Requirements: 1. First, find the paper with the proposal nature and summarize the typical methods. Second, if we want to use it, which one is more practical or easy to implement? Which makes more sense in research.

First, the article "Finding advertising keywords on Web pages" describes the typical features of keyword extraction.

The concept-based keywords Extraction uses concepts and classifications to assist in keyword extraction. A classic article titled discovering key concepts in verbose queries and a study on automatically extracted keywords in Text Categorization

Keywords Extraction Based on query logs. Some Articles: using the wisdom of the crowds for keyword generation and keyword extraction for contextual Advertisement

Keywords Extension: keywords generate keyword generation for search engine advertising using semantic similarity, using the wisdom of the crowds for keyword generation, N-Keyword Based Automatic IC query generation

Second, more common features, the features mentioned by researchers before:

Features mentioned in finding advertising keywords on Web pages

1. Speech Recognition

2. uppercase letters

3. Whether the keyword is in hypertext

4. Whether the keywords are in Meta data

5. Whether the keyword is in the title

6. Whether the keyword is in the URL

7. TF, DF

8. Keyword Location Information

9. sentence length and document length of keywords

10. Length of candidate phrases

11. query logs

Features I think

1. surrounding information content, the average information content of several nearby words or even a sentence.

2. semantic distance, using co-occurance.

3. ne. I used it in IE extraction.

4. Relationship between keywords and semantic distance. The bigger the divergance, the better the result, or the smaller the divergance. Or is it not affected?

2.3.2.1Lin: Linguistic features.

The linguistic information used in Feature Extraction schemdes: two types of POS tags-N (NN & NNS) andpropern (NNP & nnps), and one type of chunk-N phrase (NP ). the variations used in MOS are: whether the phrase contain these POS tags; whether all the words in that phrase share the same POS tags (either proper or ); and whether the whole candidate phrase is a noun phrase. for des, they are: whether the word has the POs tag; whether the word is the beginning of a noun phrase; whether the word is in a noun phrase, but not the first word; and whether the word is outside any noun phrase.

2.3.2.2C: Capitalization.

Whether a word is capitalized is an indication of being art of a proper noun, or an important word. this set of features for MOS is defined as: whether all the words in the andidate phrase are capitalized; whether the first word of He candidate phrase is capitalized; and whether the candidate phrase has a capitalized word. for des, It is imply

Whether the word is capitalized.

2.3.2.3H: Hypertext.

Whether a candidate phrase or word is part of the anchor text for a hypertext link is extracted as the following features. for MOs, they are: whether the whole candidate phrase matches exactly the anchortext of a link; whether all the words of the candidate phrase are in the same anchor text; and whether any word of the candidate phrase belongs to the anchor text of a link. for des, they are: whether the word is the beginning of the anchor text; whether the word is in the anchor text of a link, but not the first word; and whether the word is outside any anchor text.

2.3.2.4MS: Meta section features.

The header of an HTML document may provide additional information embedded in Meta Tags. Although

Text in this region is usually not seen by readers, whether a candidate appears in this meta section seems important. for MOs, the features are whether the whole candidate phrase is in the meta section. for des, they are: whether the word is the first word in a meta tag; and whether the word occurs somewhere in a meta tag, but not as the first word.

2.3.2.5T: Title.

The only human readable text in the HTML header is the title, which is usually put in the window caption by the browser. for MOs, the feature is whether the whole candidate phrase is in the title. for des, the features are: whether the word is the beginning of the title; and whether the word is in the title, but not the first word.

2.3.2.6M: Meta features.

In addition to title, several meta tags are potentially related to keywords, and are used to derive features. in the mos framework, the features are: whether the whole candidate phrase is in the meta-description; whether the whole candidate phrase is in the meta-keywords; and whether the whole candidate phrase is in the meta-title. for des, the features are: whether the word is the beginning of the metadegion; whether the word is in the meta-description, but not the first word; whether the word is the beginning of the meta-keywords; whether the word is in the meta-keywords, but not the first word; whether the word is the beginning of the meta-title; and whether the word is in the meta-title, but not the first word.

2.3.2.7U: URL.

A web document has one additional highly useful property-the name of the document, which is its URL. for MOs, the features are: whether the whole candidate phrase is in part of the URL string; and whether any word of the candidate phrase is in the URL string. in the des framework, the feature is whether the word is in the URL string.

2.3.2.8IR: Information Retrieval oriented features.

We consider the TF (Term Frequency) and DF (Document Frequency) values of the candidate as real-valued features. the document frequency is derived by counting how many conditions ENTs in our web page collection that contain the given term. in addition to the original TF and DF frequency numbers, log (TF + 1) and log (DF + 1) are also used as features. the features used in the monolithic and the decomposed frameworks are basically the same, where for Des, the "term" is the candidate word.

2.3.2.9Loc: Relative location of the candidate.

The beginning of a document often contains an introduction or summary with important words and phrases. therefore, the location of the occurrence of the word or phrase in the document is also extracted as a feature. since the length of a document or a sentence varies considerably, we take only the ratio of the location instead of the absolute number. for example, if a word appears in the 10th position, while the whole document contains 200 words, the ratio is then 0.05. these features used for the monolithic and decomposed frameworks are the same. when the candidate is a phrase, its first word is used as its location. there are three different relative locations used as features: wordratio-the relative location of the candidate in the sentence; sentratio-the location of the sentence where the candidate is in divided by the total number of sentences in the document; worddocratio-the relative location of the candidate in the document. in addition to these 3 realvalued features, we also use their logarithms as features. specifically, we used log (1 + wordratio), log (1 + sentratio), and log (1 + worddocratio ).

2.3.2.10Len: Sentence and document length.

The length (in words) of the sentence (sentlen) where the candidate occurs, and the length of the whole document

(Doclen) (words in the header are not supported DED) are used as features. Similarly, log (1 + sentlen) and log (1 + doclen) are also encoded ded.

2.3.2.11Phlen: Length of the candidate phrase.

For the monolithic framework, the length of the candidate phrase (phlen) in words and log (1 + phlen) are encoded as features. These features are not used in the decomposed framework.

2.3.2.12Q: Query log.

The query log of a search engine reflects the distribution of the keywords people are most interested in. We use

Information to create the following features. For these experiments, unless otherwise mentioned, we used a log file

With the most frequent 7.5 million queries. For the monolithic framework, we consider one binary fea-

Ture-whether the phrase appears in the query log, and two real-valued features-the frequency with which it appears and the log value, log (1 + frequency ). for the decomposed framework, we consider more variations of this information: whether the word appears in the query log file as the first word of a query; whether the word appears in the query log file as an interior word of a query; and whether the word appears in the query log file as the last word of a query. the frequency values of the above features and their log values (log (1 + F), where F is the corresponding frequency value) are also used as real-valued features. finally, whether the word never appears in any query log entries is also a feature.

Organize the content after the discussion with Shijie

Background and application:

Content-based advertising word recommendation systems, such as online advertising systems such as Google, Yahoo, and eBay

Q & A System

Keyword substitution and extension

Streamline, adjust, and reschedule redundant queries

Secondary category

Auxiliary Topic Tracking

Feature Selection:

1. Language Features: Use pos (part-of-speech) to mark parts of speech. Such as nouns, verbs, adverbs, and adjectives.

2. Title: whether the keyword appears in the title of the document.

3. position: the position of the keyword in the document, whether it appears in the first sentence, last sentence or paragraph, and last sentence of the entire article. This method is described in detail in automatic keyword extraction using linguistic features.

4. TF and IDF: the most basic information weigh feature.

5. Ne: whether the keyword is named entity, such as name and place name. Date information, such as year, month, day, and time.

6. Relationship between keywords: the semantic distance between keywords. Is the bigger the better, the smaller the better, or is it irrelevant?

7. surrounding word Information Content: Is the information content of several words near the word location High? Or how is the information content of the sentence where the word is located in the entire article?

8. Does this keyword appear in other keywords: probability of appearing as a keyword?

9.doc ument category: Refer to classification-based keyword extraction and concept-based keyword extraction.

10. Does the word appear in a summative sentence?

About ne

1. Used in paper news-oriented Automatic Chinese keyword indexing

2. The Information Content of NE is very high.

3. the differentiation of NE is very high.

Important issues:

1. Keyword definition? Whether it is the largest discrimination or the largest information content.

2. Impact of word segmentation. TF granularity issues. The biggest duplicate string in Chinese keyword extraction based on Max-duplicated strings of the documents.

News-oriented Automatic Chinese keyword indexing describes Chinese keyword extraction, a very classic article. It puts forward statistics on the Character Frequency before word segmentation, solving the problems caused by inaccurate word segmentation and Word Segmentation granularity. Methods for filtering keywords are mentioned. Use POs to mark the word string, and then filter out the words corresponding to the part of speech with low information content. For example, hyphens (,) and adverbs.

For more information about how to select the most effective features, see multi-subset selection for keyword extraction and other prototype search tasks using feature selection algorithms.

Notes on key word Extraction extracted from recent studies

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More