Brief discussion on SEO data analysis iii– Maintenance thesaurus

Source: Internet
Author: User

Intermediary transaction http://www.aliyun.com/zixun/aggregation/6858.html ">seo diagnose Taobao guest cloud host technology Hall

Objective

A specific SEO data analysis article has been separated for a long time, today a friend asked me online, have a thesaurus how to maintain. Just take this opportunity to talk about this problem. After getting a lot of keywords, first of all to deal with these words, in my actual work, summed up the following several projects I have done or feel the need to do things.

Extract Entity (popular point is to find key words in keywords)

to Heavy

Controlled thesaurus

Classification

Extract entity

The concept of extracting entities is to find key words in keywords. For example, "Beijing Hot Springs where good," the word "Beijing" and "hot spring" These two words is the focus, "where good" is only a question, the topic description help relatively small. So we need some technical means, to deal with the key words, the middle of important keywords (entities) to take out.

Look at the following key words first

You can look at the difference between the two. This algorithm has a number of implementation methods, in view of the SEO from the point of view, our accuracy and recall requirements, generally relatively low. From 0% to 80%, the amount of thought that can be spent may not be as much as the 80%~100%. And not the same industry, there will be a slightly different approach. So I took the following two methods

1. Deletion of word symbols according to part of speech (do not worry about deleting more)

2. According to TF filter high-frequency words (what is TF please own brain to fill)

This side of the word segmentation algorithm, many academics have studied a lot of Chinese word segmentation algorithm, but the actual use of the difference is very small. Here casually recommend a few, according to their own language use.

Ictclas Http://ictclas.nlpir.org/downloads Language: java,c#

crf++ http://crfpp.sourceforge.net/language: C #

SCWS http://www.xunsearch.com/scws/Language: PHP

Jieba https://pypi.python.org/pypi/jieba/Language: Python

Participle is also a subject of learning, interested can look at CRF,HMM and other models of logic. This is not the start of the talk.

The key problem is to be fast and to customize the thesaurus. Because I use Jieba, this can support. You can look at the description in the author's GitHub https://github.com/fxsjy/jieba/blob/master/README.md

After segmentation according to part of speech, excluding "stop words", we get to the desired result set.

Filter high-frequency words. Jieba can extract TF values from the entire text. These words are the core and must not be removed.

Then according to the result of the word segmentation to obtain the high TF value of the Word, for manual review, to our tourism industry thesaurus as an example, place names are often the word, tf value may be very high, but it can not be removed. So we need to prepare a Chinese place name/scenic Spot noun library, this can search on the Internet, lazy people can directly use Sogou input Word library.

Then the high-frequency words may appear in the word, may be "July", "August", "Daquan", "line" and so on. These words can also be considered to be kicked out of the entity word.

After these rounds of inspection, basically almost, and then the accuracy can be further study. Someone must ask, you have been tossing for so long, what is the use?

1. Content Association

2. Automatic tagging

3. Improve the accuracy of the site search

Last year to this year's search house and the SEO can be realized.

to Heavy

After the entity is extracted, the keyword can be weighed.

For example

? 1

2

3 Hainan Travel how much money

How much does Hainan travel cost?

After processing

? 1

2

3 Hainan | tourism

Hainan Tourism

can go heavy. There are two words, which can be solved by the same method of entity. But there are some keywords, such as "Maldives" and "Ma dai", "Great Wall" and "Badaling", users can refer to a place, we should deal with these words. We're going to need one of these things. "Controlled Thesaurus"

Controlled thesaurus

A controlled thesaurus is a way of controlling the meaning of words and tracking their related words. Back to the example above, if you search for "Badaling", you can not show the Great Wall of the content, I believe that users have long gone.

The controlled thesaurus has three major relationships: equivalence, hierarchy, and correlation

Equivalence is well understood, such as the Maldives and the Mades, which is the equivalence relation, which can be said to be a meaning, and the weight value is the highest. In the content recommendation must be presented.

Hierarchy has subordinate, such as "Confucius Temple" is "Nanjing Attractions Encyclopedia" of the subordinate words. "Da Cheng Dian" is the subordinate word of "Confucius Temple". In the actual application, when the user looking for "big into the hall", the site can tell users you are located in the "Confucius Temple" in the middle, and recommended the Confucius Temple around there are some fun things, users will certainly like. The hierarchical relationship is also the information architecture system of most websites, from the homepage, to the catalogue, to the column.

Association, a bit similar to the equivalent, but not exactly the same, such as "Sanya with group Tour", "Haikou Self-Help Tour", "Hainan tourism double Fly." They did not say a clear hierarchy, but they could not say exactly the same. This kind of word, we can take it as a related thing. can be recorded. In addition, some of the content of the self-contained attributes, such as high, rich, handsome, can be used as a related keyword, in the content recommendation more satisfied with the user's taste.

Here also want to mention, in the work, we found that the user sometimes have some special words to express their needs, such as "Sack" (homophonic Mades), or "hundred tore not to ride elder sister" such a damned input method misspelled, these keywords need to be stored up.

The final result should be this:

  

Classification

How to classify the number of keywords you get. First, you can classify, navigate, information, and business as intended. (Learning materials-know: http://www.zhihu.com/question/20905145)

The advantage of doing this is that you can quickly know which type of Word to assign to which product line to do. For example, the word information class, as far as possible to the information, questions and answers, product library such channels. Navigation class Word, if it is own brand can do, if is the competitor brand, may make the channel alone. Transaction class words, generally on the main product line, the page will have functional embodiment, such as "Add to Cart", "Download Link", "online booking" and so on. To a certain extent to meet the needs of users to avoid content dislocation. For example, this http://iphone.tgbus.com/tag/iphone6tieba/"iphone6 paste". Where is the bar? At least give a link address.

In addition to the above Division intent class method, the following is a combination of classification methods in the information architecture.

First, introduce a manual sorting keyword method: card sorting. By trying, this is really a sort of brainstorming method. We extracted 500 keywords from the thesaurus of the "Maldives", randomly assigned to

5 groups. Each group groups the keywords at hand and names the groups themselves. Then we pooled the group names of 5 groups so we identified about 10 small categories and found something that we didn't think of before.

The last situation is roughly

  

With the classification, we are in the web structure organization, can be more targeted. Specific can look at the maldives.tuniu.com of the left category, the actual operation process, we also have a certain degree of screening and hierarchical control considerations. For example, money, language, climate, can be grouped into the introduction. About this page we also rarely do external links, content is just one of countless destinations, it is impossible to have a lot of manpower to stare at this column, but concentrate on doing the user's favorite content, the performance of this channel is quite good.

At this time how to build content, it is very clear, directly from the thesaurus to find keywords and then write content can be, always more than write "Maldives quote", "Maldives travel quote" and other worthless articles, do the so-called core word performance better.

We just divided 500 keywords, the thesaurus has tens of thousands of words to be classified in the Maldives, with the passage of time, the new keyword will become more and more. Machines can be used in machine learning methods. This side I also in study, write to fear laughable, a little bit, use decision tree, according to have card sorting keyword as a training document, according to the metadata contained in the controlled thesaurus as a feature, to generate decision tree to facilitate automatic classification.

Summary

1. Algorithm is not a problem, the key is suitable for their own industry thesaurus, as for the thesaurus, the method is too much, you can see another article I said http://www.imyexi.com/?p=708 keyword mining part

2. The relationship between words and words, is the content of the recommendation, the contents of the operation of the sharp weapon, also enhance the user experience, this side had to spit, the user experience is the need for technology, not shouting slogans.

3. Would like to write a point of interest in mining, later a think word library in each word is a point of interest, as long as the control of the Word Library update, the point of interest is not a problem.

4. With ideas, executive power is also important. Thesaurus belongs to the bottom of the site content base, to fix this, the back can avoid a lot of duplication of work and no effort. (Be deep in the pit)

5. I am not born, a lot of technical terminology, noun interpretation, all based on their own learning to understand that there are errors please correct the study.

extended reading: A brief discussion of the data analysis of SEO I-opening & included some small discussion of SEO data analysis: How to improve the site included

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.