Search engine is how to judge a valuable article

Source: Internet
Author: User
Tags hash inheritance key words md5 md5 hash range reference square root

There are many people consulted the author (Mr.zhao), Baidu How to judge the false original and original? What kind of articles does Baidu like? What kind of articles compare such as getting a long tail word rank? and so on. In the face of these problems, I often do not know how to answer. If I give a more general answer, such as paying attention to the user experience, meaning, and so on, the questioner will feel that I am dealing with him, and they often complain that it is too vague. But I also can not give specific content, after all, I am not Baidu, the specific algorithm I simultaneously mercy for you round it?

  For this, I started writing this "If I" series of articles. In this series of articles, I assume that if I racked my brains to provide a better search service for netizens, what would I do, what would I do with the content of the article, how to treat the chain, how to treat the site structure and so on. Of course, I have limited skills, I can only write something I understand a little bit. and Baidu and other commercial search engines, they have a lot of better than I talent, believe that their algorithm and the way to deal with problems than I improve a lot, and I wrote these, no matter what, I hope you look at the heart has a probably. After all, in the road of SEO after a period of time, no one who can be the teacher, some ideas for reference only.   IMPORTANT statement   Here, I would like to solemnly declare that all the ideas, algorithms and procedures involved in this series of articles are not written by me, and all of them are collected from some public information. At the same time, I believe you can understand that if these free and public things can be done so far, then those trade secrets are not to mention.   OK, here we go.   If it is me, what kind of articles would I like? I would like to my users like the article, if the hard to add to the criteria, that there are two kinds of: 1. Original and users like. 2. Not original and users like. Here, my attitude is very obvious, false original is not original. So what kind of articles do users like? It is clear that some new ideas, new knowledge is often the user likes, that is, usually original articles are users like, and even if the user does not like, original site as a fresh content manufacturer, also should be protected by certain. So not original article users must not like it? Some sites, the content is often compiled after the aggregation, then these sites for users is valuable, and its corresponding article should get a better ranking.   This shows that I need to pay attention to two types of articles can be. One is the original article, the second is a valuable information aggregation site under the article.   The first thing to be clear is that the scope of this article is limited to content pages, not feature pages, list pages, and first page.   So before I can screen these two types of articles, I need to collect the information first. This article does not elaborate on the part of Spider program. When the spider program downloads the Web page information, in the Content processing module, I need first to the content to remove the noise.   Content de-noising, not everyone often mistakenly think that only the removal of code. For me, I'm going to go out of the page part of the text that is not the body content. such as the navigation bar, such as the bottom text and the list of individual articles. By removing their effects, I'll get a paragraph of text that contains only the body content of the page. I wrote about collecting rules. Webmaster friends should know that this is not difficult。 But search engine after all is a program, it is impossible to write a similar to each station of the collection rules of things, so I need to establish a set of noise elimination algorithm.   Before that, let us be clear about our purpose.         The above figure is clearly 1 is what users need most, content 2 is the user may be interested in, the rest are invalid noise. So for this, we can find the following features:   1. All call lists are all in a block of information, this block of information is mostly made up of <a> tags, even if there is free from the content of the label, its text is also basically fixed, and in the station page there are a lot of duplication, more easily judged.   2. Content 2 is generally adjacent to the content 1. and the link anchor text in Content 2 is related to content 1.   3. The content 1 part is a mixture of text text content and <a> tags, and in general, text content is unique in the collection of Web pages.   So, for this, I use the well-known tag tree method, the content page is decomposed.   from the page of the label layout, the Web page is through a number of pieces of information to provide content, and these pieces of information by a specific label planning, the common tags have <div><ul><li><p>< Table><tr><td>, we follow these tags to obscure the Web page as a tree structure.       The above picture is my hand-painted simple tag tree, in this way, I can easily identify the various pieces of information. Then I set a threshold value of a for the content of the threshold. The proportion of the content is the number of words in the information block and the ratio of the label appears here. I set the content of the information block in the page when the threshold is greater than a, I will be listed as a valid content block (this is to eliminate excessive in the chain, because if an article is covered in the chain, it is not conducive to user experience, and then I compare the text in the content block, when it is unique, the collection of this one or more pieces of content, That's what I need for "content 1".   So what do I do with 2? Before I explain the content 2, I'll explain the meaning of content 2. As I said earlier, if it is a user-focused web site, his role is to carefully classify and correlate existing Internet content to facilitate users to read better and more effectively. For such a site, even if its article is not original but from the Internet excerpt, I will give it enough attention and ranking, because it is a good aggregation of content is often more able to meet the needs of users. &nBsp So for the aggregation site, I can make a rough judgment out of "Content 2". In short, if it is a good aggregation site, first its content page must have content 2, while content 2 must occupy an important part.   Well, the identification content 2 is very simple, for the content of the threshold value of less than a specific value of the information block, I am all judged as a link module. I'll take the content 1 through some way (in the latter part of this article) and extract topic B. I'll split the anchor text of all the tags in the link module, and if all the anchor text matches the subject B, the link module will be judged to be content 2. Set the link threshold C, the link threshold for the number of tags in content 2 divided by the number of tags in all linked modules, if greater than C, this site may be aggregated Web site, for content ranking calculation will refer to aggregation site-specific algorithm.   Expand reading 1 start   I believe that a lot of SEO practitioners just contact this line, they have heard one thing, that is, content page export link to have relevance. Another thing is that the page below to have relevant reading, to attract users in depth click. At the same time should also be heard, the inner chain should be moderate, not too much.   But few would say why, and more and more people are losing sight of the details because they don't understand their inner truths. Of course, some of the previous search engine algorithms in the content of the lack of attention, but also played a role in fuelling. But if I look at it from a conspiracy point of view, I can assume that this is the truth.   Most users of the search page, the first page only 10 results, except my own products, often only about 7, the general user will only click to the 3rd page, then I need the quality of the site in fact, less than 30 can meet the maximum user experience. Then after 3-5 years of layout, gradually screened out some of the standing loneliness and serious to do details of the station, this time I will adjust this part of the algorithm, and then filter out these quality sites, pushed to the user. Of course, in the process of doing there are more reference factors, such as domain name age, JS number, website speed and so on.   Expand reading 1 End   Expand Read 2 start   You say, why is there a lot of the same time in the station article, will quickly cause the search engine punishment? Here I say is not excerpt and original problem, but you stand inside yourself and your own article repeat. The reason why the search engine responds so quickly and punishes harshly is that the root cause is that in your article, he can't extract 1 of the content.   Expand reading 2 end   Good, after this series of processing, I have obtained content 1 and content 2, the following should be the original recognition algorithm.   Now basically the search engine for the original recognition, on the big side is to use keyword matchingCombining the vector space model to judge. That's what Google does, and it has a corresponding article in its official blog. Here, I will make a vernacular version of the introduction, strive to be easy to understand.   So, I through the analysis of content 1, to get content 1 in the highest weight of the keyword K, then according to the weight of the ranking, the first n weights the highest keyword set I named K, then k={k1,k2,......,kn}, Then each keyword will correspond to a weight characteristic value that it gets in the page, I will k1 corresponding weight eigenvalue set to T1, then the first n weighted keyword corresponding to the set of eigenvalues is T={t1,t2,......,tn}, then we have this feature, we can calculate its corresponding eigenvector w={ W1,W2,......,WN}. Then I spell k as a string z, while MD5 (Z) represents the MD5 hash value of the string Z.   So suppose I decide that two pages are I and J respectively.   I calculated two formulas.   1. When MD5 (Zi) =md5 (Zj), page I and page J exactly the same, judged as reproduced.   2. Set a specific value α        When 0≤α≤1, I decided that the page is similar to duplicate.   Thus, the judgment of the original article is over. Well, the tedious boring explanation of the bitter, the following I use vernacular to repeat again.   First of all, your content is exactly the same, a word without change, it must be excerpt Ah, this time MD5 hash value can be quickly judged.   Second, a lot of seo they lazy, to carry out the so-called false original, you say you false original when inserting points of their own point of view and information also become, the result you are to change a synonym or something, so I used the eigenvector, through the eigenvector of the judgement, you these poor false original grasp out. On this, the idea of judgment is very simple, you are the highest weight of the first n keyword set very similar to the time, to judge the repetition. Here the so-called similarity includes but not only the highest weight of the top n keyword coincidence, and then constructs the eigenvector, when the contrast between the two vector angles and length, when the difference between the angle and length is less than a certain value, I define it as a similar article.   Note 1 start   friends who have been watching Google's anti-cheating group's official blog should have seen Google's blog about similar article judgment algorithms, in which the main use of the cosine theorem is to calculate the angle. But later Mr.zhao read several documents, think that blog should only be Google abandoned after the decryption, now the trend of the general algorithm, should be the calculation of the angleand length, so choose this algorithm now to show everyone.   NOTE 1 End   OK, here we notice a few questions.   is the value range variable when the 1.α is judged to be repeated?   2. How to extract the key words in the content?   3. How is the weight value of the keyword in the content given?   Below I will answer each.   First says Alpha determines the range of values to be repeated, and this range is absolutely variable. With the development of SEO industry, more and more people want to opportunistic, and this is the search engine can not accept. So it will be a few years to do an algorithm big update, and every time the algorithm big update, will be predicted to affect the number of search results. So how does the percentage of the impact result be calculated? Not a single number, of course, in terms of content (other things I'll do in other articles explained), by adjusting the alpha to determine the similarity of the value of the space changes to calculate, each page in my processing is, I calculated the alpha value will be stored in the database, In this way, every time I adjust the algorithm, the risk can be maximum control.   So how to extract keywords? This is the word segmentation technology, I will speak later. The weight assignment of different keywords in the page is also to be said later.   about the similarity of articles, in short, is that the previous changes to the article, such as "More and more SEO began to pay attention to the quality of the article." "To" high-quality articles by more SEO attention ", this in the past has not been recognized, not I did not recognize your technology, but I relaxed the scope, I can at any time, by setting parameters of the range of values, to reconsider the value of the page.   Well, if you're a little confused here, don't worry, I'll say it slowly. In the above algorithm, I need to know the first n important keywords and their corresponding weight eigenvalue. How do I get these values?   First, we must first participle. For participle, I first set a process, and then use the forward maximum matching, reverse maximum matching, the least segmentation and other ways to do participle. This is in my blog, "The Common Chinese word Segmentation technology Introduction", this will not repeat. Through participle, I got this page content 1 of the keyword set K.   When I identified content 1, I had already built the tag tree, so my content 1 was actually broken down by the tag tree into a tree-like structure of paragraphs.         above is the label tree for content 1. Here, I encountered a problem, that is, for the weight of the tag tree assignment, should be oriented to the entire page of the tag tree, or only the content of 1 of the tag tree?   Many friends might think thatSince it is for the content of the keyword 1 of the evaluation, that only deal with the content of 1 good. Fact A search engine, its processing of data less said also to tens other, so the search engine for efficient code and algorithm requirements are extremely high.   Under normal circumstances, the Web page of a website is impossible to exist in isolation, in a page for a certain keyword sorting, in addition to consider outside the station, I need to consider the weight of the station in the inheritance, then in consideration of the weight of the station inheritance, I must avoid the calculation of the internal chain, At the same time within the chain itself should have different weights to distinguish, and the weight of the chain in the calculation, I must consider the page and its relevance. In this case, I should be a one-time for the entire page of information block weight allocation, which is efficient, but also fully embodies the content and link relevance of the importance. As you can see on the Internet, relevance determines the validity of a linked vote.   OK, now that you're sure that the entire tag tree is assigned weights, start below.   First, I want to identify the thesaurus of important keywords. Key keywords are identified in two ways:   1. Key keywords in different industries.   2 Key keywords for sentence structure and part of speech.   Each of the more mature commercial search engine, for different industries, its algorithm will be different. And industry's judgment, is relies on each industry key word storehouse carries on. Baidu recently for a number of specific keywords, in search results returned to the site's record information and certification information, this shows that the thesaurus actually existed already.   So, where does the sentence structure start? The Chinese sentence is composed of the main-predicate-object-definite complement structure, and the part of speech is only noun, verb, preposition, adjective, adverb, onomatopoeia, pronoun, numeral. I believe a lot of people just do SEO when, certainly heard of the search engine in addition to noise, will remove the land and pronouns, in fact, this statement on the big side, but it is not completely accurate. From the fundamental principle, it is to deal with the sentence structure and part of speech when the attitude is different. We can be sure that the subject must be the most important part, often a sentence of the subject has changed, and its object and the meaning of the expression is often different. And if there is a change in the object, it is likely to lead to changes in the industry involved in this article. Therefore, the subject must be the key word I need. Why did I not say to remove the pronoun in the subject part? because often removing the subject will cause the sentence to be distorted, I want to keep the subject all the attributes of the word, even if it seems to have no meaning pronouns.   So attributive? Often attribute determines the degree or nature of a thing, so attributive is also very important. But the problem is, for the user, beautiful painting and beautiful painting is the same meaning, and beautiful painting and ugly painting is the opposite meaning. At the same time other sentence structures such asComplement as a sentence complement, often bearing the location, time and other information, it is also very important. If so, then I would like to determine what I think is the most important key words?   The problem is really complicated, but it's a simple and difficult way to solve it. That is the accumulation of time and data. Some people may find it irresponsible to say so, but that is the case. If there is no SEO in the world, no false original, then search engine can rest assured, because there is no false original interference, search engine can quickly identify the content of reprinted, and then very easy to calculate rankings. But with the false original, in fact, every time the adjustment of the content Judgment algorithm, more is the current some common pseudo original practice to identify. Because of the existence of pseudo original, if I design strategy, I will design a two thesaurus, thesaurus A is used to differentiate the content of the industry, Thesaurus B is for different industries, and then set a number of rules and these two sub-thesaurus association.   examples. For example, fake original rampant medical seo, through a number of disease words, can quickly identify its content belongs to the medical profession. Then in the choice of time, for some reason, I will be strict treatment of medical, then I think the contents of medical articles is only important to serve as the subject of nouns, and then in the noun as the subject, the disease noun as the highest priority, and then prioritize, in the order if the subject of nouns more than N, The information block is based on the most recent principle of the nearest root node, and the same noun selects only once, then selects the first n important keywords as the initial node of the assignment, and assigns the weights.   So at the time of assignment, I set the assignment factor E, and I can judge the weight of the assignment based on the type of keyword in these assigned nodes. For example, with the title of the repeated disease, the corresponding coefficient is E1, and title of the disease is not corresponding to the nominal coefficient of E2, the other nominal coefficient is E3. Then I started traversing the tag tree.   The entire page itself is the weight of Q, in accordance with the first n keywords in order to traverse. Then my traversal principle is as follows:   1. The first time traversal, the first important node weight value is Qe1, its parent node weight value is qe1*b, its child node weight value is Qe1*c, This principle then continues to traverse the parent node and child nodes of the parent node and the child nodes of the child nodes and their child nodes.   The following examples. Assuming that Q is 1,e1 3   A starts with the following figure         Then assumes a is the square root of the previous number, and B is the cubic root of the previous number. The following figure         Then start traversing the other nodes.       &nbsp The first traversal ends when all the nodes in the entire page tag tree are assigned. This is the second traversal, note that this time with the E2 is not the Q, but the second important keyword is the node's current weight value.   So through the n this traversal, each piece of information will have its own corresponding weight values, and then I extracted the content of 1 pieces of information, the specific above, there are drawings, there will be no more painting. Quantify the content 1. After quantization, I will be able to get the weight eigenvalue t={t1,t2,......,tn} that I need above. Thus, the algorithm layer is the first corresponding improvement. There are a lot of quantitative formulas, I don't give examples here, because this example is meaningless, and I'm not really writing search engines.   Expand reading 3 The weight of the start   link module, the last hyperlink to the page to which it points. This also explains the links in different locations, the weight of their conduction varies. The position of the inner chain determines the weight inheritance of the inner chain. And we often hear, the inside chain context to appear the keyword, in fact, this algorithm is derived from the phenomenon.   Expand reading 3 End   At this point, the algorithm layer is basically over.   Statement 1 start   1. Again, I emphasize that the algorithm is not written by me, is that I learn from others, for reference who? I forgot ... a lot.   2. All experienced business search engine, its algorithm is certainly layered, absolutely not just an algorithm layer, so this single algorithm layer, to the ranking can be said to have a great impact, but it is not exactly according to this algorithm layer to rank.   3. This article first Mr.zhao SEO blog, reprint please retain the original source: http://www.seozhao.com/379.html   Statement 1 end   So roughly understand this layer of the algorithm, after the actual operation of our What specific help?   1. We can effectively know, how to set the content page layout, so that we reproduced in the article, let Baidu know that we reprint the article at the same time, in order to better user experience and aggregated the views of the article.   2. We can better know which articles will be judged as similar articles.   3. This is the most important point, is that we can better layout of the content page. Real white hat SEO, in the station to comb, its station column on the page layout is particularly important, experienced SEO can effectively use the weight of the page to inherit, and then increase the long tail rankings, this for the portal site or a large number of content pages such as the Web site, is very important. Of course, in the long tail rankings, the weight of the page to passThe understanding and layout of the transmission is only the basis, in the future I will be in the next article, in the column level settings and weight transfer, to my point of view.   4. Understand the general principle of weight inheritance within the chain.   Source: Mr.zhao.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.