If it was me, how would I judge a valuable article?

Source: Internet
Author: User
Tags hash inheritance key words md5 range reference square root domain name


A lot of people have asked me, say Mr.zhao Ah, Baidu how to judge false original and original? What kind of articles does Baidu like? What kind of articles are compared such as getting a long tail word rank? and so on. In the face of these problems, I often do not know how to answer. If I give a more general answer, such as paying attention to the user experience, meaning, and so on, the questioner will feel that I am dealing with him, and they often complain that it is too vague. But I also can not give specific content, after all, I am not Baidu, the specific algorithm I simultaneously mercy for you round it?



To this end, I began to write this "if it is my" series of articles. In this series of articles, I assume that if I racked my brains to provide a better search service for netizens, what would I do, what would I do with the content of the article, how to treat the chain, how to treat the site structure and so on. Of course, I have limited skills, I can only write something I understand a little bit. and Baidu and other commercial search engines, they have a lot of better than I talent, believe that their algorithm and the way to deal with problems than I improve a lot, and I wrote these, no matter what, I hope you look at the heart has a probably. After all, in the road of SEO after a period of time, no one who can be the teacher, some ideas for reference only.



Here, I would like to solemnly declare that all the ideas, algorithms and procedures involved in this series of articles are not written by me, and all of them are collected from some public information. At the same time, I believe you can understand that if these free and public things can do so, then those commercial, OK, now start.



If it is me, I will like what kind of article? I would like to my users like the article, if the hard to add the criteria, that there are two kinds of: 1. Original and users like. 2. Not original and users like. Here, my attitude is very obvious, false original is not original. So what kind of articles do users like? It is clear that some new ideas, new knowledge is often the user likes, that is, usually original articles are users like, and even if the user does not like, original site as a fresh content manufacturer, also should be protected by certain. So not original article users must not like it? Some sites, the content is often compiled after the aggregation, then these sites for users is valuable, and its corresponding article should get a better ranking.



This shows that I need to pay attention to two types of articles can be. One is the original article, the second is a valuable information aggregation site under the article. The first thing to be clear is that the scope of this article is limited to content pages, not feature pages, list pages, and first page.





Content de-noising, not everyone often mistakenly think that only the removal of code. For me, I'm going to go out of the page part of the text that is not the body content. such as the navigation bar, such as the bottom text and the list of individual articles. By removing their effects, I'll get a paragraph of text that contains only the body content of the page. I wrote the rules. Webmaster friends should know that this is not difficult. But search engine after all is a program, it is impossible to write a similar to each station of the collection rules of things, so I need to establish a set of noise elimination algorithm.



Before that, let us clarify our purpose.









It is clear from the above figure that 1 is what the user needs most, content 2 is what the user might be interested in, and the rest are invalid noises. So for this we can find the following features:



1. All of the call list is in a block of information, the majority of this information block is composed of tags, even if there is free from the content of the label, its text is also basically fixed, and in the station page there are a lot of duplication, easier to judge.



2. Content 2 is generally adjacent to the content 1. and the link anchor text in Content 2 is related to content 1.



3. The content 1 part, is the text text content and the label mixes, and in the normal circumstances, the text text content is unique in the website collection of pages.



So, for this reason, I use a well-known tag tree to decompose content pages. From the page of the label layout, the page is through a number of pieces of information to provide content, and these pieces of information by a specific label planning out, the common label has, we follow these tags, the Web page to obscure as a tree-like structure.









The picture above is a simple label tree I hand-painted, and in this way, I can easily identify the pieces of information. Then I set a threshold value of a for the content of the threshold. The proportion of the content is the number of words in the information block and the ratio of the label appears here. I set the content of the information block in the page when the threshold is greater than a, I will be listed as a valid content block (this is to eliminate excessive in the chain, because if an article is covered in the chain, it is not conducive to user experience, and then I compare the text in the content block, when it is unique, the collection of this one or more pieces of content, That's what I need for "content 1".



So what do I do with content 2? Before I explain how to deal with content 2, let me explain the meaning of content 2. As I said earlier, if it is a user-focused web site, his role is to carefully classify and correlate existing Internet content to facilitate users to read better and more effectively. For such a site, even if its article is not original but from the Internet excerpt, I will give it enough attention and ranking, because it is a good aggregation of content is often more able to meet the needs of users.



So for the aggregation site, I can make a rough judgment out of "Content 2". In short, if it is a good aggregation site, first its content page must have content 2, while content 2 must occupy an important part.



OK, the recognition Content 2 is very simple, for the content proportion threshold value below a certain value information block, I all judged as the link module. I'll take the content 1 through some way (in the latter part of this article) and extract topic B. I'll split the anchor text of all the tags in the link module, and if all the anchor text matches the subject B, the link module will be judged to be content 2. Set the link threshold C, the link threshold for the number of tags in content 2 divided by the number of tags in all linked modules, if greater than C, this site may be aggregated Web site, for content ranking calculation will refer to aggregation site-specific algorithm.



I believe that a lot of SEO practitioners just contact this line, have heard one thing, is content page export link to have relevance. Another thing is that the page below to have relevant reading, to attract users in depth click. At the same time should also be heard, the inner chain should be moderate, not too much.



But few would say why, and more and more people are losing sight of the details because they are unaware of their inner truths. Of course, some of the previous search engine algorithms in the content of the lack of attention, but also played a role in fuelling. But if I look at it from a conspiracy point of view, I can assume that this is the truth.



Most users of the search page, the first page only 10 results, except my own products, often only 7 or so, the general user will only click to the 3rd page, then I need the quality of the site in fact, less than 30 can meet the maximum user experience. Then after 3-5 years of layout, gradually screened out some of the standing loneliness and serious to do details of the station, this time I will adjust this part of the algorithm, and then filter out these quality sites, pushed to the user. Of course, in the process of doing there are more reference factors, such as domain name age, JS number, website speed and so on.



You say, why when there are a lot of the same time in the article, will quickly cause the search engine punishment? Here I am not talking about excerpts and original issues, but you stand inside yourself and your own article repeat. The reason why the search engine responds so quickly and punishes harshly is that the root cause is that in your article, he can't extract 1 of the content.



OK, after this series of processing, I have obtained content 1 and content 2, the following should be the original recognition algorithm. Now basically the search engine for the original recognition, on the large face is a keyword matching combination of vector space model to judge. That's what Google does, and it has a corresponding article in its official blog. Here, I will make a vernacular version of the introduction, strive to be easy to understand.



Well, I through the analysis of content 1, get content 1 in the highest weight of the keyword K, then according to the weight of the ranking, the top n weights of the highest keyword set I named K, then k={k1,k2,......,kn}, then each keyword will be corresponding to a page in the weight of the value of the feature, I will k1 corresponding weight eigenvalue set to T1, then the first n weighted keyword corresponding to the set of eigenvalues is T={t1,t2,......,tn}, then we have this feature, we can calculate its corresponding eigenvector w={w1,w2,......,wn}. Then I spell k as a string z, while MD5 (Z) represents the MD5 hash value of the string Z.



So suppose I decide that two pages are I and J respectively. Then I calculated two formulas.



1. When MD5 (Zi) =md5 (Zj), page I and page J exactly the same, judged as reproduced.



2. Set a specific value alpha









When 0≤α≤1, I decided that the page was similar to repeating.



Thus, the judgment of the original article is over. Well, the tedious boring explanation of the bitter, the following I use vernacular to repeat again.



First of all, your content is exactly the same, a word without change, it must be excerpt Ah, this time MD5 hash value can be quickly judged.



Second, a lot of seo they lazy, to carry out the so-called false original, you say you false original when inserting points of their own point of view and information also become, the result you are to change a synonym or something, so I used the eigenvector, through the eigenvector of the judgement, you these poor false original grasp out. On this, the idea of judgment is very simple, you are the highest weight of the first n keyword set very similar to the time, to judge the repetition. Here the so-called similarity includes but not only the highest weight of the top n keyword coincidence, and then constructs the eigenvector, when the contrast between the two vector angles and length, when the difference between the angle and length is less than a certain value, I define it as a similar article.



Friends who have been watching Google's anti-cheating group's official blog should have seen Google's blog about similar article-judging algorithms, in which the main use of the cosine theorem is to calculate the angle. But later Mr.zhao read several documents, think that blog should only be Google abandoned after the decryption, now the trend of the general algorithm, should be to calculate the angle and length, so choose now to see this algorithm. OK, we've noticed a few problems here.



is the value range variable when 1.α is judged to be repeated?



2. How to extract the key words in the content?



3. How is the weight value of the keyword in the content given?



Let me answer each of these.



First, Alpha determines the range of values to be repeated, and this range is absolutely variable. With the development of SEO industry, more and more people want to opportunistic, and this is the search engine can not accept. So it will be a few years to do an algorithm big update, and every time the algorithm big update, will be predicted to affect the number of search results. So how does the percentage of the impact result be calculated? Not a single number, of course, in terms of content (other things I'll do in other articles explained), by adjusting the alpha to determine the similarity of the value of the space changes to calculate, each page in my processing is, I calculated the alpha value will be stored in the database, In this way, every time I adjust the algorithm, the risk can be maximum control.



So how to extract keywords? This is the word segmentation technology, I will speak later. The weight assignment of different keywords in the page is also to be said later.



About the article similarity, in short, is that before we change the article, such as "More and more SEO began to pay attention to the quality of the article." "To" high-quality articles by more SEO attention ", this in the past has not been recognized, not I did not recognize your technology, but I relaxed the scope, I can at any time, by setting parameters of the range of values, to reconsider the value of the page.



Well, if you're a little confused here, don't worry, I'll say it slowly. In the above algorithm, I need to know the first n important keywords and their corresponding weight eigenvalue. How do I get these values?



First, we must first participle. For participle, I first set a process, and then use the forward maximum matching, reverse maximum matching, the least segmentation and other ways to do participle. This is in my blog, "The Common Chinese word Segmentation technology Introduction", this will not repeat. Through participle, I got this page content 1 of the keyword set K.



When I identified content 1, I had already built the tag tree, so my content 1 was actually broken down by the tag tree into a tree-like structure of paragraphs.









The image above is the label tree for content 1. Here, I encountered a problem, that is, for the weight of the tag tree assignment, should be oriented to the entire page of the tag tree, or only the content of 1 of the tag tree?



Many friends may think that, since it is for the content of the keyword 1 of the evaluation, that only deal with the content of 1 good. Fact A search engine, its processing of data less said also to tens other, so the search engine for efficient code and algorithm requirements are extremely high.



Under normal circumstances, the Web page of a website is impossible to exist in isolation, in a page for a certain keyword sorting, in addition to consider outside the station, I need to consider the weight of the station in the inheritance, then in consideration of the weight of the station inheritance, I must avoid the calculation of the internal chain, At the same time within the chain itself should have different weights to distinguish, and the weight of the chain in the calculation, I must consider the page and its relevance. In this case, I should be a one-time for the entire page of information block weight allocation, which is efficient, but also fully embodies the content and link relevance of the importance. As you can see on the Internet, relevance determines the validity of a linked vote.



OK, now that you're sure that the entire label tree is assigned to weights, then start here. First, I want to identify the thesaurus of important keywords. The key keywords are determined in two ways:



1. Key keywords in different industries.



2. Key keywords for sentence structure and part of speech.



Each more mature commercial search engine, for different industries, its algorithm will be different. And industry's judgment, is relies on each industry key word storehouse carries on. Baidu recently for a number of specific keywords, in search results returned to the site's record information and certification information, this shows that the thesaurus actually existed already.



So where does the sentence structure start? The Chinese sentence is composed of the main-predicate-object definite complement, and the part of speech is only noun, verb, preposition, adjective, adverb, onomatopoeia, pronoun, numeral. I believe a lot of people just do SEO when, certainly heard of the search engine in addition to noise, will remove the land and pronouns, in fact, this statement on the big side, but it is not completely accurate. From the fundamental principle, it is to deal with the sentence structure and part of speech when the attitude is different. We can be sure that the subject must be the most important part, often a sentence of the subject has changed, and its object and the meaning of the expression is often different. And if there is a change in the object, it is likely to lead to changes in the industry involved in this article. Therefore, the subject must be the key word I need. Why did I not say to remove the pronoun in the subject part? because often removing the subject will cause the sentence to be distorted, I want to keep the subject all the attributes of the word, even if it seems to have no meaning pronouns.



What about the attributive? Often the attribute determines the degree or nature of a thing, so the attributive is also very important. But the problem is, for the user, beautiful painting and beautiful painting is the same meaning, and beautiful painting and ugly painting is the opposite meaning. At the same time, other sentence structures such as complement of the sentence, often bearing the location, time and other information, it is also very important. If so, then I would like to determine what I think is the most important key words?



The problem is really complicated, but the solution to it is simple and difficult. That is the accumulation of time and data. Some people may find it irresponsible to say so, but that is the case. If there is no SEO in the world, no false original, then search engine can rest assured, because there is no false original interference, search engine can quickly identify the content of reprinted, and then very easy to calculate rankings. But with the false original, in fact, every time the adjustment of the content Judgment algorithm, more is the current some common pseudo original practice to identify. Because of the existence of pseudo original, if I design strategy, I will design a two thesaurus, thesaurus A is used to differentiate the content of the industry, Thesaurus B is for different industries, and then set a number of rules and these two sub-thesaurus association.



Example. For example, fake original rampant medical seo, through a number of disease words, can quickly identify its content belongs to the medical profession. Then in the choice of time, for some reason, I will be strict treatment of medical, then I think the contents of medical articles is only important to serve as the subject of nouns, and then in the noun as the subject, the disease noun as the highest priority, and then prioritize, in the order if the subject of nouns more than N, The information block is based on the most recent principle of the nearest root node, and the same noun selects only once, then selects the first n important keywords as the initial node of the assignment, and assigns the weights.



So at the time of assignment, I set the assignment factor E, and I can determine the weight of the assignment based on the type of keyword in the assigned nodes. For example, with the title of the repeated disease, the corresponding coefficient is E1, and title of the disease is not corresponding to the nominal coefficient of E2, the other nominal coefficient is E3. Then I started traversing the tag tree. The entire page itself is the weight of Q, followed by the first n keywords in the order of the traversal. So my traversal principle is as follows:



1. The first time you traverse, the first important node weight value is Qe1, its parent node weight value is qe1*b, its child node weight value is Qe1*c, and then continues to traverse the parent node's parent node and its parent node's child nodes and child nodes ' parent nodes.



Here are some examples. Assuming Q is 1,e1 to 3



It starts with the following figure









Then assume a is the square root of the previous number, and B is the cubic root of the previous number. The following figure









Then start traversing the other nodes.









The first traversal ends when all the nodes in the entire page tag tree are assigned. This is the second traversal, note that this time with the E2 is not the Q, but the second important keyword is the node's current weight value.



So through the n this traversal, each piece of information will have its own corresponding weight values, and then I extracted the content of 1 pieces of information, the specific above, there is a drawing, there will be no more painting. Quantify the content 1. After quantization, I will be able to get the weight eigenvalue t={t1,t2,......,tn} that I need above. Thus, the algorithm layer is the first corresponding improvement. There are a lot of quantitative formulas, I don't give examples here, because this example is meaningless, and I'm not really writing search engines.



The weight of the link module, which is the last hyperlink to the page to which it points. This also explains the links in different locations, the weight of their conduction varies. The position of the inner chain determines the weight inheritance of the inner chain. And we often hear, the inside chain context to appear the keyword, in fact, this algorithm is derived from the phenomenon. At this point, the algorithm layer is basically over.



1. Again, I emphasize that the algorithm is not written by me, is that I learn from others, from whom? I forgot ... a lot.



2. All experienced commercial search engine, its algorithm is certainly layered, will not just be an algorithm layer, so this single algorithm layer, to the ranking can be said to have great impact, but it is not exactly according to this algorithm layer to rank.



So, after a general understanding of this layer of the algorithm, for our actual operation has any specific help?



1. We can effectively know, how to set the content page of the layout, so that we reproduced in the article, let Baidu know that we reproduced the article at the same time, in order to better user experience and aggregated the views of all the articles.



2. We can better know which articles will be judged as similar articles.



3. This is the most important point, is that we can better layout of the content page. Real white hat SEO, in the station to comb, its station column on the page layout is particularly important, experienced SEO can effectively use the weight of the page to inherit, and then increase the long tail rankings, this for the portal site or a large number of content pages such as the Web site, is very important. Of course, in the long tail rankings, the transfer of the weight of the page to understand and layout is only the basis, in the future I will be in the next article, in the column level settings and weight transfer, to my point of view to elaborate.



4. Understand the internal chain weight inheritance of the general principle.



This article starts Mr.zhao's blog, original address: http://www.seozhao.com/379.html reprint please reserve.




Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.