The technology of Chinese search engine: sorting technology

Source: Internet
Author: User
Keywords Search engine Google

Intermediary transaction SEO diagnosis Taobao guest Cloud host technology Hall

With the "eye economy" sweeping the internet, tens of thousands of dollars are flowing quickly to the search engine market, which attracts the most attention to the ball. A large number of surveys show that the search engine market is in a period of rapid development, has become one of the most promising industries in the next few years. With Google, Baidu, China Search and other features of the search engine gradually become the most commonly used network tools, corporate attention to search engines from "observation" to upgrade to "force."

With the increase of market capacity and the number of users, how to improve the search function to make it more fair, open, standard and humane has become a topic of great concern. But there is a paradox that is constantly appearing: fees can bring profits to search engine companies, but at the same time reduce the visitor's experience satisfaction. How do you weigh the balance between money and user needs?

The secret of Google's success

By 2004, Google (http://www.google.com) had been named the world's first brand for two consecutive years, and Google was founded only five years ago, initially just two Stanford University students ' research projects. This is a miracle, as Bill Gates created miracles. Bill Gates can do wonders because he sees trends in the personal computer software market, so the company that was created is Microsoft (Microsoft): Micro (small) Soft (software). What about Google? Before Google came out there have been some very successful search engine companies, its strength is very strong, it seems not only Google saw the trend of the search. Where is the secret of Google's success?

Google's success has many factors, most importantly, Google's ranking of search results is better than other search engines. Google promises that most searchers will find the results he wants on the first page of the search results. The customer is satisfied, next time, and will introduce to others, this one to, the use of more people. So Google has made itself the world's biggest brand without doing any advertising. What sort of technology does Google use? PageRank, the page level.

Google has a founder named Larry Page, said PageRank's patent is his application, so based on his name has Page Rank. Domestic also has a very successful search engine company, called Baidu (http://www.baidu.com). Baidu founder Robin Li said that in 1996 he applied for a patent called hyper-chain analysis, the principle of PageRank and the principle of hyper-chain analysis is the same, and PageRank is still in paten-pending (patent application). The implication is that there is a question of patent ownership. There is no discussion of patent ownership, but it can be seen that the successful search engine ranking technology, on its principle is similar, that is, link analysis. Both hyper-chain analysis and PageRank are linked analysis.

Link analysis What exactly is the matter? Because Li's hyper-chain analysis is not specifically introduced, the only thing I've seen is the patent on the U.S. Patent Office website about Robin Li. PageRank's introduction is quite a lot, and at present Google is the world's largest search engine, here to PageRank as the representative, detailed introduction of the principle of link analysis.

PageRank

The principle of PageRank is similar to the citation mechanism in scientific papers: whose papers are cited more often, who is the authority. Said more vernacular: Zhang in the conversation mentioned Maggie, Doe in the conversation also mentioned Maggie Cheung, Harry in the conversation also mentioned Maggie Cheung, this means that Maggie must be very famous people. On the Internet, the link is equivalent to "reference", in the B page link A, the equivalent of B in the conversation mentioned a, if in C, D, E, F, all linked to a, then that a page is the most important, a page PageRank value is the highest.

There is a simple formula for how to calculate PageRank values:

Where: The coefficient is a number greater than 0, less than 1. The general setting is 0.85. Page 1, page 2 to page n indicates that all links point to A's web page.

From the above formula can be seen three points:

1. The more pages the link points to a, the higher the level of a. That is, the level of a is proportional to the number of pages pointing to a, in the formula, the greater the N, the higher the level of A;

2, links to a Web page, the higher the level of the page, a level is higher. That is, the level of a and the page to point A is directly proportional to the level of the page itself, in the formula, the higher the level of the page n, the higher the level of A;

3, links to a Web page, its chain of the number of more, a lower level. That is, the level of a and point to a Web page of their own page chain is inversely proportional, in the formula of reality, the number of pages n chain more, a lower level.

Each page has a PageRank value, thus forming a large system of equations, to solve this equation group, you can get the PageRank value of each page. The internet has on the Bai Web page, then this equation group has Bai unknown, although this equation is a solution, but the calculation is too complicated, it is impossible to put all the pages together to solve. A friend who is interested in specific computational methods can refer to some books on numerical computation.

In short, PageRank effectively leverages the huge link-building features of the Internet. The link from page A to page B, in the words of Google Founder, is page A's support for page B, and Google is based on the number of votes to determine the importance of the page, but in addition to looking at the number of votes (links), the voter (linked pages) are analyzed. The "importance" pages are rated higher because accepting the voting page is understood as an "important item". From Sina, Yahoo, Microsoft's homepage all have three links of my webpage, perhaps more than I find 30 links in other websites still strong. If anyone else doesn't understand this principle, think of an idiom called: all. If three people say there is a tiger on Beijing Street, then many people think there is a tiger, if these three people are national leaders, then everyone will think that there is a tiger on Beijing Street.

Each page will have a PageRank value, if you want to know their site's web page PageRank value is how much, the easiest way is to download a Google's free toolbar (http://toolbar.google.com/),

Whenever you open a Web page, you can clearly see the PageRank value of this page. Of course, the value is a ballpark figure.

According to Google technology director, Google in addition to using PageRank to measure the importance of the page, there are hundreds of other factors to participate in the ranking. The same is true of other search engines, and it is impossible to sort the search results by one rule.

Other methods

Hilltop algorithm:

Hilltop is also a patent for search engine results sequencing, a patent that a Google engineer, Bharat, obtained in 2001. Google's sorting rules are constantly changing, but the biggest change is based on the hilltop algorithm. Hilltop how the principle, worthy of Google so favored?

In fact, the guiding ideology of hilltop algorithm and PageRank is consistent, are through the number of links and quality of the page to determine the ranking weight of search results. But Hilltop believes that only computing links from related documents with the same subject will be more valuable to searchers: links between topic-related pages contribute more to weight calculations than links with unrelated topics. If the site is introduced "clothing", there are 10 links from the "clothing" related to the site link to take over, that these 10 links than the other 10 from the "electrical" related to the site chain to take over the contribution to be big. Bharat said the document, which had an impact on the subject, was a "specialist" document, from which the link to the target document determined the main part of the linked page "Weight score".

The basic sequencing process with PageRank combined with hilltop algorithm to determine the degree of match between Web pages and search terms replaces the excessive reliance on PageRank values to find those authoritative pages. This hilltop algorithm is very important for two Web page sorting processes with the same subject and PR similarity. Hilltop also avoids many ways to improve Web page PageRank values by adding many invalid links.

Anchored text (anchor text)

Anchor text name sounds difficult to understand, in fact the anchor text is the link text. For example, on the personal website of CCTV (www.cctv.com) as a link to news channels, visitors by clicking on the "News channel" on the site can enter the http://www.cctv.com site, then "news channel" is the anchor text of the homepage of CCTV website.

Anchor text can be used as an evaluation of the content of the page where the anchor text resides. Normally, the Added link in the page will have a certain relationship with the content of the page itself. Apparel industry on the site will add some links to peer sites or some of the well-known enterprises to do clothing links; On the other hand, anchor text can be done as an evaluation of the page being pointed to. Anchor text can accurately describe the content of the page, the personal site to add Google links, anchor text for the "search engine." This way through the anchor text itself to know that Google is a search engine.

Anchor text to the search engine function also shows that some search engines can not index files. For example, the site added a picture of Maggie Cheung, format jpg file, search engine is difficult to index (generally only processing text). If this photo links to the anchor text as "Maggie's photos", then the search engine can recognize this picture is a picture of Maggie Cheung, after the visitor search "Maggie", this picture can be searched.

This shows that in the web design to choose the appropriate anchor text, will make the Web page and point to the importance of the page to improve.

Page layout

Each page has a layout, including headings, fonts, labels, and so on. Search engines also use these layouts to identify how well the search terms relate to the content of the page. In a static HTML format for example, search engines through web spiders crawl down the page, you need to extract the contents of the body, filtering other HTML code. When the content is extracted, search engines can record all the layout information, including: Which words appear in the title, which words appear in the body, which words are larger than other fonts, which words are bold, which words are marked with keyword and so on. This allows the search results to determine the extent to which the search results and search terms are related. For example, search "Mao Zedong", if there are two results, an article title is "Mao Zedong's Life", another article title is "Chiang Ching's Life" but the content mentioned Mao Zedong, the search engine will consider the former more important, because "Mao Zedong" in the title appeared.

Therefore, a reasonable use of page layout, will enhance the page in the search results page ranking position.

Charge ranking

It should be said that the ranking does not belong to the ranking technology (here refers to the ranking of the charges also include bidding rankings), but a search engine profit model. But the ranking of fees has been the most direct impact on the search engine ranking, this also slightly to explain.

Users can buy a keyword ranking, as long as the search engine companies pay a certain fee, you can let users of the site in the search results of the first few, according to different keywords, different locations, the length of time to define prices. Prices ranged from thousands of to hundreds of thousands of yuan (like the "lottery" in 3721 of the ranking cost is mostly hundreds of thousands of).

The charge ranking on the one hand to the search engine companies to bring benefits, on the one hand to the enterprise to bring access, in addition to visitors also have some benefits. Because the visitor wants to look for "the suit", the Enterprise wants to sell "the suit", then pays the visitor to be able to find him, thus, the buyer and the seller can meet immediately. But the ranking of the charges to the visitors to bring more is not true, the result of the ranking has lost justice, and sometimes a lot of garbage. Baidu search engine on the "Planet", ranked first in a graphite company, ranked in the second is actually "want to find the planet?" On ebay! (see chart below). It really makes visitors laugh and cry.

Of course, for companies, the charge ranking is to enhance the site in search engines ranked the most direct and easiest way. Now, how to improve the ranking of web pages in search engines, has formed a career, called SEO (Search Engine optimization), that is, search engine optimization. SEO is the search engine sorting technology, by modifying the Web page (or site) structure and actively increase the site links and other methods to let search engines think these pages are very important, so as to enhance the Web page in search engine results in the ranking.

The development trend of sequencing technology

The technical improvement and optimization of various search engines are directly reflected in the ranking of search results. Many search engines are further researching new sorting methods to improve customer satisfaction. Professionals believe that the current search engine ranking algorithms still exist two major deficiencies.

First, there is no real solution to relevance. Relevance refers to the degree to which search terms and pages are related. Only through the surface features such as links, fonts, and position can not really judge the relevance of search terms and articles, let alone many of these features will not exist at the same time. This is also a lot of search engine cheat method can be effective reason. In addition, some articles do not appear search terms, but it is very relevant to the search terms, such as search for "terrorists", but there is a Web page to introduce some of Osama bin Laden sabotage, the text does not appear "terrorist" son eye, search engine can not search the page. Surface characteristics can only be a symptom, not a permanent cure. The root cause of the method should be to increase the semantic understanding, such as keyword and keyword extraction, from the semantic analysis, the search terms and the relevant degree of the Web page, the more accurate analysis, the better the effect.

Second, the search results single. In search engines, anyone who searches for the same word has the same result. This obviously does not satisfy the visitor. Scientists search for "planet", may be to understand the knowledge of the planet, but the average person may be looking for "Star Wars" film, but the search engine gives the same results. How to meet these different types of visitors requires personalization of search results. Foreign Vivisimo Companies (http://www.vivisimo.com) are trying to solve this problem, they use the search results automatically clustering method to meet the needs of different types of customers. Search results Sort if you want to achieve from single to personalization, Vivisimo has taken a step, but the ideal result should be for each visitor, the sort results are directly related to their search habits and intentions. Search for "Sports", people who like football should put the relevant results of football in the front, for those who like basketball should put the relevant results of basketball in front.

Search engine sequencing technology should also be towards addressing these two deficiencies in the direction of development: semantic relevance and sorting personalization. The former needs perfect natural language processing technology, the latter need to record huge visitor information and complex calculations, to achieve any of these requirements are not easy, how to solve these problems, the task falls on the shoulders of scientists and engineers, which search engine to solve these problems, she may be called the next search World Overlord.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.