Understanding search engines to perform SEO

Source: Internet
Author: User
Tags file handling

The process of searching the search engine is very complicated, and the working process of the search engine can be divided into three stages basically. crawling and crawling : Search engine Spiders access pages by tracking links, and get page HTML code stored in the database. Pre -processing: Search win to crawl the page data text extraction, Chinese word segmentation, indexing and other processing, in case of ranking program call. ranking : After the user enters the keyword, the ranking invokes the index library data, calculates the relevance, and then builds the search results page in a certain format.

Crawling and crawling

Crawling and crawling is the first step in search engine work, completing data collection tasks.

Spider

The program that the search engine uses to crawl and access the page is called a Spider (spider), also known as a Robot (BOT).

Spider Agent Name:

Baidu Spider: baiduspider+ (+http://www.baidu.com/search/spider.htm) ·

Yahoo China Spider: mozilla/5.0 (compatible; Yahoo! slurp China; http://misc.yahoo.com.cn/help.html) ·

English yahoo Spider: mozilla/5.0 (compatible; Yahoo! slurp/3.0; HTTP://HELP.YAHOO.COM/HELP/US/YSEARCH/SLURP)

Google Spider: mozilla/5.0 (compatible; googlebot/2.1; +http://www.google.com/bot.html) ·

Microsoft Bing Spider: msnbot/1.1 (+http://search.msn.com/msnbot.htm) ·

Sogou Spider: sogou+web+robot+ (+http://www.sogou.com/docs/help/webmasters.htm#07) ·

Search Spider: sosospider+ (+http://help.soso.com/webspider.htm) ·

Youdao Spider: mozilla/5.0 (compatible; yodaobot/1.0; http://www.yodao.com/help/webmaster/spider/; )

Tracking Links

In order to crawl as many pages as possible on the web, search engine spiders will follow the link on the page, from a page crawl to the next page, as if spiders crawling on the spider web, which is the name of the search engine spiders this origin. The simplest crawl traversal strategy is divided into two kinds, one is depth first, the other is breadth first.

Depth-First Search

A depth-first search is one in which each layer of the search tree is always preceded by a single child node, moving further in depth until it can no longer advance (reaching the leaf node or being constrained by the depth), returning from the current node to the previous node and moving in another direction. The search tree of this method is gradually formed from the roots of the root of one branch.

Depth-First search is also known as Portrait search. Because a solved problem tree may contain infinite branches, depth-first search can not find the target node if it strayed into an infinite branch (i.e. depth infinity). Therefore, the depth-first search strategy is incomplete. In addition, the solution obtained by applying this strategy is not necessarily the best solution (shortest path).

Breadth First Search

In the depth-first search algorithm, the nodes with the greater depth are expanded first. If the algorithm in the search to the level of the node to search, this layer of nodes do not search processing, the lower nodes can not be processed, that is, the smaller the depth of the node is expanded first, that is, the first generation of nodes to expand the processing, this search algorithm is called breadth-first search method.

In the depth-first search algorithm, the nodes with the greater depth are expanded first. If the algorithm in the search to the level of the node to search, this layer of nodes do not search processing, the lower nodes can not be processed, that is, the smaller the depth of the node is expanded first, that is, the first generation of nodes to expand the processing, this search algorithm is called breadth-first search method.

Attracting spiders

Which pages are considered more important? There are several factors that affect:

· Site and page weights. High quality, the old qualification of the site is considered to be relatively high weight, this site is crawling the depth of the page will be higher, so there will be more inside pages are included.

· Page update degree. Every time a spider crawls, it stores the page data. If the second crawl Discovery page and the first collection of exactly the same, the description of the page is not updated, spiders do not need to crawl frequently. If the content of the page is updated frequently, spiders will visit this page more frequently, and new links appearing on the page will naturally be tracked faster by spiders, crawling new pages.

· Import links. Whether external links or internal links to the same site, to be crawled by the spider must have imported links to enter the page, or spiders have no chance to know the existence of the page. High-quality import links also often cause the exported links on the page to be increased by creeping depth. In general, the highest weight on the site is the home page, most of the external links to the home page, spiders visit the most frequent is the home page. The closer to the homepage, the higher the weight of the page, the greater the chance of being crawled by the spider.

Address Library

To avoid crawling and crawling URLs, search engines create an address library that records pages that have not been crawled, and which pages have been crawled. There are several sources of URLs in the address library:

(1) manual input of the seed site.

(2) After the spider crawls the page, parses the new link URL from the HTML, compares it with the data in the address library, and, if it is a URL not in the address library, it is stored in the address library to be accessed.

(3) webmaster through the Search Engine Web page submission form submitted in the Web site.

The spider extracts the URL from the address library to be accessed, accesses and crawls the page, and then removes the URL from the address library to be accessed and puts it into the Access address library.

Most mainstream search engines provide a form for webmasters to submit URLs. However, these submitted URLs are only stored in the address library, whether the inclusion also depends on the importance of the page. Most of the pages included in search engines are spiders ' own tracking links. It can be said that the submission page basic T is useless, the search engine prefers to find new pages along the link.

File Store search engine spiders crawl data into the original page database. The page data is exactly the same as the HTML that the user's browser gets. Each URI has a unique file number.

Climb the replication content detection for a row

Detecting and deleting replication content is usually done in the preprocessing described below, but now spiders crawl and crawl files with a fixed degree of replication content detection. When you encounter a large amount of reproduced or copied content on a site with a low weight, it is likely that you will not continue crawling. This is the webmaster in the log files found in the spider, but the page has never been really included in the reason.

Pretreatment

In some SEO materials, "preprocessing" is also referred to as "index", because indexing is the most important step in preprocessing.

Search Engine Spiders Crawl The original page and cannot be directly used for query ranking processing. Search engine database in the number of pages in more than trillions of, the user entered the search term, by ranking program real-time on so many page analysis relevance, the calculation is too large, it is not possible to return the ranking results within two seconds. So the crawled pages must be preprocessed to prepare for the final query ranking.

As with crawling crawls, preprocessing is done early in the background, and the user does not feel the process when searching.

1. Extract text

now the search engine is still based on the text content . The HTML code in the page crawled by the spider, in addition to the visible text that the user can see on the browser, also contains a large number of HTML formatting tags, javascript programs, etc. that cannot be used for ranking. The first thing to do in the search engine preprocessing is to remove the tags and programs from the HTML file and extract the text content of the Web page that can be used for ranking processing.

April Fool's Day

After removing the HTML code, the remaining text for ranking is just this line:

April Fool's Day

In addition to the visible text, the search engine will also extract some special code containing text information, such as meta tag text, image substitution text, Flash file substitution text, link anchor text.

2. Chinese participle

Word segmentation is a unique step in Chinese search engine. Search engine storage and processing pages and user search are based on the word. English and other language words and words separated by a space, the search engine index program can directly divide the sentence into a set of words. There is no delimiter between Chinese words and words, and all the words and words in a sentence are connected together. The search engine must first distinguish which words form a word, and which words are itself a word. For example, "weight loss method" will be the word "weight loss" and "method" two words.

There are basically two kinds of Chinese word segmentation methods, one is based on dictionary matching and the other is based on statistics.

A dictionary-based matching method is a match between a Chinese character to be analyzed and an entry in a pre-made dictionary, and the words to be scanned into the dictionary in the string of characters to be analyzed match successfully, or cut out a word.

According to the scanning direction, the dictionary-based matching method can be divided into forward matching and inverse matching. According to the difference of the matching length priority, it can be divided into the maximum match and the minimum match. The scanning direction and length are first mixed, and different methods such as forward maximum matching and inverse maximum matching can be produced.

The dictionary matching method is simple in calculation, and its accuracy depends to a great extent on the completeness and update of dictionaries.

The statistic-based word segmentation method refers to the analysis of a large number of text samples, calculates the statistical probability of the occurrence of the word and the word adjacent, the more the next few words appear, the more likely to form a word. The advantage of the statistic-based method is that it responds more quickly to the new words and also helps to eliminate ambiguity.

Based on the dictionary matching and statistics-based word segmentation methods have advantages and disadvantages, the actual use of the word breakers are mixed use of two methods, fast and efficient, but also to identify new words, neologisms, eliminate ambiguity.

The accuracy of Chinese word segmentation often affects the relevance of search engine rankings. For example, Baidu search "search engine optimization", from the snapshot can be seen, Baidu "search engine optimization" of the six words as a word.

And in Google's search for the same words, the snapshot shows Google cut it down to "search engine" and "optimize" two words. Obviously Baidu segmentation more reasonable, search engine optimization is a complete concept. Google participle tends to be more finely chopped.

The difference in this kind of participle is probably one of the reasons why some keyword rankings have different performance in different search engines. For example, Baidu prefers to search for a complete match to appear on the page, that is, the search "enough drama blog" When the four words of continuous complete appear more easily in Baidu to get a good ranking. Google is different and does not require a complete match. Some of the pages appear "enough drama" and "blog" two words, but do not have a complete match to appear, "Enough drama" appears in front, "blog" appears in other parts of the page, such a page in Google search "enough drama blog", can also get a good ranking.

Search engine for the word of the page depends on the size of the thesaurus, accuracy and the quality of the word segmentation algorithm, rather than depending on the page itself, so SEO staff can do very little to the word. The only thing that can be done is to prompt the search engine in some form on the page, some words should be treated as a word, especially when there is ambiguity, such as the keywords appearing in the page title, H1 tag and blackbody. If the page is about "Kimono", then the word "kimono" can be deliberately marked in bold. If the page is about "makeup and clothing", you can mark the "costume" two characters in bold. In this way, the search engine to analyze the page to know that the marked bold should be a word.

3. To stop the word

Whether it is English or Chinese, the page content will have some high frequency, but the content has no effect on the word, such as ",", "to" and the like, "Ah", "ha", "ah" such as exclamation, "thereby", "to", "but" the like adverbs or prepositions. These words are called stop words because they have no effect on the main meaning of the page. The common stop words in English are the,a,an,to,of and so on.

The search engine removes these stop words before indexing the page, making the index data subject more prominent and reducing the amount of unnecessary computation.

4. Eliminating noise

Most of the pages also have a portion of the content on the page theme has no contribution, such as copyright notice text, navigation bar, advertising and so on. As an example of common blog navigation, the article classification, Historical archive and other navigation content appear on almost every blog page, but the pages themselves have nothing to do with the words "classification" and "history". It is meaningless and completely irrelevant for users to search for "historical" and "categorized" keywords simply because they appear on the page and return to the blog post. So these chunks are noise, and the page theme can only be dispersed.

Search engines need to identify and eliminate these noises, ranking without the use of noisy content. The basic way to eliminate noise is to block the page according to the HTML tag, distinguish between the page header, navigation, body, footer, advertising and other areas, the large number of repeated chunks of the site often belong to the noise. After the page is de-noising, the rest is the main content of the page.

5. Go to the heavy

Search engines also need to re-process the page.

The same article is often repeated on different sites and the same site on different URLs, the search engine does not like this repetitive content. When a user searches, the user experience is too bad, though it is content-related, if the first two pages of the page are all the same articles from different websites. The search engine wants to return only one article in the same article, so you need to identify and delete duplicates before indexing, a process called "deduplication."

The basic way to remove weight is to calculate the fingerprint of the page feature keywords, that is to say from the main content of the page to select the most representative part of the keywords (often the most frequently occurring keywords), and then calculate the digital fingerprint of these keywords. Here the key words are selected in the word, to stop the word, after the noise elimination. Experimental results show that it is usually possible to select 10 key keywords to achieve high computational accuracy, and then select more words to improve the accuracy of the contribution is not much.

Typical fingerprint computing methods such as the MD5 algorithm (Information Digest algorithm Fifth edition). The characteristic of this type of fingerprint algorithm is that any slight change in the input (feature keyword) will lead to a large gap in the computed fingerprint.

Understand the search engine to go to the heavy algorithm, SEO staff should know simply to increase the ",", "get", the change of paragraph order of the so-called pseudo-original, and can not escape the search engine to the heavy algorithm, because such operations can not change the characteristics of the article keyword. and the search engine de-weight algorithm is likely to be more than the page level, but to the paragraph level, mixed with different articles, cross-swap paragraph order can not make reprint and plagiarism into original.

6. Forward Indexing

A forward index can also be referred to as an index.

After the word extraction, word segmentation, noise reduction, to the weight, the search engine is unique, can reflect the main content of the page, the content of the word unit. Then the search engine index program can extract keywords, according to the word segmentation program divided into good words, the page into a set of keywords, while recording each keyword on the page frequency of occurrence, the number of occurrences, format (such as appear in the title tag, bold, H tags, anchor text, etc.), location (such as page The first paragraph of the text, etc.). In this way, each page can be recorded as a set of keywords, where each keyword's word frequency, format, location and other weight information are also documented.

The search engine index program stores pages and keywords into the index library. The simplified index thesaurus form is shown in table 2-1.

Each file corresponds to a file ID, and the contents of the file are represented as a collection of keywords. In fact, in the Search Engine index library, the keyword has also been converted to the keyword ID. Such a data structure is called a forward index.

7. Inverted index

The forward index is not yet directly available for ranking. Assuming the user searches for keyword 2, if there is only a positive index, the ranking program needs to scan all the files in the index library, find the file containing the keyword 2, and then do the correlation calculation. Such calculations do not meet the need to return ranking results in real time.

Therefore, the search engine will re-construct the forward index database as an inverted index, the mapping of the file corresponding to the keyword into a keyword-to-file mapping, as shown in table 2-2.

In the inverted index, the key word is the primary key, each keyword corresponds to a series of files, these files appear this keyword. This way, when a user searches for a keyword, the sorting program locates the keyword in the inverted index, and can immediately find all the files that contain the keyword.

8. Link Relationship Calculation

Link relational computing is also an important part of preprocessing. Now all major search engine ranking factors include link flow information between pages. After crawling the content of the page, the search engine must calculate in advance: which links on the page are to which other pages, what import links each page has, and what anchor text the links use, and these complex links point to relationships that form the link weights of the site and the page.

Google PR value is one of the most important embodiment of this link relationship. Other search engines also perform similar calculations, although they are not called PR.

Because of the huge number of pages and links, the link between the Internet is always in the update, so the link relationship and the calculation of PR is a long time to spend. With respect to PR and link analysis, there is a special section on this later.

9. Special File Handling

In addition to HTML files, search engines can often crawl and index a variety of text-based file types, such as PDF, Word, WPS, XLS, PPT, txt files, and more. We also often see these file types in search results. But the current search engine can not deal with pictures, video, flash such non-text content, and can not execute scripts and programs.

While the search engine has made some progress in identifying images and extracting text from Flash, it is far from the goal of reading images, videos, and Flash content directly to return results. The ranking of pictures and video content is often based on the relevant text content, the details of which can be referenced in the following integrated search section.

Ranking

After the search engine spider Crawl interface, the search engine program calculates the inverted index, the receiving engine is ready to handle the user search at any time. After the user fills in the keyword in the search box, the ranking program calls the index library data, the calculation rank shows to the customer, the ranking process is the direct interaction with the customer.

Anon

Source: Lu Songsong Blog (/qq:13340454), welcome to share this article, reproduced please keep the source! Http://lusongsong.com/reed/1589.html


Understanding search engines to perform SEO

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.