A brief analysis of Baidu Spider crawling

Source: Internet
Author: User

These days have been engaged in website and product promotion, do not understand a lot of things, but the promotion of those things, many nouns are very attractive to me. The first is SEO, understand the process of SEO, encountered the "external link", learning the external links when the "spider crawling", suddenly received so much information, feeling quite magical, seo things are indeed not simple.

And today we want to talk about the word "spider crawling". I believe I am not the first to mention it, because I am a successor, but I hope my description can let more people understand the word, after all, a lot of professional introduction is quite professional, because it is too professional, and people find it impossible to understand.

First, introduce Baidu included. There are a lot of web sites in the world, and there are lots of pages in the Web, countless, just like our people, more than 6 billion of the population. So, some people in the world are very influential, such as Jackie Chan, Bruce Lee, Mackerson and so on, but like us nobody, is so humble. If you have a big contribution to the world, it's a natural name, so I can say in other words, "contribution" in the network, will be included in Baidu, included is its network address, Baidu included, if a lot of prestige, then you may appear in Baidu search headlines, and the headline is always a concern, Because of this position everyone wants to contend, then produced the SEO (Search engine optimization).

Then, the contents of the collection are unified in a library, there is an orderly, and this library in the network world has a very good name "database", as for the principle of the database I do not say, here we mainly realize that it is in a certain format to save or record data, "spider crawling" to use this stuff. Say "spider" to everybody again, of course, is not the spider we see everyday, simply it is a computer program, crawling process is the process of implementation of the algorithm (as for the argument, can not be simply understood as a daily arithmetic process, its meaning is equivalent to an activity of the planning process), recently, as if Baidu changed the search algorithm, But the specific how to change or let people slowly to understand it.

"Spider crawling" image a little, there are vertical crawling also have horizontal crawling, that is, our computer professional terms of depth traversal and breadth traversal, and traversal of the content is large and small Web site or Web page, after the spider actively download the Web page, and then download the Web page through a variety of programs after the calculation before putting to the retrieval area, Will form a stable ranking, and then be included in the database Baidu, the final display on the Baidu page. And here, Baidu sent more than a "spider", but more than one, or 10, or hundreds, thousands, more or million, hundreds of thousands of, in short, it is certainly a lot of numbers, and sent spiders here is the computer terminology: thread. It is obvious that multiple spiders are multiple threads, and the efficiency of multi-threaded execution of search is high. A number of "spiders" together search, is a breadth of search, a "spider" along a certain rule to go down, is a deep search. and web search depth first and breadth first, Baidu spiders scratch the page from the starting site (that is, the seed site refers to some portals) is the breadth of first crawl is to crawl more URLs, the depth of the first crawl is to capture the quality of the Web page, this strategy is calculated and distributed by the Dispatch, Baidu Spider Spiders are only responsible for crawling, weight priority refers to the reverse connection more pages of priority crawl, this is a strategy of scheduling, the general situation of the Web crawl caught 40% is the normal range, 60% is very good, 100% is impossible, of course, the more the better. In the process of learning and understanding, stumbled on a spider crawling security article, which introduced to the spider will first choose to traverse those sites, will automatically avoid those network loopholes, lest they fall into it, this quite attracted me, weak and weak remember this article said: Priority traversal static Web site, Because the dynamic site may exist dead loops, so that spiders go out, but the general process of spider search will first detect the safety of the site, found that these destructive actions, will avoid. I think this is worth considering. In the process of establishing dynamic Web site, must be rigorous their own program code, so as to avoid the web site loopholes, and finally no spiders dare to enter.

Today is introduced here, a lot of places, I hope you have a lot of corrections! Reprint please bring: Asian Ceramics Mall: www.asiachinachina.com



Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.