Web Spider is http://www.aliyun.com/zixun/aggregation/8313.html ">spider, is a very image name." The internet is likened to a spider web, so spider is the spider crawling over the internet. Web spiders look for Web pages by their link addresses, start with a page (usually the first page) of a website, read the content of the page, find other links in the page, and then look for the next page through these link addresses so that you can continue to cycle until all the pages in the site are crawled. If the entire Internet as a Web site, then the Web spider can use this principle of the Internet all the pages are crawled down.
For search engines, it is almost impossible to crawl all the pages on the Internet, and from the current data, the largest search engine is just crawling around 40% of the total number of pages. One of the reasons for this is the bottleneck of grasping technology, unable to traverse all the pages, there are many pages that cannot be found from links to other pages, and another reason is the storage technology and processing technology, and if the average size of each page is 20K (including pictures), the capacity of 10 billion pages is 100x 2000G bytes, even if you can store, download also have problems (according to a machine download 20K per second calculation, need 340 machines non-stop download a year to download all the pages).
At the same time, because the amount of data is too large, in the provision of search also has the effect of efficiency. As a result, many search engine web spiders just crawl those important pages, while crawling the importance of evaluation is mainly based on the link depth of a page. When crawling Web pages, web spiders generally have two strategies: breadth first and depth first. Breadth first means that web spiders crawl all the pages of a link in the starting page and then select one of the linked pages to continue crawling all the pages linked to the page. This is the most common way, because this method lets the network spider parallel processing, enhances its crawl speed. Depth first refers to the Web spider will start from the beginning page, a link to track down a link, after processing this line and then to the next Start page, continue tracking links.
One advantage of this approach is that web spiders are easier to design. Since it is impossible to crawl all the Web pages, some web spiders have set up access layers for some less important sites. For example, a is the starting page, belongs to the 0 layer, B, C, D, E, F belongs to the 1th layer, G, H belong to the 2nd layer, I belong to the 3rd layer. If the network spider sets the access layer to 2, Web page I will not be accessed. This also allows some Web sites to search the search engine, the other part of the site can not be searched. For web designers, a flattened web site structure is designed to help search engines crawl more of their web pages.
Web spiders visit the Web site, often encounter the encryption data and Web page permissions, and some pages are required to access the membership rights. Of course, the owner of the site can pass the protocol to let web Spiders not crawl (described in the next section), but for some websites that sell reports, they want search engines to search for their reports, but they can't be completely free to allow searchers to see them, so they need to give the Web spider the appropriate username and password. Web spiders can crawl the Web pages by giving them permission to provide search. When the searcher clicks on the page, the searcher also needs to provide the appropriate authorization.
Web sites and web spiders, web spiders need to crawl the Web page, different from the general access, if the control is not good, it will cause the Web server overburdened. This April, Taobao because Yahoo search engine web spiders crawl its data caused by the instability of the Taobao server. Web site can not communicate with the Web spider? In fact, there are a number of ways to allow Web sites and web spiders to communicate. On the one hand let webmasters understand where the web spiders are from, do something, on the other hand also told the web spider which pages should not crawl, which pages should be updated.
Each network spider has its own name, when crawls the webpage, all will indicate its identity to the website. Web spiders will send a request when they crawl the page, and there is a field in the request that identifies the spider's identity as a user. For example, Google Web spider logo for the Googlebot,baidu network spider logo for baiduspider, Yahoo network spider logo for Inktomi slurp. If there is access to log records on the site, the webmaster can know which search engine web spider came over, when came over, and read how much data and so on. If the webmaster finds a problem with the spider, contact the owner by its identity.