What is a web crawler (Spider) program

Source: Internet
Author: User

The spider, also known as WebCrawler or robot, is a program that is a collection of roaming Web documents along a link. It typically resides on the server, reads the document using a standard protocol such as HTTP, with a given URL, and then continues roaming until there are no new URLs that meet the criteria, as a new starting point for all of the URLs included in the document. The main function of WebCrawler is to automatically fetch Web documents from each Web site on the Internet and extract some information from the Web document to describe the Web document, providing raw data for appending and updating data to the database server of the search engine site, including title, length, File setup time, number of links in the HTML file, etc. 1. Search Policy The ①IP address search strategy first gives the crawler a starting IP address, and then searches for the document in each WWW address after the IP address is incremented based on the IP address, regardless of the document's hyperlink address to other Web sites.  The advantage is that the search is comprehensive and can find information sources for new documents that are not referenced by other documents, and the disadvantage is that they are not suitable for large-scale searches. ② Depth-First search strategy depth-first search is a way to use more early in the development of crawlers. It is designed to reach the leaf nodes of the searched structure (that is, those HTML files that do not contain any hyperlinks). In an HTML file, when a hyperlink is selected, the linked HTML file performs a deep-first search, which means that a separate chain must be searched thoroughly before searching for the rest of the hyperlink results. The depth-first search goes along the hyperlink on the HTML file to no further depth, then returns to an HTML file and continues to select the other hyperlinks in the HTML file. When there are no more hyperlinks to choose from, the search has ended.  The advantage is that you can traverse a Web site or a deeply nested collection of documents, and the downside is that the web structure is quite deep, and it's possible that it will happen once you get in and never get out. ③ Width-First search strategy in the width-first search, search for all hyperlinks in a Web page, and then continue searching for the next layer until the bottom. For example, an HTML file has three hyperlinks, selects one of them and processes the corresponding HTML file, and then no longer selects any of the hyperlinks in the second HTML file, but instead returns and selects the second hyperlink, processes the corresponding HTML file, returns, selects the third hyperlink and processes the corresponding HTML file. Once all the hyperlinks on a layer have been selected, you can start searching for the remaining hyperlinks in the Himl file you just processed. This guarantees the first treatment of the shallow layer. When a deep, endless branch is encountered, it does not result in a failure to sink into a deep document in the www. One advantage of the breadth-first search strategy is that it can be two HFind the shortest path between tml files. The width-first search strategy is often the best strategy for implementing crawlers because it is easy to implement and has the most desired functionality. However, if you want to traverse a specified site or a deeply nested set of HTML files, it takes a long time to reach a deep HTML file with a width-first search strategy. Considering the characteristics of the above strategies and the search information of domestic information navigation system, a search strategy based on width-first search strategy and linear search strategy is adopted in China.  For some non-referenced or rarely referenced HTML files, the width-first search strategy may omit these orphaned sources of information, which can be supplemented by a linear search strategy. The crawler strategy of ④ Professional search engine at present, the professional search engine web crawler often uses the "best priority" principle to access the web, that is, to quickly and effectively get more topics related to the page ("return"), each time select the "Most Valuable" link to visit.  Because links are included in the page, and often high-value pages contain links that also have high value, the evaluation of link value is sometimes converted to the evaluation of the page value. The problems should be paid attention to in the design of ⑤ crawler The first problem is the standardization of URL addresses: On www, a URL address can have a variety of representations, can be represented by an IP address, can also be represented by a domain name. To prevent the crawler from repeatedly accessing the same address. The second problem is to avoid falling into the network trap: links on the network are more complex, and some static Web pages may form a closed loop. To avoid crawling the crawler over a loop, check to see if the URL is in the list of addresses to be searched before adding it to the list of addresses to be searched. For dynamic Web pages, crawlers should ignore all URLs with parameters. The third problem: for pages that are denied access, crawlers should follow the roaming deny access rule.

What is a web crawler (Spider) program

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.