What is reptile technology? __ Data

Source: Internet
Author: User

Web crawler is an Internet rover that can systematically browse the World Wide Web and is typically used for Web indexing (Web spidering).

Web search engines and other sites use web crawlers to update the index of their web content or other Web page content. Web crawlers can copy all of the pages they visit so that later processing is indexed by the search engine for the pages it downloads so that users can search more efficiently.

Because the number of pages on the Internet is very large, even the largest crawler has not finished indexing. For this reason, it is not good for search engines to provide relevant search results early in the 2000 on the World Wide Web. Modern search engines have greatly improved this.

A web crawler starts with a list of URLs named "seeds." When the retriever accesses these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to be accessed, called crawl boundaries. Recursively accesses a URL from a boundary based on a set of policies. If the crawler is performing an archived web site, it will copy and save the information. Files are usually stored in such a way that they can be viewed, read, and browsed as if they were on a live network, but are saved as "snapshots."

A significant amount means that the crawler can download only a limited number of pages at a given time, and therefore needs to prioritize their downloads. A high change rate may mean that the page may have been updated or even deleted.

The number of possible URLs generated by server-side software also makes it difficult for web crawlers to avoid retrieving duplicate content. There is an infinite combination of HTTP GET (URL-based) parameters, in which only a small fraction actually returns the unique content. For example, a simple online photo gallery can provide users with three options specified by the http GET parameter in the URL. If there are four ways to sort the image, you can select the thumbnail size, the two file formats, and the option to disable user-supplied content, you can access the same set of content using 48 different URLs, all of which can be linked to the Web site. This mathematical combination creates a problem for the crawlers because they must sort the endless combinations of relatively small script changes to retrieve the unique content.

In addition to students and teachers, many people crawl data to make products based on data modeling. Continuous data updates are often required to keep products in effect. This is the first to exclude those who sell "dead" data companies, but also to the data traversal efficiency requirements. Meet the above three points of the company, not too much in the country, I know that there are two, you can pay attention to:

Spider (As far as I know, Spider http://w3.zmatrix.cn has previously participated in the development of Baidu Crawler, technology, some of the reputation in the circle)

The other is the locomotive, Octopus, gooseeker and so on.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.