Nine-chapter algorithm surface question 44 design a Web Crawler

Source: Internet
Author: User

Nine Chapters count judges Net-original website


http://www.jiuzhang.com/problem/44/


Topics

If you want to design a basic web Crawler, how to design? What are the factors that need to be considered?



Answer

There is no standard answer. There is a need to answer as many considerations as possible.


Interviewer Angle

This question is a common design problem in the interview. In fact, if you have not done the relevant design, it is not easy to answer a result that satisfies the interviewer. The question is not limited to what you might ask when you go to an interview with a search engine company. Here, we answer this question from the junior level and senior level two angles.


1. How to abstract the entire Internet

Junior: Abstract as an no-map, the Web page is a node, the link in the Web page is a forward edge.

Senior: Ibid.


2. Crawl algorithm

Junior: Adopt BFS method, maintain a queue, crawl to a webpage, analyze the link of the webpage, throw in the queue.

Senior: Using priority queue scheduling, different from the simple BFS, for each page set a certain crawl weight, priority to crawl higher weights of the page. For the weight of the setting, the factors considered are: 1. Whether it belongs to a more popular site 2. Link length 3. Link to the page weight of the page 4. The number of times the page was pointed to, and so on. Further consideration, for the popular site, can not be unlimited crawl, so need to carry out two level scheduling. When you first schedule which Web site to crawl, and then select the site that you want to crawl, schedule which pages are crawled in that site. The advantage of this is that it is very polite to limit the crawl of individual websites and also to fetch some opportunities for other web pages.


Nine-chapter algorithm surface question 44 design a Web Crawler

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.