Nine Chapters count judges Net-original website
http://www.jiuzhang.com/problem/44/
Topics
If you want to design a basic web Crawler, how to design? What are the factors that need to be considered?
Answer
There is no standard answer. There is a need to answer as many considerations as possible.
Interviewer Angle
This question is a common design problem in the interview. In fact, if you have not done the relevant design, it is not easy to answer a result that satisfies the interviewer. The question is not limited to what you might ask when you go to an interview with a search engine company. Here, we answer this question from the junior level and senior level two angles.
1. How to abstract the entire Internet
Junior: Abstract as an no-map, the Web page is a node, the link in the Web page is a forward edge.
Senior: Ibid.
2. Crawl algorithm
Junior: Adopt BFS method, maintain a queue, crawl to a webpage, analyze the link of the webpage, throw in the queue.
Senior: Using priority queue scheduling, different from the simple BFS, for each page set a certain crawl weight, priority to crawl higher weights of the page. For the weight of the setting, the factors considered are: 1. Whether it belongs to a more popular site 2. Link length 3. Link to the page weight of the page 4. The number of times the page was pointed to, and so on. Further consideration, for the popular site, can not be unlimited crawl, so need to carry out two level scheduling. When you first schedule which Web site to crawl, and then select the site that you want to crawl, schedule which pages are crawled in that site. The advantage of this is that it is very polite to limit the crawl of individual websites and also to fetch some opportunities for other web pages.
Nine-chapter algorithm surface question 44 design a Web Crawler