"Getting Started" is a good motive, but it may be slow. If you have a project in your hands or in your mind, you will be driven by your goals and will not learn as slowly as the learning module. "Getting Started" is a good motive, but it may be slow. If you have a project in your hands or in your mind, you will be driven by your goals and will not learn as slowly as the learning module.
In addition, if every knowledge point in the knowledge system is vertex and dependency is edge, this is not a directed acyclic graph. Because the experience of learning A can help you learn B. Therefore, you do not need to learn how to get started, because such an "getting started" point does not exist! What you need to learn is how to do something bigger. in this process, you will soon learn what you need to learn. Of course, you can argue that you need to understand python first. Otherwise, how can you learn python as a crawler? But in fact, you can fully learn python: D during the crawler process.
If you see the "techniques" mentioned in many of the previous answers-how to crawl with any software, let me talk about "Tao" and "techniques"-how crawlers work and how to implement them in python.
Let's talk about summarize for a long time:
You need to learn
Basic crawler working principle
Basic http capture tool, scrapy
Bloom Filter: Bloom Filters by Example
If you need large-scale web page crawling, you need to learn the concept of distributed crawler. Actually, you just need to learn how to maintain a distributed queue that can be effectively shared by all cluster machines. The simplest implementation is python-rq: https://github.com/nvie/rq
Combination of rq and Scrapy: darkrho/scrapy-redis-GitHub
Subsequent processing, webpage extraction (grangier/python-goose · GitHub), storage (Mongodb)
The following is short talk:
Let's talk about the experience of a cluster to climb the entire Douban.
1) first, you must understand how crawlers work.
Imagine you are a spider. now you are put on the Internet. You need to view all the web pages. What should we do? No problem. you can start from somewhere, for example, the homepage of People's Daily. this is called initial pages, which is represented by $.
On the homepage of the People's Daily, you can see various links on that page. So you are very happy to climb the "domestic news" page. That's great. you have already crawled the two pages (home page and domestic news )! For the moment, you don't have to worry about how to handle the crawled page. you can imagine that you have fully copied this page into an html file and put it on you.
Suddenly you find that on the domestic news page, there is a link chain back to the "homepage ". As a smart spider, you must know that you don't have to climb back, because you have already seen it. Therefore, you need to use your mind to save the page address you have already viewed. In this way, every time you see a new link that may need to be crawled, you should first check whether you have already accessed this page address in your mind. If you have been there, don't go.
Okay. Theoretically, if all the pages can be reached from the initial page, it proves that you can climb all the pages.
So how can we implement it in python?
Simple
Import Queueinitial_page = "http://www.renminribao.com" url_queue = Queue. Queue () seen = set () seen. insert (initial_page) url_queue.put (initial_page) while (True ):
# Keep running until the sea breaks
If url_queue.size ()> 0:
Current_url = url_queue.get () # obtain the first url in the team example.
Store (current_url) # store the webpage represented by this url
For next_url in extract_urls (current_url): # extract the link url
If next_url not in seen:
Seen. put (next_url)
Url_queue.put (next_url)
Else:
Break
It is already a pseudo-code.
The backbone of all crawlers is here. let's take a look at why crawlers are actually very complicated-search engine companies usually have a whole team for maintenance and development.
2) Efficiency
If you directly process the above code and run it directly, you need a whole year to crawl the entire Douban content. Not to mention that a search engine like Google needs to crawl content across the network.
What is the problem? There are too many webpages to be crawled, and the above code is too slow and too slow. Imagine that there are N websites on the whole network, so the complexity of the judgment is N * log (N), because all webpages need to be traversed once, the complexity of log (N) is required for each set reuse. OK, OK. I know that the python set implementation is hash-but this is still too slow, at least the memory usage efficiency is not high.
What are the common practices for determining the importance of judgment? Bloom Filter. in short, it is still a hash method, but it features that it can use fixed memory (not increasing with the number of URLs) to O (1) determine whether the url is already in set. Unfortunately, there is no white lunch in the world. The only problem is that if the url is not in the set, BF can be 100% sure that this url has not been viewed. However, if this url is in set, it will tell you that this url should have already appeared, but I have 2% of the uncertainty. Note that the uncertainty here can be very small when the memory you allocate is large enough. A simple tutorial: Bloom Filters by Example
Note this feature. if a url has been viewed, it may be repeated with a small probability (it doesn't matter if you look more than it will not be exhausted ). But if you haven't seen it, you will definitely be taken a look (this is very important, or we will miss some web pages !). [IMPORTANT: this section has a problem. Please skip it temporarily]
Well, now we are close to the fastest way to handle heavy lifting. Another bottleneck-you only have one machine. No matter how high your bandwidth is, as long as your machine is downloading the web page speed is a bottleneck, you only need to speed up this speed. If one machine is not enough, use multiple machines! Of course, we assume that each server has already achieved the maximum efficiency-using Multithreading (for python, multi-process ).
3) cluster-based crawling
I spent a total of more than 100 machines running around the clock for a month while crawling the bean. Imagine that if you only use one machine, you have to run for 100 months...
So, if you have 100 machines, how can you use python to implement a distributed crawling algorithm?
We call the 99 machines with low computing power on the 100 platform as slave, and the other large machine as master, so let's review the url_queue in the above code, if we can put this queue on this master machine, all slave can be connected to the master through the network. every time a slave completes downloading a webpage, request a new webpage from the master to capture the webpage. Each time slave captures a new webpage, all the links on this webpage are sent to the master's queue. Similarly, the bloom filter is also placed on the master, but now the master only sends the url that has not been accessed to slave. The Bloom Filter is placed in the memory of the master, and the Accessed url is put in Redis running on the master, so that all operations are O (1 ). (At least O (1). For Redis access efficiency, see LINSERT-Redis)
Consider how to use python:
After scrapy is installed on each slave, each server becomes a server load balancer with crawling capability. Redis and rq are installed on the master node as distributed queues.
The code is written
#slave.pycurrent_url = request_from_master()to_send = []for next_url in extract_urls(current_url): to_send.append(next_url)store(current_url);send_to_master(to_send)#master.pydistributed_queue = DistributedQueue()bf = BloomFilter()initial_pages = "www.renmingribao.com"while(True): if request == 'GET': if distributed_queue.size()>0: send(distributed_queue.get()) else: break elif request == 'POST': bf.put(request.url)
Okay, you can think of someone who has already prepared what you need: darkrho/scrapy-redis-GitHub
4) outlook and post-processing
Although the above uses a lot of "simple", it is not easy to implement a commercial-scale available crawler. The above code is used to crawl a whole website. there is almost no big problem.
However, if you need to attach the file, such
Effective storage (how to arrange the database)
Effective sentence recognition (this refers to the sentence recognition on the web page. I don't want to crawl the People's Daily report or the big people's daily newspaper that copied it)
Effective information extraction (for example, how to extract all the addresses on the web page, "Qianjin Road, Chaoyang district"), the search engine usually does not need to store all the information, why am I saving the video...
Update in time (predict how often this page will be updated)