How do I get started with Python crawlers?

Source: Internet
Author: User

By Serco Source: Know
Links: https://www.zhihu.com/question/20899988/answer/24923424

Copyright belongs to the author. Commercial reprint please contact the author for authorization, non-commercial reprint please specify the source.

"Getting Started" is a good motivation, but it may be slow to play. If you have a project in your hand or in your head, you will be driven by the goal instead of learning as you learn the module.

In addition, if every knowledge point in the knowledge system is a point in the graph, and the dependency is the edge, then the graph must not be a graph with no direction. Because learning A's experience can help you learn B. Therefore, you do not need to learn how to "get started" because such an "entry" point does not exist at all! What you need to learn is how to do a bigger thing, and in the process you will quickly learn what you need to learn. Of course, you can argue that you need to know Python first, or how to learn python as a reptile? But in fact, you can learn python in the process of doing this reptile:D

See many of the previous answers to the "technique"-how to crawl with what software, then I will talk about "Tao" and "technique" it-how the crawler works and how to implement in Python.

Let's make it short summarize:
You need to learn
    1. Basic Crawler Working principle
    2. Basic HTTP crawlers, Scrapy
    3. Bloom filter:bloom Filters by Example
    4. If you need a large-scale web crawl, you need to learn the concept of distributed crawlers. It's not that iffy, you just have to learn how to maintain a distributed queue that all cluster machines can share effectively. The simplest implementation is PYTHON-RQ: https://github.com/nvie/rq
    5. The combination of RQ and scrapy: Darkrho/scrapy-redis GitHub
    6. Subsequent processing, Web page disjunction (grangier/python-goose GITHUB), Storage (MONGODB)

The following is the short word length said:

Tell me about the experience of a cluster crawling down the whole watercress.

1) First you need to understand how reptiles work.
Imagine you're a spider, and now you're on the Internet. Well, you need to look through all the pages. What do we do? No problem, you just start somewhere, like the People's Daily home page, this is called initial pages, with $.

On the homepage of the People's Daily, you see the various links that the page leads to. So you are happy to climb to the "domestic news" that page. Good, so you have climbed the two pages (home and domestic news)! For the moment, do not worry about crawling down the page how to deal with, you imagine you put this page finished a full copy into a HTML put on you.

Suddenly you find that in the domestic news on this page, there is a link back to "home". As a clever spider, you must know you don't have to crawl back, because you've already seen it. So, you need to use your brain to save the address of the page you've already seen. This way, every time you see a new link that you might need to crawl, you'll find out if you've been to the page address in your head. If you've been there, then don't go.

Well, theoretically, if all the pages can be reached from the initial page, then you can prove that you can crawl through all the pages.

So how do you do it in Python?
Very simple
ImportQueueInitial_page="Http://www.renminribao.com"Url_queue=Queue.Queue()Seen=Set()Seen.Insert(Initial_page)Url_queue.Put(Initial_page)While(True):#一直进行直到海枯石烂IfUrl_queue.Size()>0:Current_url=Url_queue.Get() #拿出队例中第一个的url store (current_url )  #把这个url代表的网页存储好 for next_url in  Extract_urls (current_url #提取把这个url里链向的url Span class= "K" >if next_url not in seen : seen. Put (next_url) url_queue. Put (next_url) else: break                
It's already very pseudo-code.

All reptile backbone are here, and here's an analysis of why crawlers are actually a very complicated thing--search engine companies usually have a whole team to maintain and develop.

2) Efficiency
If you directly process the above code directly, you will need a whole year to crawl the entire watercress content. Not to mention the search engine like Google needs to crawl the content of the whole network.

What is the problem? There are too many pages to crawl, and the code above is too slow. Imagine a network of N sites, then analysis of the complexity of the weight is N*log (n), because all the pages to traverse once, and each time the use of a set to reuse the complexity of log (n). Ok,ok, I know Python's set implementation is hash--but this is still too slow, at least the memory usage is inefficient.

What is the usual method of weighing? Bloom Filter. It is still a hash method, but it is characterized by the ability to use fixed memory (which does not grow with the number of URLs) to determine whether the URL is already in set with the efficiency of O (1). Unfortunately, there is no free lunch, the only problem is that if the URL is not set, the BF can be 100% to determine that the URL has not been seen. But if this URL is in set, it will tell you that the URL should have already appeared, but I have 2% uncertainty. Note that the uncertainty here can become very small when you allocate enough memory. A simple tutorial: Bloom Filters by Example

Notice this feature, if the URL has been seen, then it may be a small probability of repeated look at (no matter, more to see not exhausted). But if you have not been seen, you will be looked at (this is very important, otherwise we will miss out some of the pages!) )。 [IMPORTANT: There is a problem with this paragraph, please temporarily skip]


Well, the quickest way to deal with the weight is now approaching. Another bottleneck--you have only one machine. No matter how much bandwidth you have, as long as your machine downloads the speed of the page is the bottleneck, then you only have to speed up this speed. Using a machine is not enough-use a lot of Taiwan bar! Of course, let's assume that each machine is already in maximum efficiency-using multithreading (python, multi-process bar).

3) Clustered Crawl
When I climbed the watercress, I used more than 100 machines to run for one months day and night. Imagine that if you use only one machine you will have to run for 100 months ...

So, suppose you now have 100 machines to use, how do you implement a distributed crawl algorithm in Python?

We called the 99 smaller machines in the 100 Taichung Slave, and the other larger machine called Master, so look back at the Url_queue in the code above, and if we can put this queue on this master machine, All slave can be connected to master via the network, and whenever a slave completes downloading a webpage, it asks master for a new page to crawl. And each time slave a new Web page, the link to the page all the links to master's queue. Similarly, Bloom filter is placed on master, but now master only sends a URL that determines that it has not been visited to slave. The Bloom filter is placed in master's memory, and the URL visited is placed in a redis running on master so that all operations are O (1). (At least the split is O (1), the access efficiency of Redis see: Linsert–redis)


Consider how Python is implemented:
In each of the slave to install the good scrapy, then each machine becomes a grasping ability of the slave, in master with Redis and RQ as a distributed queue.


The code is then written
#slave.pycurrent_url = request_from_master()to_send = []for next_url in extract_urls(current_url):    to_send.append(next_url)store(current_url);send_to_master(to_send)#master.pydistributed_queue = DistributedQueue()bf = BloomFilter()initial_pages = "www.renmingribao.com"while(True):    if request == ‘GET‘:        if distributed_queue.size()>0:            send(distributed_queue.get())        else:            break    elif request == ‘POST‘:        bf.put(request.url)        


Well, you can actually think that someone has written you what you need: Darkrho/scrapy-redis GitHub

4) Outlook and post-processing
While it's a lot of "simple", it's not easy to really implement a commercially available crawler. The above code is used to crawl a whole website with little or no big problems.

But if you need these follow-up treatments, such as
    1. Efficient storage (How the database should be arranged)
    2. Effectively sentenced to heavy (here refers to the Web site, I do not want to the people's Daily and copy it of the Big daily) to climb it again.
    3. Effective information extraction (such as how to extract all the addresses on the page extracted, "Chaoyang District Endeavor Road, China Road"), the search engine usually does not need to store all the information, than the film I saved what to do ...
    4. Update (Predict how often this page will be updated)

As you can imagine, every point here is for a decade of research by many researchers. Even so,
"The road long its repair far XI, I will go up and down and quest."

So, do not ask how to get started, directly on the road is good:)

How do I get started with Python crawlers?

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.