What method and framework is better to use Python to write crawlers?

Source: Internet
Author: User
I used to write a very simple Python crawler and implement it directly using a built-in library. Does anyone use Python to crawl large data? What method is used? In addition, what are the advantages of using the existing Python crawler framework compared with using the built-in library directly? Because Python itself is easy to write crawlers. I used to write a very simple Python crawler and implement it directly using a built-in library. Does anyone use Python to crawl large data? What method is used?
In addition, what are the advantages of using the existing Python crawler framework compared with using the built-in library directly? Because Python itself is easy to write crawlers. Reply: Look at Scrapy (http://scrapy.org/ ), Based on this framework, write your own crawlers. Some crawler-related libraries have been collected and used due to project requirements, and some comparative analysis has been done. The following are some databases I have encountered:
  • Beautiful Soup. It is well known and integrates some common crawler demands. Disadvantage: JS cannot be loaded.
  • Scrapy. A powerful crawler framework can be used for simple page crawling (for example, the url pattern can be clearly known ). With this framework, data such as Amazon product information can be easily crawled. However, for a slightly more complex page, such as weibo page information, this framework cannot meet the requirements.
  • Machize. Advantage: javascript can be loaded. Disadvantage: The document is seriously missing. However, the official example and manual methods are barely usable.
  • Selenium. This is a driver that calls the browser. Through this library, you can directly call the browser to complete some operations, such as entering the verification code.
  • Cola. A Distributed crawler framework. The overall design of the project is a bit poor, and the coupling between modules is high, but it is worth learning.

Some of my practical experiences are as follows:

  • For simple requirements, such as fixed pattern information, how can this problem be solved.
  • For complex requirements, such as crawling dynamic pages, status conversion, anti-crawler mechanism, and high concurrency, it is difficult to find a library that meets the requirements, many things can only be written by yourself.

As mentioned by the subject:

In addition, what are the advantages of using the existing Python crawler framework compared with using the built-in library directly? Because Python itself is easy to write crawlers.

Third party library can do anything that built-in library cannot do or is difficult to do. In addition, crawlers are simple and depend entirely on requirements. They have nothing to do with Python. To process js running results, you can use html5lib.
However, I think it is best to use the beautifulsoup4 interface so that it can use html5lib internally. If you write crawlers by yourself, using asynchronous event-driven libraries such as gevent is much better than simply multithreading. When I was a sophomore, I wrote a web crawler crawling http://amazon.com. All comments of bestseller top100 of a certain item.

No framework is needed. In linux, the database named beautifulsoup is used to help parse html, and regular expressions are also good.

Crawlers are slow. A small trick is to use proxy. Because it is a foreign website, it is very slow and can prevent too many accesses from the same ip address.

There are about tens of thousands of web pages, and then use beautifsoup to parse and pick data that interest you, such as scores, comments, sellers, and categories. Then, we use some scientific libraries to make some simple statistics and reports, such as numpy, scipy, and matplotlib. There are also a lot of js libraries for generating reports on the Internet, which are cool and nice :)

Well, that's it. Let me answer it.

If you want to crawl things on a larger scale, there are two solutions: one is to write a crawler framework by yourself, and the other is always through the crawler framework.
1. Write a crawler framework by yourself. I have never written it before.
2. Use the crawler frame of the thread.
Scrapy is widely used. First, scrapy is asynchronous, and then scrapy can be written as a distributed crawler. In this way, big data can no longer be crawled for a lifetime.
In addition, there are pyspider and sola. I am collecting more crawlers, but if you want to start using the framework, you can only find these two crawlers, because many frameworks are written in English, most of them do not want to climb English.

Some people mentioned cola, which was written by Chinese people.
I 've heard of scrapy before, but I 've never seen it before. I just looked at it and found that apart from the distributed part, it looks really like it.
Scrapy can be used to save json files and reduce database dependencies.

I thought for a moment, distributed is still my original intention. I really did not expect other parts to be so similar.
Actually, using that framework is not a tangle, because there is almost no choice.

Second question: What is the difference between the python class library and the framework?
You asked this question because your current requirement for crawling is still very simple !!
If you only need to crawl static pages, and you cannot crawl many pages, we really suggest you use whatever you like, or simply use the class library. We recommend requests and get a few lines of code.

However, there are not only static pages in our lives, but also ajax, js, and various inexplicable details.

The details are terrible. For example, if we use regular expressions or xpath to extract data, why not all pages have a next page and 5000 data records crawled one night, I have a total of 0.2 million items, and crawlers are blocked. I rely on them.

Sometimes it's annoying to think that you are unshakable.

At this time, you will know the benefits of the framework. The biggest role of the framework is to use the simplest method to help you achieve your needs. That is to say, If you can meet your work needs well now, don't look at the framework. If you have a hard time at work, let's take a look. Maybe someone else has created a wheel and is waiting for your carts! Old man!

Give you the cola link.
Cola: a distributed crawler framework
Scrapy Baidu is
Pyspider hasn't used this to look at the individual. You can first look at scrapy content, and then use redis for distributed implementation. For specific implementation, refer to the Code on github, such as chineking/cola · GitHub. .
If you need mongodb for storage, you need to go deep into it. In addition, mongodb can implement cluster-based storage to fully meet the requirements of the landlord.
There are many frameworks, such as crawler frameworks | write code for yourself , The landlord can try.
Large-scale data crawling can actually be achieved through distributed implementation. My blog has a detailed description and source code, which is implemented by python3.4.
Welcome to the web resource search crawler (python 3.4.1 implementation) I wrote a photo of a crawlers crawling school students, and crawled for three or four days. I have developed a cloud crawler development framework for several times: X-arms allows developers to write and run crawlers in Javascript on the cloud. You are welcome to use shoot bricks ~

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.