Do you want to use Python as a crawler, using Scrapy Framework or requests, BS4, etc.?

Source: Internet
Author: User
Want to use Python (Python3) to implement a crawler, to achieve some of their own needs.
Referring to the information on the Internet, we find that there are two options for you to choose:
1. Using the Scrapy framework
All say the framework is powerful and simple to implement. But not compatible with Python3,
2. Use libraries such as requests and BS4 to achieve your own
Compared to scenario one, you may want to write a lot more code, and performance may not be as open-source framework.

Because of their own learning Python3 (many people say that Python3 is the trend, so did not learn Python2), if the adoption of a plan, there will be scrapy to Python3 support is not good enough (although now Scrapy official online said that the support of Python3 is in progress, But not equal), want to be familiar with the person to answer the Scrapy to Python3 support in the end how? If you adopt scenario two, then ask, if I want to use requests, BS4 and other libraries to implement a simple version of the scrapy, how difficult, need to learn those things?

Reply content:

Really don't tangle 2 or 3, for crawlers, feel no difference, these are not things, in addition to coding and print.
and requests and BS4 all support it (until I'm sure).

Then what's the matter?
1 Limit IP
With requests agent, buy agent, or online free agent
2 masquerading as a browser
Requests Switch User agent
3 login first, save cookies
Requests use the session to post the cookies and then crawl
4 URL parameters Too many, do not understand what meaning
Webdriver and Phantomjs
5 JavaScript and Ajax issues
Browser F12 analyzes request rules and requests requests directly. or with Webdriver and PHANTOMJS, if using scrapy, use SCRAPYJS
6 Too slow to climb
Multi-threaded, don't say Gil, generally network IO slow, CPU and other IO
7 or slow?
Scrapy Async (done a few projects, very useful), Pyspider (this support Python3)
8 or slow?
Distributed (not yet covered), Redis,scrapyd
9 Verification Code
Sorry, I can't help you. Simple to PIL, grayscale binary cutting recognition
10 If you want to make an asynchronous request yourself
Grequests Good



Claw machine reply, waiting to be added.
PS unknowingly themselves with Python for a period of time, wrote the crawler, the Web, the recent use of Python to earn some money a few days ago just with a few libraries to write a simple crawler, but because I am using the Python2.7, so may be a little different, first say my experience

2 months ago To learn the Scrapy framework, and then wrote a few reptiles, basically is basespider,crawlspider, at that time feel that writing a crawler is very simple, there is a ready-made frame in there, as long as you define the class to crawl and grab the function on the line

After a one-month break from Python learning, after reading "Python core programming", when talking about crawlers, I thought about why I didn't write one and started doing it.

At this time to realize that the crawler is not as simple as you think, you have to define such as storage data class, with the domain name reservation function, data deduplication and a series of problems, and finally written in two scenarios, one is to define a class, one is only the function, but they are basically similar, of course, there are a series of problems not resolved, The current function is to crawl the URL according to the input URL and crawl depth, but the basic prototype came out, later slowly solve

Personal advice to learn scrapy first, I can feel the greatest benefit is to learn the regular, so that their own crawler to write the URL directly with the regular, and other libraries are not used

After learning Scrapy, try to write a crawler, because this time you have the basic operation of the crawler has mastered, Tiger still not, the Lord said, with the request and BS4 library is certainly not enough, but don't hurry to learn library, then need to check (I personally like to use regular, So I wrote the crawler just use the RE, of course, can not be denied that the above two is also very strong, personal preferences only) the process of writing will certainly encounter problems, such as data storage, deduplication, crawling, a solution, to enhance their absolute advantage

Look at their own crawler-run web pages, but also a sense of accomplishment do not tangle python2 or python3 problems.
Learning programming is not only learning grammar, it is learning computational thinking, programming ideas. The difference between Python2 and Python3 is not very big.

Look at your situation, it is recommended to learn the standard library or requests this library learning Crawler, first learn to grasp the package, analog post, get, automatic fill in the basic skills, and then learn the scrapy framework.

It is suggested that you take a look at the Python spider Lenovo video and learn the basics.

Search for "Python crawler legend Video" has a playback address.

Come on! Try Urllib and Urllib2 first, and familiarize yourself with the basic thinking of reptiles. Then familiar with probably after look at requests, which is also urllib\urllib2 encapsulation, familiar with the capture and analysis of the page composition, understand the post, get what is the principle and practical, try to write a few small station crawler, when you are not satisfied with this time can go to the scrapy, But before entering the pit, I recommend the landlord first to understand the multithreading of Python, I am now dying in the middle. Watch you use the scene.
If your crawler is playing, practice practiced hand. or a site request for a small amount of concurrency, you can use Scrapy.
I tend to use requests BS RE if your crawler is requesting a site very frequently and in very large quantities.

The business logic of crawlers is simple. the point is to climb back! Back crawl! Back crawl!


Scrapy Advantage is the abstraction of the business, allowing you to quickly get results by configuring the data format you need. This is convenient when the request volume is very small, but when the request amount of a big up, will inevitably encounter anti-crawling mechanism of various seal you, for anti-crawl scrapy not provide a particularly effective processing mechanism.

In addition, often get effective data operation, with Beautifulsoup+re, and in order to use scrapy have to configure a bunch of things but appear cumbersome.
Since all the anti-crawl processing need to do their own, so it seems scrapy advantage is actually very small, so I suggest requests BS re do. The requests and BS4 libraries are quite powerful, simply write a dozens of line, and then with the agent and multi-process/multi-threading, you can fetch considerable data. Master if you want to get started these two libraries can be in the NetEase cloud class to search a course on Python crawler, the specific name is forgotten, but personally think that said is good. The other is the use of documentation, all instructions are in the document, Baidu a search there.
  • Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.