What is the best way to use Python to write crawlers?

Source: Internet
Author: User
Previously only a very simple Python crawler, directly with the built-in library implementation, there is no one who used python to crawl the larger scale of data, using what method?
Also, with the existing Python crawler framework, what are the advantages compared to using the built-in libraries directly? Because Python itself is very simple to write crawlers.

Reply content:

Can see Scrapy ( / http scrapy.org/ ), based on this framework to write their own crawler as a result of the project needs to collect and use some crawler-related libraries, do some comparative analysis. Here are some of the libraries I've been exposed to:
    • Beautiful Soup. Fame, integration of a number of common crawler needs. Disadvantage: Cannot load JS.
    • Scrapy. A seemingly powerful reptile framework that satisfies simple page crawls (such as the case where URL pattern is clearly known). With this framework, you can easily climb down data such as Amazon commodity information. But for a slightly more complex page, such as Weibo's page information, this framework will not meet the requirements.
    • Mechanize. Advantages: JS can be loaded. Disadvantage: The document is severely missing. But through the official example and the method of human flesh try, still can barely use.
    • Selenium This is a call to the browser of the driver, through the library you can directly invoke the browser to complete certain operations, such as entering a verification code.
    • Cola A distributed crawler framework. The overall design of the project is a bit bad, the coupling between modules is high, but it is worth learning.

Here are some of my practical experiences:

    • For simple requirements, such as having a fixed pattern of information, it is possible to do so.
    • For more complex requirements, such as crawling dynamic pages, involving state transitions, involving anti-crawler mechanisms, involving high concurrency, in this case it is difficult to find a matching requirements of the library, a lot of things can only write their own.

As for the main question mentioned:

Also, with the existing Python crawler framework, what are the advantages compared to using the built-in libraries directly? Because Python itself is very simple to write crawlers.

Third party library can do built-in library do or do difficult things, only that. What's more, it's not simple, it's all about demand, it's nothing to do with Python. To handle the results of JS run, you can use Html5lib.
But I think the best is to use the Beautifulsoup4 interface, let it internal use html5lib. Write your own crawler, with some asynchronous event-driven libraries, such as gevent, than simple multithreading is much better. When I was a sophomore, I wrote a crawler crawl. / http Amazon.com All reviews of bestseller top100 for certain categories of goods.

There is no framework, the use of Linux under the "BeautifulSoup" library to help parse HTML, regular expressions can also be very troublesome.

Crawler is slow, there is a small trick is to go agent, because it is a foreign site, very slow, and can prevent the same IP access too many times.

About tens of thousands of pages, and then use Beautifsoup analysis, pick some of their own interesting data, such as rating, comment, business, classification or something. Then use some scientific library to do some simple statistics and reports, such as NumPy, scipy, matplotlib and so on. Online also have a lot of data Generation Report JS library, very cool, also very good:)

Well, that's it. Let me answer that, too.

If the landlord wants to crawl something larger, there are two options, one to write a reptile frame, and another to overall through the reptile frame.
1, I wrote a crawler frame, I have not written can not say
2, through the thread of the crawler frame.
Using more is Scrapy, first scrapy asynchronous, and then Scrapy can be written as a distributed crawler. You don't have to climb a lifetime to face big data.
There are also pyspider,sola and so on. More reptiles I am also in the collection, but if you are going to start using the framework, probably can only find these two, the reason, or because many of the framework is written in English bar, most of them do not want to crawl English pit.

Some people also mentioned Cola, this is written by the Chinese, the author said
rely on, previously just heard scrapy, never to see, just looked at, found in addition to the distribution of the part, unexpectedly really like.
From Scrapy, it's inspiring. You can save the form of a JSON file and reduce dependency on the database.

Think about, distributed or my original intention, I did not expect the other parts so close.
In fact, using that framework is not a worthwhile thing to do because there is almost no choice.

the second question. What is the difference between Python's own class libraries and frameworks?
You ask such a question, is because, you now climb the demand is still very simple!!
Just crawl static pages, and can not climb many, really suggest you like what to use what good, or directly with the class library bar, recommended requests, a few lines of code is done

However, there are not only static pages in life, but also Ajax, as well as JS, there are all kinds of inexplicable details.

And the details are quite scary, for example, the extraction of data, with regular or XPath, why not all pages have next page, one night climbed 5,000 data, I a total of 200,000 how to do, crawler and was sealed, I am.

Sometimes it's really tough to think about yourself.

At this time, you will know the benefits of the framework, the most important role of the framework is to use the simplest way to help you achieve the requirements, that is to say, If you can now well meet the needs of the work, then do not look at the framework, if the work is a bit difficult, then go to see it, maybe others have made the wheel, waiting for you to cart it!

Cola's link to you.
Cola: A distributed crawler framework
Scrapy Baidu is
Pyspider still useless this to see the individual, can start to see scrapy aspects of content, and then combined with redis, implementation of distributed, specific implementation can refer to the code on GitHub, such as Chineking/cola GitHub
Storage, the need for MongoDB, to go into the words, this aspect of the content is quite a lot, and MongoDB can realize cluster-type storage, can meet the requirements of the landlord.
There are many frameworks, such as reptile frames | Write code for yourself , the landlord can try.
Crawling large-scale data can actually be achieved through distribution. My blog has a very detailed narrative and source code, python3.4 implementation.
Welcome to the Networking Resource Search crawler (Python 3.4.1 Implementation) Wrote a small reptile climbed the school student's photo scores, climbed for three or four days. It's been a couple of times. I developed a cloud crawler development framework: God Archer, you can let developers in the cloud using JavaScript to write and run crawlers, welcome to use Brick ~
  • Related Article

    Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.