Python crawler essay-scrapy Framework (1) Introduction to the installation and structure of the--scrapy framework

Source: Internet
Author: User

Introduction to the Scrapy framework

Scrapy,python developed a fast, high-level screen capture and web crawling framework for crawling web sites and extracting structured data from pages. Scrapy can be used for data mining, monitoring and automated testing in a wide range of applications. (Quoted from: Baidu Encyclopedia)

Scrapy Official website: https://scrapy.org/

Scrapy Official Document: https://doc.scrapy.org/en/latest/

Scrapy Frame Installation

First we install Scrapy, using the following command

pip install scrapy

At this point, many people should be confronted with the following problems

error: Microsoft Visual C++ 10.0 is required.Get it with "Microsoft Windows SDK 7.1": www.microsoft.com/download/details.aspx?id=8279

This is because a lot of C + + content is used in scrapy, so you need to have a C + + 10.0 environment first when installing. The most straightforward workaround is to download and install Microsoft Visual C + + 10.0. But for this next big environment, the configuration is the pain of the huge hard wind, it is really daunting.

So I suggest the second way, we carefully observed that PIP error before the operation of the

Running setup.py clean for TwistedFailed to build Twisted

That is to say that there is an error installing the Twisted module, we can choose to manually download the twisted module and install it. Python's various libraries have a lot of downloads, many people may download to twisted-xx.x.x.tar.bz2, after the decompression to install, found that the same error occurred. At this point, we carefully observe the previous installation of Scrapy information, we will find that the PIP command is actually used to download the bz2 file, unzip, run the extracted setup.py files, so this is not the same as the manual installation process of the above no difference.

I recommend a Web site Https://www.lfd.uci.edu/~gohlke/pythonlibs, which contains almost all of the commonly used Python libraries. For example, we need to download the twisted Library This time, then we search the Web page for twisted, and then download the corresponding number of bits and Python version of the Twisted Library. Then run cmd in the twisted download location and execute the following command (remember to replace the file name you downloaded)

pip install Twisted-xx.x.x-cpxx-cpxxm-win_amd64.whl

Then we just have to wait for it to run to complete the installation, so we've installed the twisted library that Scrapy must, and then we re-execute

pip install scrapy

Installation is successful!

During the installation, we can see that it has downloaded many auxiliary libraries for us, which makes scrapy a complete and systematic crawler framework that greatly simplifies our programming and reduces the cost of learning.

Scrapy is based on the requests library, so we also need to execute the following command

pip install requests

At this point, we have completed the installation of the Scrapy crawler framework.

Scrapy Crawler Template Introduction

Scrapy is a system framework, and the pre-use preparation is slightly more complicated than some other crawlers, but this limited number of steps can greatly reduce the difficulty of our subsequent programming.

Since we are not familiar with the scrapy framework, we use Scrapy's own commands to generate scrapy templates. This time we take jobbole as an example.

For example, first, we use the following command

scrapy startproject jobboleSpider

This allows us to create a Scrapy crawler project under the current path, but this project is not complete at this time, so let's use the following command to create a scrapy template from its prompt

cd jobboleSpiderscrapy genspider jobbole http://blog.jobbole.com/all-posts/

Based on the message, we learned that we used the Scrapy "basic" template to successfully create a scrapy project and build a basic structure. Then let's open the IDE to see the template we created.

Where Scrapy.cfg makes our global profile, contains our settings file path, project name and so on.

Jobbole.py is the main file for our future crawler logic.

items.py is our file that defines the data storage structure.

middlewares.py contains a large number of middleware, such as the download middleware, redirection middleware, is the Scrapy engine and other parts of the code to send information between the important channel.

pipelines.py, as its name is, is a conduit that is used primarily to store the data we obtain in a database.

Setteings.py is made up of a large number of scrapy settings, such as whether robot protocol is followed.

Conclusion

At this point we have implemented the installation of scrapy and the implementation of the basic framework, but has not carried out specific programming, the next I will take everyone to achieve the first Jobbole "the latest article" of all the article crawl, to initially experience the fun of Scrapy crawler. Then gradually into the simulation landing, breaking the anti-crawler restrictions and so on.

I will try to explain in detail every step of my operation to achieve a "small white tutorial". In the following tutorial we will use XPath and regular expressions, confined to space, the author of these two knowledge points will only carry out some basic teaching, if you want to skillfully use, it is best to consult some other information for more in-depth study.

Python crawler essay-scrapy Framework (1) Introduction to the installation and structure of the--scrapy framework

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.