Introduction to the Scrapy framework
Scrapy,python developed a fast, high-level screen capture and web crawling framework for crawling web sites and extracting structured data from pages. Scrapy can be used for data mining, monitoring and automated testing in a wide range of applications. (Quoted from: Baidu Encyclopedia)
Scrapy Official website: https://scrapy.org/
Scrapy Official Document: https://doc.scrapy.org/en/latest/
Scrapy Frame Installation
First we install Scrapy, using the following command
pip install scrapy
At this point, many people should be confronted with the following problems
error: Microsoft Visual C++ 10.0 is required.Get it with "Microsoft Windows SDK 7.1": www.microsoft.com/download/details.aspx?id=8279
This is because a lot of C + + content is used in scrapy, so you need to have a C + + 10.0 environment first when installing. The most straightforward workaround is to download and install Microsoft Visual C + + 10.0. But for this next big environment, the configuration is the pain of the huge hard wind, it is really daunting.
So I suggest the second way, we carefully observed that PIP error before the operation of the
Running setup.py clean for TwistedFailed to build Twisted
That is to say that there is an error installing the Twisted module, we can choose to manually download the twisted module and install it. Python's various libraries have a lot of downloads, many people may download to twisted-xx.x.x.tar.bz2, after the decompression to install, found that the same error occurred. At this point, we carefully observe the previous installation of Scrapy information, we will find that the PIP command is actually used to download the bz2 file, unzip, run the extracted setup.py files, so this is not the same as the manual installation process of the above no difference.
I recommend a Web site Https://www.lfd.uci.edu/~gohlke/pythonlibs, which contains almost all of the commonly used Python libraries. For example, we need to download the twisted Library This time, then we search the Web page for twisted, and then download the corresponding number of bits and Python version of the Twisted Library. Then run cmd in the twisted download location and execute the following command (remember to replace the file name you downloaded)
pip install Twisted-xx.x.x-cpxx-cpxxm-win_amd64.whl
Then we just have to wait for it to run to complete the installation, so we've installed the twisted library that Scrapy must, and then we re-execute
pip install scrapy
Installation is successful!
During the installation, we can see that it has downloaded many auxiliary libraries for us, which makes scrapy a complete and systematic crawler framework that greatly simplifies our programming and reduces the cost of learning.
Scrapy is based on the requests library, so we also need to execute the following command
pip install requests
At this point, we have completed the installation of the Scrapy crawler framework.
Scrapy Crawler Template Introduction
Scrapy is a system framework, and the pre-use preparation is slightly more complicated than some other crawlers, but this limited number of steps can greatly reduce the difficulty of our subsequent programming.
Since we are not familiar with the scrapy framework, we use Scrapy's own commands to generate scrapy templates. This time we take jobbole as an example.
For example, first, we use the following command
scrapy startproject jobboleSpider
This allows us to create a Scrapy crawler project under the current path, but this project is not complete at this time, so let's use the following command to create a scrapy template from its prompt
cd jobboleSpiderscrapy genspider jobbole http://blog.jobbole.com/all-posts/
Based on the message, we learned that we used the Scrapy "basic" template to successfully create a scrapy project and build a basic structure. Then let's open the IDE to see the template we created.
Where Scrapy.cfg makes our global profile, contains our settings file path, project name and so on.
Jobbole.py is the main file for our future crawler logic.
items.py is our file that defines the data storage structure.
middlewares.py contains a large number of middleware, such as the download middleware, redirection middleware, is the Scrapy engine and other parts of the code to send information between the important channel.
pipelines.py, as its name is, is a conduit that is used primarily to store the data we obtain in a database.
Setteings.py is made up of a large number of scrapy settings, such as whether robot protocol is followed.
Conclusion
At this point we have implemented the installation of scrapy and the implementation of the basic framework, but has not carried out specific programming, the next I will take everyone to achieve the first Jobbole "the latest article" of all the article crawl, to initially experience the fun of Scrapy crawler. Then gradually into the simulation landing, breaking the anti-crawler restrictions and so on.
I will try to explain in detail every step of my operation to achieve a "small white tutorial". In the following tutorial we will use XPath and regular expressions, confined to space, the author of these two knowledge points will only carry out some basic teaching, if you want to skillfully use, it is best to consult some other information for more in-depth study.
Python crawler essay-scrapy Framework (1) Introduction to the installation and structure of the--scrapy framework