Scrapy is an application framework for crawling Web site data and extracting structured data. It can be applied in a series of programs including data mining, information processing or storing historical data. It was originally designed for page crawling (or, more specifically, web crawling), or it can be applied to get the data returned by the API (such as Web Services) or a generic web crawler.
Scrapy can also help you achieve higher-order crawler frameworks, such as crawling Web site authentication, content analysis and processing, repetitive crawl, distributed crawling and so on very complex things.
installation scrapy Environment: Ubuntu 14.04 + python2.7
A recent study of reptiles has long been told that Python is very cool to write reptiles (seemingly pythoner python is cool, but it is true that Python's class library is very rich, do not have to reinvent the wheel), there is a strong framework scrapy, so decided to try.
The first thing to use Scrapy, of course, is to install scrapy and talk about Ubuntu installation.
You need to install a series of dependencies before installing Scrapy.
* Python 2.7:scrapy is the Python framework, and of course you have to install Python first, but since Scrapy only supports Python2.7 for the time being, make sure you install Python 2.7 first.
* lxml: Most Linux distributions have their own lxml
* OpenSSL: Systems other than windows are already available
* Python Package:pip and Setuptools. Because the PIP is now dependent on Setuptools, the Mount Pip is automatically installed Setuptools
With the above dependencies, it is easy to install the Scrapy dependencies in a non-Windows environment, with only the PIP installed. Scrapy Use the PIP to complete the installation. Check if scrapy dependencies are installed
You may not be assured that your computer has been installed, the above mentioned existing dependencies, then you can use the following methods to check that this article is using Ubuntu 14.04. Check the Python version
If you see the following output, the Python environment is installed, I'm showing Python 2.7.6, and the version is 2.7 satisfied. If the following information does not appear, then please the reader to install Python on their own, this article does not introduce the installation of Python (a bunch of online search).
Check to see if lxml and OpenSSL are installed
Suppose you have Python installed, enter Python in the console, and go into the Python interactive environment.
Then enter import lxml and import OpenSSL separately if there is no error, both dependencies are installed.
Install Python-dev and Libevent
Python-dev is an important tool for developing Python on Linux, and you need to install the following
* You will need to install a Python class library outside of the source, which contains the C + + files that need to be compiled to invoke the Python API
* You write your own program compiling requires a link libpythonxx. (A|SO)
Libevent is a high performance network library that starts at a time, and many of the bottom layers of the framework use the Libevent
The above two libraries need to be installed, otherwise the error will be later. Install the $sudo apt-get install Python-dev $sudo apt-get install Libevent-dev install PIP using the following instructions
Since scrapy can be easily installed with the PIP, we need to install the PIP first, and you can install the PIP $ sudo apt-get install PYTHON-PIP using PIP to install scrapy with the following instructions
Install scrapy using the following instructions. $ sudo pip install Scrapy
Remember to be sure to get root permissions!!! Remember!! [Method of entering root: *****$: sudo su]
With this scrapy installation complete, use the following command to check that the Scrapy is successfully installed. $ scrapy Version
The following results indicate that the installation was successful and the installation version here is 1.02
Simple Example
Create a Python source file named stackoverflow.py, which reads as follows:
1
2
3
4
5
6
7 8 9
20
|
Import Scrapy
class Stackoverflowspider (scrapy. Spider):
name = ' StackOverflow '
start_urls = [' Http://stackoverflow.com/questions?sort=votes ']
def Parse (self, Response): for
href in response.css ('. Question-summary h3 a::attr (HREF) '):
Full_url = Response.urljoin (Href.extract ())
yield scrapy. Request (Full_url, callback=self.parse_question)
def parse_question (self, Response):
yield {
' title ': Response.css (' H1 A::text '). Extract () [0],
' votes ': response.css ('. question. Vote-count-post::text '). Extract () [0],
' body ': response.css ('. question. Post-text '). Extract () [0],
' tags ': response.css ('. question. Post-tag::text '). Extract (),
' link ': Response.url,
}
|
Run:
Scrapy Runspider Stackoverflow.py-o Top-stackoverflow-questions.json
|
The result is similar to the following:
[{
"body": "... LONG HTML here ... ",
" link ":" http://stackoverflow.com/questions/11227809/ Why-is-processing-a-sorted-array-faster-than-an-unsorted-array ",
" tags ": [" Java "," C + + "," Performance "," Optimization "],
" title ":" Why is processing a sorted array faster than a unsorted array? ",
" votes ":" 9924 "
},
{"Body
": "... LONG HTML here ... ",
" link ":" Http://stackoverflow.com/questions/1260748/how-do-i-remove-a-git-submodule ",
"tags": ["Git", "git-submodules"],
"title": "How do I remove a git submodule?",
"votes": "1764"
},< c14/> ...]
|
When you run Scrapy runspider somefile.py This statement, Scrapy will look for a spider defined in the source file and give it to the crawler engine to execute it. The Start_urls property defines the starting URL, through which the crawler constructs the initial request, returns response, and then invokes the default callback method parse and passes in to the response. In the parse callback method, we extract the href attribute value of each question page link by using the CSS selector, then yield another request, and register the Parse_question callback method, which is executed after the request completes.
Process Flowchart:
Scrapy Frame composition
One of the benefits of Scrapy is that all requests are scheduled and processed asynchronously, even if a request fails without affecting other requests to continue to be processed.
In our example, the parsing result is generated in JSON format, and you can also export it to other formats (such as XML, CSV), or store it on an FTP, Amazon S3. You can also store them in a database by pipeline, and the data is stored in a variety of ways. scrapy Features Overview
You can already crawl data from a Web site by Scrapy and save it for resolution, but that's just scrapy fur. Scrapy offers more features to make it easier and more efficient for you to crawl. For example, a built-in CSS selector and XPath expression to support extensions to select and extract data from the Html/xml source code, and use regular expressions to provide interactive shell console test CSS and XPath expressions, which are useful when debugging your spider program Built-in support for generating multiple formats of subscription exports (JSON, CSV, XML) and storing them in multiple locations (FTP, S3, local file system) robust coding support and automatic recognition for handling foreign, nonstandard, and error coding issues can be extended to allow you to use signals and friendly APIs (Middlewares, extensions, and pipelines) to write custom plug-in functionality. A large number of built-in extensions and middleware for use: Cookies and Sessions handling HTTP features like compression, authentication, caching User-agent R Obots.txt crawl depth restriction and more, there are many other good things, such as the reusable spider to crawl Sitemaps and Xml/csv subscription, a crawl element associated with the media pipeline to automatically download pictures, A cache DNS parser, and so on. Original link: http://www.pycoding.com/2016/03/08/scrapy-01.html