Scrapy is a very mature crawler framework that can capture web page data and extract structured data. Currently, many enterprises are used in the production environment. For more information about scrapy.org, visit the official website www.scrapy.org ). We will install the SDK step by step according to the installation guide provided on the official website. For more information, see: http://doc.scrapy.org/en/latest/intro/install.html: requirements# .
Scrapy is a very mature crawler framework that can capture web page data and extract structured data. Currently, many enterprises are used in the production environment. For more information about scrapy.org, visit the official website www.scrapy.org ).
Based on the Installation Guide provided on the official website, we will install it step by step. For more information, see http://doc.scrapy.org/en/latest/intro/install.html:
- Requirements Lifecycle ¶
- Python 2.5, 2.6, 2.7 (3.x is not yet supported)
- Twisted 2.5.0, 8.0 or above (Windows users: you'll need to install Zope. Interface and maybe pywin32 because of this Twisted bug)
- W3lib
- Lxml or libxml2 (if using libxml2, version 2.6.28 or abve is highly recommended)
- Simplejson (not required if using Python 2.6 or above)
- Pyopenssl (for HTTPS support. Optional, but highly recommended)
Next, record the process from installing Python to installing scrapy. Finally, run the command to capture data to verify the installation configuration.
Preparations
Operating System: RHEL 5
Python version: Python-2.7.2
Zope. interface version: zope. interface-3.8.0
Twisted version: Twisted-11.1.0
Libxml2: libxml2-2.7.4.tar.gz
W3lib: w3lib-1.0
Scrapy: Scrapy-0.14.0.2841
Install configurations
1. Install zlib
First, check whether zlib has been installed in your system. This library is a data compression tool kit. The scrapy framework depends on this tool kit. Check whether the RHEL 5 system is installed:
- [Root @ localhost scrapy]# Rpm-qa zlib
- Zlib-1.2.3-3
My system has been installed by default. skip this step if you install it. If no installation is available, download the package at http://www.zlib.net/and install the package. To download zlib-1.2.5.tar.gz, run the following command:
- [Root @ localhost scrapy] # tar-xvzf zlib-1.2.5.tar.gz
- [Root @ localhost zlib-1.2.5] # cd zlib-1.2.5
- [Root @ localhost zlib-1.2.5] # make
- [Root @ localhost zlib-1.2.5] # make install
2. Install Python
I have installed Python 2.4 in my system. According to the requirements and suggestions on the official website, I chose Python-2.7.2, as shown below:
Http://www.python.org/download/ (Agent required)
Http://www.python.org/ftp/python/2.7.2/Python-2.7.2.tgz
I downloaded the Python source code and re-compiled it. The installation process is as follows:
- [Root @ localhost scrapy] # tar-zvxf Python-2.7.2.tgz
- [Root @ localhost scrapy] # cd Python-2.7.2
- [Root @ localhost Python-2.7.2] #./configure
- [Root @ localhost Python-2.7.2] # make
- [Root @ localhost Python-2.7.2] # make install
By default, the Python program is installed in/usr/local/lib/python2.7.
If Python is not installed in your system, run the following command:
- [Root @ localhost scrapy] # python
- Python 2.7.2 (default, Dec 5 2011, 22:04:07)
- [GCC 4.1.1 20070105 (Red Hat 4.1.1-52)] on linux2
- Type "help", "copyright", "credits" or "license" for more information.
- >>>
Indicates that the latest Python installation is ready for use.
If you have other Python versions in your system, such as Python 2.4 in my system, you need to create a symbolic link:
- [Root @ localhost python2.7] # mv/usr/bin/python. bak
- [Root @ localhost python2.7] # ln-s/usr/local/bin/python/usr/bin/python
After this operation, the execution of python takes effect.
3. Install setuptools
Install a tool to manage the Python module. skip this step if it is already installed. If you need to install it, refer to the following link:
Http://pypi.python.org/pypi/setuptools/0.6c11#installation-instructions
Http://pypi.python.org/packages/2.7/s/setuptools/setuptools-0.6c11-py2.7.egg#md5=fe1f997bc722265116870bc7919059ea
However, after installing Python-2.7.2, you can see that there is a setup. py script in the Python decompression package. You can use this script to install some Python-related modules and execute the command:
- [Root @ localhost Python-2.7.2] # python setup. py install
After installation, the Python module is installed in the/usr/local/lib/python2.7/site-packages directory.
4. install zope. interface
As follows:
Http://pypi.python.org/pypi/zope.interface/3.8.0
Http://pypi.python.org/packages/source/z/zope.interface/zope.interface-3.8.0.tar.gz#md5=8ab837320b4532774c9c89f030d2a389
The installation process is as follows:
- [Root @ localhost scrapy] $ tar-xvzf zope.interface-3.8.0.tar.gz
- [Root @ localhost scrapy] $ cd zope. interface-3.8.0
- [Root @ localhost zope. interface-3.8.0] $ python setup. py build
- [Root @ localhost zope. interface-3.8.0] $ python setup. py install
After the installation is complete, you can see zope and zope. interface-3.8.0-py2.7.egg-info under/usr/local/lib/python2.7/site-packages.
5. Install Twisted
As follows:
Http://twistedmatrix.com/trac/
Http://pypi.python.org/packages/source/T/Twisted/Twisted-11.1.0.tar.bz2#md5=972f3497e6e19318c741bf2900ffe31c
The installation process is as follows:
- [Root @ localhost scrapy] # bzip2-d Twisted-11.1.0.tar.bz2
- [Root @ localhost scrapy] # tar-xvf Twisted-11.1.0.tar
- [Root @ localhost scrapy] # cdtwisted-11.1.0
- [Root @ localhost Twisted-11.1.0] # python setup. py install
After the installation is complete, you can see twisted and Twisted-11.1.0-py2.7.egg-info under/usr/local/lib/python2.7/site-packages.
6. Install w3lib
As follows:
Http://pypi.python.org/pypi/w3lib
Http://pypi.python.org/packages/source/w/w3lib/w3lib-1.0.tar.gz#md5=f28aeb882f27a616e0fc43d01f4dcb21
The installation process is as follows:
- [Root @ localhost scrapy] # tar-xvzf w3lib-1.0.tar.gz
- [Root @ localhost scrapy] # cd w3lib-1.0
- [Root @ localhost w3lib-1.0] # python setup. py install
After the installation is complete, you can see w3lib and w3lib-1.0-py2.7.egg-info under/usr/local/lib/python2.7/site-packages.
7. Install libxml2
You can find the corresponding version of the compressed package on the website http://xmlsoft.org.
The installation process is as follows:
- [Root @ localhost scrapy] # tar-xvzf libxml2-2.7.4.tar.gz
- [Root @ localhost scrapy] # cd libxml2-2.7.4
- [Root @ localhost libxml2-2.7.4] #./configure
- [Root @ localhost libxml2-2.7.4] # make
- [Root @ localhost libxml2-2.7.4] # make install
8. Install pyOpenSSL
This step is optional and the corresponding installation package is:
Https://launchpad.net/pyopenssl
If necessary, you can select the desired version. Skip this step.
9. Install Scrapy
As follows:
Http://scrapy.org/download/
Http://pypi.python.org/pypi/Scrapy
Http://pypi.python.org/packages/source/S/Scrapy/Scrapy-0.14.0.2841.tar.gz#md5=fe63c5606ca4c0772d937b51869be200
The installation process is as follows:
- [Root @ localhost scrapy] # tar-xvzf Scrapy-0.14.0.2841.tar.gz
- [Root @ localhost scrapy] # cd Scrapy-0.14.0.2841
- [Root @ localhost Scrapy-0.14.0.2841] # python setup. py install
Installation Verification
After the above installation and configuration process, Scrapy has been installed. We can verify it through the following command line:
- [Root @ localhost scrapy] # scrapy
- Scrapy 0.14.0.2841-no active project
-
- Usage:
- Scrapy[Options] [args]
-
- Available commands:
- Fetch Fetch a URL using the Scrapy downloader
- Runspider Run a self-contained spider (without creating a project)
- Settings Get settings values
- Shell Interactive scraping console
- Startproject Create new project
- Version Print Scrapy version
- View Open URL in browser, as seen by Scrapy
-
- Use "scrapy-H "to see more info about a command
The above prompt message provides a fetch Command, which captures the specified webpage. You can first look at the help information of the fetch Command, as shown below:
- [Root @ localhost scrapy] # scrapy fetch -- help
- Usage
- =====
- Scrapy fetch [options]
-
- Fetch a URL using the Scrapy downloader and print its content to stdout. You
- May want to use -- nolog to disable logging
-
- Options
- ========
- -- Help,-h show this help message and exit
- -- Spider = SPIDER use this spider
- -- Headers print response HTTP headers instead of body
-
- Global Options
- --------------
- -- Logfile = FILE log file. if omitted stderr will be used
- -- Loglevel = LEVEL,-L LEVEL
- Log level (default: DEBUG)
- -- Nolog disable logging completely
- -- Profile = FILE write python cProfile stats to FILE
- -- Lsprof = FILE write lsprof profiling stats to FILE
- -- Pidfile = FILE write process ID to FILE
- -- Set = NAME = VALUE,-s NAME = VALUE
- Set/override setting (may be repeated)
Specify a URL according to the command prompt, and capture the data of a webpage after execution, as shown below:
- [Root @ localhost scrapy] # scrapy fetch http://doc.scrapy.org/en/latest/intro/install.html> install.html
- 23:40:04 + 0800 [scrapy] INFO: Scrapy 0.14.0.2841 started (bot: scrapybot)
- 23:40:04 + 0800 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, MemoryUsage, SpiderState
- 23:40:04 + 0800 [scrapy] DEBUG: Enabled downloader middlewares: Enabled, disabled, UserAgentMiddleware, RetryMiddleware, disabled, RedirectMiddleware, CookiesMiddleware, disabled, disabled, DownloaderStats
- 23:40:04 + 0800 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
- 23:40:04 + 0800 [scrapy] DEBUG: Enabled item pipelines:
- 23:40:05 + 0800 [default] INFO: Spider opened
- 23:40:05 + 0800 [default] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
- 23:40:05 + 0800 [scrapy] DEBUG: Telnet console listening on 0.0.0.0: 6023
- 23:40:05 + 0800 [scrapy] DEBUG: Web service listening on 0.0.0.0: 6080
- 23:40:07 + 0800 [default] DEBUG: crawler (200) (Referer: None)
- 23:40:07 + 0800 [default] INFO: Closing spider (finished)
- 23:40:07 + 0800 [default] INFO: Dumping spider stats:
- {'Downloader/request_bytes ': 227,
- 'Downloader/request_count ': 1,
- 'Downloader/request_method_count/get': 1,
- 'Downloader/response_bytes ': 22676,
- 'Downloader/response_count ': 1,
- 'Downloader/response_status_count/100': 1,
- 'Finish _ reason ': 'finished ',
- 'Finish _ time': datetime. datetime (2011, 12, 5, 15, 40, 7, 918833 ),
- 'Scheduler/memory_enqueued': 1,
- 'Start _ time': datetime. datetime (2011, 12, 5, 15, 40, 5, 5749 )}
- 23:40:07 + 0800 [default] INFO: Spider closed (finished)
- 23:40:07 + 0800 [scrapy] INFO: Dumping global stats:
- {'Memusage/max ': 17711104, 'memusage/startup': 17711104}
- [Root @ localhost scrapy] # ll install.html
- -Rw-r -- 1 root 22404 Dec 5 23:40 install.html
- [Root @ localhost scrapy] #
We can see that we have successfully captured a webpage.
Next, you can further apply the scrapy framework according to the instruction on the scrapy official website. The Tutorial link page is http://doc.scrapy.org/en/latest/intro/tutorial.html.