Install Scrapy-0.14.0.2841 crawler framework under RHEL 5

Source: Internet
Author: User
Document directory
  • 1. Install zlib
  • 2. Install Python
  • 3. Install setuptools
  • 4. install Zope. Interface
  • 5. Install twisted
  • 6. Install w3lib
  • 7. Install libxml2
  • 8. Install pyopenssl
  • 9. Install scrapy

Scrapy is a very mature crawler framework that can capture web page data and extract structured data. Currently, many enterprises are used in the production environment. For more information about scrapy.org, visit the official website www.scrapy.org ).

Based on the Installation Guide provided on the official website, we will install it step by step. For more information, see http://doc.scrapy.org/en/latest/intro/install.html:


Requirements¶Python 2.5, 2.6, 2.7 (3.x is not yet supported)Twisted 2.5.0, 8.0 or above (Windows users: you’ll need to install Zope.Interface and maybe pywin32 because of this Twisted bug)w3liblxml or libxml2 (if using libxml2, version 2.6.28 or above is highly recommended)simplejson (not required if using Python 2.6 or above)pyopenssl (for HTTPS support. Optional, but highly recommended)

Next, record the process from installing python to installing scrapy. Finally, run the command to to verify the installation configuration.

Preparations


Operating System: RHEL 5
Python version: Python-2.7.2
Zope. Interface version: Zope. Interface-3.8.0
Twisted version: Twisted-11.1.0
Libxml2: libxml2-2.7.4.tar.gz
W3lib: w3lib-1.0
Scrapy: Scrapy-0.14.0.2841


Install configurations


1. Install zlib

First, check whether zlib has been installed in your system. This library is a data compression tool kit. The scrapy framework depends on this tool kit. Check whether the RHEL 5 system is installed:


[root@localhost scrapy]# rpm -qa zlibzlib-1.2.3-3

My system has been installed by default. skip this step if you install it. If no installation is available, download the package at http://www.zlib.net/and install the package. To download zlib-1.2.5.tar.gz, run the following command:


[root@localhost scrapy]# tar -xvzf zlib-1.2.5.tar.gz[root@localhost zlib-1.2.5]# cd zlib-1.2.5[root@localhost zlib-1.2.5]# make[root@localhost zlib-1.2.5]# make install


2. Install Python


I have installed Python 2.4 in my system. According to the requirements and suggestions on the official website, I chose Python-2.7.2, as shown below:

Http://www.python.org/download/ (Agent required)
Http://www.python.org/ftp/python/2.7.2/Python-2.7.2.tgz

I downloaded the Python source code and re-compiled it. The installation process is as follows:


[root@localhost scrapy]# tar -zvxf Python-2.7.2.tgz[root@localhost scrapy]# cd Python-2.7.2[root@localhost Python-2.7.2]# ./configure[root@localhost Python-2.7.2]# make[root@localhost Python-2.7.2]# make install


By default, the python program is installed in/usr/local/lib/python2.7.

If python is not installed in your system, run the following command:


[root@localhost scrapy]# pythonPython 2.7.2 (default, Dec  5 2011, 22:04:07) [GCC 4.1.1 20070105 (Red Hat 4.1.1-52)] on linux2Type "help", "copyright", "credits" or "license" for more information.>>>

Indicates that the latest Python installation is ready for use.

If you have other Python versions in your system, such as Python 2.4 in my system, you need to create a symbolic link:


[root@localhost python2.7]# mv /usr/bin/python /usr/bin/python.bak[root@localhost python2.7]# ln -s /usr/local/bin/python /usr/bin/python

After this operation, the execution of Python takes effect.

3. Install setuptools

Install a tool to manage the python module. skip this step if it is already installed. If you need to install it, refer to the following link:

Http://pypi.python.org/pypi/setuptools/0.6c11#installation-instructions
Http://pypi.python.org/packages/2.7/s/setuptools/setuptools-0.6c11-py2.7.egg#md5=fe1f997bc722265116870bc7919059ea

However, after installing Python-2.7.2, you can see that there is a setup. py script in the python decompression package. You can use this script to install some Python-related modules and execute the command:


[root@localhost Python-2.7.2]# python setup.py install

After installation, the python module is installed in the/usr/local/lib/python2.7/Site-packages directory.

4. install Zope. Interface

As follows:

Http://pypi.python.org/pypi/zope.interface/3.8.0
Http://pypi.python.org/packages/source/z/zope.interface/zope.interface-3.8.0.tar.gz#md5=8ab837320b4532774c9c89f030d2a389

The installation process is as follows:


[root@localhost scrapy]$ tar -xvzf zope.interface-3.8.0.tar.gz[root@localhost scrapy]$ cd zope.interface-3.8.0[root@localhost zope.interface-3.8.0]$ python setup.py build[root@localhost zope.interface-3.8.0]$ python setup.py install

After the installation is complete, you can see Zope and Zope. interface-3.8.0-py2.7.egg-info under/usr/local/lib/python2.7/Site-packages.

5. Install twisted

As follows:

Http://twistedmatrix.com/trac/
Http://pypi.python.org/packages/source/T/Twisted/Twisted-11.1.0.tar.bz2#md5=972f3497e6e19318c741bf2900ffe31c

The installation process is as follows:


[root@localhost scrapy]# bzip2 -d Twisted-11.1.0.tar.bz2[root@localhost scrapy]# tar -xvf Twisted-11.1.0.tar[root@localhost scrapy]# cd Twisted-11.1.0[root@localhost Twisted-11.1.0]# python setup.py install

After the installation is complete, you can see twisted and Twisted-11.1.0-py2.7.egg-info under/usr/local/lib/python2.7/Site-packages.

6. Install w3lib

As follows:

Http://pypi.python.org/pypi/w3lib
Http://pypi.python.org/packages/source/w/w3lib/w3lib-1.0.tar.gz#md5=f28aeb882f27a616e0fc43d01f4dcb21

The installation process is as follows:


[root@localhost scrapy]# tar -xvzf w3lib-1.0.tar.gz[root@localhost scrapy]# cd w3lib-1.0[root@localhost w3lib-1.0]# python setup.py install

After the installation is complete, you can see w3lib and w3lib-1.0-py2.7.egg-info under/usr/local/lib/python2.7/Site-packages.

7. Install libxml2

As follows:

Http://download.chinaunix.net/download.php? Id = 28497 & resourceid = 6095
Http://download.chinaunix.net/down.php? Id = 28497 & resourceid = 6095 & Site = 1

Or, you can find the corresponding version of the compressed package on the website http://xmlsoft.org.

The installation process is as follows:


[root@localhost scrapy]# tar -xvzf libxml2-2.7.4.tar.gz[root@localhost scrapy]# cd libxml2-2.7.4[root@localhost libxml2-2.7.4]# ./configure[root@localhost libxml2-2.7.4]# make[root@localhost libxml2-2.7.4]# make install


8. Install pyopenssl

This step is optional and the corresponding installation package is:

Https://launchpad.net/pyopenssl

If necessary, you can select the desired version. Skip this step.

9. Install scrapy

As follows:

Http://scrapy.org/download/
Http://pypi.python.org/pypi/Scrapy
Http://pypi.python.org/packages/source/S/Scrapy/Scrapy-0.14.0.2841.tar.gz#md5=fe63c5606ca4c0772d937b51869be200

The installation process is as follows:


[root@localhost scrapy]# tar -xvzf Scrapy-0.14.0.2841.tar.gz[root@localhost scrapy]# cd Scrapy-0.14.0.2841[root@localhost Scrapy-0.14.0.2841]# python setup.py install


Installation Verification



After the above installation and configuration process, scrapy has been installed. We can verify it through the following command line:


[root@localhost scrapy]# scrapyScrapy 0.14.0.2841 - no active projectUsage:  scrapy <command> [options] [args]Available commands:  fetch         Fetch a URL using the Scrapy downloader  runspider     Run a self-contained spider (without creating a project)  settings      Get settings values  shell         Interactive scraping console  startproject  Create new project  version       Print Scrapy version  view          Open URL in browser, as seen by ScrapyUse "scrapy <command> -h" to see more info about a command

The above prompt message provides a FETCH Command, which captures the specified webpage. You can first look at the help information of the FETCH Command, as shown below:


[root@localhost scrapy]# scrapy fetch --helpUsage=====  scrapy fetch [options] <url>Fetch a URL using the Scrapy downloader and print its content to stdout. Youmay want to use --nolog to disable loggingOptions=======--help, -h              show this help message and exit--spider=SPIDER         use this spider--headers               print response HTTP headers instead of bodyGlobal Options----------------logfile=FILE          log file. if omitted stderr will be used--loglevel=LEVEL, -L LEVEL                        log level (default: DEBUG)--nolog                 disable logging completely--profile=FILE          write python cProfile stats to FILE--lsprof=FILE           write lsprof profiling stats to FILE--pidfile=FILE          write process ID to FILE--set=NAME=VALUE, -s NAME=VALUE                        set/override setting (may be repeated)

Specify a URL according to the command prompt, and capture the data of a webpage after execution, as shown below:


[root@localhost scrapy]# scrapy fetch http://doc.scrapy.org/en/latest/intro/install.html > install.html2011-12-05 23:40:04+0800 [scrapy] INFO: Scrapy 0.14.0.2841 started (bot: scrapybot)2011-12-05 23:40:04+0800 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, MemoryUsage, SpiderState2011-12-05 23:40:04+0800 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats2011-12-05 23:40:04+0800 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware2011-12-05 23:40:04+0800 [scrapy] DEBUG: Enabled item pipelines: 2011-12-05 23:40:05+0800 [default] INFO: Spider opened2011-12-05 23:40:05+0800 [default] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)2011-12-05 23:40:05+0800 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:60232011-12-05 23:40:05+0800 [scrapy] DEBUG: Web service listening on 0.0.0.0:60802011-12-05 23:40:07+0800 [default] DEBUG: Crawled (200) <GET http://doc.scrapy.org/en/latest/intro/install.html> (referer: None)2011-12-05 23:40:07+0800 [default] INFO: Closing spider (finished)2011-12-05 23:40:07+0800 [default] INFO: Dumping spider stats:        {'downloader/request_bytes': 227,         'downloader/request_count': 1,         'downloader/request_method_count/GET': 1,         'downloader/response_bytes': 22676,         'downloader/response_count': 1,         'downloader/response_status_count/200': 1,         'finish_reason': 'finished',         'finish_time': datetime.datetime(2011, 12, 5, 15, 40, 7, 918833),         'scheduler/memory_enqueued': 1,         'start_time': datetime.datetime(2011, 12, 5, 15, 40, 5, 5749)}2011-12-05 23:40:07+0800 [default] INFO: Spider closed (finished)2011-12-05 23:40:07+0800 [scrapy] INFO: Dumping global stats:        {'memusage/max': 17711104, 'memusage/startup': 17711104}[root@localhost scrapy]# ll install.html -rw-r--r-- 1 root root 22404 Dec  5 23:40 install.html[root@localhost scrapy]#

We can see that we have successfully captured a webpage.

Next, you can further apply the scrapy framework according to the instruction on the scrapy official website. The tutorial link page is http://doc.scrapy.org/en/latest/intro/tutorial.html.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.