Install Scrapy-0.14.0.2841 crawler framework under RHEL5

Source: Internet
Author: User
Scrapy is a very mature crawler framework that can capture web page data and extract structured data. Currently, many enterprises are used in the production environment. For more information about scrapy.org, visit the official website www.scrapy.org ). We will install the SDK step by step according to the installation guide provided on the official website. For more information, see: http://doc.scrapy.org/en/latest/intro/install.html: requirements#&nbsp.

Scrapy is a very mature crawler framework that can capture web page data and extract structured data. Currently, many enterprises are used in the production environment. For more information about scrapy.org, visit the official website www.scrapy.org ).

Based on the Installation Guide provided on the official website, we will install it step by step. For more information, see http://doc.scrapy.org/en/latest/intro/install.html:

  1. Requirements Lifecycle ¶
  2. Python 2.5, 2.6, 2.7 (3.x is not yet supported)
  3. Twisted 2.5.0, 8.0 or above (Windows users: you'll need to install Zope. Interface and maybe pywin32 because of this Twisted bug)
  4. W3lib
  5. Lxml or libxml2 (if using libxml2, version 2.6.28 or abve is highly recommended)
  6. Simplejson (not required if using Python 2.6 or above)
  7. Pyopenssl (for HTTPS support. Optional, but highly recommended)
Next, record the process from installing Python to installing scrapy. Finally, run the command to capture data to verify the installation configuration.

Preparations


Operating System: RHEL 5
Python version: Python-2.7.2
Zope. interface version: zope. interface-3.8.0
Twisted version: Twisted-11.1.0
Libxml2: libxml2-2.7.4.tar.gz
W3lib: w3lib-1.0
Scrapy: Scrapy-0.14.0.2841


Install configurations


1. Install zlib

First, check whether zlib has been installed in your system. This library is a data compression tool kit. The scrapy framework depends on this tool kit. Check whether the RHEL 5 system is installed:

  1. [Root @ localhost scrapy]# Rpm-qa zlib
  2. Zlib-1.2.3-3
My system has been installed by default. skip this step if you install it. If no installation is available, download the package at http://www.zlib.net/and install the package. To download zlib-1.2.5.tar.gz, run the following command:
  1. [Root @ localhost scrapy] # tar-xvzf zlib-1.2.5.tar.gz
  2. [Root @ localhost zlib-1.2.5] # cd zlib-1.2.5
  3. [Root @ localhost zlib-1.2.5] # make
  4. [Root @ localhost zlib-1.2.5] # make install

2. Install Python

I have installed Python 2.4 in my system. According to the requirements and suggestions on the official website, I chose Python-2.7.2, as shown below:

Http://www.python.org/download/ (Agent required)
Http://www.python.org/ftp/python/2.7.2/Python-2.7.2.tgz

I downloaded the Python source code and re-compiled it. The installation process is as follows:

  1. [Root @ localhost scrapy] # tar-zvxf Python-2.7.2.tgz
  2. [Root @ localhost scrapy] # cd Python-2.7.2
  3. [Root @ localhost Python-2.7.2] #./configure
  4. [Root @ localhost Python-2.7.2] # make
  5. [Root @ localhost Python-2.7.2] # make install

By default, the Python program is installed in/usr/local/lib/python2.7.

If Python is not installed in your system, run the following command:

  1. [Root @ localhost scrapy] # python
  2. Python 2.7.2 (default, Dec 5 2011, 22:04:07)
  3. [GCC 4.1.1 20070105 (Red Hat 4.1.1-52)] on linux2
  4. Type "help", "copyright", "credits" or "license" for more information.
  5. >>>
Indicates that the latest Python installation is ready for use.

If you have other Python versions in your system, such as Python 2.4 in my system, you need to create a symbolic link:

  1. [Root @ localhost python2.7] # mv/usr/bin/python. bak
  2. [Root @ localhost python2.7] # ln-s/usr/local/bin/python/usr/bin/python
After this operation, the execution of python takes effect.

3. Install setuptools

Install a tool to manage the Python module. skip this step if it is already installed. If you need to install it, refer to the following link:

Http://pypi.python.org/pypi/setuptools/0.6c11#installation-instructions
Http://pypi.python.org/packages/2.7/s/setuptools/setuptools-0.6c11-py2.7.egg#md5=fe1f997bc722265116870bc7919059ea

However, after installing Python-2.7.2, you can see that there is a setup. py script in the Python decompression package. You can use this script to install some Python-related modules and execute the command:

  1. [Root @ localhost Python-2.7.2] # python setup. py install
After installation, the Python module is installed in the/usr/local/lib/python2.7/site-packages directory.

4. install zope. interface

As follows:

Http://pypi.python.org/pypi/zope.interface/3.8.0
Http://pypi.python.org/packages/source/z/zope.interface/zope.interface-3.8.0.tar.gz#md5=8ab837320b4532774c9c89f030d2a389

The installation process is as follows:

  1. [Root @ localhost scrapy] $ tar-xvzf zope.interface-3.8.0.tar.gz
  2. [Root @ localhost scrapy] $ cd zope. interface-3.8.0
  3. [Root @ localhost zope. interface-3.8.0] $ python setup. py build
  4. [Root @ localhost zope. interface-3.8.0] $ python setup. py install
After the installation is complete, you can see zope and zope. interface-3.8.0-py2.7.egg-info under/usr/local/lib/python2.7/site-packages.

5. Install Twisted

As follows:

Http://twistedmatrix.com/trac/
Http://pypi.python.org/packages/source/T/Twisted/Twisted-11.1.0.tar.bz2#md5=972f3497e6e19318c741bf2900ffe31c

The installation process is as follows:

  1. [Root @ localhost scrapy] # bzip2-d Twisted-11.1.0.tar.bz2
  2. [Root @ localhost scrapy] # tar-xvf Twisted-11.1.0.tar
  3. [Root @ localhost scrapy] # cdtwisted-11.1.0
  4. [Root @ localhost Twisted-11.1.0] # python setup. py install
After the installation is complete, you can see twisted and Twisted-11.1.0-py2.7.egg-info under/usr/local/lib/python2.7/site-packages.

6. Install w3lib

As follows:

Http://pypi.python.org/pypi/w3lib
Http://pypi.python.org/packages/source/w/w3lib/w3lib-1.0.tar.gz#md5=f28aeb882f27a616e0fc43d01f4dcb21

The installation process is as follows:

  1. [Root @ localhost scrapy] # tar-xvzf w3lib-1.0.tar.gz
  2. [Root @ localhost scrapy] # cd w3lib-1.0
  3. [Root @ localhost w3lib-1.0] # python setup. py install
After the installation is complete, you can see w3lib and w3lib-1.0-py2.7.egg-info under/usr/local/lib/python2.7/site-packages.

7. Install libxml2

You can find the corresponding version of the compressed package on the website http://xmlsoft.org.

The installation process is as follows:

  1. [Root @ localhost scrapy] # tar-xvzf libxml2-2.7.4.tar.gz
  2. [Root @ localhost scrapy] # cd libxml2-2.7.4
  3. [Root @ localhost libxml2-2.7.4] #./configure
  4. [Root @ localhost libxml2-2.7.4] # make
  5. [Root @ localhost libxml2-2.7.4] # make install

8. Install pyOpenSSL

This step is optional and the corresponding installation package is:

Https://launchpad.net/pyopenssl

If necessary, you can select the desired version. Skip this step.

9. Install Scrapy

As follows:

Http://scrapy.org/download/
Http://pypi.python.org/pypi/Scrapy
Http://pypi.python.org/packages/source/S/Scrapy/Scrapy-0.14.0.2841.tar.gz#md5=fe63c5606ca4c0772d937b51869be200

The installation process is as follows:

  1. [Root @ localhost scrapy] # tar-xvzf Scrapy-0.14.0.2841.tar.gz
  2. [Root @ localhost scrapy] # cd Scrapy-0.14.0.2841
  3. [Root @ localhost Scrapy-0.14.0.2841] # python setup. py install

Installation Verification


After the above installation and configuration process, Scrapy has been installed. We can verify it through the following command line:

  1. [Root @ localhost scrapy] # scrapy
  2. Scrapy 0.14.0.2841-no active project
  3. Usage:
  4. Scrapy[Options] [args]
  5. Available commands:
  6. Fetch Fetch a URL using the Scrapy downloader
  7. Runspider Run a self-contained spider (without creating a project)
  8. Settings Get settings values
  9. Shell Interactive scraping console
  10. Startproject Create new project
  11. Version Print Scrapy version
  12. View Open URL in browser, as seen by Scrapy
  13. Use "scrapy-H "to see more info about a command
The above prompt message provides a fetch Command, which captures the specified webpage. You can first look at the help information of the fetch Command, as shown below:
  1. [Root @ localhost scrapy] # scrapy fetch -- help
  2. Usage
  3. =====
  4. Scrapy fetch [options]
  5. Fetch a URL using the Scrapy downloader and print its content to stdout. You
  6. May want to use -- nolog to disable logging
  7. Options
  8. ========
  9. -- Help,-h show this help message and exit
  10. -- Spider = SPIDER use this spider
  11. -- Headers print response HTTP headers instead of body
  12. Global Options
  13. --------------
  14. -- Logfile = FILE log file. if omitted stderr will be used
  15. -- Loglevel = LEVEL,-L LEVEL
  16. Log level (default: DEBUG)
  17. -- Nolog disable logging completely
  18. -- Profile = FILE write python cProfile stats to FILE
  19. -- Lsprof = FILE write lsprof profiling stats to FILE
  20. -- Pidfile = FILE write process ID to FILE
  21. -- Set = NAME = VALUE,-s NAME = VALUE
  22. Set/override setting (may be repeated)
Specify a URL according to the command prompt, and capture the data of a webpage after execution, as shown below:
  1. [Root @ localhost scrapy] # scrapy fetch http://doc.scrapy.org/en/latest/intro/install.html> install.html
  2. 23:40:04 + 0800 [scrapy] INFO: Scrapy 0.14.0.2841 started (bot: scrapybot)
  3. 23:40:04 + 0800 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, MemoryUsage, SpiderState
  4. 23:40:04 + 0800 [scrapy] DEBUG: Enabled downloader middlewares: Enabled, disabled, UserAgentMiddleware, RetryMiddleware, disabled, RedirectMiddleware, CookiesMiddleware, disabled, disabled, DownloaderStats
  5. 23:40:04 + 0800 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
  6. 23:40:04 + 0800 [scrapy] DEBUG: Enabled item pipelines:
  7. 23:40:05 + 0800 [default] INFO: Spider opened
  8. 23:40:05 + 0800 [default] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
  9. 23:40:05 + 0800 [scrapy] DEBUG: Telnet console listening on 0.0.0.0: 6023
  10. 23:40:05 + 0800 [scrapy] DEBUG: Web service listening on 0.0.0.0: 6080
  11. 23:40:07 + 0800 [default] DEBUG: crawler (200) (Referer: None)
  12. 23:40:07 + 0800 [default] INFO: Closing spider (finished)
  13. 23:40:07 + 0800 [default] INFO: Dumping spider stats:
  14. {'Downloader/request_bytes ': 227,
  15. 'Downloader/request_count ': 1,
  16. 'Downloader/request_method_count/get': 1,
  17. 'Downloader/response_bytes ': 22676,
  18. 'Downloader/response_count ': 1,
  19. 'Downloader/response_status_count/100': 1,
  20. 'Finish _ reason ': 'finished ',
  21. 'Finish _ time': datetime. datetime (2011, 12, 5, 15, 40, 7, 918833 ),
  22. 'Scheduler/memory_enqueued': 1,
  23. 'Start _ time': datetime. datetime (2011, 12, 5, 15, 40, 5, 5749 )}
  24. 23:40:07 + 0800 [default] INFO: Spider closed (finished)
  25. 23:40:07 + 0800 [scrapy] INFO: Dumping global stats:
  26. {'Memusage/max ': 17711104, 'memusage/startup': 17711104}
  27. [Root @ localhost scrapy] # ll install.html
  28. -Rw-r -- 1 root 22404 Dec 5 23:40 install.html
  29. [Root @ localhost scrapy] #
We can see that we have successfully captured a webpage.

Next, you can further apply the scrapy framework according to the instruction on the scrapy official website. The Tutorial link page is http://doc.scrapy.org/en/latest/intro/tutorial.html.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.