Install Scrapy-0.14.0.2841 crawler framework under RHEL 5

Last Update:2018-12-05 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Document directory

1. Install zlib
2. Install Python
3. Install setuptools
4. install Zope. Interface
5. Install twisted
6. Install w3lib
7. Install libxml2
8. Install pyopenssl
9. Install scrapy

Scrapy is a very mature crawler framework that can capture web page data and extract structured data. Currently, many enterprises are used in the production environment. For more information about scrapy.org, visit the official website www.scrapy.org ).

Based on the Installation Guide provided on the official website, we will install it step by step. For more information, see http://doc.scrapy.org/en/latest/intro/install.html:

Requirements¶Python 2.5, 2.6, 2.7 (3.x is not yet supported)Twisted 2.5.0, 8.0 or above (Windows users: you’ll need to install Zope.Interface and maybe pywin32 because of this Twisted bug)w3liblxml or libxml2 (if using libxml2, version 2.6.28 or above is highly recommended)simplejson (not required if using Python 2.6 or above)pyopenssl (for HTTPS support. Optional, but highly recommended)

Next, record the process from installing python to installing scrapy. Finally, run the command to to verify the installation configuration.

Preparations

Operating System: RHEL 5
Python version: Python-2.7.2
Zope. Interface version: Zope. Interface-3.8.0
Twisted version: Twisted-11.1.0
Libxml2: libxml2-2.7.4.tar.gz
W3lib: w3lib-1.0
Scrapy: Scrapy-0.14.0.2841

Install configurations

1. Install zlib

First, check whether zlib has been installed in your system. This library is a data compression tool kit. The scrapy framework depends on this tool kit. Check whether the RHEL 5 system is installed:

[root@localhost scrapy]# rpm -qa zlibzlib-1.2.3-3

My system has been installed by default. skip this step if you install it. If no installation is available, download the package at http://www.zlib.net/and install the package. To download zlib-1.2.5.tar.gz, run the following command:

[root@localhost scrapy]# tar -xvzf zlib-1.2.5.tar.gz[root@localhost zlib-1.2.5]# cd zlib-1.2.5[root@localhost zlib-1.2.5]# make[root@localhost zlib-1.2.5]# make install

2. Install Python

I have installed Python 2.4 in my system. According to the requirements and suggestions on the official website, I chose Python-2.7.2, as shown below:

Http://www.python.org/download/ (Agent required)
Http://www.python.org/ftp/python/2.7.2/Python-2.7.2.tgz

I downloaded the Python source code and re-compiled it. The installation process is as follows:

[root@localhost scrapy]# tar -zvxf Python-2.7.2.tgz[root@localhost scrapy]# cd Python-2.7.2[root@localhost Python-2.7.2]# ./configure[root@localhost Python-2.7.2]# make[root@localhost Python-2.7.2]# make install

By default, the python program is installed in/usr/local/lib/python2.7.

If python is not installed in your system, run the following command:

[root@localhost scrapy]# pythonPython 2.7.2 (default, Dec  5 2011, 22:04:07) [GCC 4.1.1 20070105 (Red Hat 4.1.1-52)] on linux2Type "help", "copyright", "credits" or "license" for more information.>>>

Indicates that the latest Python installation is ready for use.

If you have other Python versions in your system, such as Python 2.4 in my system, you need to create a symbolic link:

[root@localhost python2.7]# mv /usr/bin/python /usr/bin/python.bak[root@localhost python2.7]# ln -s /usr/local/bin/python /usr/bin/python

After this operation, the execution of Python takes effect.

3. Install setuptools

Install a tool to manage the python module. skip this step if it is already installed. If you need to install it, refer to the following link:

Http://pypi.python.org/pypi/setuptools/0.6c11#installation-instructions
Http://pypi.python.org/packages/2.7/s/setuptools/setuptools-0.6c11-py2.7.egg#md5=fe1f997bc722265116870bc7919059ea

However, after installing Python-2.7.2, you can see that there is a setup. py script in the python decompression package. You can use this script to install some Python-related modules and execute the command:

[root@localhost Python-2.7.2]# python setup.py install

After installation, the python module is installed in the/usr/local/lib/python2.7/Site-packages directory.

4. install Zope. Interface

As follows:

Http://pypi.python.org/pypi/zope.interface/3.8.0
Http://pypi.python.org/packages/source/z/zope.interface/zope.interface-3.8.0.tar.gz#md5=8ab837320b4532774c9c89f030d2a389

The installation process is as follows:

[root@localhost scrapy]$ tar -xvzf zope.interface-3.8.0.tar.gz[root@localhost scrapy]$ cd zope.interface-3.8.0[root@localhost zope.interface-3.8.0]$ python setup.py build[root@localhost zope.interface-3.8.0]$ python setup.py install

After the installation is complete, you can see Zope and Zope. interface-3.8.0-py2.7.egg-info under/usr/local/lib/python2.7/Site-packages.

5. Install twisted

As follows:

Http://twistedmatrix.com/trac/
Http://pypi.python.org/packages/source/T/Twisted/Twisted-11.1.0.tar.bz2#md5=972f3497e6e19318c741bf2900ffe31c

The installation process is as follows:

[root@localhost scrapy]# bzip2 -d Twisted-11.1.0.tar.bz2[root@localhost scrapy]# tar -xvf Twisted-11.1.0.tar[root@localhost scrapy]# cd Twisted-11.1.0[root@localhost Twisted-11.1.0]# python setup.py install

After the installation is complete, you can see twisted and Twisted-11.1.0-py2.7.egg-info under/usr/local/lib/python2.7/Site-packages.

6. Install w3lib

As follows:

Http://pypi.python.org/pypi/w3lib
Http://pypi.python.org/packages/source/w/w3lib/w3lib-1.0.tar.gz#md5=f28aeb882f27a616e0fc43d01f4dcb21

The installation process is as follows:

[root@localhost scrapy]# tar -xvzf w3lib-1.0.tar.gz[root@localhost scrapy]# cd w3lib-1.0[root@localhost w3lib-1.0]# python setup.py install

After the installation is complete, you can see w3lib and w3lib-1.0-py2.7.egg-info under/usr/local/lib/python2.7/Site-packages.

7. Install libxml2

As follows:

Http://download.chinaunix.net/download.php? Id = 28497 & resourceid = 6095
Http://download.chinaunix.net/down.php? Id = 28497 & resourceid = 6095 & Site = 1

Or, you can find the corresponding version of the compressed package on the website http://xmlsoft.org.

The installation process is as follows:

[root@localhost scrapy]# tar -xvzf libxml2-2.7.4.tar.gz[root@localhost scrapy]# cd libxml2-2.7.4[root@localhost libxml2-2.7.4]# ./configure[root@localhost libxml2-2.7.4]# make[root@localhost libxml2-2.7.4]# make install

8. Install pyopenssl

This step is optional and the corresponding installation package is:

Https://launchpad.net/pyopenssl

If necessary, you can select the desired version. Skip this step.

9. Install scrapy

As follows:

Http://scrapy.org/download/
Http://pypi.python.org/pypi/Scrapy
Http://pypi.python.org/packages/source/S/Scrapy/Scrapy-0.14.0.2841.tar.gz#md5=fe63c5606ca4c0772d937b51869be200

The installation process is as follows:

[root@localhost scrapy]# tar -xvzf Scrapy-0.14.0.2841.tar.gz[root@localhost scrapy]# cd Scrapy-0.14.0.2841[root@localhost Scrapy-0.14.0.2841]# python setup.py install

Installation Verification

After the above installation and configuration process, scrapy has been installed. We can verify it through the following command line:

[root@localhost scrapy]# scrapyScrapy 0.14.0.2841 - no active projectUsage:  scrapy <command> [options] [args]Available commands:  fetch         Fetch a URL using the Scrapy downloader  runspider     Run a self-contained spider (without creating a project)  settings      Get settings values  shell         Interactive scraping console  startproject  Create new project  version       Print Scrapy version  view          Open URL in browser, as seen by ScrapyUse "scrapy <command> -h" to see more info about a command

The above prompt message provides a FETCH Command, which captures the specified webpage. You can first look at the help information of the FETCH Command, as shown below:

[root@localhost scrapy]# scrapy fetch --helpUsage=====  scrapy fetch [options] <url>Fetch a URL using the Scrapy downloader and print its content to stdout. Youmay want to use --nolog to disable loggingOptions=======--help, -h              show this help message and exit--spider=SPIDER         use this spider--headers               print response HTTP headers instead of bodyGlobal Options----------------logfile=FILE          log file. if omitted stderr will be used--loglevel=LEVEL, -L LEVEL                        log level (default: DEBUG)--nolog                 disable logging completely--profile=FILE          write python cProfile stats to FILE--lsprof=FILE           write lsprof profiling stats to FILE--pidfile=FILE          write process ID to FILE--set=NAME=VALUE, -s NAME=VALUE                        set/override setting (may be repeated)

Specify a URL according to the command prompt, and capture the data of a webpage after execution, as shown below:

[root@localhost scrapy]# scrapy fetch http://doc.scrapy.org/en/latest/intro/install.html > install.html2011-12-05 23:40:04+0800 [scrapy] INFO: Scrapy 0.14.0.2841 started (bot: scrapybot)2011-12-05 23:40:04+0800 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, MemoryUsage, SpiderState2011-12-05 23:40:04+0800 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats2011-12-05 23:40:04+0800 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware2011-12-05 23:40:04+0800 [scrapy] DEBUG: Enabled item pipelines: 2011-12-05 23:40:05+0800 [default] INFO: Spider opened2011-12-05 23:40:05+0800 [default] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)2011-12-05 23:40:05+0800 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:60232011-12-05 23:40:05+0800 [scrapy] DEBUG: Web service listening on 0.0.0.0:60802011-12-05 23:40:07+0800 [default] DEBUG: Crawled (200) <GET http://doc.scrapy.org/en/latest/intro/install.html> (referer: None)2011-12-05 23:40:07+0800 [default] INFO: Closing spider (finished)2011-12-05 23:40:07+0800 [default] INFO: Dumping spider stats:        {'downloader/request_bytes': 227,         'downloader/request_count': 1,         'downloader/request_method_count/GET': 1,         'downloader/response_bytes': 22676,         'downloader/response_count': 1,         'downloader/response_status_count/200': 1,         'finish_reason': 'finished',         'finish_time': datetime.datetime(2011, 12, 5, 15, 40, 7, 918833),         'scheduler/memory_enqueued': 1,         'start_time': datetime.datetime(2011, 12, 5, 15, 40, 5, 5749)}2011-12-05 23:40:07+0800 [default] INFO: Spider closed (finished)2011-12-05 23:40:07+0800 [scrapy] INFO: Dumping global stats:        {'memusage/max': 17711104, 'memusage/startup': 17711104}[root@localhost scrapy]# ll install.html -rw-r--r-- 1 root root 22404 Dec  5 23:40 install.html[root@localhost scrapy]#

We can see that we have successfully captured a webpage.

Next, you can further apply the scrapy framework according to the instruction on the scrapy official website. The tutorial link page is http://doc.scrapy.org/en/latest/intro/tutorial.html.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Install Scrapy-0.14.0.2841 crawler framework under RHEL 5

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Install Scrapy-0.14.0.2841 crawler framework under RHEL 5

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support