Scrapy is an open-source Python standalone crawler with the twisted framework. This crawler actually contains a toolkit for most web crawlers to download and extract.
Installation environment:
centos5.4python2.7.3
Installation steps:
1. Download The python2.7 http://www.python.org/ftp/python/2.7.3/Python-2.7.3.tgz
[[email protected] ~]# wget http://www.python.org/ftp/python/2.7.3/Python-2.7.3.tgz -P /opt[[email protected] opt]# tar xvf Python-2.7.3.tgz [[email protected] Python-2.7.3]# ./configure [[email protected] Python-2.7.3]# make && make install
Verify python2.7 Installation
[[email protected] Python-2.7.3]# python2.7Python 2.7.3 (default, Feb 28 2013, 03:08:43) [GCC 4.1.2 20080704 (Red Hat 4.1.2-50)] on linux2Type "help", "copyright", "credits" or "license" for more information.>>> exit()
2. Install setuptools and http://pypi.python.org/packages/source/s/setuptools/setuptools-0.6c11.tar.gz
[[email protected] ~]# wget http://pypi.python.org/packages/source/s/setuptools/setuptools-0.6c11.tar.gz -P /opt/[[email protected] opt]# tar zxvf setuptools-0.6c11.tar.gz [[email protected] setuptools-0.6c11]# python2.7 setup.py install
3. Install twisted
[[email protected] setuptools-0.6c11]# easy_install Twisted......Installed /usr/local/lib/python2.7/site-packages/Twisted-12.3.0-py2.7-linux-x86_64.egg......Installed /usr/local/lib/python2.7/site-packages/zope.interface-4.0.4-py2.7-linux-x86_64.egg
To install Zope. Interface on twisted, download it from the address below.
Zope. Interface: http://pypi.python.org/packages/source/z/zope.interface/zope.interface-4.0.1.tar.gz
Twisted: http://twistedmatrix.com/Releases/Twisted/12.1/Twisted-12.1.0.tar.bz2
5. Install w3lib
[[email protected] setuptools-0.6c11]# easy_install -U w3libSearching for w3libReading http://pypi.python.org/simple/w3lib/Reading http://github.com/scrapy/w3libBest match: w3lib 1.2Downloading http://pypi.python.org/packages/source/w/w3lib/w3lib-1.2.tar.gz#md5=f929d5973a9fda59587b09a72f185a9eProcessing w3lib-1.2.tar.gzRunning w3lib-1.2/setup.py -q bdist_egg --dist-dir /tmp/easy_install-wm_1BB/w3lib-1.2/egg-dist-tmp-2DQHY_zip_safe flag not set; analyzing archive contents...Adding w3lib 1.2 to easy-install.pth fileInstalled /usr/local/lib/python2.7/site-packages/w3lib-1.2-py2.7.eggProcessing dependencies for w3libFinished processing dependencies for w3lib
W3lib: http://pypi.python.org/packages/source/w/w3lib/w3lib-1.2.tar.gz
6. Install libxml2 or install lxml with easy_install
[[email protected] lxml-3.1.0]# easy_install lxml
Verify lxml Installation
[[email protected] lxml-3.1.0]# python2.7Python 2.7.3 (default, Feb 28 2013, 03:08:43) [GCC 4.1.2 20080704 (Red Hat 4.1.2-50)] on linux2Type "help", "copyright", "credits" or "license" for more information.>>> import lxml>>> exit()
You can also install libxml2. We recommend that you install version 2.6.28 or later on the official website, but it is not found on the official website. I first installed version 2.6.9 and ran scrapy with the following error:
Traceback (most recent call last): File "/usr/local/bin/scrapy", line 5, in <module> pkg_resources.run_script(‘Scrapy==0.14.4‘, ‘scrapy‘) File "build/bdist.linux-x86_64/egg/pkg_resources.py", line 489, in run_script File "build/bdist.linux-x86_64/egg/pkg_resources.py", line 1207, in run_script File "/usr/local/lib/python2.7/site-packages/Scrapy-0.14.4-py2.7.egg/EGG-INFO/scripts/scrapy", line 4, in <module> execute() File "/usr/local/lib/python2.7/site-packages/Scrapy-0.14.4-py2.7.egg/scrapy/cmdline.py", line 112, in execute cmds = _get_commands_dict(inproject) File "/usr/local/lib/python2.7/site-packages/Scrapy-0.14.4-py2.7.egg/scrapy/cmdline.py", line 37, in _get_commands_dict cmds = _get_commands_from_module(‘scrapy.commands‘, inproject) File "/usr/local/lib/python2.7/site-packages/Scrapy-0.14.4-py2.7.egg/scrapy/cmdline.py", line 30, in _get_commands_from_module for cmd in _iter_command_classes(module): File "/usr/local/lib/python2.7/site-packages/Scrapy-0.14.4-py2.7.egg/scrapy/cmdline.py", line 21, in _iter_command_classes for module in walk_modules(module_name): File "/usr/local/lib/python2.7/site-packages/Scrapy-0.14.4-py2.7.egg/scrapy/utils/misc.py", line 65, in walk_modules submod = __import__(fullpath, {}, {}, [‘‘]) File "/usr/local/lib/python2.7/site-packages/Scrapy-0.14.4-py2.7.egg/scrapy/commands/shell.py", line 8, in <module> from scrapy.shell import Shell File "/usr/local/lib/python2.7/site-packages/Scrapy-0.14.4-py2.7.egg/scrapy/shell.py", line 14, in <module> from scrapy.selector import XPathSelector, XmlXPathSelector, HtmlXPathSelector File "/usr/local/lib/python2.7/site-packages/Scrapy-0.14.4-py2.7.egg/scrapy/selector/__init__.py", line 30, in <module> from scrapy.selector.libxml2sel import * File "/usr/local/lib/python2.7/site-packages/Scrapy-0.14.4-py2.7.egg/scrapy/selector/libxml2sel.py", line 12, in <module> from .factories import xmlDoc_from_html, xmlDoc_from_xml File "/usr/local/lib/python2.7/site-packages/Scrapy-0.14.4-py2.7.egg/scrapy/selector/factories.py", line 14, in <module> libxml2.HTML_PARSE_NOERROR + AttributeError: ‘module‘ object has no attribute ‘HTML_PARSE_RECOVER‘
Upgrade to version 2.6.21.
Libxml2.6.1: ftp://xmlsoft.org/libxml2/python/libxml2-python-2.6.21.tar.gz
7. Install pyopenssl (this is an optional installation, mainly to enable scrapy to support https)
The pyOpenSSL-0.13 version was installed with easy_install pyopenssl, but the installation was not successful, so you manually download. 011 for installation.
[[email protected] opt]# wget http://launchpadlibrarian.net/58498441/pyOpenSSL-0.11.tar.gz -P /opt[[email protected] opt]# tar zxvf pyOpenSSL-0.11.tar.gz [[email protected] pyOpenSSL-0.11]# python2.7 setup.py install
Pyopenssl: http://launchpadlibrarian.net/58498441/pyOpenSSL-0.11.tar.gz
8. Install scrapy
[[email protected] pyOpenSSL-0.11]# easy_install -U Scrapy
Verify Installation
[[email protected] pyOpenSSL-0.11]# scrapyScrapy 0.16.4 - no active projectUsage: scrapy <command> [options] [args]Available commands: fetch Fetch a URL using the Scrapy downloader runspider Run a self-contained spider (without creating a project) settings Get settings values shell Interactive scraping console startproject Create new project version Print Scrapy version view Open URL in browser, as seen by Scrapy [ more ] More commands available when run from project directoryUse "scrapy <command> -h" to see more info about a command
Scrapy: http://pypi.python.org/packages/source/S/Scrapy/Scrapy-0.14.4.tar.gz
Summary:
Pyopenssl cannot be installed independently. You can also download pyopenssl0.11 for installation, and then use easy_install-u scrapy for full installation.
Yuanwen: http://www.cnblogs.com/xiaoruoen/archive/2013/02/27/2933854.html
Install scrapy in centos