First, the crawler frame Scarpy Introduction
Scrapy is a fast, high-level screen crawl and web crawler framework that crawls Web sites, gets structured data from Web pages, has a wide range of uses, from data mining to monitoring and automated testing, scrapy fully implemented in Python, fully open source, and code hosted on GitHub, Can run on the Linux,windows,mac and BSD platform, based on the Twisted asynchronous network library to deal with network communications, users only need to customize the development of a few modules can easily implement a crawler, to crawl Web content and various images.
Second, scrapy Installation Guide
Our installation steps assume that you have installed the content: <1>python2.7<2>lxml<3>openssl, we use Python's package management tool PIP or Easy_install to install the scrapy.
How PIP is installed:
Copy CodeThe code is as follows: Pip install Scrapy
How to install Easy_install:
Copy CodeThe code is as follows: Easy_install scrapy
Third, the environment configuration on Ubuntu platform
1. Python's Package management tool
The current package management tool chain is EASY_INSTALL/PIP + distribute/setuptools
Distutils:python's own Basic installation tool, suitable for very simple application scenarios;
Setuptools: A large number of extensions have been made for distutils, especially the packet dependency mechanism. In part of the Python community is already the de facto standard;
Distribute: Due to the slow development of Setuptools, Python 3 is not supported, code is confusing, a bunch of programmers reinvent the line, refactor code, add functionality, hope to replace Setuptools and be accepted as the official standard library, they work very hard, in a short time Let the community accept the Distribute;,setuptools/distribute all just expand the distutils;
Easy_install:setuptools and distribute have their own installation scripts, that is, once the setuptools or distribute installation is complete, Easy_install will be available. The biggest feature is the automatic discovery of Python's officially maintained package source PyPI, the installation of third-party Python package is very convenient; Use:
Pip:pip's goal is very clear – to replace Easy_install. Easy_install has many shortcomings: the installation transaction is non-atomic operation, only support SVN, no uninstall command, install a series of packages need to write scripts; Pip solves the above problems, has become a new fact standard, virtualenv and it has become a good pair of partners;
Installation process:
Installing distribute
Copy CodeThe code is as follows: $ Curl-o http://python-distribute.org/distribute_setup.py
$ python distribute_setup.py
Install PIP:
Copy CodeThe code is as follows: $ Curl-o https://raw.github.com/pypa/pip/master/contrib/get-pip.py
$ [sudo] python get-pip.py
2, the installation of Scrapy
On the Windows platform, you can download a variety of dependent binary packages via the package management tool or manually: Pywin32,twisted, Zope.interface,lxml,pyopenssl, in the later version of Ubuntu9.10, the official recommendations do not use Ubuntu provided Python-scrapy package, they are either too old or too slow to match the latest scrapy, the solution is to use the official Ubunt U Packages, which provides all of the dependent libraries, and provides ongoing updates for the latest bugs, is more stable, they continue to build from the GitHub repository (master and stable branches), Scrapy the installation method on the Ubuntu9.10 version is as follows:
<1> Enter GPG key
Copy CodeThe code is as follows: sudo apt-key adv--keyserver hkp://keyserver.ubuntu.com:80--recv 627220E7
<2> Create a/etc/apt/sources.list.d/scrapy.list file
Copy CodeThe code looks like this: Echo ' Deb Http://archive.scrapy.org/ubuntu scrapy main ' | sudo tee/etc/apt/sources.list.d/scrapy.list
<3> Update package list, install the scrapy version, where version is replaced with the actual version, such as scrapy-0.22
Copy CodeThe code is as follows: sudo apt-get update && sudo apt-get install scrapy-version
3, Scrapy rely on the installation of the library
Installation of Scrapy dependent libraries under ubuntu12.04
Importerror:no module named W3lib.http
Copy CodeThe code is as follows: Pip install W3lib
Importerror:no module named Twisted
Copy CodeThe code is as follows: Pip install twisted
Importerror:no module named lxml.html
Copy CodeThe code is as follows: Pip install lxml
FIX: Error:libxml/xmlversion.h:no such file or directory
Copy CodeThe code is as follows: Apt-get install Libxml2-dev Libxslt-dev
Apt-get Install Python-lxml
Solution: Importerror:no module named Cssselect
Copy CodeThe code is as follows: Pip install Cssselect
Importerror:no module named OpenSSL
Copy CodeThe code is as follows: Pip install Pyopenssl
4, customized their own crawler development
Switch to the file directory to open a new project
Copy CodeThe code is as follows: Scrapy startproject test
Python crawler Framework Scrapy installation use steps