Python crawler Framework Scrapy installation use steps

Source: Internet
Author: User

First, the crawler frame Scarpy Introduction
Scrapy is a fast, high-level screen crawl and web crawler framework that crawls Web sites, gets structured data from Web pages, has a wide range of uses, from data mining to monitoring and automated testing, scrapy fully implemented in Python, fully open source, and code hosted on GitHub, Can run on the Linux,windows,mac and BSD platform, based on the Twisted asynchronous network library to deal with network communications, users only need to customize the development of a few modules can easily implement a crawler, to crawl Web content and various images.

Second, scrapy Installation Guide

Our installation steps assume that you have installed the content: <1>python2.7<2>lxml<3>openssl, we use Python's package management tool PIP or Easy_install to install the scrapy.
How PIP is installed:

Copy CodeThe code is as follows: Pip install Scrapy


How to install Easy_install:

Copy CodeThe code is as follows: Easy_install scrapy

Third, the environment configuration on Ubuntu platform

1. Python's Package management tool
The current package management tool chain is EASY_INSTALL/PIP + distribute/setuptools
Distutils:python's own Basic installation tool, suitable for very simple application scenarios;
Setuptools: A large number of extensions have been made for distutils, especially the packet dependency mechanism. In part of the Python community is already the de facto standard;
Distribute: Due to the slow development of Setuptools, Python 3 is not supported, code is confusing, a bunch of programmers reinvent the line, refactor code, add functionality, hope to replace Setuptools and be accepted as the official standard library, they work very hard, in a short time Let the community accept the Distribute;,setuptools/distribute all just expand the distutils;
Easy_install:setuptools and distribute have their own installation scripts, that is, once the setuptools or distribute installation is complete, Easy_install will be available. The biggest feature is the automatic discovery of Python's officially maintained package source PyPI, the installation of third-party Python package is very convenient; Use:
Pip:pip's goal is very clear – to replace Easy_install. Easy_install has many shortcomings: the installation transaction is non-atomic operation, only support SVN, no uninstall command, install a series of packages need to write scripts; Pip solves the above problems, has become a new fact standard, virtualenv and it has become a good pair of partners;

Installation process:
Installing distribute

Copy CodeThe code is as follows: $ Curl-o http://python-distribute.org/distribute_setup.py
$ python distribute_setup.py


Install PIP:

Copy CodeThe code is as follows: $ Curl-o https://raw.github.com/pypa/pip/master/contrib/get-pip.py
$ [sudo] python get-pip.py

2, the installation of Scrapy
On the Windows platform, you can download a variety of dependent binary packages via the package management tool or manually: Pywin32,twisted, Zope.interface,lxml,pyopenssl, in the later version of Ubuntu9.10, the official recommendations do not use Ubuntu provided Python-scrapy package, they are either too old or too slow to match the latest scrapy, the solution is to use the official Ubunt U Packages, which provides all of the dependent libraries, and provides ongoing updates for the latest bugs, is more stable, they continue to build from the GitHub repository (master and stable branches), Scrapy the installation method on the Ubuntu9.10 version is as follows:
<1> Enter GPG key

Copy CodeThe code is as follows: sudo apt-key adv--keyserver hkp://keyserver.ubuntu.com:80--recv 627220E7


<2> Create a/etc/apt/sources.list.d/scrapy.list file

Copy CodeThe code looks like this: Echo ' Deb Http://archive.scrapy.org/ubuntu scrapy main ' | sudo tee/etc/apt/sources.list.d/scrapy.list


<3> Update package list, install the scrapy version, where version is replaced with the actual version, such as scrapy-0.22

Copy CodeThe code is as follows: sudo apt-get update && sudo apt-get install scrapy-version

3, Scrapy rely on the installation of the library
Installation of Scrapy dependent libraries under ubuntu12.04
Importerror:no module named W3lib.http

Copy CodeThe code is as follows: Pip install W3lib


Importerror:no module named Twisted

Copy CodeThe code is as follows: Pip install twisted


Importerror:no module named lxml.html

Copy CodeThe code is as follows: Pip install lxml


FIX: Error:libxml/xmlversion.h:no such file or directory

Copy CodeThe code is as follows: Apt-get install Libxml2-dev Libxslt-dev
Apt-get Install Python-lxml


Solution: Importerror:no module named Cssselect

Copy CodeThe code is as follows: Pip install Cssselect


Importerror:no module named OpenSSL

Copy CodeThe code is as follows: Pip install Pyopenssl

4, customized their own crawler development
Switch to the file directory to open a new project

Copy CodeThe code is as follows: Scrapy startproject test

Python crawler Framework Scrapy installation use steps

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.