Install and use the Python crawler framework Scrapy

Source: Internet
Author: User

1. Introduction to the crawler framework "Scarpy"
Scrapy is a fast high-level Web Crawler framework for screen capture and webpage crawling. It crawls websites and obtains structured data from website pages. It is widely used, from data mining to monitoring and automatic testing, Scrapy is fully implemented in Python and is fully open-source. The code is hosted on Github and can run on Linux, Windows, Mac, and BSD platforms, twisted-based Asynchronous Network Library to process network communication. You only need to customize several modules to easily implement a crawler to capture webpage content and various images.

Ii. Scrapy Installation Guide

Assume that you have installed <1> Python2.7 <2> lxml <3> OpenSSL. We use Python package management tool pip or easy_install to install Scrapy.
Pip installation method:
Copy codeThe Code is as follows: pip install Scrapy
Installation Method of easy_install:
Copy codeThe Code is as follows: easy_install Scrapy

Iii. Environment configuration on the Ubuntu Platform

1. python package management tool
The current package management tool chain is easy_install/pip + distribute/setuptools
Distutils: Basic Installation tool provided by Python, suitable for very simple application scenarios;
Setuptools: a large number of extensions are made for distutils, especially the package dependency mechanism. It is already a de facto standard in some Python subcommunities;
Distribute: due to the slow development progress of setuptools, Python 3 is not supported, and the code is messy, a group of programmers can reconstruct the code and add functions. They hope to replace setuptools and be accepted as the official standard library, they worked very hard to make the community accept distrils in a short time; and setuptools/distrils only expanded distutils;
Easy_install: the installation script that comes with setuptools and distribute. That is, once setuptools or distribute is installed, easy_install is also available. the biggest feature is to automatically find the Python official package source PyPI, which is very convenient to install a third-party Python package; use:
Pip: pip has a clear target-replace easy_install. easy_install has many disadvantages: Installation transactions are non-atomic operations, only svn is supported, and uninstallation commands are not provided. scripts are required to install a series of packages. pip solves the above problems, it has become a new fact standard, and virtualenv has become a good partner with it;

Installation Process:
Install distribute
Copy codeThe Code is as follows: $ curl-O http://python-distribute.org/distribute_setup.py
$ Python distribute_setup.py
Install pip:
Copy codeThe Code is as follows: $ curl-O https://raw.github.com/pypa/pip/master/contrib/get-pip.py
$ [Sudo] python get-pip.py

2. Scrapy Installation
On Windows, you can use the package management tool or manually download various dependent Binary packages: pywin32, Twisted, and zope. interface, lxml, and pyOpenSSL. In Versions later than Ubuntu9.10, the official recommendation does not use the python-scrapy package provided by Ubuntu. They are either too old or too slow to match the latest Scrapy, the solution is to use the official Ubuntu Packages, which provides all the dependent libraries and provides continuous updates to the latest bugs, providing higher stability, they are continuously built from the Github repository (master and stable branches). The installation method of Scrapy in Versions later than Ubuntu9.10 is as follows:
<1> enter the GPG key.
Copy codeThe Code is as follows: sudo apt-key adv -- keyserver hkp: // keyserver.ubuntu.com: 80 -- recv 627220E7
<2> Create the/etc/apt/sources. list. d/scrapy. list file.
Copy codeThe Code is as follows: echo 'deb http://archive.scrapy.org/ubuntu scrapy main' | sudo tee/etc/apt/sources. list. d/scrapy. list
<3> Update package list, install scrapy VERSION, where VERSION is replaced by the actual VERSION, such as scrapy-0.22
Copy codeThe Code is as follows: sudo apt-get update & sudo apt-get install scrapy-VERSION

3. Install Scrapy dependent Libraries
Install scrapy dependent libraries in ubuntu12.04
ImportError: No module named w3lib. http
Copy codeThe Code is as follows: pip install w3lib
ImportError: No module named twisted
Copy codeThe Code is as follows: pip install twisted
ImportError: No module named lxml.html
Copy codeThe Code is as follows: pip install lxml
Solution: error: libxml/xmlversion. h: No such file or directory

Copy codeThe Code is as follows: apt-get install libxml2-dev libxslt-dev
Apt-get install python-lxml
Solution: ImportError: No module named cssselect

Copy codeThe Code is as follows: pip install cssselect
ImportError: No module named OpenSSL
Copy codeThe Code is as follows: pip install pyOpenSSL

4. Customize your own crawler Development
Switch to the file directory and start a new project.
Copy codeThe Code is as follows: scrapy startproject test

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.