Python-scrapy Frame

Source: Internet
Author: User
Tags openssl library print format

Scrapy is a fast and extensible crawler Framework for crawling Web site content developed using Python.
Scrapy,python developed a fast, high-level screen capture and web crawling framework for crawling web sites and extracting structured data from pages. Scrapy can be used for data mining, monitoring and automated testing in a wide range of applications.
The attraction of Scrapy is that it is a framework that anyone can easily modify as needed. It also provides a variety of types of crawler base classes, such as Basespider, Sitemap Crawler, and the latest version of the web2.0 crawler to provide support.
Scrapy provides a tool to build the project, with some files provisioned in the generated project, and users needing to add their own code to the files, but relying on a third-party library is really much.

git clone https://github.com/scrapy/scrapy.git or wget https://github.com/ Scrapy/scrapy/archive/0.14.zip

One
1, yum install dependent library;

Yum install gcc gcc-c++ mysql mysql-server mysql-devel libffi libxml2 libxml2-devel libxslt libxslt-devel libxslt1-devel R Uby

2, Python-2.7.6.tgz
Python2.7 above version;
(The example uses 2.7 and 3.6 above the print format, such as more than 3.0 python, download the third-party dependent library to see if the version support python2.7;
Here are some third-party libraries that need to make setup.py changes to install)

wget http://www.python.org/ftp/python/2.7.6/Python-2.7.6.tgz./configure--prefix=/usr/local/  &&/usr/bin/python/usr/bin/-s/usr/local/python/bin/python2. 7 /usr/bin/-V

3, pip-9.0.1.tar.gz

PIP Python Package management tool; Wget https: // pypi.python.org/packages/11/b6/abcb525026a4be042b486df43905d6893fb04f05aac21c32c638e939e447/ PIP-9.0.1.TAR.GZ#MD5=35F01DA33009719497F01A4BA69D63C9tar-xf pip-9.0. 1 . tar.gz cd pippython setup.py buildpython setup.py Install

4, setuptools-11.3.tar.gz
Setuptools is a sub-project of Pythonenterpriseapplicationkit (PEAK), It is a set of Python's DISTUTILSDE tool enhancements (for versions above Python2.3.5, 64-bit platforms for more than Python2.4 versions), which makes it easier for programmers to create and publish Python packages. In particular, those that have dependencies on other packages; Setuptools module; to build, install, upgrade, and unload Python packages; (Setuptools version 11.3 is OK; version is too high to depend on more)
The setuptools under Python comes with a easy_install tool that is useful and convenient for installing every three-party module and tool in Python.

Install PIP before installing Setuptools;

wget https://pypi.python.org/packages/34/a9/ 65ef401499e6878b3c67c473ecfd8803eacf274b03316ec8f2e86116708d/setuptools-11.3.tar.gztar-xf setuptools-  11.3. tar.gz cd setuptoolspython setup.py buildpython setup.py Install

5, zope.interface-4.1.1.tar.gz
Python supports multiple inheritance, but does not support interfaces, Zope.inteface is its three-party interface implementation Library, used in twisted;

wget https://pypi.python.org/packages/a2/af/ c4a17a2ab696c84c304f7c6c66236ee0ea019cf79852af32c7d3f89e0b8e/zope.interface-4.1.1.tar.gz#md5= Edcd5f719c5eb2e18894c4d06e29b6c6tar-xf Zope. interface-4.1. 1 . tar.gz cd Zope. interface/python setup.py install

6, twisted-12.1.0.tar.bz2
Twisted is an event-driven network engine framework implemented in Python;

wget https://twistedmatrix.com/releases/twisted/12.1/twisted-12.1.0.tar.bz2tar-xf twisted-  12.1. 0 . tar.bz2 cd twistedpython setup.py buildpython setup.py Install

7, six-1.10.0.tar.gz
As the name implies packaging python2 and python3 differences;

wget https://pypi.python.org/packages/b3/b2/ 238e2590826bfdd113244a40d9d3eb26918bd798fc187e2360a8367068db/six-1.10.0.tar.gz#md5= 34eed507548117b2ab523ab14b2f8b55tar-xf six-1.10. 0 . tar.gz MV Six-1.10. 0/ sixcd six/python setup.py buildpython setup.py install

8, w3lib-1.17.0.tar.gz
W3lib module, this package is used to remove some redundant HTML tags;

wget https://pypi.python.org/packages/ac/b6/ 91ae356d48dd1d48732967eb79b2e41be4b2493b4e43a89be57b1f3be37d/w3lib-1.17.0.tar.gz#md5= 03f4d6160208c547e4c31a63486b9516tar-xf w3lib-1.17. 0 . tar.gz python setup.py buildpython setup.py install

9, Mysql-python-1.2.5.zip
MySQLdb is a popular MySQL database server interface for Python (because MySQL database is needed to support the crawl process)

wget https://pypi.python.org/packages/a5/e9/ 51b544da85a36a68debe7a7091f068d802fc515a3a202652828c73453cad/mysql-python-1.2.5.zip#md5= 654f75b302db6ed8dc5a898c625e030cunizp mysql-python-1.2. 5 . zip cd MySQL-python/python setup.py buildpython setup.py install

Two
Third-party dependent libraries;
Installs the required plug-in, and then setup.py the install in Python
The module can use PIP install, or you can download the package install directly
(The missing dependent modules are installed on the demand of the prompt version; The version is too high, the plugin may be more.)

1, lxml-3.4.4.tar.gz
lxml XML Toolkit is a Python C library libxml2 and libxslt combination; (also yum installation)

wget https://pypi.python.org/packages/63/c7/ 4f2a2a4ad6c6fa99b14be6b3c1cece9142e2d915aa7c43c908677afc8fa4/lxml-3.4.4.tar.gz#md5= a9a65972afc173ec7a39c585f4eea69ctar-xf lxml-3.4. 4 . tar.gz CD scrapy/python setup.py buildpython setup.py install

2, pyopenssl-17.0.0.tar.gz
OpenSSL Library

wget https://pypi.python.org/packages/9f/32/ 80fe4fddeb731b7766cd09fe0b2032a91b43dae655e216792af2a6ae3190/pyopenssl-17.0.0.tar.gz#md5= 0704ca95106960375cfe78259453094atar-xf pyopenssl-17.0. 0 . tar.gz CD pyopenssl/python setup.py buildpython setup.py install

3, cffi-1.10.0.tar.gz
The external function interface of Python, based on C declaration;

wget https://pypi.python.org/packages/5b/b9/ 790f8eafcdab455bcd3bd908161f802c9ce5adbf702a83aa7712fcc345b7/cffi-1.10.0.tar.gz#md5= 2b5fa41182ed0edaf929a789e602a070tar-xf cffi-1.10. 0 . tar.gz CD cffi/python setup.py buildpython setup.py install

4, cryptography-1.8.1.tar.gz
Cryptography cryptography is a package that provides Python developers with cryptographic recipes and primitives

wget https://pypi.python.org/packages/ec/5f/ d5bc241d06665eed93cd8d3aa7198024ce7833af7a67f6dc92df94e00588/cryptography-1.8.1.tar.gz#md5= 9F28A9C141995CD2300D0976B4FAC3FBtar-xf cryptography-1.8. 1 . tar.gz CD Cryptography/python setup.py buildpython setup.py install


5, pyparsing-1.5.7.tar.gz
The Pyparsing module is an alternative to creating and executing simple grammars, with traditional LEX/YACC methods, or using regular expressions. The Pyparsing module provides a class in which the client code constructs the grammar library directly using Python code.

wget https://pypi.python.org/packages/6f/2c/ 47457771c02a8ff0f302b695e094ec309e30452232bd79198ee94fda689f/pyparsing-1.5.7.tar.gz#md5= 9be0fcdcc595199c646ab317c1d9a709tar-xf pyparsing-1.5. 7 . tar.gz cd pyparsingpython setup.py buildpython setup.py Install


6, idna-2.5.tar.gz
IDNA module, with Python standard library, application in Internationalized Domain name (IDNA)

wget https://pypi.python.org/packages/d8/82/ 28a51052215014efc07feac7330ed758702fc0581347098a81699b5281cb/idna-2.5.tar.gz#md5= fc1d992bef73e8824db411bb5d21f012tar-xf idna-2.5. tar.gz cd Idnapython setup.py Buildpython setup.py Install



7, pycparser-2.17.tar.gz
The Pycparser module parser is a module that uses the ply module to analyze C language syntax, and can be easily integrated into applications that need to parse the C source code;

wget https://pypi.python.org/packages/be/64/ 1bb257ffb17d01f4a38d7ce686809a736837ad4371bcc5c42ba7a715c3ac/pycparser-2.17.tar.gz#md5= Ca98dcb50bc1276f230118f6af5a40c7tar-xf pycparser-2.17. tar.gz cd pycparser/python setup.py buildpython setup.py Install

8, ipaddress-1.0.18.tar.gz
The functionality of the IPAddress module and class makes it simple to handle various tasks related to IP address, including checking if there are two hosts on the same subnet, all host iterations on a particular subnet, checking whether a string represents a valid IP address or definition of the network;

wget https://pypi.python.org/packages/4e/13/ 774faf38b445d0b3a844b65747175b2e0500164b7c28d78e34987a5bfe06/ipaddress-1.0.18.tar.gz#md5= 310c2dfd64eb6f0df44aa8c59f2334a7tar-xf ipaddress-1.0.  - . tar.gz cd ipaddresspython setup.py buildpython setup.py Install

9, enum34-1.1.6.tar.gz
Enum type custom type module in Python module

wget https://pypi.python.org/packages/bf/3e/ 31d502c25302814a7c2f1d3959d2a3b3f78e509002ba91aea64993936876/enum34-1.1.6.tar.gz#md5= 5f13a0841a61f7fc295c514490d120d0tar-xf enum34-1.1. 6  /usr/local/enum34/python setup.py install

10, packaging-16.8.tar.gz
The core packaging module for Python packages

wget https://pypi.python.org/packages/c6/70/ bb32913de251017e266c5114d0a645f262fb10ebc9bf6de894966d124e35/packaging-16.8.tar.gz#md5= 53895cdca04ecff80b54128e475b5d3btar-xf packaging-16.8. tar.gz CD Packaging/python setup.py buildpython setup.py Install

11, asn1crypto-0.11.1.tar.gz
Asn1crypto module; A quick, easy to parse and serialize ASN Pure Python Library

wget https://pypi.python.org/packages/97/a4/ bf830df887ea2312d3114ea6f01c8ff0af3fe4d6fd088402bd99b5515746/asn1crypto-0.11.1.tar.gz#md5= D3C24181D33A355E389B6FBECE7E24CFtar-xf asn1crypto-0.11. 1 . tar.gz CD asn1crypto-0.11. 1 python setup.py buildpython setup.py install

Three
Perfect the third party libraries finally switch to the Scrapy directory for installation;
CD scrapy/
Python setup.py Build
Python setup.py Install

[Email protected]]# whereis scrapy
Scrapy:/usr/local/scrapy
Cp-rp/usr/local/scrapy/bin/scrapy/usr/bin
# Scrapy Version
Scrapy 0.14.4


"Example crawl a site information"

1. Create a project

Scrapy startproject my_project[[email protected]]# tree.└──my_project├──my_project│   ├──__init__.py│   ├── items.py│   ├──pipelines.py│   ├──settings.py│   └──spiders│   └──__init__.py└──scrapy.cfg

The Scrapy.cfg crawler executes the entry file, and the input scrapy crawl crawler begins to read the configuration content in the file first.
my_project/items.py defines the data that the crawler captures, and the poetic way to store information;
For example, the result of a crawler crawl can be either a header string or a structured JSON object, or a byte stream of a picture, and items define properties in a structured object.
my_project/pipelines.py defines how information is saved;
Crawler crawl content is stored in the memory object, can be customized to write to the file in db or directly in the console output;
Scrapy will use the pipeline (pipeline) method to give the in-memory information to each pipeline file sequentially.
my_project/settings.py This file preserves the configuration information that the crawler relies on when it runs.

Python-scrapy Frame

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.