Scrapy Crawler Framework Installation and demo example

Source: Internet
Author: User
Tags connection pooling ffi http request openssl alphanumeric characters xml parser xpath nltk

Scrapy is a generic crawler framework written by the Python language, and a brother of the recent project team is using Scrapy to get information from some large electric dealers on the big data side. As a result of modifying a little bit of the project, here also recorded some of the content of Scray, can write how much it. SCRAYP source code hosted on the GitHub, official website (http://scrapy.org). It is currently updated to the 1.0.x version.

First, installation

Scrapy Requirements python2.7+ Environment, Ubuntu is a long time ago has been python2.7+, installation is relatively simple. python2.7+ is used by default on the centos7.x. Here the two mainstream distributions as an example of installation.

1, Ubuntu under

The installation under Ubuntu is easy, and can be done directly via the Apt-get command (though the scrapy version in the default Ubuntu source will be a bit old):

sudo apt-get update && sudo apt-get install scrapy-version
The version above is for you to install the build number.

Here you can import the official source of the Scrapy and import key and Scray.list as follows:

sudo apt-key adv--keyserver hkp://keyserver.ubuntu.com:80--recv 627220E7
Echo ' Deb Http://archive.scrapy.org/ubuntu scrapy main ' | sudo tee/etc/apt/sources.list.d/scrapy.list
Of course, you can also use the PIP to install the following commands:

Pip Install Scrapy
2. Under Centos7

Because the Python version needs to be upgraded under CENTOS6, it's easy to use CENTOS7 directly, and the Python package Manager used by the default CENTOS7 is Easy_install--and of course you can install the PIP via Easy_install. So use the following command to fix the Scrapy installation:

Easy_install Scrapy
However, the installation process may not be as smooth as above, and it relies on some RPM packages, so the following packages need to be installed before installation:

Yum install libxslt-devel libffi libffi-devel python-devel gcc OpenSSL openssl-devel
If you do not install the above package in advance, you may experience the following related errors and problems

Error 1:

ERROR:/bin/sh:xslt-config:command not found
* * Make sure the development packages of LIBXML2 and LIBXSLT are installed * *
Solution yum-y Install Libxslt-devel.

Error 2:

Using build configuration of Libxslt 1.1.28
Building against LIBXML2/LIBXSLT in the following directory:/usr/lib64
Src/lxml/lxml.etree.c:85:20:fatal error:python.h:no such file or directory
#include "Python.h"
^
Compilation terminated.
Compile failed:command ' gcc ' failed with exit status 1
Error:setup script exited with Error:command ' GCC ' failed with exit status 1
The Python-devel package is missing because the Python.h file is in the Python-devel package. You can install the package directly yum.

Error 3:

Removing: _configtest.c _CONFIGTEST.O
C/_cffi_backend.c:13:17:fatal error:ffi.h:no such file or directory
#include <ffi.h>
^
Compilation terminated.
Error:setup script exited with Error:command ' GCC ' failed with exit status
CentOS under the error is very good, yum list|grep ffi related packages, found to perform the following installation yum-y install Libffi.

Second, the simple test

Let's start by creating a demo project and looking at its directory structure:

Create demo Project
# scrapy startproject demo
2015-12-26 12:24:09 [scrapy] info:scrapy 1.0.3 (started) Bot:scrapybot > 2015-12-26 12:24:09 [Scrapy] info:optional features Available:ssl, HTTP11
2015-12-26 12:24:09 [scrapy] Info:overr Idden settings: {}
New scrapy project ' demo ' created in:
   /root/demo
You can start your firs T spider with:
    CD demo
    scrapy genspider example example.com
Demo project directory structure
# tree demo/
demo/
├──demo
│  ├──__init__.py
│  ├──items.py
│   ; ├──pipelines.py
│  ├──settings.py
│  └──spiders
│     & nbsp └──__init__.py
└──scrapy.cfg
2 directories, 6 files
The demo files created above function as follows:

SCRAPY.CFG: The configuration file for the project, generally without modification.
tutorial/: The Python module for this project. You will then join the code here.
tutorial/items.py: Item file in a project that holds a crawled class, similar to the dict dictionary rule.
tutorial/pipelines.py: A pipelines file in a project that is a method of data processing after it is crawled.
tutorial/settings.py: The project's settings file, you can set the requested request header, cookie, etc.
tutorial/spiders/: The directory where the spider code is placed.
Before making a more detailed presentation, look at an official example:

Modify the items.py file as follows:

Import Scrapy
Class Dmozitem (Scrapy. Item):
title = Scrapy. Field ()
link = scrapy. Field ()
desc = scrapy. Field ()
Here we define the title, URL, and description of the three-part information that we want to crawl a website, and then call it in Spider.

Depending on the prompts above, you can use the Scrapy genspider example example.com command to create a example,start_urls name named example.com spider, or you can manually demo/ Create a file under the Spiders directory, as the official example reads:

Import Scrapy
Class Dmozspider (scrapy. Spider):
    name = "DMOZ"
    allowed_domains = ["dmoz.org"]
  & nbsp Start_urls = [
       ] Http://www.dmoz.org/Computers/Programming/Languages /python/books/",
       " http://www.dmoz.org/Computers/Programming/ Languages/python/resources/"
   ]
    def parse (self, Response):
         filename = response.url.split ("/") [-2]
         with open (filename, ' WB ') as F:
            F.write (response.body)
The code above is very simple, defining a spider with a name of DMOZ, fetching the starting page to two, without specifying the rules and Followe page. A special callback function is also not specified, and only one default handler function is edited parse, which is to crawl the contents of the top two pages and save them to a file. So even the definition in the items is useless. The command to invoke the crawl rule is as follows:

Scrapy Crawl DMOZ
Note: Under a project spider name needs to be unique, never repeat, how many spider under a project can be viewed through the Scrapy list.

# scrapy List
Dmoz
Example
To use items, you must add the selector (selectors) to the corresponding request content through XPath (XML or HTML rules) or RE module rules on the page corresponding to the processing bin value to the items, and return. As follows:

Import Scrapy
Class Dmozspider (Scrapy. Spider):
Name = "DMOZ"
Allowed_domains = ["dmoz.org"]
Start_urls = [
"Http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"Http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
Def parse (self, Response):
For SEL in Response.xpath ('//ul/li '):
title = Sel.xpath (' A/text () '). Extract ()
link = sel.xpath (' a @href '). Extract ()
desc = Sel.xpath (' text () '). Extract ()
Print title, LINK, desc
When you perform the Scrapy crawl DMOZ, the corresponding three results are output to the screen, or you can change the print to return and specify the save to file with the-o parameter after the scrapy crawl command, with the default supported file types Csv\xml\json, and so on.

Python Crawler's Tool list encyclopedia


When it comes to reptiles, you get a Python crawler-related package from the Internet. Python library with Web Capture and data processing

Internet

General

Urllib-Network library (STDLIB).

Requests-Network library.

grab– Network library (based on Pycurl).

pycurl– Network Library (binding Libcurl).

Urllib3–python HTTP library, secure connection pooling, support file post, high availability.

httplib2– Network Library.

robobrowser– a simple, Python-style Python library that allows browsing of web pages without a separate browser.

Mechanicalsoup-A Python library that interacts with the Web site automatically.

Mechanize-Stateful, programmable web browsing library.

socket– the underlying network interface (STDLIB).

Unirest for Python–unirest is a lightweight HTTP library that can be used in multiple languages.

Hyper–python's HTTP/2 client.

PYSOCKS–SOCKSIPY updated and actively maintained versions, including bug fixes and some other features. As a direct replacement of the socket module.

Asynchronous

treq– is similar to the requests API (based on twisted).

Aiohttp–asyncio HTTP client/server (PEP-3156).

Web crawler Framework

A fully functional reptile

grab– Web crawler Framework (based on Pycurl/multicur).

The scrapy– web crawler framework (based on twisted) does not support Python3.

pyspider– a powerful reptile system.

cola– a distributed crawler framework.

Other

portia– is a visual crawler based on Scrapy.

The HTTP Resource kit for Restkit–python. It allows you to easily access HTTP resources and build objects around it.

Demiurge– is based on Pyquery Crawler micro-framework.

Html/xml Parser

General

Write efficient html/xml processing library in lxml–c language. XPath is supported.

cssselect– parse the DOM tree and CSS selectors.

pyquery– resolves the DOM tree and the jquery selector.

beautifulsoup– inefficient html/xml processing library, pure Python implementation.

html5lib– generates the DOM of the Html/xml document based on the WHATWG specification. This specification is used on all browsers now.

feedparser– Parse Rss/atom feeds.

markupsafe– provides a safe escape string for xml/html/xhtml.

xmltodict– A Python module that lets you feel like you're working with JSON when you're working with XML.

xhtml2pdf– converts html/css to PDF.

untangle– Easy Implementation converts an XML file into a Python object.

Clean

bleach– Clean HTML (requires html5lib).

Sanitize– brings clarity to the chaotic data world.

Text Processing

A library for parsing and manipulating simple text.

General

The difflib– (Python standard library) helps with a differentiated comparison.

levenshtein– quickly calculates Levenshtein distance and string similarity.

fuzzywuzzy– fuzzy string Matching.

esmre– the regular expression accelerator.

ftfy– automatically defragment Unicode text to reduce fragmentation.

Transformation

unidecode– converts Unicode text to ASCII.

Character encoding

uniout– print a readable character, not a string that is escaped.

chardet– is compatible with Python's 2/3 character encoder.

xpinyin– A library that converts Chinese characters into pinyin.

pangu.py– the spacing between CJK and alphanumeric characters in the formatted text.

of Slug

awesome-slugify– a Python slugify library that can retain Unicode.

python-slugify– a Python slugify library that can convert Unicode to ASCII.

unicode-slugify– a tool that can generate Unicode slugs.

pytils– a simple tool for handling Russian strings (including pytils.translit.slugify).

General Parser

Python implementation of the Ply–lex and YACC parsing tools.

pyparsing– A generic framework generation parser.

Person's name

Python-nameparser-the component that parses the person's name.

Phone number

Phonenumbers-Parse, format, store, and validate international phone numbers.

User Agent String

python-user-agents– the parser for the browser user agent.

HTTP Agent Parser–python HTTP Proxy Analyzer.

Specific format file processing

A library that parses and processes a specific text format.

General

tablib– A module that exports data to XLS, CSV, JSON, YAML, and other formats.

textract– extracts text from a variety of files, such as Word, PowerPoint, PDF, and so on.

messytables– a tool for parsing messy tabular data.

rows– a common data interface, support a lot of formats (currently support csv,html,xls,txt– will provide more!) )。

Office

python-docx– reads, queries, and modifies the docx files of Microsoft word2007/2008.

xlwt/xlrd– writes data and formatting information from an Excel file.

xlsxwriter– A Python module that creates a excel.xlsx file.

xlwings– a BSD-licensed library that makes it easy to invoke Python in Excel, and vice versa.

openpyxl– a library of Excel2010 Xlsx/xlsm/xltx/xltm files for reading and writing.

marmir– extracts the Python data structure and converts it to a spreadsheet.

Pdf

pdfminer– a tool to extract information from a PDF document.

pypdf2– a library that can split, merge, and convert PDF pages.

reportlab– allows you to quickly create rich PDF documents.

pdftables– directly extracts the table from the PDF file.

Markdown

python-markdown– a markdown of John Gruber, implemented in Python.

Mistune– is the fastest, full-featured markdown pure python parser.

markdown2– a fast markdown that is fully implemented in Python.

Yaml

pyyaml– is a Python yaml parser.

Css

cssutils– a Python CSS library.

Atom/rss

feedparser– a generic feed parser.

Sql

sqlparse– a non-validated SQL statement parser.

HTTP

HTTP

The HTTP request/Response message parser implemented by the Http-parser–c language.

Micro-format

opengraph– a Python module to parse the Open Graph protocol tag.

Portable Execution Body

pefile– a multi-platform module for parsing and processing portable executable bodies (i.e., PE) files.

Psd

psd-tools– the Adobe Photoshop PSD (ie pe) file to the Python data structure.

Natural language Processing

A library for dealing with human language problems.

NLTK-the best platform for writing Python programs to handle human language data.

Pattern–python's network mining module. He has natural language processing tools, machine learning, and more.

Textblob– provides a consistent API for deep natural language processing tasks. Developed on the shoulders of NLTK and pattern giants.

jieba– Chinese Word segmentation tool.

snownlp– Chinese Text Processing library.

loso– another Chinese word segmentation library.

genius– the Chinese participle based on conditional random fields.

langid.py– an independent language recognition system.

korean– a Korean-language morphological library.

pymorphy2– Russian Morphological Analyzer (POS tagging + morphological change engine).

pypln– is a distributed natural language processing channel written in Python. The goal of this project is to create an easy way to use NLTK to process large language libraries over a network interface.

Browser automation and Simulation

selenium– Automation real browsers (Chrome browser, Firefox browser, Opera browser, ie browser).

ghost.py– for PYQT webkit encapsulation (required PYQT).

spynner– for PYQT webkit encapsulation (required PYQT).

splinter– General API Browser emulator (Selenium Web driver, Django Client, Zope).

Multiple processing

Threading–python the standard library's thread runs. is effective for I/O intensive tasks. The CPU-bound task is useless because the Python GIL.

The multiprocessing– standard Python library runs multiple processes.

celery– Asynchronous task queue/job queue based on distributed message delivery.

The Concurrent-futures–concurrent-futures module provides a high-level interface for calling asynchronous execution.

Asynchronous

Asynchronous Network Programming Library

asyncio– (in Python 3.4 + version of Python standard library) asynchronous I/O, time cycle, collaborative programs and tasks.

twisted– the event-driven network engine framework.

tornado– a network framework and an asynchronous network library.

Pulsar–python Event-driven concurrency framework.

Diesel–python I/O framework based on green events.

gevent– a coprocessor based on the Greenlet Python network library.

Eventlet– has an asynchronous framework supported by WSGI.

tomorrow– the wonderful cosmetic syntax of asynchronous code.

Queue

celery– Asynchronous task queue/job queue based on distributed message delivery.

huey– Small multithreaded task queues.

MRQ–MR. queue– uses the Redis & gevent Python distributed task queue.

rq– Lightweight Task Queue Manager based on Redis.

simpleq– a simple, extensible, Amazon based SQS queue.

Python-gearman–gearman's Python API.

Cloud computing

picloud– Cloud executes Python code.

dominoup.com– Cloud executes R,python and MATLAB code.

Email

e-Mail Resolution Library

flanker– e-mail address and MIME parsing library.

The Talon–mailgun library is used to extract quotations and signatures for messages.

URL and Network address operation

Resolve/Modify URL and network address library.

Url

furl– A small Python library, making manipulating URLs simplistic.

purl– a simple, immutable URL and a clean API for debugging and manipulation.

urllib.parse– is used to break a Uniform Resource Locator (URL) string between components (addressing scheme, network location, path, etc.) between partitions, in order to combine components to a URL string, and the "relative URL" into an absolute URL, called "Basic URL."

tldextract– accurately separates the TLD from the registry domain and subdomain of the URL, using the public suffix list.

Network address

netaddr– a python library for displaying and manipulating network addresses.

Web content Extraction

A library that extracts the contents of a Web page.

Text and meta data for HTML pages

newspaper– uses Python for news extraction, article extraction, and content curatorial.

html2text– HTML to markdown formatted text.

python-goose–html content/Article extractor.

lassie– humanized Web content Retrieval Tool

micawber– a small library that extracts rich content from the Web site.

Sumy-A module that automatically summarizes text files and HTML pages

haul– an extensible image crawler.

PYTHON-READABILITY–ARC90 readability Tool's fast Python interface.

scrapely– a library that extracts structured data from an HTML Web page. Some examples of web pages and data extraction are given, and scrapely constructs a parser for all similar web pages.

Video

youtube-dl– a small command-line program that downloads video from YouTube.

You-get–python3 YouTube, Youku/NicoNico Video Downloader.

Wiki

wikiteam– Download and save the Wikis tool.

WebSocket

The library used for WebSocket.

crossbar– Open Source Application Messaging Router (WebSocket and Wamp implemented in Python for Autobahn).

autobahnpython– provides the python implementation of the WebSocket protocol and the WAMP protocol and is open source.

Websocket-for-python–python 2 and 3 and PyPy WebSocket client and server libraries.

DNS resolution

dnsyo– checks your DNS on more than 1500 DNS servers worldwide.

The Pycares–c-ares interface. C-ares is the C language library for DNS request and asynchronous name resolution.

Computer Vision

opencv– Open source Computer Vision Library.

simplecv– is an introduction to camera, image processing, feature extraction, format conversion, and a readable interface (based on OPENCV).

The mahotas– fast computer image processing algorithm (implemented entirely using C + +) is based entirely on the NumPy array as its data type.

Proxy Server

shadowsocks– a fast tunneling agent that helps you penetrate firewalls (supports TCP and UDP,TFO, multi-user and smooth reboot, Destination IP blacklist).

Tproxy–tproxy is a simple TCP routing agent (layer 7th), based on Gevent, configured in Python.

List of other Python tools

Awesome-python

Pycrumbs

Python-github-projects

Python_reference

Pythonidae

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.