Python crawler scrapy Framework Learning _

Python crawler scrapy Framework Learning __python

Last Update:2018-07-30 Source: Internet

Author: User

Tags xpath

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Python crawler scrapy Framework Learning

First, the steps:
New Project (Project): Create a new reptile project
Clear goals (Items): Identify the goals you want to crawl
Making Reptiles (Spider): Making reptiles start crawling Web pages
Storage content (Pipeline): Designing Pipeline Storage Crawl Content

1. New Project
Scrapy startproject filename baidu.com

2. Clear target
In Scrapy, items are containers that are used to load crawled content, somewhat like DiC in Python, a dictionary, but provide some additional protection-reduction errors.
In general, the item can be created with the Scrapy.item.Item class, and the Scrapy.item.Field object is used to define the attribute (which can be understood to resemble an ORM mapping relationship).
Next, we start to build the item model.
First of all, what we want is:
Author (author)
Content (text)
tags (tags)

3, the production of reptiles is also the most important step

#-*-Coding:utf-8-*-
import scrapy
import sys
sys.path.append ("D:\\pycodes\\quotes")
from Quotes.items Import Quotesitem

class Booksspider (scrapy. Spider):
    name = ' books '
    allowed_domains = [' quotes.toscrape.com ']
    start_urls = [' http:// quotes.toscrape.com/']

    def parse (self, Response): For
        sel in Response.xpath ('//div[@class = ' quote '] '):
            item = Quotesitem ()
            item[' text ']=sel.xpath (' span[@class = ' text ']/text () '). Extract () item[
            ' author ']= Sel.xpath (' Span/small/text () '). Extract ()
            item[' tags ']=sel.xpath (' Div/a/text () '). Extract ()
            yield item

4. Design Channel

Process the item data by designing the pipeline channel.

Class Doubanpipeline (object):
    def process_item (self, item, spider): Return
        Item

class Doubaninfopipeline (object):
    def open_spider (self,spider):
        self.f=open ("Result.txt", "W")

    def Close_spider (self,spider):
        Self.f.close ()

    def process_item (self,item,spider):
        try: Line
            = str (dict (item)) + ' \ n '
            Self.f.write (line)
        except: Pass return
        item

1. Use of selector XPath
Response.xpath (//div/@href). Extract ()
Response.xpath (//div[@href]/text ()). Extract ()
Response.xpath (//div[contains (@href, "image")]/@href

If you select P under div that is not a direct child node, you need
Div.xpath (".//p") attention plus.

2, the application of Xpath.re
Selector also has a. Re () method, which is used to extract data from a regular expression. However, unlike using the. XPath () or. css () method, the. Re () method returns a list of Unicode strings. So you can't construct a nested. Re () call.

Here is an example of extracting the name of the image from the HTML code above:

Response.xpath ('//a[contains (@href, "image")]/text ()). Re (R ' name:\s* (. *) ')
[u ' my image 1 ',
U ' My Image 2 ',
U ' My Image 3 ',
U ' My Image 4 ',
U ' my Image 5 ']

3,
For example, the test () function can be useful when the Starts-with () or contains () of an XPath does not meet the requirements.

For example, select a link in the list that has the "class" element and ends with a number:

From scrapy import Selector

doc = "" "
...
...
... the second item ... third item ... fourth item ... fifth item ... ..... ...
...
... “””
sel = Selector (Text=doc, type= "html")
Sel.xpath ('//li//@href '). Extract ()
[u ' link1.html ', U ' link2.html ', U ' link3.html ', U ' link4.html ', U ' link5.html ']
Sel.xpath ('//li[re:test (@class, "item-\d$")]//@href '). Extract ()
[u ' link1.html ', U ' link2.html ', U ' link4.html ', U ' link5.html ']

3, for Index,link in Enumberate (links):
Print (Index,link)
0 Link1
1 link2
...

4, not necessarily in accordance with the four steps to
Sometimes you can default without changing items.py
Generate the resulting dictionary directly in spider.py, for example:
yield{

Wait a minute

5, recursive links, distribution crawl,

In Parse (self,response):
method by adding the following:

Next_page=response.xpath ("")
if Next_page:
    next_page=response.urljoin (next_page)
    yield scrapy. Request (Next_page,callback=self.parse)

6, how to prevent the occurrence of 403 errors:
Need to adjust setting.py file
Adjust User_agent
User_agent = ' mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) applewebkit/536.5 (khtml, like Gecko) chrome/19.0.1084.54 safari/536.5 '
Analog browser access

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More