Python crawler scrapy Framework Learning __python

Source: Internet
Author: User
Tags xpath
Python crawler scrapy Framework Learning

First, the steps:
New Project (Project): Create a new reptile project
Clear goals (Items): Identify the goals you want to crawl
Making Reptiles (Spider): Making reptiles start crawling Web pages
Storage content (Pipeline): Designing Pipeline Storage Crawl Content

1. New Project
Scrapy startproject filename baidu.com

2. Clear target
In Scrapy, items are containers that are used to load crawled content, somewhat like DiC in Python, a dictionary, but provide some additional protection-reduction errors.
In general, the item can be created with the Scrapy.item.Item class, and the Scrapy.item.Field object is used to define the attribute (which can be understood to resemble an ORM mapping relationship).
Next, we start to build the item model.
First of all, what we want is:
Author (author)
Content (text)
tags (tags)

3, the production of reptiles is also the most important step

#-*-Coding:utf-8-*-
import scrapy
import sys
sys.path.append ("D:\\pycodes\\quotes")
from Quotes.items Import Quotesitem

class Booksspider (scrapy. Spider):
    name = ' books '
    allowed_domains = [' quotes.toscrape.com ']
    start_urls = [' http:// quotes.toscrape.com/']

    def parse (self, Response): For
        sel in Response.xpath ('//div[@class = ' quote '] '):
            item = Quotesitem ()
            item[' text ']=sel.xpath (' span[@class = ' text ']/text () '). Extract () item[
            ' author ']= Sel.xpath (' Span/small/text () '). Extract ()
            item[' tags ']=sel.xpath (' Div/a/text () '). Extract ()
            yield item

4. Design Channel

Process the item data by designing the pipeline channel.

Class Doubanpipeline (object):
    def process_item (self, item, spider): Return
        Item

class Doubaninfopipeline (object):
    def open_spider (self,spider):
        self.f=open ("Result.txt", "W")

    def Close_spider (self,spider):
        Self.f.close ()

    def process_item (self,item,spider):
        try: Line
            = str (dict (item)) + ' \ n '
            Self.f.write (line)
        except: Pass return
        item
#

1. Use of selector XPath
Response.xpath (//div/@href). Extract ()
Response.xpath (//div[@href]/text ()). Extract ()
Response.xpath (//div[contains (@href, "image")]/@href

If you select P under div that is not a direct child node, you need
Div.xpath (".//p") attention plus.

2, the application of Xpath.re
Selector also has a. Re () method, which is used to extract data from a regular expression. However, unlike using the. XPath () or. css () method, the. Re () method returns a list of Unicode strings. So you can't construct a nested. Re () call.

Here is an example of extracting the name of the image from the HTML code above:

Response.xpath ('//a[contains (@href, "image")]/text ()). Re (R ' name:\s* (. *) ')
[u ' my image 1 ',
U ' My Image 2 ',
U ' My Image 3 ',
U ' My Image 4 ',
U ' my Image 5 ']

3,
For example, the test () function can be useful when the Starts-with () or contains () of an XPath does not meet the requirements.

For example, select a link in the list that has the "class" element and ends with a number:

From scrapy import Selector

doc = "" "
...
...
... the second item ... third item ... fourth item ... fifth item ... ..... ...
...
... “””
sel = Selector (Text=doc, type= "html")
Sel.xpath ('//li//@href '). Extract ()
[u ' link1.html ', U ' link2.html ', U ' link3.html ', U ' link4.html ', U ' link5.html ']
Sel.xpath ('//li[re:test (@class, "item-\d$")]//@href '). Extract ()
[u ' link1.html ', U ' link2.html ', U ' link4.html ', U ' link5.html ']

3, for Index,link in Enumberate (links):
Print (Index,link)
0 Link1
1 link2
...

4, not necessarily in accordance with the four steps to
Sometimes you can default without changing items.py
Generate the resulting dictionary directly in spider.py, for example:
yield{

}

Wait a minute

5, recursive links, distribution crawl,

In Parse (self,response):
method by adding the following:

Next_page=response.xpath ("")
if Next_page:
    next_page=response.urljoin (next_page)
    yield scrapy. Request (Next_page,callback=self.parse)

6, how to prevent the occurrence of 403 errors:
Need to adjust setting.py file
Adjust User_agent
User_agent = ' mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) applewebkit/536.5 (khtml, like Gecko) chrome/19.0.1084.54 safari/536.5 '
Analog browser access

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.