Python crawler scrapy Framework Learning
First, the steps:
New Project (Project): Create a new reptile project
Clear goals (Items): Identify the goals you want to crawl
Making Reptiles (Spider): Making reptiles start crawling Web pages
Storage content (Pipeline): Designing Pipeline Storage Crawl Content
1. New Project
Scrapy startproject filename baidu.com
2. Clear target
In Scrapy, items are containers that are used to load crawled content, somewhat like DiC in Python, a dictionary, but provide some additional protection-reduction errors.
In general, the item can be created with the Scrapy.item.Item class, and the Scrapy.item.Field object is used to define the attribute (which can be understood to resemble an ORM mapping relationship).
Next, we start to build the item model.
First of all, what we want is:
Author (author)
Content (text)
tags (tags)
3, the production of reptiles is also the most important step
#-*-Coding:utf-8-*-
import scrapy
import sys
sys.path.append ("D:\\pycodes\\quotes")
from Quotes.items Import Quotesitem
class Booksspider (scrapy. Spider):
name = ' books '
allowed_domains = [' quotes.toscrape.com ']
start_urls = [' http:// quotes.toscrape.com/']
def parse (self, Response): For
sel in Response.xpath ('//div[@class = ' quote '] '):
item = Quotesitem ()
item[' text ']=sel.xpath (' span[@class = ' text ']/text () '). Extract () item[
' author ']= Sel.xpath (' Span/small/text () '). Extract ()
item[' tags ']=sel.xpath (' Div/a/text () '). Extract ()
yield item
4. Design Channel
Process the item data by designing the pipeline channel.
Class Doubanpipeline (object):
def process_item (self, item, spider): Return
Item
class Doubaninfopipeline (object):
def open_spider (self,spider):
self.f=open ("Result.txt", "W")
def Close_spider (self,spider):
Self.f.close ()
def process_item (self,item,spider):
try: Line
= str (dict (item)) + ' \ n '
Self.f.write (line)
except: Pass return
item
#
1. Use of selector XPath
Response.xpath (//div/@href). Extract ()
Response.xpath (//div[@href]/text ()). Extract ()
Response.xpath (//div[contains (@href, "image")]/@href
If you select P under div that is not a direct child node, you need
Div.xpath (".//p") attention plus.
2, the application of Xpath.re
Selector also has a. Re () method, which is used to extract data from a regular expression. However, unlike using the. XPath () or. css () method, the. Re () method returns a list of Unicode strings. So you can't construct a nested. Re () call.
Here is an example of extracting the name of the image from the HTML code above:
Response.xpath ('//a[contains (@href, "image")]/text ()). Re (R ' name:\s* (. *) ')
[u ' my image 1 ',
U ' My Image 2 ',
U ' My Image 3 ',
U ' My Image 4 ',
U ' my Image 5 ']
3,
For example, the test () function can be useful when the Starts-with () or contains () of an XPath does not meet the requirements.
For example, select a link in the list that has the "class" element and ends with a number:
From scrapy import Selector
doc = "" "
...
...
... the second item ... third item ... fourth item ... fifth item ... ..... ...
...
... “””
sel = Selector (Text=doc, type= "html")
Sel.xpath ('//li//@href '). Extract ()
[u ' link1.html ', U ' link2.html ', U ' link3.html ', U ' link4.html ', U ' link5.html ']
Sel.xpath ('//li[re:test (@class, "item-\d$")]//@href '). Extract ()
[u ' link1.html ', U ' link2.html ', U ' link4.html ', U ' link5.html ']
3, for Index,link in Enumberate (links):
Print (Index,link)
0 Link1
1 link2
...
4, not necessarily in accordance with the four steps to
Sometimes you can default without changing items.py
Generate the resulting dictionary directly in spider.py, for example:
yield{
}
Wait a minute
5, recursive links, distribution crawl,
In Parse (self,response):
method by adding the following:
Next_page=response.xpath ("")
if Next_page:
next_page=response.urljoin (next_page)
yield scrapy. Request (Next_page,callback=self.parse)
6, how to prevent the occurrence of 403 errors:
Need to adjust setting.py file
Adjust User_agent
User_agent = ' mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) applewebkit/536.5 (khtml, like Gecko) chrome/19.0.1084.54 safari/536.5 '
Analog browser access