The road of Scrapy exploration

Source: Internet
Author: User
Tags python web crawler

The road of Scrapy exploration

Scrapy Discovery Road Table of Contents
    • What is 1 scrapy?
    • 2 How to learn
      • 2.1 See Manual
      • 2.2 Installation
      • 2.3 Getting Started
      • 2.4 Some tools
    • 3 Some of the problems encountered
      • 3.1 Request and response's splicing relationship,
      • 3.2 How to post data
      • 3.3 Request Filtered by Scrapy
      • What is the item 3.4 scrapy?
      • 3.5 Display problems in Chinese
      • 3.6 Complex Start_ruls
      • 3.7 Little crawlers don't want to build a project
      • 3.8 Other
    • 4 PostScript
1What's scrapy?
    • Scray is a Python web crawler framework, what is a reptile? Please Baidu. What can a reptile do? Please Baidu. Crawler can do what interesting things, you can read about the web crawler posts. Can be srapy on the official website of the description.
    • (Version: Scrapy 0.24.6)
2How to learn2.1Read the Manual

The Scrapy website has its own starter manual, available in PDF and online two formats.

2.2Installation

The handbook has, own toss.

2.3Entry
    • Manual
      For starters, detailed examples are given in the manual, and most of the articles on the web are basically picked out from the manual and translated. I have read the handbook several times. More than 80% of the questions can be answered in the manual. Therefore, it is recommended to read through the manual.
  • Xpath,lxml and BeautifulSoup,
    You can refer to the content on the world. Then try to fetch a few pages yourself. Here is a piece of code that I wrote when I was learning lxml, so that I could get acquainted with XPath.
     fromlxmlImportEtree fromStringioImportStringioxmlfile="" " <?xml version=" 1.0 "encoding=" iso-8859-1 "?><bookstore><book category= "COOKING" ><title lang= "en" >everyday italian</title><author>giada De laurentiis</author><year>2005</year><price>30.00</price></book><book category= "Children" ><title lang= "en" >harry potter</title><author>j K. rowling</author><year>2005</year><price>29.99</price></book><book category= "WEB" ><title lang= "en" >xquery Kick start</title><author>james mcgovern</author><author>per bothner</author><author>kurt cagle</author><author>james linn</author><author>vaidyanathan nagarajan</author><year>2003</year><price>49.99</price></book><book category= "WEB" ><title lang= "ch" >learning xml</title><author>erik T. ray</author><year>2003</year><price>39.95</price></book></bookstore>"""F=stringio (xmlfile)Tree=etree.parse (f) Tree.xpath ('//title ') [0].tagtree.xpath ('//title ') [0].text##  Closure for Querydef QRESF(tree):def qresfunction(query):Res=tree.xpath (query)Try:Reslst=[q.tag forQinchResResls=[q.text forQinchResPrint Zip(RESLST,RESLS)except Attributeerror:Resls=' \ n '. join ([Q forQinchRES])PrintReslsreturn Len(RES)returnQresfunctionQ=QRESF (tree) Q ('//title ') Q ('//author ') Q ('/bookstore/book/title ') Q ('/bookstore/book[price > ') Q ('//book/title[@*] ') Q ('//@* ') Q ('//* ') Q ('//@lang ') Q ('//book/title|//book/price ') Q ('//bookstore/child::year ') Q ('//nothing ') Q ('/book/child::year ') Q ('//book/child::* ') Q ('//book/title[@lang] ') Q ('//text () ')
    • Learn some of Python's network libraries
      Like Urllib and Urllib2. The basic understanding is also very helpful. such as how to add parameters to the URL, how to convert the Chinese parameters.
    • Learning Twisted
      Scrapy is based on this implementation. Can understand, I have seen the basic tutorial, have a certain understanding, but generally not used.
    • Learn JavaScript
      At least to be able to read the JS file, because most of the Web page is the asynchronous callback method get to the data, that is, your first page access is often not get the content you want, this time you have to analyze the background of the JS code to get to the data. There are plugins that mimic the behavior of the browser, but the individual feels too cumbersome, too cumbersome, and the simplest and quickest way to analyze the background directly (of course, this depends on your patience and skill in reading the code).
    • Learn HTTP protocol
      For the Internet, the HTTP protocol is a very important and basic protocol, understand the protocol, you play the crawler will be more relaxed. Recommended here HTTP The Definitive guide (HTTP authoritative guidelines, e-text, see e-text information easy to find, but also more direct, do not understand the way it is, but you can also try my dictionary, on my GitHub, is to use Lxml+xpath implementation).
    • Write your own crawler.
      You can search for other people's crawler projects on the Internet, most of them are to grasp the top of the post as an example, but also to catch what beautiful pictures of (hehe). Look at their source code, see how the Master is realized.
    • To read the manual constantly.
      The basic principle in the handbook is the cornerstone of the problem solving.
    • Study the source code of Scrapy itself
      This is a need to understand when seeking help without a door. Also, looking at the source code can help you understand the contents of the manual. I tried to look at the source code to see the manual (have not studied thoroughly, just read some), the understanding of the manual is very helpful (below is an example).
    • Do some small items of interest
      Practice practiced hand, deepen the understanding of the concept (for example, grab a post on the forum, write a dictionary what). There are a lot of problems that will naturally come up at this time. Moreover, the realization of things, still quite a sense of accomplishment.
2.4Some tools

There are some references in the manual, which can be consulted. Many browsers have their own Network Monitor. For example, the Google page right-click on the inspect element, open to monitor the current page situation, you can view the Web page sent requests, you can view the XPath, you can edit the elements, study the structure of the page and so on. Firebug can be installed on Firefox. It is recommended to use the Postman plugin on Google to simulate a browser sending a request.

3Some of the problems encountered

The following are some of the problems encountered during the exploration process, although most of the manual is still able to be solved, but it is not so easy to notice.

3.1The concatenation of request and response,

When the item returned in response needs to be spliced to send the data from the requested request, it can be passed through the Meta attribute of requests to see the manual:

Class Scrapy.http.Request (URL [, Callback, method= ' GET ', headers, body, cookies, meta,encoding= ' utf-8 ', priority=0, dont _filter=false, Errback]) ... metaa dict that contains arbitrary metadata for this request. This dict are empty for new requests, and are usuallypopulated by different scrapy components (extensions, middlewares, etc) . So the data contained in Thisdict depends is on the extensions and you are enabled. See Request.meta Special keys for a list of special meta keys recognized by Scrapy.this Dict are shallow copied when the RE Quest is cloned using the copy () or replace () methods, Andcan also being accessed, in your spider, from the Response.meta att Ribute.

In the last sentence, the meta is copied to the meta in response.

3.2How to post data

Also see the request class parameters, there is a method, the default is get, then it should be post, then, where the data post, from the HTTP protocol can be known in the body post data. This allows you to construct a POST request. A more convenient approach is to construct formrequest, which is exemplified in the manual.

return [Formrequest (url="Http://www.example.com/post/action",                    formdata={' name ': ' John Doe ', ' age ': ' 27 '},                    callback=self. after_post)]
3.3Request is scrapy filtered

The problem is this, I send a request, get the preliminary data, the data contains the header description of the subsequent page data and the current page data, but the data on the page I want to put in the same function unified processing, so, I am equivalent to the current page request sent two times, but because Srapy will automatically filter duplicate pages, so, Duplicate send is filtered out. The workaround can also be found in the manual.

class scrapy. http. Request (URL [, Callback, method= ' GET ', headers, body, cookies, meta,                          encoding= ' Utf-8 ', priority=0, dont_filter=  False, Errback])

If you see a dont_filter, you can specify that the request is not filtered out by scrapy. It seems that the people who design scrapy think about it quite well.

3.4What is the Scrapy item?
3.2.2 Item Fieldsfield Objects is used to specify metadata for each field.last_updated field illustrated in the example a Bove. For example, the serializer function for theyou can specify any kind of metadata for each field. There is no restriction on the values accepted by Field objects. For the same reason, there is the no reference list of all available metadata keys. Each key defined in Field objectscould is used by a different components, and only those components know about it. You can also define and use Anyother Field key in your project too, for your own needs. The main goal of Field objects is to provide a-todefine all Field metadata on one place. Typically, those components whose behaviour depends on each field use the Certainfield keys to configure that behaviour. You must refer to their documentation to see which metadata keys is used Byeach component. It's important to note, the Field objects used to declare the item does not stay assigned as class attributes. Instead,they CAn is accessed through the Item.fields attribute. And that's all your need to know about declaring items.

A large section of the foggy, look directly at the source code of the item.

class Baseitem(OBJECT_REF):"" " Base class for all scraped items. " "    Passclass Field(Dict):"" " Container of field metadata " ""class Itemmeta(type):def __new__(MCS, class_name, Bases, Attrs): Fields= {}New_attrs= {} forN, VinchAttrs.iteritems ():if isinstance(V, Field): Fields[N] = VElse:New_attrs[N] = VCLS=Super(Itemmeta, MCS). __new__ (MCS, class_name, bases, New_attrs)Cls.fields= Cls.fields.copy () cls.fields.update (fields)returnClsclass Dictitem(Dictmixin, Baseitem): Fields= {}def __init__( Self, *args, **kwargs): Self. _values = {}ifArgsorKwargs:# Avoid creating dict for more common case             forKvinch Dict(*args, **kwargs). Iteritems (): Self[K] = V## ...class Item(dictitem):__metaclass__= Itemmeta

Suddenly it was suddenly enlightened.

3.5Display problems in Chinese

Scrapy is saved by default in UTF format. However, if you are exporting this in the command line

Scrapy Crawl Dmoz-o Items.json

What you get is a UTF8 memory representation string,

"\U5C97\U4F4D\U804C\U8D23\UFF1A"

I tried to export it in the form of pipeline to solve the Chinese problem. However, there is nothing you can do about a nested dictionary or list. On the other hand, the direct deposit to the database has no Chinese problem.

3.6The complex start_ruls

The problem is that multiple request addresses need to be constructed based on the configuration file, rather than directly from Start_urls. This can cover the Start_requests method in the spider. The manual mentions that the source code can also be seen clearly.

class Spider(OBJECT_REF):"" " Base class for Scrapy spiders. All spiders must inherit from thisclass.    """    name=None    def __init__( Self, name=None, **kwargs):ifName is  not None: Self. Name = Nameelif  not GetAttr( Self,' name ',None):Raise ValueError("%s must has a name"%type( Self).__name__) Self. __dict__.update (Kwargs)if  not hasattr( Self,' Start_urls '): Self. Start_urls = []## ...    def start_requests( Self): forUrlinch  Self. Start_urls:yield  Self. Make_requests_from_url (URL)def Make_requests_from_url( Self, URL):returnRequest (URL, dont_filter=True)
3.7Little reptile doesn't want to build a project

For small, simple crawlers, a file can be fixed. Scrapy has provided such instructions in the

Runspider    syntax:scrapy runspider <spider_file.py>    Requires project:norun a spider self-contained in a py Thon file, without have to create a project
3.8Other

Practice in the process has encountered many other aspects of the problem, length of time constraints, specific problems found after the solution to share.

4Postscript
    • He contacted Scrapy six months ago and then explored it in his spare time. Oneself also wrote a few simple small project practice, to Scrapy also is relatively clear (of course, there are a lot of problems to be solved), so write this article, one is the process of self-summary, and the second is to share some of their own exploration methods, I believe that the same as the novice will encounter similar problems. In fact, most of the problems can be solved by Scrapy's own manual. Therefore, it is recommended to read the manual, be sure to read it. Then learn the basics of networking. Then surf the internet for answers. Finally, we suggest reading the source code.

(Thank you for reading, welcome to correct, welcome to communicate)

date:2015-07-13t11:36+0800

Author: Walker knows

ORG version 7.9.3f with Emacs version 24

Validate XHTML 1.0

The road of Scrapy exploration

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.