The first knowledge scrapy, the United States and air network Pictures crawl actual combat

Source: Internet
Author: User

The two days to study the next Scrapy crawler framework, so ready to write a reptile practice practiced hand. Usually do more things is to browse the pictures, right, that is the kind of art photo, I am proud to think that more beautiful photos will certainly improve the aesthetic, and become an elegant programmer.      O (∩_∩) o~ open a joke, then nonsense to say, cut to the chase, write a picture crawler. Design ideas: Crawl target for the United States and air network model photos, the use of Crawlspider extract the URL address of each photo, and the extracted image URL to write a static HTML text as storage, open to view the picture.       My environment is win8.1, Python2.7+scrapy 0.24.4, how to match the environment I will not say, we Baidu a bit. Referring to the official documentation, I concluded that there are roughly four steps to building a crawler:
    • Create a Scrapy Project
    • Defines an element that needs to be extracted from a Web page item
    • Implements a spider class that completes the crawl URL and extracts the item function through the interface
    • Implement an item pipeline class to complete the storage function of item.
Next is very simple, refer to step by step to do, first in the terminal built a project, the project name we named Moko it. Input directive scrapy Startproject Moko, Scrapy will create a Moko file directory in the current directory, there are some initial files, the use of files are interested in the following document, I mainly introduce the files we used this time.

Define the item in items.py to define the data we want to crawl:

#-*-Coding:utf-8-*-

# Define Here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

Import Scrapy


Class Mokoitem (Scrapy. Item):
# define the fields for your item here is like:
# name = Scrapy. Field ()
url = scrapy. Field ()

    • The URL here is used to store the dict number of the final result, which will be explained later, and the name is randomly named. For example, I also need to climb the name of the image author, then we can add a name = Scrapy. Field (), and so on.
    • Next we go into the Spiders folder, built a python file, the name we take mokospider.py here, add the core code to implement the spider:
    • The spider is a Python class that inherits from Scrapy.contrib.spiders.CrawlSpider and has three required defined members

      Name: Names, the spider's logo, must be unique, different reptiles define different names

      Start_urls: A list of URLs from which spiders start crawling

      Parse (): The parsing method, when called, passes in the response object returned from each URL as the only parameter, responsible for parsing and matching the crawled data (parsing to item) and tracking more URLs.

#-*-Coding:utf-8-*-
#File name:spyders/mokospider.py
#Author: Jhonny Zhang
#mail: [Email protected]
#create time:2014-11-29
#############################################################################

From scrapy.contrib.spiders import Crawlspider,rule
From SCRAPY.CONTRIB.LINKEXTRACTORS.SGML import Sgmllinkextractor
From Moko.items import Mokoitem
Import re
From scrapy.http import Request
From Scrapy.selector import Selector


Class Mokospider (Crawlspider):
Name = "Moko"
Allowed_domains = ["moko.cc"]
start_urls=["Http://www.moko.cc/post/aaronsky/list.html"]
Rules = (Rule (Sgmllinkextractor (allow= ('/post/\d*\.html ')), callback = ' parse_img ', follow=true),)
def parse_img (self, Response):
Urlitem = Mokoitem ()
SEL = Selector (response)
For divs in Sel.xpath ('//div[@class = "pic DBd"] '):
Img_url=divs.xpath ('.//img/@src2 '). Extract () [0]
urlitem[' url '] = Img_url
            yield urlitem           Our project is named Moko, the crawler allows the domain allowed_domains limited in moko.cc, that is, the crawler's constrained area, the crawler only crawl the page under this domain name. The crawler start address starts with http://www.moko.cc/post/aaronsky/list.html. Then set crawl rule this is where crawlspider differs from the underlying crawler, for example, we start crawling from a page, a page with a lot of hyperlink URLs, we crawler according to the rules set, and then go to crawl the URL of the hyperlink, so repeated. Callback callback function, encountered a Web page call this callback function processing, I do not use the default parse this name, because the official document said that the reptile framework may invoke parse, causing conflicts.       Target http://www.moko.cc/post/aaronsky/list.html There are a lot of pictures on the page, each link to the image has a regular, such as random click on the Open is HTTP// Www.moko.cc/post/1052776.html, the http://www.moko.cc/post/here are the same, and each link has a different part that is the number part behind it. So here we use the regular expression, fill in the rule rules = (rule (sgmllinkextractor (allow= ('/post/\d*\.html ')),  callback = ' parse_img ', follow= True)   refers to the current page, all pages that conform to the/post/\d*\.html suffix are crawled, and the parse_img processing is called.       Next define the analytic function parse_img, this place is more critical, he passed in the parameter is the crawler to open the URL returned after the response object, response object inside the content is a very large number of strings, We are using crawlers to filter out what we need. How to filter it??? Haha, there is a selector method, using his XPath () path expression formula to parse the content, before parsing need to specifically analyze the Web page, we use the tool isFirebug The intercepted web core code is what we need is the SRC2 part! He is in <div class= "pic dBd" > tag inside, first instance of a Mokoitem () defined in the Items.py object Urlitem, with a cool selector into response, I used a loop here, each processing a URL, using the XPath path expression to parse out the URL, as for how XPath, self-Baidu under. The results are stored in Urlitem, where we use the URL defined in our items.py!       Then define pipelines, which is part of our content store. From Moko.items import Mokoitem

Class Mokopipeline (object):
def __init__ (self):
Self.mfile = open (' test.html ', ' W ')
def process_item (self, item, spider):
Text = ' Self.mfile.writelines (text)
def close_spider (self, spider):
Self.mfile.close () creates a test.html file to store the results. Note that some HTML rules are used in my process_item to display images directly in HTML.     End in defining a method to close the file, called at the end of the crawler. Last Define Set settings.pybot_name = ' Moko '

Spider_modules = [' Moko.spiders ']
Newspider_module = ' Moko.spiders '

# Crawl responsibly by identifying yourself (and your website) on the User-agent
#USER_AGENT = ' Moko (+http://www.yourdomain.com) '

item_pipelines={' moko.pipelines.MokoPipeline ': 1,} Finally, I wish you all the fun ^_^

The first knowledge scrapy, the United States and air network Pictures crawl actual combat

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.