I first recognized scrapy, but I learned how to crawl images on the US Kong network,

Source: Internet
Author: User

I first recognized scrapy, but I learned how to crawl images on the US Kong network,
I studied the scrapy crawler framework over the past two days, so I was prepared to write a crawler trainer. I usually do a lot of things to browse pictures. Right, that is the kind of art photos. I am proud to think that watching more beautiful photos will definitely improve the aesthetics and become an elegant programmer. O (distinct _ distinct) O ~ Make a joke, so you don't need to talk much about it. Let's get into the question and write an image crawler. Design Concept: crawls a photo of a model of the US (null) network, uses CrawlSpider to extract the url address of each photo, and writes the extracted image url to a static html text for storage. Open the image and you can view the image. My environment is win8.1, python2.7 + Scrapy 0.24.4. I will not talk about how to configure the environment. According to the official documents, I have summarized four steps for building a crawler:

  • Create a scrapy project
  • Define the element item to be extracted from the webpage
  • Implement a spider class to crawl URLs and extract items through interfaces.
  • Implements an item pipeline class to store items.
The next step is simple. Follow the steps to create a project in the terminal and name the project moko. Enter the command scrapy startproject moko. scrapy will create a moko file directory in the current directory, which contains some original files. You may be interested in this document, I will mainly introduce the files we used this time.

Define Item in items. py to define the data to be captured:

# -*- coding: utf-8 -*-# Define here the models for your scraped items## See documentation in:# http://doc.scrapy.org/en/latest/topics/items.htmlimport scrapyclass MokoItem(scrapy.Item):    # define the fields for your item here like:    # name = scrapy.Field()    url = scrapy.Field()

 

  • The url here is used to store the dict Number of the final result. It will be explained later. The name is random. For example, if I still need to crawl the name of the image author, we can add a name = scrapy. Field (), and so on.
  • Next, go to the spiders folder and create a python file. The name is mokospider. py, and add the core code to implement Spider:
  • Spider is a Python class inherited from scrapy. contrib. spiders. CrawlSpider and has three required defining members.

    Name: name. The spider ID must be unique. Different crawlers define different names.

    Start_urls: a url list where spider crawls from these webpages

    Parse (): the parsing method. When calling, the Response object returned from each URL is passed as the unique parameter, which is used to parse and match the captured data (resolved to item ), trace more URLs.

# -*- coding: utf-8 -*-#File name :spyders/mokospider.py#Author:Jhonny Zhang#mail:veinyy@163.com#create Time : 2014-11-29#############################################################################from scrapy.contrib.spiders import CrawlSpider,Rulefrom scrapy.contrib.linkextractors.sgml import SgmlLinkExtractorfrom moko.items import MokoItemimport refrom scrapy.http import Requestfrom scrapy.selector import Selectorclass MokoSpider(CrawlSpider):    name = "moko"    allowed_domains = ["moko.cc"]    start_urls=["http://www.moko.cc/post/aaronsky/list.html"]    rules = (Rule(SgmlLinkExtractor(allow=('/post/\d*\.html')),  callback = 'parse_img', follow=True),)    def parse_img(self, response):        urlItem = MokoItem()        sel = Selector(response)        for divs in sel.xpath('//div[@class="pic dBd"]'):            img_url=divs.xpath('.//img/@src2').extract()[0]            urlItem['url'] = img_url            yield urlItem

 

Our project is named moko. The crawler's allowed domain allowed_domains is restricted to moko. cc, that is, the crawler's restricted area. It requires crawlers to only crawl webpages under this domain name. The starting URL of a crawler starts at http://www.moko.cc/post/ononsky/list.html. Then set the crawling Rule. This is the difference between CrawlSpider and basic crawler. For example, we start crawling from webpage A. There are many hyperlink URLs on Webpage, the crawler then crawls the url that complies with the rules according to the set rules. When the callback function is called on a webpage, the default name parse is not used because the crawler framework may call parse in the official document, causing a conflict. The target http://www.moko.cc/post/aaronsky/list.html has many image links on the webpage. The links of each image can follow the rules. So here we use a regular expression to fill in the Rule rules = (Rule (SgmlLinkExtractor (allow = ('/post/\ d *\. html '), callback = 'parse _ img', follow = True),) indicates the current webpage, all conforming to/post/\ d *\. all webpages with the html suffix are crawled, and parse_img is called for processing. Next we will define the parsing function parse_img, which is critical. The input parameter is the response object returned by the crawler after the url is opened. The content in the response object is a lot of strings, we use crawlers to filter out the content we need. How to filter ??? Haha, there is an awesome Selector method that uses its xpath () path expression formula to parse the content. Before parsing it, we need to analyze the web page. The tool we use here is firebug. What we need is the core code of src2! In the <div class = "pic dBd"> label , the first instance is in Items. the object urlItem of MokoItem () defined in py is passed into response with a powerful Selector. Here I use a loop to process a url and parse the url using the xpath path expression, as for how to use xpath, Baidu. The results are stored in urlItem. Here we use the url defined in Items. py! Next, let's define pipelines, which is responsible for our content storage.
from moko.items import MokoItemclass MokoPipeline(object):    def __init__(self):        self.mfile = open('test.html', 'w')    def process_item(self, item, spider):        text = ''        self.mfile.writelines(text)    def close_spider(self, spider):        self.mfile.close()

 

Create a test.html file to store results. Note that some html rules are used in process_item to directly display images in html. At the end, define a method to close the file and call it at the end of the crawler. Finally, set settings. py.
BOT_NAME = 'moko'SPIDER_MODULES = ['moko.spiders']NEWSPIDER_MODULE = 'moko.spiders'# Crawl responsibly by identifying yourself (and your website) on the user-agent#USER_AGENT = 'moko (+http://www.yourdomain.com)'ITEM_PIPELINES={'moko.pipelines.MokoPipeline': 1,}     

 


Finally, let's take a look at it. I wish you a great deal of fun. ^_^

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.