Python Crawler: Crawling comics with scrapy frames __python

Source: Internet
Author: User

This article, through the Scrapy framework to achieve the same function. Scrapy is an application framework for crawling Web site data and extracting structured data. More details on the use of the framework are available in the official documentation, and this article shows the overall implementation of crawling comic pictures. scrapy Environment Configuration installation

The first is the installation of Scrapy, the blogger is using the Mac system, run the command line directly:

Pip Install Scrapy

For the extraction of HTML node information using the beautiful Soup library, the approximate usage is visible before an article is installed directly through the command:

Pip Install Beautifulsoup4

The HTML5LIB interpreter is required for the beautiful Soup object initialization of the target Web page, and the installed command:

Pip Install Html5lib

When the installation is complete, run the command directly at the command line:

Scrapy

You can see the following output, which proves that the Scrapy installation is complete.

Scrapy 1.2.1-no active project

Usage:
  scrapy <command> [options] [args]

Available commands:
  Bench         Run Quick benchmark Test
  commands      
  fetch         fetch a URL using the Scrapy downloader
  genspider     Generate new spider using pre-defined templates
  runspider     Run a self-contained spider (without creating a P roject)
  settings get      settings values
  ...
Project Creation

Create a project named Comics from the command line under the current path

Scrapy Startproject Comics

When the creation is complete, the corresponding project folder appears under the current directory, and you can see that the resulting comics file structure is:

|____comics
| |______init__.py |
|______pycache__ | |____items.py | |____pipelines.py |
|____ settings.py
| |____spiders | |
|______init__.py |
| |______pycache__
|____scrapy.cfg

Ps. Print the current file Structure command for:

Find. -print | Sed-e ' s; {fnxx==xxfn}*/;|____;g;s;____|; |; G

Each file corresponds to the specific functions of the official documents can be consulted, this implementation of these files does not involve much, so press not the table. Create Spider class

Create a class to implement the specific crawl function, all of our processing implementations will be in this class, and it must be scrapy. Spider subclass of the class.

Creates a comics.py file under the Comics/spiders file path.

The concrete implementation of comics.py:

#coding: utf-8

import Scrapy

class comics (Scrapy. Spider):

    name = "Comics"

    def Start_requests (self):
        urls = [' Http://www.xeall.com/shenshi '] for
        URL in URLs:
            yield scrapy. Request (Url=url, Callback=self.parse)

    def parse (self, Response):
        Self.log (Response.body);

The custom class is scrapy. Spider, where the Name property is the unique identity of the reptile, as an argument to the Scrapy crawl command. The properties of other methods are explained later. Run

After creating the custom class, switch to the comics path, run the command, start the crawler task and start crawling the page.

Scrapy Crawl Comics

The results of the print for the crawler to run the process of information, and target crawl page HTML source.

2016-11-26 22:04:35 [scrapy] info:scrapy 1.2.1 started (bot:comics)
2016-11-26 22:04:35 [scrapy] Info:overridden SE Ttings: {' Robotstxt_obey ': True, ' bot_name ': ' Comics ', ' Newspider_module ': ' comics.spiders ', ' spider_modules ': [' Comics.spiders ']}
2016-11-26 22:04:35 [scrapy] info:enabled extensions:
[' Scrapy.extensions.corestats.CoreStats ',
 ' scrapy.extensions.telnet.TelnetConsole ',
 ' Scrapy.extensions.logstats.LogStats ']
 ...

At this point, a basic crawler creation is complete, the following is the implementation of the specific process. climb a comic picture Start Address

The starting address of the reptile is:

Http://www.xeall.com/shenshi

Our main focus is on the comic list in the middle of the page with controls that show the number of pages below the list. As shown in the following figure
1.jpg

The main task of the crawler is to crawl the pictures of each comic in the list, crawl through the current page, go to the next page of the comic list and continue to crawl the comics, and then loop until all the comics have crawled.

The URL of the starting address we put in an array of URLs for the start_requests function. Where Start_requests is the method that overloads the parent class, which is executed at the start of the crawler task.

The main execution of the Start_requests method is in this line of code: request the specified URL, and call the corresponding callback function when the request completes Self.parse

Scrapy. Request (Url=url, Callback=self.parse)

There is another way to implement the previous code:

#coding: utf-8

import Scrapy

class comics (Scrapy. Spider):

    name = "Comics"
    start_urls = [' Http://www.xeall.com/shenshi ']

    def parse (self, response):
        Self.log (Response.body);

Start_urls is the property provided in the framework, for an array containing the URL of the target page, set the Start_urls value, do not need to overload the Start_requests method, the crawler will crawl the address in the Start_urls in turn, and automatically invokes parse as the callback method after the request completes.

However, in order to facilitate the process of other callback functions, the demo is still using the previous implementation. Crawl Comic URL

Starting from the Start page, first we have to crawl to the URL of each comic. Current page Comic list

The start page is the first page of the comic list, and we need to extract the required information from the current page and move through the implementation callback parse method.

Import the BeautifulSoup library at the beginning

From BS4 import BeautifulSoup

Request the returned HTML source code to initialize the BeautifulSoup.

Def parse (self, Response):
    content = response.body;
    Soup = beautifulsoup (content, "Html5lib")

Initialization specified Html5lib interpreter, if not installed here will be an error. BeautifulSoup initialization without the provision of the specified interpreter, will automatically use the best interpreter to match, there is a hole, for the target page of the source code using the default best interpreter for the lxml, at this time the result will be resolved problems, and resulting in the next data extraction. So when you find that sometimes the results are problematic, print soup to see if it's correct.

Look at the HTML source, the page shows the comic list part of the class named Listcon ul tags, through the Listcon class can uniquely confirm the corresponding label
2.jpg

Extract a label that contains a comic list

Listcon_tag = Soup.find (' ul ', class_= ' Listcon ')

The Find method above means looking for class-Listcon

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.