This article, through the Scrapy framework to achieve the same function. Scrapy is an application framework for crawling Web site data and extracting structured data. More details on the use of the framework are available in the official documentation, and this article shows the overall implementation of crawling comic pictures. scrapy Environment Configuration installation
The first is the installation of Scrapy, the blogger is using the Mac system, run the command line directly:
Pip Install Scrapy
For the extraction of HTML node information using the beautiful Soup library, the approximate usage is visible before an article is installed directly through the command:
Pip Install Beautifulsoup4
The HTML5LIB interpreter is required for the beautiful Soup object initialization of the target Web page, and the installed command:
Pip Install Html5lib
When the installation is complete, run the command directly at the command line:
Scrapy
You can see the following output, which proves that the Scrapy installation is complete.
Scrapy 1.2.1-no active project
Usage:
scrapy <command> [options] [args]
Available commands:
Bench Run Quick benchmark Test
commands
fetch fetch a URL using the Scrapy downloader
genspider Generate new spider using pre-defined templates
runspider Run a self-contained spider (without creating a P roject)
settings get settings values
...
Project Creation
Create a project named Comics from the command line under the current path
Scrapy Startproject Comics
When the creation is complete, the corresponding project folder appears under the current directory, and you can see that the resulting comics file structure is:
|____comics
| |______init__.py |
|______pycache__ | |____items.py | |____pipelines.py |
|____ settings.py
| |____spiders | |
|______init__.py |
| |______pycache__
|____scrapy.cfg
Ps. Print the current file Structure command for:
Find. -print | Sed-e ' s; {fnxx==xxfn}*/;|____;g;s;____|; |; G
Each file corresponds to the specific functions of the official documents can be consulted, this implementation of these files does not involve much, so press not the table. Create Spider class
Create a class to implement the specific crawl function, all of our processing implementations will be in this class, and it must be scrapy. Spider subclass of the class.
Creates a comics.py file under the Comics/spiders file path.
The concrete implementation of comics.py:
#coding: utf-8
import Scrapy
class comics (Scrapy. Spider):
name = "Comics"
def Start_requests (self):
urls = [' Http://www.xeall.com/shenshi '] for
URL in URLs:
yield scrapy. Request (Url=url, Callback=self.parse)
def parse (self, Response):
Self.log (Response.body);
The custom class is scrapy. Spider, where the Name property is the unique identity of the reptile, as an argument to the Scrapy crawl command. The properties of other methods are explained later. Run
After creating the custom class, switch to the comics path, run the command, start the crawler task and start crawling the page.
Scrapy Crawl Comics
The results of the print for the crawler to run the process of information, and target crawl page HTML source.
2016-11-26 22:04:35 [scrapy] info:scrapy 1.2.1 started (bot:comics)
2016-11-26 22:04:35 [scrapy] Info:overridden SE Ttings: {' Robotstxt_obey ': True, ' bot_name ': ' Comics ', ' Newspider_module ': ' comics.spiders ', ' spider_modules ': [' Comics.spiders ']}
2016-11-26 22:04:35 [scrapy] info:enabled extensions:
[' Scrapy.extensions.corestats.CoreStats ',
' scrapy.extensions.telnet.TelnetConsole ',
' Scrapy.extensions.logstats.LogStats ']
...
At this point, a basic crawler creation is complete, the following is the implementation of the specific process. climb a comic picture Start Address
The starting address of the reptile is:
Http://www.xeall.com/shenshi
Our main focus is on the comic list in the middle of the page with controls that show the number of pages below the list. As shown in the following figure
1.jpg
The main task of the crawler is to crawl the pictures of each comic in the list, crawl through the current page, go to the next page of the comic list and continue to crawl the comics, and then loop until all the comics have crawled.
The URL of the starting address we put in an array of URLs for the start_requests function. Where Start_requests is the method that overloads the parent class, which is executed at the start of the crawler task.
The main execution of the Start_requests method is in this line of code: request the specified URL, and call the corresponding callback function when the request completes Self.parse
Scrapy. Request (Url=url, Callback=self.parse)
There is another way to implement the previous code:
#coding: utf-8
import Scrapy
class comics (Scrapy. Spider):
name = "Comics"
start_urls = [' Http://www.xeall.com/shenshi ']
def parse (self, response):
Self.log (Response.body);
Start_urls is the property provided in the framework, for an array containing the URL of the target page, set the Start_urls value, do not need to overload the Start_requests method, the crawler will crawl the address in the Start_urls in turn, and automatically invokes parse as the callback method after the request completes.
However, in order to facilitate the process of other callback functions, the demo is still using the previous implementation. Crawl Comic URL
Starting from the Start page, first we have to crawl to the URL of each comic. Current page Comic list
The start page is the first page of the comic list, and we need to extract the required information from the current page and move through the implementation callback parse method.
Import the BeautifulSoup library at the beginning
From BS4 import BeautifulSoup
Request the returned HTML source code to initialize the BeautifulSoup.
Def parse (self, Response):
content = response.body;
Soup = beautifulsoup (content, "Html5lib")
Initialization specified Html5lib interpreter, if not installed here will be an error. BeautifulSoup initialization without the provision of the specified interpreter, will automatically use the best interpreter to match, there is a hole, for the target page of the source code using the default best interpreter for the lxml, at this time the result will be resolved problems, and resulting in the next data extraction. So when you find that sometimes the results are problematic, print soup to see if it's correct.
Look at the HTML source, the page shows the comic list part of the class named Listcon ul tags, through the Listcon class can uniquely confirm the corresponding label
2.jpg
Extract a label that contains a comic list
Listcon_tag = Soup.find (' ul ', class_= ' Listcon ')
The Find method above means looking for class-Listcon