Compile a Python crawler to capture and share GIF images on a cartoon,
This article introduces crawlers to capture GIF interesting pictures on a runaway cartoon to facilitate offline viewing. Crawlers are developed using python3.3 and mainly use the urllib, request, and BeautifulSoup modules.
The urllib module provides high-level interfaces for retrieving data from the World Wide Web. When we open a URL with urlopen (), it is equivalent to opening a file with the built-in open () in Python. However, the difference is that the former receives a URL as a parameter and cannot perform seek operations on the opened file stream (from the underlying perspective, because the actual operation is socket, therefore, you cannot perform the seek operation), while the latter receives a local file name.
The BeautifulSoup module of Python helps you parse HTML and XML
First, write a web crawler, that is, capture the html source code and other content of the web page, and then analyze and extract the corresponding content.
This analysis of html content is basically enough for webpage analysis with simple content if the regular expression re module is used for a little matching.
However, if you need to parse html with complicated content due to heavy workload, you can find that the re module cannot be implemented or is difficult to implement.
Using the beautifulsoup module to help you analyze the html source code, you will find that the process becomes so simple, greatly improving the efficiency of html source code analysis.
Note: BeautifulSoup is a third-party library and I use bs4. Urllib2 is allocated to urllib. request in python3. The original Article in this document is as follows.
Note: The urllib2 module has been split into SS several modules in Python 3 named urllib. requestand urllib. error.
The crawler source code is as follows:
#-*-Coding: UTF-8-*-import urllib. requestimport bs4, ospage_sum = 1 # set the number of download pages path = OS. getcwd () path = OS. path. join (path, 'runaway GIF ') if not OS. path. exists (path): OS. mkdir (path) # create folder url = "http://baozoumanhua.com/gif/year" # url address headers = {# camouflage browser 'user-agent': 'mozilla/5.0 (Windows NT 6.1; WOW64) appleWebKit/537.36 (KHTML, like Gecko) ''chrome/32.0.1700.76 Safari/100'} for count in range (page_sum): req = urllib. request. request (url = url + str (count + 1), headers = headers) print (req. full_url) content = urllib. request. urlopen (req ). read () soup = bs4.BeautifulSoup (content) # BeautifulSoup img_content = soup. findAll ('img ', attrs = {'style': 'width: 460px '}) url_list = [img ['src'] for img in img_content] # list derivation url title_list = [img ['alt'] for img in img_content] # image name for I in range (url_list. _ len _ (): imgurl = url_list [I] filename = path + OS. sep + title_list [I] + ". gif "print (filename +": "+ imgurl) # print the download information urllib. request. urlretrieve (imgurl, filename) # download an image
You can modify the number of downloaded pages in row 15th and save the file as baozougif. py, use the command python baozougif. after py runs, a folder named "violent GIF" is generated in the same directory. All images are automatically downloaded to this directory.