This article explains the example code for writing a Python crawler to crawl a GIF on a comic, with sample code Python3, using the Urllib module, request module, and BeautifulSoup module, and the friends you need can refer to
This article is to introduce the crawler is to crawl the cartoon gif funny image, convenient offline viewing. The crawler was developed by python3.3, and was mainly used in Urllib, request and BeautifulSoup modules.
The Urllib module provides a high-level interface for fetching data from the World Wide Web, and when we open a URL with Urlopen (), it is equivalent to opening a file with Python's built-in open (). But the difference is that the former receives a URL as a parameter, and there is no way to do a seek operation on the open file stream (from the bottom-up point of view, because it is actually the socket, so there is no way to do the seek operation), and the latter receives a local file name.
Python's BeautifulSoup module that can help you implement HTML and XML parsing
First of all, generally write web crawler, that is, crawl Web page HTML source content, and then analyze, extract the corresponding content.
This work of parsing HTML content, if only with ordinary regular expression re module to go a little match, for the content of simple page analysis, or basic enough.
But for a lot of work, to parse the content of the complex HTML, then with the RE module, you will find that can not be achieved, or difficult to achieve.
and using the BeautifulSoup module to help you achieve the analysis of HTML source work, you will find that things have become so simple, greatly improved the efficiency of the analysis of HTML source code.
Note: BeautifulSoup is a third-party library and I am using BS4. URLLIB2 is assigned to the urllib.request in Python3, and the original text in the document is as follows.
Note:the URLLIB2 module has been split across several modules in Python 3 named Urllib.requestand Urllib.error.
The crawler source code is as follows
#-*-Coding:utf-8-*-import urllib.requestimport bs4,ospage_sum = 1 #设置下载页数path = OS.GETCWD () path = os.path.join (path, ' burst Walk gif ') if not os.path.exists (path): Os.mkdir (path) #创建文件夹url = "Http://baozoumanhua.com/gif/year" #url地址 headers = {#伪装浏览器 ' user-agent ': ' mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.36 (khtml, like Gecko) "chrome/32.0.1700.76 safari/537.36 '}for count in range (Page_sum): req = urllib.request.Request (url = url+str (count+1), headers = headers) print (req.full_url) content = URLLIB.R Equest.urlopen (req). Read () soup = bs4. BeautifulSoup (content) # BeautifulSoup img_content = Soup.findall (' img ', attrs={' style ': ' width:460px '}) url_list = [img[' src '] for img in img_content] #列表推导 url title_list = [img[' alt '] for IMG in img_content] #图片名称 for i in rang E (url_list.__len__ ()): Imgurl = url_list[i] filename = path + os.sep +title_list[i] + ". gif" Print (filename+ ":" + Imgurl) #打印下Download Information Urllib.request.urlretrieve (imgurl,filename) #下载图片
On line 15th, you can modify the number of downloads, save the file as baozougif.py, and use the command Python baozougif.py to generate a "gif" folder in the same directory after running, and all the pictures will be downloaded to that directory automatically. "