Today, I want to download many images, which is difficult to manually, so I wrote a small program. During this period, many problems were encountered.
The most important one is that some web pages will return 403 forbidden, which will be resolved after the header information is added. Record.
Here we use regular expressions, urllib web page programming, and other knowledge. I haven't used it for a long time. It's a review.
Code
#-*-Encoding: UTF-8 -*-
Import re, urllib2
Def getpage (URL ):
'''Download the HTML code of the file and find the core code on the first floor '''
Opener = urllib2.build _ opener ()
# Error 403 and garbled characters are displayed if no header information is added
Opener. addheaders = [('user-agent', 'mozilla/5.0 ')];
Htmlall = opener. Open (URL). Read ()
Reg1floor = '<Div class = "msgfont"> (.*?) </Div>'
Html = Re. Search (reg1floor, htmlall)
Html = html. Group ()
# The file storage encoding and file editing encoding are both UTF-8, so decode once. Otherwise, garbled characters will appear, but the results will not be affected.
Return html. Decode ('utf-8 ')
Def getimg (URL ):
'''Image address from the core code, and download, save, and name '''
Regimg = ''
Dir = 'f: \ my_document \ Desktop \ temp \\'
Pagehtml = getpage (URL)
# Find all image addresses
Imglist = Re. findall (regimg, pagehtml)
# Print imglist
For index in xrange (1, Len (imglist) + 1 ):
Finename = dir + STR (INDEX) + '.jpg'
Urllib. urlretrieve (imglist [index-1], finename)
Print finename + 'OK! '
If _ name _ = '_ main __':
Getimg ('HTTP: // response ')