Use Python to crawl pictures of pornographic websites, and small toys for technical dwellings. Here I will show you the full steps to download a pornographic website image in Python, where you will see the simplicity of Python and the boredom of the technical house.
First you should have a pornographic website URL, of course I will not give you, find yourself!!! I'll tell you the rules:
Http://www.*****.com/htm/piclist "1"/"2". htm
A pornographic site in the picture area, the URL is different only "1" "2" two places, through observation you can find that "1" is the picture type (stockings legs, pure beauty, * * * *, and so on), "2" at the current classification of the nth page directory, so. First we can build a directory URL address that represents the picture list:
BaseURL = ' http://www.*****.com/htm/piclist%s/%s.htm ' urls = [BaseURL% (x, y) for x in Xrange (1,10) for y in Xrange (1,10)]
This is all the first 9 pages of all the image categories on the current site.
Then analyze the structure of each page directory:
As you can see, each directory corresponds to a <a> tag and is included in a <li>, so you can build a parser that parses the HTML page:
class Imgidxpageparser (Sgmllib. Sgmlparser): ' Parse a image index page ' def __init__ (self): Sgmllib. Sgmlparser.__init__ (self) self.imgurllist = [] Self.taglist = [] Self.ontitle = False def unknown_ Starttag (self, Tag, attrs): if tag = = ' Li ': self.tagList.append (' li ') elif tag = = ' A ' and Len (sel F.taglist): Self.imgurlList.append (WEBSITE + attrs[0][1]) elif tag = = ' title ': Self.ontitle = True def unknown_endtag (self,tag): if tag = = ' Li ': self.tagList.pop () elif tag = = ' title ': Self.ontitle = False def handle_data (self, data): If Self.ontitle:title = Data.decode (' utf- 8 '). Encode (' gb2312 '). Strip () Self.title = Title[:title.find ('_')]
Because of the use of sgmllib. Sgmlparser, we should import sgmlib first. In Imgidxpageparser, we use the title to save the current page in the category, with Imgurllist to save each directory address. About Sgmllib. Sgmlparser's introduction, please click here.
For this, we construct a function that returns the category name and directory list under the specified directory:
def getpageurllist (page): "Get title and all the pages list of a Picindex page" ' parser = Imgidxpageparser () C2/>parser.feed (Urllib2.urlopen (page). Read ()) return parser.imgurllist, Parser.title
This function returns all the directory entries in the page and the section (Stockings legs, pure aesthetics, * * *, and so on).
Then we open one of the image pages to analyze as follows:
Of course, the image link is placed in ' src ', for which we build a second parser--the address of each image used to parse the picture page:
Class Imagepageparser (Sgmllib. Sgmlparser): ' Parse a image page ' s image URLs ' def __init__ (self): sgmllib. Sgmlparser.__init__ (self) self.imgurllist = [] def unknown_starttag (self, Tag, attrs): if tag = = ' Meta ' and ATTRS[0][1] = = ' description ': self.title = Attrs[1][1].decode (' Utf-8 '). Encode (' gb2312 '). Strip () elif tag = = ' IMG ': for key, value in Attrs:
In Imagepageparser, we save the title of the current page by title and save the address of each image with Imgurllist. Also we need to construct a function to use Imagepageparser and return all the image addresses and the caption of these images:
def getimageurllist (pageurl): "Get all image URLs from a page" ' parser = Imagepageparser () parser.feed (U Rllib2.urlopen (Pageurl). Read ()) return parser.imgurllist, Parser.title
The Getimageurllist function returns the address of all the images in the Pageurl page and the title of the page.
Since all of the addresses we get are presented in a list, we need a function to download all the images from a list at once:
def downloadimage (Imglist, save= "): " Download images of a list " if not (os.path.exists (' Save ') and Os.path. Isdir (Save)): try: os.makedirs (save) except Exception,e:pass for x in imglist: try: filename = save + ' \ \ ' + x[x.rfind ('/') + 1:] print filename urllib.urlretrieve (x, filename) except Exceptio N, E: print ' exception catched: ', E
In the Downloadimage function, we receive a picture address list and a save path (not required, if not in the current directory), the function starts if is to ensure that the incoming save path does not exist, first create a folder. Then there is a list of traversed image addresses, and each image is downloaded through the Urllib.urlretrieve () function. For more information about Urllib, please click here. The purpose of our use of try-except in for is to not crash the entire program when loading an image in the event of an error (network disconnection, file IO error, or any unknown error).
When the preparation is finished, we can begin to download the image, and the rest of the task is to assemble the blocks that have just been assembled:
if __name__ = = ' __main__ ': basepath = ' g:\\__temp\\ ' baseurl = ' http://www.*****.com/htm/piclist%s/%s.htm ' urls = [BaseURL% (x, y) for x in Xrange (1,10) for y in Xrange (1,10)] for URL in URLs: try: pagelist, Outt itle = getpageurllist (URL) print outtitle, pagelist for page in pagelist: imglist, intitle = getimageurllist (page) Downloadimage (imglist, save= basepath + outtitle + ' \ \ ' + intitle) except Exception, E: print ' Excepiton catched : ', E
This completes the combination of the program, so that you can automatically download pornographic website pictures, and due to the processing of the title, we will be based on the site to save the structure of the same directory hierarchy.
Of course, there are a lot of procedures to deal with, for example, in order to improve efficiency we need a multi-threaded join, the picture download possible errors to do a complete processing, for this, please remember a word: self-reliance, clothed!
80 lines of Python code to automatically crawl pornographic website pictures