80 lines of Python code to automatically crawl pornographic website pictures

Last Update:2014-11-05 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Use Python to crawl pictures of pornographic websites, and small toys for technical dwellings. Here I will show you the full steps to download a pornographic website image in Python, where you will see the simplicity of Python and the boredom of the technical house.

First you should have a pornographic website URL, of course I will not give you, find yourself!!! I'll tell you the rules:

Http://www.*****.com/htm/piclist "1"/"2". htm

A pornographic site in the picture area, the URL is different only "1" "2" two places, through observation you can find that "1" is the picture type (stockings legs, pure beauty, * * * *, and so on), "2" at the current classification of the nth page directory, so. First we can build a directory URL address that represents the picture list:

BaseURL = ' http://www.*****.com/htm/piclist%s/%s.htm ' urls = [BaseURL% (x, y) for x in Xrange (1,10) for y in Xrange (1,10)]

This is all the first 9 pages of all the image categories on the current site.

Then analyze the structure of each page directory:

As you can see, each directory corresponds to a <a> tag and is included in a <li>, so you can build a parser that parses the HTML page:

class Imgidxpageparser (Sgmllib. Sgmlparser): ' Parse a image index page ' def __init__ (self): Sgmllib. Sgmlparser.__init__ (self) self.imgurllist = [] Self.taglist = [] Self.ontitle = False def unknown_ Starttag (self, Tag, attrs): if tag = = ' Li ': self.tagList.append (' li ') elif tag = = ' A ' and Len (sel  F.taglist): Self.imgurlList.append (WEBSITE + attrs[0][1]) elif tag = = ' title ': Self.ontitle =            True def unknown_endtag (self,tag): if tag = = ' Li ': self.tagList.pop () elif tag = = ' title ': Self.ontitle = False def handle_data (self, data): If Self.ontitle:title = Data.decode (' utf- 8 '). Encode (' gb2312 '). Strip () Self.title = Title[:title.find ('_')]

Because of the use of sgmllib. Sgmlparser, we should import sgmlib first. In Imgidxpageparser, we use the title to save the current page in the category, with Imgurllist to save each directory address. About Sgmllib. Sgmlparser's introduction, please click here.

For this, we construct a function that returns the category name and directory list under the specified directory:

def getpageurllist (page):    "Get title and all the pages list of a Picindex page" '     parser = Imgidxpageparser () C2/>parser.feed (Urllib2.urlopen (page). Read ())    return parser.imgurllist, Parser.title

This function returns all the directory entries in the page and the section (Stockings legs, pure aesthetics, * * *, and so on).

Then we open one of the image pages to analyze as follows:

Of course, the image link is placed in ' src ', for which we build a second parser--the address of each image used to parse the picture page:

Class Imagepageparser (Sgmllib. Sgmlparser):    ' Parse a image page ' s image URLs '    def __init__ (self):        sgmllib. Sgmlparser.__init__ (self)        self.imgurllist = []    def unknown_starttag (self, Tag, attrs):        if tag = = ' Meta ' and ATTRS[0][1] = = ' description ':            self.title = Attrs[1][1].decode (' Utf-8 '). Encode (' gb2312 '). Strip ()        elif tag = = ' IMG ':            for key, value in Attrs:

In Imagepageparser, we save the title of the current page by title and save the address of each image with Imgurllist. Also we need to construct a function to use Imagepageparser and return all the image addresses and the caption of these images:

def getimageurllist (pageurl):    "Get all image URLs from a page" '    parser = Imagepageparser ()    parser.feed (U Rllib2.urlopen (Pageurl). Read ())    return parser.imgurllist, Parser.title

The Getimageurllist function returns the address of all the images in the Pageurl page and the title of the page.

Since all of the addresses we get are presented in a list, we need a function to download all the images from a list at once:

def downloadimage (Imglist, save= "):    " Download images of a list "    if not (os.path.exists (' Save ') and Os.path. Isdir (Save)):        try:            os.makedirs (save)        except Exception,e:pass for    x in imglist:        try:            filename = save + ' \ \ ' + x[x.rfind ('/') + 1:]            print filename            urllib.urlretrieve (x, filename)        except Exceptio N, E:            print ' exception catched: ', E

In the Downloadimage function, we receive a picture address list and a save path (not required, if not in the current directory), the function starts if is to ensure that the incoming save path does not exist, first create a folder. Then there is a list of traversed image addresses, and each image is downloaded through the Urllib.urlretrieve () function. For more information about Urllib, please click here. The purpose of our use of try-except in for is to not crash the entire program when loading an image in the event of an error (network disconnection, file IO error, or any unknown error).

When the preparation is finished, we can begin to download the image, and the rest of the task is to assemble the blocks that have just been assembled:

if __name__ = = ' __main__ ':    basepath = ' g:\\__temp\\ '    baseurl = ' http://www.*****.com/htm/piclist%s/%s.htm '    urls = [BaseURL% (x, y) for x in Xrange (1,10) for y in Xrange (1,10)]    for URL in URLs:        try:            pagelist, Outt itle = getpageurllist (URL)            print outtitle, pagelist for            page in pagelist:                imglist, intitle = getimageurllist (page)                Downloadimage (imglist, save= basepath + outtitle + ' \ \ ' + intitle)        except Exception, E:            print ' Excepiton catched : ', E

This completes the combination of the program, so that you can automatically download pornographic website pictures, and due to the processing of the title, we will be based on the site to save the structure of the same directory hierarchy.

Of course, there are a lot of procedures to deal with, for example, in order to improve efficiency we need a multi-threaded join, the picture download possible errors to do a complete processing, for this, please remember a word: self-reliance, clothed!

80 lines of Python code to automatically crawl pornographic website pictures

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

80 lines of Python code to automatically crawl pornographic website pictures

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

80 lines of Python code to automatically crawl pornographic website pictures

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support