[Crawler] crawlers with graduation photos and graduation photos

Source: Internet
Author: User

[Crawler] crawlers with graduation photos and graduation photos

No wonder the pressure is too high, and the pressure on large servers is too high. What does the editing mean by directly attaching an image to a page? A photo is 8 M +, and the Internet speed is limited. Simply write a crawler so that he can get down slowly. By the way, he is learning to practice his skills... (PS: I do not know why it is invalid to download all links using thunder on the page under windows. Why ?)

There are a total of 192 groups of images. The first 20 groups have problems with the order of the web pages. After the crawlers finish writing them, they are too lazy to correct them. So let's do it. _ (: blank "blank )_


The Code is as follows:

If you have a python environment, you can save it and run it. If you are not interested, download Baidu online storage.

Http://pan.baidu.com/s/1hSvH8

(I feel that using sublimeText to install python support directly, F5 is not faster than directly using python file names in linux. I don't know if the speed varies with the test time)

#! /Usr/bin/env python #! -*-Coding: UTF-8-*-import urllib, urllib2import reimport time # Return the webpage source code def getHtml (url ): # print 'start Downloading Html Source Code 'user _ agent = 'mozilla/4.0 (compatible; MSIE 5.5; Windows NT) 'headers = {'user-agent ': user_agent} req = urllib2.Request (url, headers = headers) html = urllib2.urlopen (req) srcCode = html. read () # print srcCodereturn srcCode # srcCode contains the image webpage def getImg (srcCode, startNum, endNum ): # Create a regular expression for the images in the web page, and the format is listpattern = re. compile (R' <.*? Href = "(.*?) "Title =" no .*? Group. JPG "> ') imgSrcHtml = pattern. findall (srcCode) print imgSrcHtml # print len (imgSrcHtml) num = startNum # count = endPage-startPage # for x in xrange (1, count): for I in imgSrcHtml [startNum-1: endNum-1]: # complete link, get the complete address I = 'HTTP: // www.online.sdu.edu.cn '+ I # print I # return imageRealSrc # for src in imageRealSrc: # SUCCESS' % num) user_agent = 'mozilla/4.0 (compatible; MSIE 5.5; Windows NT) 'Headers = {'user-agent': user_agent} req = urllib2.Request (I, headers = headers) print "Downloading... "print iprint" and saving as JPG \ n "% (num + 8) # html = urllib2.urlopen (req, timeout = 5) html = urllib2.urlopen (req) f = open (". /pics/"+ u'7's. jpg '% (num + 8), 'W + B') f. write (html. read () f. close () num + = 1 # try: # html = urllib2.urlopen (req, timeout = 5) # handle T Exception, e: # print 'throw an Exception:' + str (e) # Break # try: # f = open (". /"+ 'salary s.jpg '% num, 'W + B') # f. write (html. read () # f. close () # num + = 1 # handle T Exception, e: # print 'throws an Exception:' + str (e) # num + = 1 # break # time. sleep (5) print 'all tasks completed! 'Imgurl = "http://www.online.sdu.edu.cn/news/article-17317.html" print "a total of 184 photos, please enter the number of start and end in turn" Start = int (raw_input ("start number \ n ")) end = int (raw_input ("End number \ n") print "Starting... "getImg (getHtml (ImgUrl), start, end)




Copyright Disclaimer: This article is an original article by the blogger and cannot be reproduced without the permission of the blogger.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.