Start learning python today and write a crawler.
First, find an example from Baidu to see.
1ImportUrllib.request,re,sys,os
2defGet_bing_backphoto ():
3if(Os.path.exists ('Photos') = = False):
4Os.mkdir ('Photos')
5 forIinchRange (0,30):
6URL ='http://cn.bing.com/HPImageArchive.aspx?format=js&idx='+str (i) +'&N=1&NC=1361089515117&FORM=HYLH1'
7html = urllib.request.urlopen (URL). Read ()
8ifHTML = ='NULL':
9Print('Open & Read Bing error!')
TenSys.exit (-1)
Onehtml = Html.decode ('Utf-8')
AReg = Re.compile ('" url": "(. *?)", "URLBase"', Re. S
-Text = Re.findall (reg,html)
-#http://s.cn.bing.net/az/hprichbg/rb/LongJi_ZH-CN8658435963_1366x768.jpg
the forImgurlinchText:
-right = Imgurl.rindex ('/')
-Name = Imgurl.replace (Imgurl[:right+1],"')
-Savepath ='photos/'+ Name
+Urllib.request.urlretrieve (Imgurl, Savepath)
-Print(Name +'Save success!')
+Get_bing_backphoto ()
Note that in line 1th, Python needs an import urllib.request,2.x version of Import Urllib
Line 7th, the same 3.4 version needs to write Urllib as Url.request.urlopen (), 2.x version is Url.urlopen ();
Otherwise it will error attributeerror: ' Module ' object has no attribute ' request '
Run, successfully grabbed 17 pictures down, in the script directory under the photo folder.
The following is a detailed analysis of the script execution steps;
1. The overall structure is divided into 3 parts, first import the toolkit, then define the grab function, and finally execute this function.
2.3rd, 4 line, determine whether there is a photo folder under the current path, no make one.
3. The 5th Line writes a loop, takes 30 pages in a row, and 6-14 acts each time the loop body, takes the picture address of the current page all out;
The 12th line is the focus, what exactly is the meaning of the http://cn.bing.com/HPImageArchive.aspx?format=js&idx=1&n=1&nc=1361089515117 &FORM=HYLH1 Enter the browser to see just know;
{"Images": [{"StartDate": "20150908", "fullstartdate": "201509081600", "EndDate": "20150909","url": "Http://s.cn.bing.net/az/hprichbg/rb/CoalTitVideo_ZH-CN7865623960_1920x1080.jpg", "URLBase" :"/az/hprichbg/rb/coaltitvideo_zh-cn7865623960 "," Copyright ":" Great Britain, Scotland, the coal-and-the-saggy-dwelling on the lichen-covered branches (©nhpa/superstock) "," Copyrightlink ":" Http://www.bing.com/search?q=%E7%85%A4%E5%B1%B1%E9%9B%80&form=hpcapt&mkt=zh-cn "," WP ": false," HSH ":" E0422C438EB020D7102D69B4405D1CC3 "," DRK ": 1," Top ": 1," bot ": 1," HS ": [{" desc ":" Of course I am sweet small and painful little public who, "link": "Http://www.bing.com/images/search?q=%E7%85%A4%E5%B1%B1%E9%9B%80&form=hphot1&mkt=zh-cn "," Query ":" Why else would anyone call me Belle? " "," Locx ":" Locy ": 35},{" desc ":" Pseet~pseet~ today's breakfast is delicious, "" link ":" Http://www.bing.com/search?q=%E7%85%A4%E5%B1%B1 %e9%9b%80+%e5%8f%ab%e5%a3%b0&form=hphot2&mkt=zh-cn "," Query ":" Hey, hey, you're not coming near, Tsee. See See see! "," Locx ": Notoginseng," locy ": 37},{" desc ":" Since it is a small public who, "," link ":" Http://www.bing.com/images/search?q=%E9%B8%9F%E7%AC%BC&form=hphot3&mkt=zh-cn "," Query ": Of course, live in a fairytale house. "," Locx ": Locy," msg ": [{" title ":" Today's picture Story "," link ":" Http://www.bing.com/search?q=%E7%85%A4%E5%B1%B1%E9%9B%80&form=pgbar1&mkt=ZH-CN "," text ":" Coal Saggy "}]}]," ToolTips ": {" Loading ":" Loading ... "," previous ":" Previous Page "," Next ":" Next Page "," Walle ":" This picture cannot be downloaded as a wallpaper. "," Walls ":" Download today's beauty map. Use as desktop wallpaper only. "}}
It is generally possible to guess that the function of this sentence is to take the "url": and "urlbase" between the parts.
4. Line 15th, write an internal for loop, the current page contains all the pictures are taken out (sometimes a page may contain more than one picture, of course, it may be empty);
Line 16th, the txt intercept xxx.jpg as the save file name.
The 19th line really retrieves the picture.
Python is still relatively concise, easy to read, the only thing that is a bit around is the regular expression part; it's better to use more.
The first crawler script executed successfully!
Python crawler Learning