Python crawler Learning

Last Update:2015-09-10 Source: Internet

Author: User

Tags save file

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Start learning python today and write a crawler.

First, find an example from Baidu to see.

1ImportUrllib.request,re,sys,os
2defGet_bing_backphoto ():
3if(Os.path.exists ('Photos') = = False):
4Os.mkdir ('Photos')
5 forIinchRange (0,30):
6URL ='http://cn.bing.com/HPImageArchive.aspx?format=js&idx='+str (i) +'&N=1&NC=1361089515117&FORM=HYLH1'
7html = urllib.request.urlopen (URL). Read ()
8ifHTML = ='NULL':
9Print('Open & Read Bing error!')
TenSys.exit (-1)
Onehtml = Html.decode ('Utf-8')
AReg = Re.compile ('" url": "(. *?)", "URLBase"', Re. S
-Text = Re.findall (reg,html)
-#http://s.cn.bing.net/az/hprichbg/rb/LongJi_ZH-CN8658435963_1366x768.jpg
the forImgurlinchText:
-right = Imgurl.rindex ('/')
-Name = Imgurl.replace (Imgurl[:right+1],"')
-Savepath ='photos/'+ Name
+Urllib.request.urlretrieve (Imgurl, Savepath)
-Print(Name +'Save success!')
+Get_bing_backphoto ()

Note that in line 1th, Python needs an import urllib.request,2.x version of Import Urllib

Line 7th, the same 3.4 version needs to write Urllib as Url.request.urlopen (), 2.x version is Url.urlopen ();

Otherwise it will error attributeerror: ' Module ' object has no attribute ' request '

Run, successfully grabbed 17 pictures down, in the script directory under the photo folder.

The following is a detailed analysis of the script execution steps;

1. The overall structure is divided into 3 parts, first import the toolkit, then define the grab function, and finally execute this function.

2.3rd, 4 line, determine whether there is a photo folder under the current path, no make one.

3. The 5th Line writes a loop, takes 30 pages in a row, and 6-14 acts each time the loop body, takes the picture address of the current page all out;

The 12th line is the focus, what exactly is the meaning of the http://cn.bing.com/HPImageArchive.aspx?format=js&idx=1&n=1&nc=1361089515117 &FORM=HYLH1 Enter the browser to see just know;

{"Images": [{"StartDate": "20150908", "fullstartdate": "201509081600", "EndDate": "20150909","url": "Http://s.cn.bing.net/az/hprichbg/rb/CoalTitVideo_ZH-CN7865623960_1920x1080.jpg", "URLBase" :"/az/hprichbg/rb/coaltitvideo_zh-cn7865623960 "," Copyright ":" Great Britain, Scotland, the coal-and-the-saggy-dwelling on the lichen-covered branches (©nhpa/superstock) "," Copyrightlink ":" Http://www.bing.com/search?q=%E7%85%A4%E5%B1%B1%E9%9B%80&form=hpcapt&mkt=zh-cn "," WP ": false," HSH ":" E0422C438EB020D7102D69B4405D1CC3 "," DRK ": 1," Top ": 1," bot ": 1," HS ": [{" desc ":" Of course I am sweet small and painful little public who, "link": "Http://www.bing.com/images/search?q=%E7%85%A4%E5%B1%B1%E9%9B%80&form=hphot1&mkt=zh-cn "," Query ":" Why else would anyone call me Belle? " "," Locx ":" Locy ": 35},{" desc ":" Pseet~pseet~ today's breakfast is delicious, "" link ":" Http://www.bing.com/search?q=%E7%85%A4%E5%B1%B1 %e9%9b%80+%e5%8f%ab%e5%a3%b0&form=hphot2&mkt=zh-cn "," Query ":" Hey, hey, you're not coming near, Tsee. See See see! "," Locx ": Notoginseng," locy ": 37},{" desc ":" Since it is a small public who, "," link ":" Http://www.bing.com/images/search?q=%E9%B8%9F%E7%AC%BC&form=hphot3&mkt=zh-cn "," Query ": Of course, live in a fairytale house. "," Locx ": Locy," msg ": [{" title ":" Today's picture Story "," link ":" Http://www.bing.com/search?q=%E7%85%A4%E5%B1%B1%E9%9B%80&form=pgbar1&mkt=ZH-CN "," text ":" Coal Saggy "}]}]," ToolTips ": {" Loading ":" Loading ... "," previous ":" Previous Page "," Next ":" Next Page "," Walle ":" This picture cannot be downloaded as a wallpaper. "," Walls ":" Download today's beauty map. Use as desktop wallpaper only. "}}

It is generally possible to guess that the function of this sentence is to take the "url": and "urlbase" between the parts.

4. Line 15th, write an internal for loop, the current page contains all the pictures are taken out (sometimes a page may contain more than one picture, of course, it may be empty);

Line 16th, the txt intercept xxx.jpg as the save file name.

The 19th line really retrieves the picture.

Python is still relatively concise, easy to read, the only thing that is a bit around is the regular expression part; it's better to use more.

The first crawler script executed successfully!

Python crawler Learning

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More