Python crawler Learning

Source: Internet
Author: User
Tags save file

Start learning python today and write a crawler.

First, find an example from Baidu to see.

1ImportUrllib.request,re,sys,os
2defGet_bing_backphoto ():
3if(Os.path.exists ('Photos') = = False):
4Os.mkdir ('Photos')
5 forIinchRange (0,30):
6URL ='http://cn.bing.com/HPImageArchive.aspx?format=js&idx='+str (i) +'&N=1&NC=1361089515117&FORM=HYLH1'
7html = urllib.request.urlopen (URL). Read ()
8ifHTML = ='NULL':
9Print('Open & Read Bing error!')
TenSys.exit (-1)
Onehtml = Html.decode ('Utf-8')
AReg = Re.compile ('" url": "(. *?)", "URLBase"', Re. S
-Text = Re.findall (reg,html)
-#http://s.cn.bing.net/az/hprichbg/rb/LongJi_ZH-CN8658435963_1366x768.jpg
the forImgurlinchText:
-right = Imgurl.rindex ('/')
-Name = Imgurl.replace (Imgurl[:right+1],"')
-Savepath ='photos/'+ Name
+Urllib.request.urlretrieve (Imgurl, Savepath)
-Print(Name +'Save success!')
+Get_bing_backphoto ()

Note that in line 1th, Python needs an import urllib.request,2.x version of Import Urllib

Line 7th, the same 3.4 version needs to write Urllib as Url.request.urlopen (), 2.x version is Url.urlopen ();

Otherwise it will error attributeerror: ' Module ' object has no attribute ' request '

Run, successfully grabbed 17 pictures down, in the script directory under the photo folder.

The following is a detailed analysis of the script execution steps;

1. The overall structure is divided into 3 parts, first import the toolkit, then define the grab function, and finally execute this function.

2.3rd, 4 line, determine whether there is a photo folder under the current path, no make one.

3. The 5th Line writes a loop, takes 30 pages in a row, and 6-14 acts each time the loop body, takes the picture address of the current page all out;

The 12th line is the focus, what exactly is the meaning of the http://cn.bing.com/HPImageArchive.aspx?format=js&idx=1&n=1&nc=1361089515117 &FORM=HYLH1 Enter the browser to see just know;

{"Images": [{"StartDate": "20150908", "fullstartdate": "201509081600", "EndDate": "20150909","url": "Http://s.cn.bing.net/az/hprichbg/rb/CoalTitVideo_ZH-CN7865623960_1920x1080.jpg", "URLBase" :"/az/hprichbg/rb/coaltitvideo_zh-cn7865623960 "," Copyright ":" Great Britain, Scotland, the coal-and-the-saggy-dwelling on the lichen-covered branches (©nhpa/superstock) "," Copyrightlink ":" Http://www.bing.com/search?q=%E7%85%A4%E5%B1%B1%E9%9B%80&form=hpcapt&mkt=zh-cn "," WP ": false," HSH ":" E0422C438EB020D7102D69B4405D1CC3 "," DRK ": 1," Top ": 1," bot ": 1," HS ": [{" desc ":" Of course I am sweet small and painful little public who, "link": "Http://www.bing.com/images/search?q=%E7%85%A4%E5%B1%B1%E9%9B%80&form=hphot1&mkt=zh-cn "," Query ":" Why else would anyone call me Belle? " "," Locx ":" Locy ": 35},{" desc ":" Pseet~pseet~ today's breakfast is delicious, "" link ":" Http://www.bing.com/search?q=%E7%85%A4%E5%B1%B1 %e9%9b%80+%e5%8f%ab%e5%a3%b0&form=hphot2&mkt=zh-cn "," Query ":" Hey, hey, you're not coming near, Tsee. See See see! "," Locx ": Notoginseng," locy ": 37},{" desc ":" Since it is a small public who, "," link ":" Http://www.bing.com/images/search?q=%E9%B8%9F%E7%AC%BC&form=hphot3&mkt=zh-cn "," Query ": Of course, live in a fairytale house. "," Locx ": Locy," msg ": [{" title ":" Today's picture Story "," link ":" Http://www.bing.com/search?q=%E7%85%A4%E5%B1%B1%E9%9B%80&form=pgbar1&mkt=ZH-CN "," text ":" Coal Saggy "}]}]," ToolTips ": {" Loading ":" Loading ... "," previous ":" Previous Page "," Next ":" Next Page "," Walle ":" This picture cannot be downloaded as a wallpaper. "," Walls ":" Download today's beauty map. Use as desktop wallpaper only. "}}

It is generally possible to guess that the function of this sentence is to take the "url": and "urlbase" between the parts.

4. Line 15th, write an internal for loop, the current page contains all the pictures are taken out (sometimes a page may contain more than one picture, of course, it may be empty);

Line 16th, the txt intercept xxx.jpg as the save file name.

The 19th line really retrieves the picture.

Python is still relatively concise, easy to read, the only thing that is a bit around is the regular expression part; it's better to use more.

The first crawler script executed successfully!

Python crawler Learning

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.