question No. 0013: use Python to write a crawl picture of the program, crawl this link in the Japanese sister pictures:-)
Koko... Sister even, the big night to climb something to eat it. Food atlas: SIP, Lick, Twist ~ Scd
There are a lot of ways to write a simple crawling crawler.
This attempt was made with urlib.request .
Read the image network source code, using Re.compile to find the required IMG Tag generated image list, and finally use Request.urlretrieve to download pictures to local.
Code:
ImportOSImportReImporturllib.requestdefpic_collector (URL): Content=urllib.request.urlopen (URL). Read () R= Re.compile (' "height=" "src=" (. *) "') Pic_list= R.findall (Content.decode ('Utf-8')) Os.mkdir ('pic_collection') Os.chdir (Os.path.join (OS.GETCWD (),'pic_collection')) forIinchRange (len (pic_list)): Pic_num= str (i) +'. jpg'Urllib.request.urlretrieve (Pic_list[i], pic_num)Print("success!"+Pic_list[i]) pic_collector ("http://tieba.baidu.com/p/4341640851")
Note:
1. Re.compile () content is determined by the source code of the Web page . For example, I picked this page, using Chrome to view the source code, to find the image that you want to download tags, the full content of the following (in a picture as an example):
class =" bde_image pic_type=" 1 width=" 450 height=" 450 src=" http://imgsrc.baidu.com/forum/w%3d580/sign=a6080fca870a19d8cb03840d03fb82c9/ 2683ea039245d688be88e4dfa3c27d1ed31b2445.jpg Size= " 259380 " ;
That is, the content of the picture tag is "
2. R.findall () in the content after the decode (' Utf-8 ') to be able to understand the Utf-8 format page source code
3. os.mkdir (filename) new folder; Os.chdir (filename) Change path to XX folder; OS.GETCWD () Gets the current folder name (string)
4. Urllib.request.urlretrieve (pic,pic_name) Save the image to the above path and set the file name
The saved files are as follows:
In the future to see the United States teenager soap Flakes no longer have unlimited right button, I heart very comforting _ (: 3"∠) _
Oh, if in the bar, heap sugar want to download xx page to xx page pictures how to do,.???.
For example, the above image sticker, the URL is jiangzi:
Http://tieba.baidu.com/p/4341640851?pn=1 #第1页http://tieba.baidu.com/p/4341640851?pn=2 # 2nd page http://tieba.baidu.com/p/4341640851?pn=3 http://tieba.baidu.com/p/4341640851?pn=4 #第4页 ... http://tieba.baidu.com/p/4341640851?pn=n #第n页
Then change the code:
Importurllib.requestImportReImportOSdeffetch_pictures (URL, m, n): Os.chdir (Os.path.join (OS.GETCWD (),'pic_collection')) Temp= 1#record number of pictures forXinchRange (n-m+1): Html_content= Urllib.request.urlopen (URL +"? pn="+ str (N+X-1)). Read ()#key!R = Re.compile (' "height=" "src=" (. *) "') Picture_url_list= R.findall (Html_content.decode ('Utf-8')) Print(picture_url_list) forIinchRange (len (picture_url_list)): Picture_name= str (temp) +'. jpg'Urllib.request.urlretrieve (Picture_url_list[i], picture_name)Print("success!"+picture_url_list[i]) Temp+ = 1Fetch_pictures ("http://tieba.baidu.com/p/4341640851", 1, 3)
So you can download the picture on page 1th to 3rd, down the entire post of the picture to see the number of pages to change their own.
"Python Mini Practice" 0013