Python crawler download Beautiful pictures (various methods)

Last Update:2018-01-13 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Disclaimer: The following code, Python version 3.6 runs flawlessly

First, the idea introduction

Different image sites have different anti-crawler mechanisms, depending on the specific site to take the corresponding method

1. Browser browsing analysis Address change law

2. Python test class Gets the Web page content to get the image address

3. Python test class Download the picture, save the success of the crawler can achieve

Second, the Watercress beauty (difficulty:?)

1. Website: https://www.dbmeinv.com/dbgroup/show.htm

After clicking in the browser, get the new address by category and page number: "https://www.dbmeinv.com/dbgroup/show.htm?cid=%s&pager_offset=%s"% (CID, index)

(where cid:2-chest 3-Leg 4-Face 5-Miscellaneous 6-Hips 7-Socks Index: pages)

2. Call through Python to view the contents of the Web page, the following is the contents of test_url.py

1  fromUrllibImportRequest2 ImportRe3  fromBs4ImportBeautifulSoup4 5 6 defget_html (URL):7req =request. Request (URL)8     returnRequest.urlopen (req). Read ()9 Ten  One if __name__=='__main__': AURL ="https://www.dbmeinv.com/dbgroup/show.htm?cid=2&pager_offset=2" -HTML =get_html (URL) -data = BeautifulSoup (HTML,"lxml") the     Print(data) -R = r'(https://\s+\.jpg)' -p =Re.compile (R) -Get_list =Re.findall (P, str (data)) +     Print(get_list)

Request Web site via Urllib.request.Request (URL), BeautifulSoup parse returned binary content, Re.findall () match image address

Final print (get_list) prints out a list of image addresses

3. Call through Python, download the image, the following is the contents of test_down.py

1  fromUrllibImportRequest2 3 4 defget_image (URL):5req =request. Request (URL)6Get_img =Request.urlopen (req). Read ()7With open ('e:/python_doc/images/downtest/001.jpg','WB') as FP:8 fp.write (get_img)9         Print("Download success! ")Ten     return One  A  - if __name__=='__main__': -URL ="https://ww2.sinaimg.cn/bmiddle/0060lm7Tgy1fn1cmtxkrcj30dw09a0u3.jpg" theGet_image (URL)

Get the picture through Urllib.request.Request (image_url), then write to the local, see the path under a picture, indicating that the entire crawler implementation is achievable

4. Complete the above analysis, write the full crawler code douban_spider.py

1  fromUrllibImportRequest2  fromUrllib.requestImportUrlopen3  fromBs4ImportBeautifulSoup4 ImportOS5 Import Time6 ImportRe7 8 9 #the global declaration can be written to the configuration file, here for the convenience of readers, it is only written in a fileTen #Image Address OnePicpath = R'E:\Python_Doc\Images' A #Watercress Address -Douban_url ="https://www.dbmeinv.com/dbgroup/show.htm?cid=%s&pager_offset=%s" -  the  - #folder where the path is saved, no folder is created by itself, parent folder cannot be created - defSetPath (name): -Path =Os.path.join (Picpath, name) +     if  notOs.path.isdir (path): - os.mkdir (path) +     returnPath A  at  - #Get HTML content - defget_html (URL): -req =request. Request (URL) -     returnRequest.urlopen (req). Read () -  in  - #Get Image Address to defGet_imageurl (HTML): +data = BeautifulSoup (HTML,"lxml") -R = r'(https://\s+\.jpg)' thep =Re.compile (R) *     returnRe.findall (P, str (data)) $ Panax Notoginseng  - #Save Picture the defsave_image (Savepath, url): +Content =urlopen (URL). Read () A     #url[-11:] means to intercept the original image after 11 bits theWith open (Savepath +'/'+ url[-11:],'WB') as code: + code.write (content) -  $  $ defDo_task (Savepath, CID, index): -url = douban_url%(CID, index) -HTML =get_html (URL) theImage_list =get_imageurl (HTML) -     #here to judge the fact that there is little meaning, the basic procedures are manually terminated, because the picture you are not finishedWuyi     if  notimage_list: the         Print(U'It 's all crawled .') -         return Wu     #real-time viewing, this is necessary -     Print("=============================================================================") About     Print(U'Start crawl cid=%s page%s'%(CID, index)) $      forImageinchimage_list: - save_image (savepath, image) -     #Crawl Next page -Do_task (Savepath, CID, index+1) A  +  the if __name__=='__main__': -     #file name $filename ="DouBan3" thefilepath =setpath (filename) the  the     #2-Chest 3-leg 4-Face 5-Miscellaneous 6-Hips 7-socks the      forIinchRange (2, 8): -Do_task (filepath, I, 1)

View Code

Run the program, go to Folder view, the picture has been written to the computer!

5. Analysis: Watercress picture download with a relatively simple crawler can be achieved, the site only control as if only can not be called frequently, so watercress is not suitable for multi-threaded calls

Watercress also has an address: https://www.dbmeinv.com/dbgroup/current.htm interested children can do their own research

Third, MM131 network (difficulty:?? ）

1. Website: http://www.mm131.com

To be perfected ...

Four, fried egg net (difficulty:?? ）

1. Website: Http://jandan.net/ooxx

To be perfected ...

Five, celestial pole picture (difficulty:?? ）

1. Website: http://pic.yesky.com

To be perfected ...

Vi. Summary and additions

1. There are three ways to get the content of a webpage

Urllib.request.Request and Urllib.request.urlopen

-Fast, easy to find, unable to get the Web content after JS execution

Requests method of request with headers

--fast, can realize camouflage, can't get the Web content after JS executes

Chrome Headless Method

--slow, equal to browser access, you can get JS after the execution of the Web page content

Python crawler download Beautiful pictures (various methods)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python crawler download Beautiful pictures (various methods)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python crawler download Beautiful pictures (various methods)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support