Python crawler download Beautiful pictures (various methods)

Source: Internet
Author: User

Disclaimer: The following code, Python version 3.6 runs flawlessly


First, the idea introduction

Different image sites have different anti-crawler mechanisms, depending on the specific site to take the corresponding method

1. Browser browsing analysis Address change law

2. Python test class Gets the Web page content to get the image address

3. Python test class Download the picture, save the success of the crawler can achieve

Second, the Watercress beauty (difficulty:?)

1. Website: https://www.dbmeinv.com/dbgroup/show.htm

After clicking in the browser, get the new address by category and page number: "https://www.dbmeinv.com/dbgroup/show.htm?cid=%s&pager_offset=%s"% (CID, index)

(where cid:2-chest 3-Leg 4-Face 5-Miscellaneous 6-Hips 7-Socks Index: pages)

2. Call through Python to view the contents of the Web page, the following is the contents of test_url.py

1  fromUrllibImportRequest2 ImportRe3  fromBs4ImportBeautifulSoup4 5 6 defget_html (URL):7req =request. Request (URL)8     returnRequest.urlopen (req). Read ()9 Ten  One if __name__=='__main__': AURL ="https://www.dbmeinv.com/dbgroup/show.htm?cid=2&pager_offset=2" -HTML =get_html (URL) -data = BeautifulSoup (HTML,"lxml") the     Print(data) -R = r'(https://\s+\.jpg)' -p =Re.compile (R) -Get_list =Re.findall (P, str (data)) +     Print(get_list)

Request Web site via Urllib.request.Request (URL), BeautifulSoup parse returned binary content, Re.findall () match image address

Final print (get_list) prints out a list of image addresses

3. Call through Python, download the image, the following is the contents of test_down.py

1  fromUrllibImportRequest2 3 4 defget_image (URL):5req =request. Request (URL)6Get_img =Request.urlopen (req). Read ()7With open ('e:/python_doc/images/downtest/001.jpg','WB') as FP:8 fp.write (get_img)9         Print("Download success! ")Ten     return One  A  - if __name__=='__main__': -URL ="https://ww2.sinaimg.cn/bmiddle/0060lm7Tgy1fn1cmtxkrcj30dw09a0u3.jpg" theGet_image (URL)

Get the picture through Urllib.request.Request (image_url), then write to the local, see the path under a picture, indicating that the entire crawler implementation is achievable

4. Complete the above analysis, write the full crawler code douban_spider.py

1  fromUrllibImportRequest2  fromUrllib.requestImportUrlopen3  fromBs4ImportBeautifulSoup4 ImportOS5 Import Time6 ImportRe7 8 9 #the global declaration can be written to the configuration file, here for the convenience of readers, it is only written in a fileTen #Image Address OnePicpath = R'E:\Python_Doc\Images' A #Watercress Address -Douban_url ="https://www.dbmeinv.com/dbgroup/show.htm?cid=%s&pager_offset=%s" -  the  - #folder where the path is saved, no folder is created by itself, parent folder cannot be created - defSetPath (name): -Path =Os.path.join (Picpath, name) +     if  notOs.path.isdir (path): - os.mkdir (path) +     returnPath A  at  - #Get HTML content - defget_html (URL): -req =request. Request (URL) -     returnRequest.urlopen (req). Read () -  in  - #Get Image Address to defGet_imageurl (HTML): +data = BeautifulSoup (HTML,"lxml") -R = r'(https://\s+\.jpg)' thep =Re.compile (R) *     returnRe.findall (P, str (data)) $ Panax Notoginseng  - #Save Picture the defsave_image (Savepath, url): +Content =urlopen (URL). Read () A     #url[-11:] means to intercept the original image after 11 bits theWith open (Savepath +'/'+ url[-11:],'WB') as code: + code.write (content) -  $  $ defDo_task (Savepath, CID, index): -url = douban_url%(CID, index) -HTML =get_html (URL) theImage_list =get_imageurl (HTML) -     #here to judge the fact that there is little meaning, the basic procedures are manually terminated, because the picture you are not finishedWuyi     if  notimage_list: the         Print(U'It 's all crawled .') -         return Wu     #real-time viewing, this is necessary -     Print("=============================================================================") About     Print(U'Start crawl cid=%s page%s'%(CID, index)) $      forImageinchimage_list: - save_image (savepath, image) -     #Crawl Next page -Do_task (Savepath, CID, index+1) A  +  the if __name__=='__main__': -     #file name $filename ="DouBan3" thefilepath =setpath (filename) the  the     #2-Chest 3-leg 4-Face 5-Miscellaneous 6-Hips 7-socks the      forIinchRange (2, 8): -Do_task (filepath, I, 1)
View Code

Run the program, go to Folder view, the picture has been written to the computer!

5. Analysis: Watercress picture download with a relatively simple crawler can be achieved, the site only control as if only can not be called frequently, so watercress is not suitable for multi-threaded calls

Watercress also has an address: https://www.dbmeinv.com/dbgroup/current.htm interested children can do their own research

Third, MM131 network (difficulty:?? )

1. Website: http://www.mm131.com

To be perfected ...

Four, fried egg net (difficulty:?? )

1. Website: Http://jandan.net/ooxx

To be perfected ...

Five, celestial pole picture (difficulty:?? )

1. Website: http://pic.yesky.com

To be perfected ...

Vi. Summary and additions

1. There are three ways to get the content of a webpage

Urllib.request.Request and Urllib.request.urlopen

-Fast, easy to find, unable to get the Web content after JS execution

Requests method of request with headers

--fast, can realize camouflage, can't get the Web content after JS executes

Chrome Headless Method

--slow, equal to browser access, you can get JS after the execution of the Web page content

Python crawler download Beautiful pictures (various methods)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.