Disclaimer: The following code, Python version 3.6 runs flawlessly
First, the idea introduction
Different image sites have different anti-crawler mechanisms, depending on the specific site to take the corresponding method
1. Browser browsing analysis Address change law
2. Python test class Gets the Web page content to get the image address
3. Python test class Download the picture, save the success of the crawler can achieve
Second, the Watercress beauty (difficulty:?)
1. Website: https://www.dbmeinv.com/dbgroup/show.htm
After clicking in the browser, get the new address by category and page number: "https://www.dbmeinv.com/dbgroup/show.htm?cid=%s&pager_offset=%s"% (CID, index)
(where cid:2-chest 3-Leg 4-Face 5-Miscellaneous 6-Hips 7-Socks Index: pages)
2. Call through Python to view the contents of the Web page, the following is the contents of test_url.py
1 fromUrllibImportRequest2 ImportRe3 fromBs4ImportBeautifulSoup4 5 6 defget_html (URL):7req =request. Request (URL)8 returnRequest.urlopen (req). Read ()9 Ten One if __name__=='__main__': AURL ="https://www.dbmeinv.com/dbgroup/show.htm?cid=2&pager_offset=2" -HTML =get_html (URL) -data = BeautifulSoup (HTML,"lxml") the Print(data) -R = r'(https://\s+\.jpg)' -p =Re.compile (R) -Get_list =Re.findall (P, str (data)) + Print(get_list)
Request Web site via Urllib.request.Request (URL), BeautifulSoup parse returned binary content, Re.findall () match image address
Final print (get_list) prints out a list of image addresses
3. Call through Python, download the image, the following is the contents of test_down.py
1 fromUrllibImportRequest2 3 4 defget_image (URL):5req =request. Request (URL)6Get_img =Request.urlopen (req). Read ()7With open ('e:/python_doc/images/downtest/001.jpg','WB') as FP:8 fp.write (get_img)9 Print("Download success! ")Ten return One A - if __name__=='__main__': -URL ="https://ww2.sinaimg.cn/bmiddle/0060lm7Tgy1fn1cmtxkrcj30dw09a0u3.jpg" theGet_image (URL)
Get the picture through Urllib.request.Request (image_url), then write to the local, see the path under a picture, indicating that the entire crawler implementation is achievable
4. Complete the above analysis, write the full crawler code douban_spider.py
1 fromUrllibImportRequest2 fromUrllib.requestImportUrlopen3 fromBs4ImportBeautifulSoup4 ImportOS5 Import Time6 ImportRe7 8 9 #the global declaration can be written to the configuration file, here for the convenience of readers, it is only written in a fileTen #Image Address OnePicpath = R'E:\Python_Doc\Images' A #Watercress Address -Douban_url ="https://www.dbmeinv.com/dbgroup/show.htm?cid=%s&pager_offset=%s" - the - #folder where the path is saved, no folder is created by itself, parent folder cannot be created - defSetPath (name): -Path =Os.path.join (Picpath, name) + if notOs.path.isdir (path): - os.mkdir (path) + returnPath A at - #Get HTML content - defget_html (URL): -req =request. Request (URL) - returnRequest.urlopen (req). Read () - in - #Get Image Address to defGet_imageurl (HTML): +data = BeautifulSoup (HTML,"lxml") -R = r'(https://\s+\.jpg)' thep =Re.compile (R) * returnRe.findall (P, str (data)) $ Panax Notoginseng - #Save Picture the defsave_image (Savepath, url): +Content =urlopen (URL). Read () A #url[-11:] means to intercept the original image after 11 bits theWith open (Savepath +'/'+ url[-11:],'WB') as code: + code.write (content) - $ $ defDo_task (Savepath, CID, index): -url = douban_url%(CID, index) -HTML =get_html (URL) theImage_list =get_imageurl (HTML) - #here to judge the fact that there is little meaning, the basic procedures are manually terminated, because the picture you are not finishedWuyi if notimage_list: the Print(U'It 's all crawled .') - return Wu #real-time viewing, this is necessary - Print("=============================================================================") About Print(U'Start crawl cid=%s page%s'%(CID, index)) $ forImageinchimage_list: - save_image (savepath, image) - #Crawl Next page -Do_task (Savepath, CID, index+1) A + the if __name__=='__main__': - #file name $filename ="DouBan3" thefilepath =setpath (filename) the the #2-Chest 3-leg 4-Face 5-Miscellaneous 6-Hips 7-socks the forIinchRange (2, 8): -Do_task (filepath, I, 1)
View Code
Run the program, go to Folder view, the picture has been written to the computer!
5. Analysis: Watercress picture download with a relatively simple crawler can be achieved, the site only control as if only can not be called frequently, so watercress is not suitable for multi-threaded calls
Watercress also has an address: https://www.dbmeinv.com/dbgroup/current.htm interested children can do their own research
Third, MM131 network (difficulty:?? )
1. Website: http://www.mm131.com
To be perfected ...
Four, fried egg net (difficulty:?? )
1. Website: Http://jandan.net/ooxx
To be perfected ...
Five, celestial pole picture (difficulty:?? )
1. Website: http://pic.yesky.com
To be perfected ...
Vi. Summary and additions
1. There are three ways to get the content of a webpage
Urllib.request.Request and Urllib.request.urlopen
-Fast, easy to find, unable to get the Web content after JS execution
Requests method of request with headers
--fast, can realize camouflage, can't get the Web content after JS executes
Chrome Headless Method
--slow, equal to browser access, you can get JS after the execution of the Web page content
Python crawler download Beautiful pictures (various methods)