In order to climb the model's picture, we first need to find each model's own page. By looking at the source of the page, we can find that the characteristics of the model's respective pages are as follows:
We can get the page addresses of each model by looking at the tag whose class attribute is lady-name and then taking its href attribute.
1 html = urlopen (URL)
2 bs = BeautifulSoup (Html.read (). Decode (' GBK '), "Html.parser")
3 girls = Bs.findall ("a", {"Class": "Lady-name"})
4 for item in girls:
5 Linkurl = item.get (' href ')
Continue to analyze the characteristics of the models ' respective pages, the page layout after the Model page is opened as follows:
In this page we want to extract the model's personality domain name, the domain name opened, there are models in the picture. So our key question is how to extract this domain name. According to our previous study, we will look for this tag, but we open the page source will find that the source of the Web page does not contain this information. This is because this part of the information is dynamically generated with JS. So what do we do in this situation?
The answer is to use selenium and PHANTOMJS, the relevant concepts can be their own Baidu. In short, PHANTOMJS is a browser without interface, and selenium is a tool to test the browser, combined with these 2, we can parse the dynamic page.
The code to get the model's personality domain name is as follows:
Copy Code
1 def geturls (URL):
2 driver= Webdriver. PHANTOMJS ()
3 html = urlopen (URL)
4 bs = BeautifulSoup (Html.read (). Decode (' GBK '), "Html.parser")
5 girls = Bs.findall ("a", {"Class": "Lady-name"})
6 Namewithurl = {}
7 for item in girls:
8 Linkurl = item.get (' href ')
9 Driver.get ("https:" +linkurl)
BS1 = BeautifulSoup (Driver.page_source, "Html.parser")
Links = bs1.find ("div", {"Class": "Mm-p-info Mm-p-domain-info"})
The If links are not None:
Links = Links.li.span.get_text ()
Namewithurl[item.get_text ()] = links
Print (links)
Return Namewithurl
Copy Code
Here, we use PHANTOMJS to load the dynamic page, and then use BeautifulSoup to rule the loaded page, the next work is the same as the normal Web page.
The next analysis of the characteristics of the model's personal homepage, visually is such a page:
Analysis of the source, we will find that the model's image address can be obtained:
1 HTML = urlopen (Personurl)
2 bs = BeautifulSoup (Html.read (). Decode (' GBK '), "Html.parser")
3 contents = Bs.find ("div", {"Class": "Mm-aixiu-content"})
4 IMGs = Contents.findall ("img", {"src": re.compile (R '//img\.alicdn\.com/.*\.jpg ')})
So we can get a picture of the model's personal domain address, and the next question is how to save the picture.
We can use the Urlretrieve function in Urllib to do the work of saving.
Usage for Urlretrieve (Imgurl, Savepath)
Add multithreading and other code, complete the crawler code is:
The code is as follows |
Copy Code |
#coding = Utf-8 From urllib.request import Urlopen From urllib.request import Urlretrieve From Urllib.error import Httperror From selenium import Webdriver From selenium.webdriver.common.by Import by From BS4 import BeautifulSoup From Multiprocessing.dummy import Pool as ThreadPool Import Sys,os Import re Savepath=r ". \save" def mkdir (path): If Os.path.exists (path): Return Os.mkdir (PATH) def geturls (URL): Driver= Webdriver. PHANTOMJS () html = urlopen (URL) BS = BeautifulSoup (Html.read (). Decode (' GBK '), "Html.parser") Girls = Bs.findall ("a", {"Class": "Lady-name"}) Namewithurl = {} For item in girls: Linkurl = item.get (' href ') Driver.get ("https:" +linkurl) BS1 = BeautifulSoup (Driver.page_source, "Html.parser") Links = bs1.find ("div", {"Class": "Mm-p-info Mm-p-domain-info"}) If links is not None: Links = Links.li.span.get_text () Namewithurl[item.get_text ()] = links Print (links) Return Namewithurl def getimgs (parms): PersonName = Parms[0] Personurl = "https:" +parms[1] html = Urlopen (Personurl) BS = BeautifulSoup (Html.read (). Decode (' GBK '), "Html.parser") Contents = Bs.find ("div", {"Class": "Mm-aixiu-content"}) IMGs = Contents.findall ("img", {"src": re.compile (R '//img\.alicdn\.com/.*\.jpg ')}) Savefilename = Os.path.join (savepath,personname) mkdir (Savefilename) Print ("img num:", Len (IMGs)) CNT = 0 For IMG in IMGs: Try Urlretrieve (url = "https:" +img.get ("src"), filename =os.path.join (SAVEFILENAME,STR (CNT) + ". jpg") Cnt+=1 Except Httperror as E: Continue if __name__ = = "__main__": mkdir (Savepath) Pagenum = 10 For I in Range (1,pagenum): URLs = Geturls ("https://mm.taobao.com/json/request_top_list.htm" + "? page=" +str (i)) Pool = ThreadPool (4) Pool.map (Getimgs,urls.items ()) Pool.close () Pool.join () # for (k,v) in Urls.items (): # Getimgs ((k,v)) |
Then open inside will have you unexpected result oh, so the picture all automatically download success Oh.