Python crawl Taobao model pictures web crawler sample

Last Update:2017-01-13 Source: Internet

Author: User

Tags mkdir

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

In order to climb the model's picture, we first need to find each model's own page. By looking at the source of the page, we can find that the characteristics of the model's respective pages are as follows:

We can get the page addresses of each model by looking at the tag whose class attribute is lady-name and then taking its href attribute.

1 html = urlopen (URL)
2 bs = BeautifulSoup (Html.read (). Decode (' GBK '), "Html.parser")
3 girls = Bs.findall ("a", {"Class": "Lady-name"})
4 for item in girls:
5 Linkurl = item.get (' href ')

Continue to analyze the characteristics of the models ' respective pages, the page layout after the Model page is opened as follows:

In this page we want to extract the model's personality domain name, the domain name opened, there are models in the picture. So our key question is how to extract this domain name. According to our previous study, we will look for this tag, but we open the page source will find that the source of the Web page does not contain this information. This is because this part of the information is dynamically generated with JS. So what do we do in this situation?

The answer is to use selenium and PHANTOMJS, the relevant concepts can be their own Baidu. In short, PHANTOMJS is a browser without interface, and selenium is a tool to test the browser, combined with these 2, we can parse the dynamic page.

The code to get the model's personality domain name is as follows:

Copy Code
1 def geturls (URL):
2 driver= Webdriver. PHANTOMJS ()
3 html = urlopen (URL)
4 bs = BeautifulSoup (Html.read (). Decode (' GBK '), "Html.parser")
5 girls = Bs.findall ("a", {"Class": "Lady-name"})
6 Namewithurl = {}
7 for item in girls:
8 Linkurl = item.get (' href ')
9 Driver.get ("https:" +linkurl)
BS1 = BeautifulSoup (Driver.page_source, "Html.parser")
Links = bs1.find ("div", {"Class": "Mm-p-info Mm-p-domain-info"})
The If links are not None:
Links = Links.li.span.get_text ()
Namewithurl[item.get_text ()] = links
Print (links)
Return Namewithurl
Copy Code
Here, we use PHANTOMJS to load the dynamic page, and then use BeautifulSoup to rule the loaded page, the next work is the same as the normal Web page.

The next analysis of the characteristics of the model's personal homepage, visually is such a page:

Analysis of the source, we will find that the model's image address can be obtained:

1 HTML = urlopen (Personurl)
2 bs = BeautifulSoup (Html.read (). Decode (' GBK '), "Html.parser")
3 contents = Bs.find ("div", {"Class": "Mm-aixiu-content"})
4 IMGs = Contents.findall ("img", {"src": re.compile (R '//img\.alicdn\.com/.*\.jpg ')})

So we can get a picture of the model's personal domain address, and the next question is how to save the picture.

We can use the Urlretrieve function in Urllib to do the work of saving.

Usage for Urlretrieve (Imgurl, Savepath)

Add multithreading and other code, complete the crawler code is:

The code is as follows

Copy Code

#coding = Utf-8
From urllib.request import Urlopen
From urllib.request import Urlretrieve
From Urllib.error import Httperror
From selenium import Webdriver
From selenium.webdriver.common.by Import by
From BS4 import BeautifulSoup
From Multiprocessing.dummy import Pool as ThreadPool
Import Sys,os
Import re

Savepath=r ". \save"

def mkdir (path):
If Os.path.exists (path):
Return
Os.mkdir (PATH)

def geturls (URL):
Driver= Webdriver. PHANTOMJS ()
html = urlopen (URL)
BS = BeautifulSoup (Html.read (). Decode (' GBK '), "Html.parser")
Girls = Bs.findall ("a", {"Class": "Lady-name"})
Namewithurl = {}
For item in girls:
Linkurl = item.get (' href ')
Driver.get ("https:" +linkurl)
BS1 = BeautifulSoup (Driver.page_source, "Html.parser")
Links = bs1.find ("div", {"Class": "Mm-p-info Mm-p-domain-info"})
If links is not None:
Links = Links.li.span.get_text ()
Namewithurl[item.get_text ()] = links
Print (links)
Return Namewithurl

def getimgs (parms):
PersonName = Parms[0]
Personurl = "https:" +parms[1]
html = Urlopen (Personurl)
BS = BeautifulSoup (Html.read (). Decode (' GBK '), "Html.parser")
Contents = Bs.find ("div", {"Class": "Mm-aixiu-content"})
IMGs = Contents.findall ("img", {"src": re.compile (R '//img\.alicdn\.com/.*\.jpg ')})
Savefilename = Os.path.join (savepath,personname)
mkdir (Savefilename)
Print ("img num:", Len (IMGs))
CNT = 0
For IMG in IMGs:
Try
Urlretrieve (url = "https:" +img.get ("src"), filename =os.path.join (SAVEFILENAME,STR (CNT) + ". jpg")
Cnt+=1
Except Httperror as E:
Continue

if __name__ = = "__main__":
mkdir (Savepath)
Pagenum = 10
For I in Range (1,pagenum):
URLs = Geturls ("https://mm.taobao.com/json/request_top_list.htm" + "? page=" +str (i))
Pool = ThreadPool (4)
Pool.map (Getimgs,urls.items ())
Pool.close ()
Pool.join ()
# for (k,v) in Urls.items ():
# Getimgs ((k,v))

Then open inside will have you unexpected result oh, so the picture all automatically download success Oh.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More