Python crawl Taobao model pictures web crawler sample

Source: Internet
Author: User
Tags mkdir

In order to climb the model's picture, we first need to find each model's own page. By looking at the source of the page, we can find that the characteristics of the model's respective pages are as follows:

We can get the page addresses of each model by looking at the tag whose class attribute is lady-name and then taking its href attribute.

1 html = urlopen (URL)
2 bs = BeautifulSoup (Html.read (). Decode (' GBK '), "Html.parser")
3 girls = Bs.findall ("a", {"Class": "Lady-name"})
4 for item in girls:
5 Linkurl = item.get (' href ')

Continue to analyze the characteristics of the models ' respective pages, the page layout after the Model page is opened as follows:

In this page we want to extract the model's personality domain name, the domain name opened, there are models in the picture. So our key question is how to extract this domain name. According to our previous study, we will look for this tag, but we open the page source will find that the source of the Web page does not contain this information. This is because this part of the information is dynamically generated with JS. So what do we do in this situation?

The answer is to use selenium and PHANTOMJS, the relevant concepts can be their own Baidu. In short, PHANTOMJS is a browser without interface, and selenium is a tool to test the browser, combined with these 2, we can parse the dynamic page.

The code to get the model's personality domain name is as follows:

Copy Code
1 def geturls (URL):
2 driver= Webdriver. PHANTOMJS ()
3 html = urlopen (URL)
4 bs = BeautifulSoup (Html.read (). Decode (' GBK '), "Html.parser")
5 girls = Bs.findall ("a", {"Class": "Lady-name"})
6 Namewithurl = {}
7 for item in girls:
8 Linkurl = item.get (' href ')
9 Driver.get ("https:" +linkurl)
BS1 = BeautifulSoup (Driver.page_source, "Html.parser")
Links = bs1.find ("div", {"Class": "Mm-p-info Mm-p-domain-info"})
The If links are not None:
Links = Links.li.span.get_text ()
Namewithurl[item.get_text ()] = links
Print (links)
Return Namewithurl
Copy Code
Here, we use PHANTOMJS to load the dynamic page, and then use BeautifulSoup to rule the loaded page, the next work is the same as the normal Web page.

The next analysis of the characteristics of the model's personal homepage, visually is such a page:

Analysis of the source, we will find that the model's image address can be obtained:

1 HTML = urlopen (Personurl)
2 bs = BeautifulSoup (Html.read (). Decode (' GBK '), "Html.parser")
3 contents = Bs.find ("div", {"Class": "Mm-aixiu-content"})
4 IMGs = Contents.findall ("img", {"src": re.compile (R '//img\.alicdn\.com/.*\.jpg ')})

So we can get a picture of the model's personal domain address, and the next question is how to save the picture.

We can use the Urlretrieve function in Urllib to do the work of saving.

Usage for Urlretrieve (Imgurl, Savepath)

Add multithreading and other code, complete the crawler code is:

The code is as follows Copy Code

#coding = Utf-8
From urllib.request import Urlopen
From urllib.request import Urlretrieve
From Urllib.error import Httperror
From selenium import Webdriver
From selenium.webdriver.common.by Import by
From BS4 import BeautifulSoup
From Multiprocessing.dummy import Pool as ThreadPool
Import Sys,os
Import re

Savepath=r ". \save"

def mkdir (path):
If Os.path.exists (path):
Return
Os.mkdir (PATH)

def geturls (URL):
Driver= Webdriver. PHANTOMJS ()
html = urlopen (URL)
BS = BeautifulSoup (Html.read (). Decode (' GBK '), "Html.parser")
Girls = Bs.findall ("a", {"Class": "Lady-name"})
Namewithurl = {}
For item in girls:
Linkurl = item.get (' href ')
Driver.get ("https:" +linkurl)
BS1 = BeautifulSoup (Driver.page_source, "Html.parser")
Links = bs1.find ("div", {"Class": "Mm-p-info Mm-p-domain-info"})
If links is not None:
Links = Links.li.span.get_text ()
Namewithurl[item.get_text ()] = links
Print (links)
Return Namewithurl

def getimgs (parms):
PersonName = Parms[0]
Personurl = "https:" +parms[1]
html = Urlopen (Personurl)
BS = BeautifulSoup (Html.read (). Decode (' GBK '), "Html.parser")
Contents = Bs.find ("div", {"Class": "Mm-aixiu-content"})
IMGs = Contents.findall ("img", {"src": re.compile (R '//img\.alicdn\.com/.*\.jpg ')})
Savefilename = Os.path.join (savepath,personname)
mkdir (Savefilename)
Print ("img num:", Len (IMGs))
CNT = 0
For IMG in IMGs:
Try
Urlretrieve (url = "https:" +img.get ("src"), filename =os.path.join (SAVEFILENAME,STR (CNT) + ". jpg")
Cnt+=1
Except Httperror as E:
Continue

if __name__ = = "__main__":
mkdir (Savepath)
Pagenum = 10
For I in Range (1,pagenum):
URLs = Geturls ("https://mm.taobao.com/json/request_top_list.htm" + "? page=" +str (i))
Pool = ThreadPool (4)
Pool.map (Getimgs,urls.items ())
Pool.close ()
Pool.join ()
# for (k,v) in Urls.items ():
# Getimgs ((k,v))

Then open inside will have you unexpected result oh, so the picture all automatically download success Oh.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.