Python climb heating enjoy pictures

Last Update:2018-01-15 Source: Internet

Author: User

Tags gettext

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Target page: http://www.axlcg.com/wmxz/1.html

First get the URL of each atlas on the first page

You can see the URL of the atlas is really ul class Homeboy-ul Clearfix Line-dot under the A tag in Li, so we have to approach the target layer by level.

        allsoup = BeautifulSoup(allurldigit)  # 得到解析后的html        allpage = allsoup.find(‘ul‘, attrs={‘class‘: ‘homeboy-ul clearfix line-dot‘})        allpage2 = allpage.find_all(‘a‘) #一步找到所有的a标签        for allpage2index in allpage2:            allpage3 = allpage2index[‘href‘] #拿到url            if allpage3 not in allurl: #判断一下是否已经在容器里了，不在的话才加入                allurl.append(allpage3) #存到allurl这个list容器里

Get the URL for each page
Just get a page how can you call a crawler, we want to get multiple pages.
Can see the next page of the URL is in the UL Information-page-ul Clearfix under a Li, this time found all the Li tag is the same, then how can we find the next page of the URL?
The text in the label on the next page is written on the next page, so we can tell if the text content in Li is the next page, or skip to the next page and crawl through all the pages of the album.
Get the IMG address you really want
Click on an atlas to go in, we can see the address of the image.
Copy it and verify that it is correct.
The discovery is really what we want.
In the same way to get the URL of the image and put it in a collection, an atlas will also jump to the next page of the URL, get the image URL, because each page only one.
Download images to Local
```
        urllib.request.urlretrieve(m, "D:/Desktop//image/" + str(count) + ".jpg")
```
The first parameter is the URL of the IMG, and the second parameter is the file name of the path + picture.
Results

Code

#!/usr/bin/env Python# Encoding=utf-8# python crawl http://www.axlcg.com/warmImportRequests fromBs4ImportBeautifulSoupImportUrllib.requestallurl=[]img=[]count= 0#伪装成浏览器defDownload_page (URL):returnRequests.get (URL, headers={' User-agent ':' mozilla/5.0 (Windows NT 10.0; Win64; x64) applewebkit/537.36 (khtml, like Gecko) chrome/63.0.3236.0 safari/537.36 '}). Content# Crawl All the URLs of the Atlas and put them in a listdefGet_all_url (): Firsturl= "http://www.axlcg.com/wmxz/"PageIndex= 0     while 1  andPageIndex<  -: Allurldigit=Download_page (Firsturl)# first Page formattingAllsoup=BeautifulSoup (Allurldigit)# get parsed HTMLAllpage=Allsoup.find (' ul ', Attrs={' class ':' Homeboy-ul clearfix line-dot '}) Allpage2=Allpage.find_all (' A ') forAllpage2indexinchAllpage2:allpage3=allpage2index[' href ']ifAllpage3 not inchAllurl:allurl.append (Allpage3)# Find the URL of the next pageNext_page1=Allsoup.find (' ul ', Attrs={' class ':' Information-page-ul clearfix '}) Next_page2=Next_page1.find_all (' Li ') forNext_page2_indexinchNext_page2:# Print (next_page2)Next_page3=Next_page2_index.find (' A ')# Print (next_page3)            ifNext_page3.gettext ()== "Next Page"  andNext_page3.get ("href")!= None: Firsturl=Next_page3.get ("href") pageindex=PageIndex+ 1                Print("Total Page" +Firsturl)Print(Allurl)Print(Len(Allurl))# Download images for each URLdefMain (): Get_all_url ();I=  thePageCount= 0;  # up to eight pagesIndex= 0Url=Download_page (Allurl[i]) soup=BeautifulSoup (URL) i=I+ 1     whileIndex<  +  andI< Len(Allurl):# Print (allpage)        # Print (soup)Page0=Soup.find ("Div", Attrs={' class ':' Slidebox-detail '})# Print (PAGE0)Page=Page0.find_all ("Li")# Print (page)         forPageIndexinchPage:page2=Pageindex.find ("img");            # Print (Page2)Img.append (page2[' src '])Next =Soup.find (' ul ', Attrs={' class ':' Information-page-ul clearfix '}) next2= Next. Find_all (' Li ') forNext_urlinchNEXT2:# Print (next_url)Next_page=Next_url.find ("a")if(PageCount< 7  andNext_page.gettext ()== "Next Page"  andNext_page!= None  andNext_page.get ("href")!= None):# Print (Next_page.get ("href"))Url=Next_page.get (' href ') PageCount=PageCount+ 1Url=Download_page (URL) soup=BeautifulSoup (URL) Break;            elif(PageCount>= 7): URL=Download_page (Allurl[i]) soup=BeautifulSoup (URL) pagecount= 0                Print(Len(IMG)) Download ()Print("New Page" +Allurl[i]) I=I+ 1                 BreakdefDownload ():#print (Len (img))    GlobalImg,countPrint("Start downloading pictures") forMinchImg:urllib.request.urlretrieve (M,"d:/desktop//632/" + Str(count)+ ". jpg") Count=Count+1        Print("Downloading section"+Str(count)+"Zhang") img=[]Print("Download Complete")if __name__ == ' __main__ ': Main ()#download ();

Python climb heating enjoy pictures

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More