Target page: http://www.axlcg.com/wmxz/1.html
First get the URL of each atlas on the first page
You can see the URL of the atlas is really ul class Homeboy-ul Clearfix Line-dot under the A tag in Li, so we have to approach the target layer by level.
allsoup = BeautifulSoup(allurldigit) # 得到解析后的html allpage = allsoup.find(‘ul‘, attrs={‘class‘: ‘homeboy-ul clearfix line-dot‘}) allpage2 = allpage.find_all(‘a‘) #一步找到所有的a标签 for allpage2index in allpage2: allpage3 = allpage2index[‘href‘] #拿到url if allpage3 not in allurl: #判断一下是否已经在容器里了,不在的话才加入 allurl.append(allpage3) #存到allurl这个list容器里
Get the URL for each page
Just get a page how can you call a crawler, we want to get multiple pages.
Can see the next page of the URL is in the UL Information-page-ul Clearfix under a Li, this time found all the Li tag is the same, then how can we find the next page of the URL?
The text in the label on the next page is written on the next page, so we can tell if the text content in Li is the next page, or skip to the next page and crawl through all the pages of the album.
Get the IMG address you really want
Click on an atlas to go in, we can see the address of the image.
Copy it and verify that it is correct.
The discovery is really what we want.
In the same way to get the URL of the image and put it in a collection, an atlas will also jump to the next page of the URL, get the image URL, because each page only one.
Download images to Local
urllib.request.urlretrieve(m, "D:/Desktop//image/" + str(count) + ".jpg")
The first parameter is the URL of the IMG, and the second parameter is the file name of the path + picture.
Results
Code
#!/usr/bin/env Python# Encoding=utf-8# python crawl http://www.axlcg.com/warmImportRequests fromBs4ImportBeautifulSoupImportUrllib.requestallurl=[]img=[]count= 0#伪装成浏览器defDownload_page (URL):returnRequests.get (URL, headers={' User-agent ':' mozilla/5.0 (Windows NT 10.0; Win64; x64) applewebkit/537.36 (khtml, like Gecko) chrome/63.0.3236.0 safari/537.36 '}). Content# Crawl All the URLs of the Atlas and put them in a listdefGet_all_url (): Firsturl= "http://www.axlcg.com/wmxz/"PageIndex= 0 while 1 andPageIndex< -: Allurldigit=Download_page (Firsturl)# first Page formattingAllsoup=BeautifulSoup (Allurldigit)# get parsed HTMLAllpage=Allsoup.find (' ul ', Attrs={' class ':' Homeboy-ul clearfix line-dot '}) Allpage2=Allpage.find_all (' A ') forAllpage2indexinchAllpage2:allpage3=allpage2index[' href ']ifAllpage3 not inchAllurl:allurl.append (Allpage3)# Find the URL of the next pageNext_page1=Allsoup.find (' ul ', Attrs={' class ':' Information-page-ul clearfix '}) Next_page2=Next_page1.find_all (' Li ') forNext_page2_indexinchNext_page2:# Print (next_page2)Next_page3=Next_page2_index.find (' A ')# Print (next_page3) ifNext_page3.gettext ()== "Next Page" andNext_page3.get ("href")!= None: Firsturl=Next_page3.get ("href") pageindex=PageIndex+ 1 Print("Total Page" +Firsturl)Print(Allurl)Print(Len(Allurl))# Download images for each URLdefMain (): Get_all_url ();I= thePageCount= 0; # up to eight pagesIndex= 0Url=Download_page (Allurl[i]) soup=BeautifulSoup (URL) i=I+ 1 whileIndex< + andI< Len(Allurl):# Print (allpage) # Print (soup)Page0=Soup.find ("Div", Attrs={' class ':' Slidebox-detail '})# Print (PAGE0)Page=Page0.find_all ("Li")# Print (page) forPageIndexinchPage:page2=Pageindex.find ("img"); # Print (Page2)Img.append (page2[' src '])Next =Soup.find (' ul ', Attrs={' class ':' Information-page-ul clearfix '}) next2= Next. Find_all (' Li ') forNext_urlinchNEXT2:# Print (next_url)Next_page=Next_url.find ("a")if(PageCount< 7 andNext_page.gettext ()== "Next Page" andNext_page!= None andNext_page.get ("href")!= None):# Print (Next_page.get ("href"))Url=Next_page.get (' href ') PageCount=PageCount+ 1Url=Download_page (URL) soup=BeautifulSoup (URL) Break; elif(PageCount>= 7): URL=Download_page (Allurl[i]) soup=BeautifulSoup (URL) pagecount= 0 Print(Len(IMG)) Download ()Print("New Page" +Allurl[i]) I=I+ 1 BreakdefDownload ():#print (Len (img)) GlobalImg,countPrint("Start downloading pictures") forMinchImg:urllib.request.urlretrieve (M,"d:/desktop//632/" + Str(count)+ ". jpg") Count=Count+1 Print("Downloading section"+Str(count)+"Zhang") img=[]Print("Download Complete")if __name__ == ' __main__ ': Main ()#download ();
Python climb heating enjoy pictures