Wrote a grab Taobao pictures of the crawler, all written with if,for,while, relatively humble, entry works.
Extract photos of Taobao models from Web http://mm.taobao.com/json/request_top_list.htm?type=0&page=.
#-*-coding:cp936-*-ImportUrllib2ImportUrllibmmurl="http://mm.taobao.com/json/request_top_list.htm?type=0&page="I=0#The second page has a personal page no picture, there will be an IO error whileI<15: URL=mmurl+Str (i)#Print URL #打印出列表的url up=urllib2.urlopen (URL)#open a page and save it in a handleCont=Up.read ()#print len (cont) #页面的长度Ahref='<a href= "http'#Filter keywords for page links within pagesTarget="Target"PA=cont.find (AHREF)#find the head position of a Web page linkPT=cont.find (TARGET,PA)#find the trailing position of a Web page link forAinchRange (0,20):#What if I don't put 20 hard code in? How do I find the end of a file? URLX=cont[pa+len (AHREF) -4:pt-2]#from head to tail, save Web links to variables ifLen (URLX) < 60:#if the page link length is appropriate for Len ()!!!! "Urla=urlx#then you're ready to print it out. PrintUrla#This is the model personal URL that you want ######## #以下开始对model个人的URL进行操作 #########MUP=urllib2.urlopen (Urla)#Open the Model personal page and store it in the handleMcont=mup.read ()#Read the handle to the model page and deposit the Mcont stringIMGH="" #filter keywords for "picture" links within a pageIMGT=". jpg"iph=mcont.find (IMGH)#find the head position of the "picture" linkIPT=mcont.find (IMGT,IPH)#find the trailing position of the "picture" link forBinchRange (0,10):#Hard coded againMpic=mcont[iph:ipt+len (IMGT)]#the original picture link, the link character noise is too bigIPH1=mpic.find ("http")#Filter the above link againipt1=mpic.find (IMGT)#Ibid .picx=mpic[iph1:ipt1+Len (IMGT)]ifLen (picx) <150:#There are still some URLs that are "http:ss.png><dfsdf>.jpg" (set to 100 unexpectedly accidental)Pica=picx#"is Len (picx) <100 instead of picx!! "Otherwise it will not show PrintPica############################ ########## #开始下载pica这个图片Urllib.urlretrieve (Pica,"PIC\\TB"+str (i) +"x"+str (a) +"x"+str (b) +". jpg") ########### pica picture Download complete. (Add the numbers of each loop body to avoid duplicate names) ############################iph=mcont.find (Imgh,iph+len (IMGH))#start the next loopIPT=Mcont.find (IMGT,IPH)########### #model个人URL内的 "Picture link" is complete ##########PA=cont.find (Ahref,pa+len (AHREF))#take the original head position as the starting point and continue looking backwards for the next headPT=cont.find (TARGET,PA)#keep looking for the next tail .I+=1
A simple Python crawler.