Wrote a grab Taobao pictures of the crawler, all written with if,for,while, relatively humble, entry works.
Extract photos of Taobao models from Web http://mm.taobao.com/json/request_top_list.htm?type=0&page=.
Copy the Code code as follows:
#-*-coding:cp936-*-
Import Urllib2
Import Urllib
Mmurl= "Http://mm.taobao.com/json/request_top_list.htm?type=0&page="
i=0# The second page has a personal page no picture, there will be an IO error
While i<15:
URL=MMURL+STR (i)
#print URL #打印出列表的url
Up=urllib2.urlopen (URL) #打开页面, stored in handle
Cont=up.read ()
#print len (cont) #页面的长度
Ahref= ' target= "target"
Pa=cont.find (AHREF) #找出网页链接的头部位置
Pt=cont.find (TARGET,PA) #找出网页链接的尾部位置
For a in range (0,20): #如才能不把20硬编码进去? How do I find the end of a file?
Urlx=cont[pa+len (ahref) -4:pt-2] #从头部到尾部, save web links to variables
If Len (URLX) <: #如果网页链接长度适合 "Len ()!!! 】
URLA=URLX #那么就准备将其打印出来
Print Urla #这是想要的model个人URL
######## #以下开始对model个人的URL进行操作 #########
Mup=urllib2.urlopen (URLA) #打开model个人的页面, stored in the handle
Mcont=mup.read () #对model页面的句柄进行读出操作, deposit Mcont string
Imgh= "imgt=". jpg "
Iph=mcont.find (IMGH) #找出 the head position of the "picture" link
Ipt=mcont.find (IMGT,IPH) #找出 the trailing position of the "picture" link
For b in Range (0,10): #又是硬编码 ·
Mpic=mcont[iph:ipt+len (IMGT)] #原始图片链接, the noise of the link character is too loud
Iph1=mpic.find ("http") #对上面的链接再过滤一次
Ipt1=mpic.find (IMGT) #同上
Picx=mpic[iph1:ipt1+len (IMGT)]
If Len (picx) <150: #仍有一些URL是 "http:ss.png> . jpg "(set to 100 unexpectedly accidental)
Pica=picx # "is Len (picx) <100 instead of picx!! "Otherwise it will not show
Print Pica
############################
########## #开始下载pica这个图片
Urllib.urlretrieve (Pica, "PIC\\TB" +str (i) + "X" +str (a) + "X" +str (b) + ". jpg")
########### pica picture Download complete. (Add the numbers of each loop body to avoid duplicate names)
############################
Iph=mcont.find (Imgh,iph+len (IMGH)) #开始下一个循环
Ipt=mcont.find (IMGT,IPH)
########### #model个人URL内的 "Picture link" is complete ##########
Pa=cont.find (Ahref,pa+len (ahref)) #将原来的头部位作为起始点, go back and find the next head.
Pt=cont.find (TARGET,PA) #继续找下一个尾部
I+=1
is not very simple, small friends can be slightly modified to crawl other content ...