Simple Python crawling taobao image crawlers,
I wrote a crawler for capturing taobao images, all of which were written using if, for, and while, which is relatively simple and the entry-level work.
Http://mm.taobao.com/json/request_top_list.htm from web? Type = 0 & page = extract the taobao model photo.
Copy codeThe Code is as follows:
#-*-Coding: cp936 -*-
Import urllib2
Import urllib
Mmurl = "http://mm.taobao.com/json/request_top_list.htm? Type = 0 & page ="
I = 0 # I/O errors may occur if a personal page does not contain images on the second page.
While I <15:
Url = mmurl + str (I)
# Print url # print the url of the List
Up = urllib2.urlopen (url) # Open the page and store it in the handle
Cont = up. read ()
# Print len (cont) # page length
Ahref = '<a href = "http' # filter Keywords of webpage links on the page
Target = "target"
Pa = cont. find (ahref) # locate the header of the webpage link
Pt = cont. find (target, pa) # locate the end Of the webpage link
For a in range (): # If we can not hard encode 20? How can I find the end of a file?
Urlx = cont [pa + len (ahref)-4: pt-2] # store the webpage link from the header to the tail
If len (urlx) <60: # if the webpage link length is suitable for [len ()!!!!]
Urla = urlx # print it out.
Print urla # This is the personal URL of the desired model.
######## Start to operate on the URL of a model individual #########
Mup = urllib2.urlopen (urla) # Open the personal page of the model and store it in the handle.
Mcont = mup. read () # read the handle of the model page and save it to the mcont string.
Imgh = "Imgt = ". jpg"
Iph = mcont. find (imgh) # locate the header of the [Image] Link
ARIMA = mcont. find (imgt, iph) # locate the end of the [Image] Link
For B in range (): # It is hard-coded ····
Mpic = mcont [iph: EPT + len (imgt)] # original image link. The noise of the link characters is too high.
Iph1 = mpic. find ("http") # filter the above link again
Ipt1 = mpic. find (imgt) # Same as above
Picx = mpic [iph1: ipt1 + len (imgt)]
If len (picx) <150: # some URLs are still "http: ss.png> <dfsdf>. jpg" (if it is set to 100, it will accidentally hurt)
Pica = picx # [len (picx) <100 instead of picx !!] Otherwise it will not be displayed
Print pica
############################
########## Start to download the pica Image
Urllib. urlretrieve (pica, "pic \ tb" + str (I) + "x" + str (a) + "x" + str (B) + ". jpg ")
########## The pica image has been downloaded. (add the number of each loop body to avoid repeated names)
############################
Iph = mcont. find (imgh, iph + len (imgh) # start the next loop
Ipt = mcont. find (imgt, iph)
########### Image link extraction in the personal URL of the model ##########
Pa = cont. find (ahref, pa + len (ahref) # use the original header position as the starting point and continue searching for the next header.
Pt = cont. find (target, pa) # continue to find the next tail
I + = 1