A simple Python crawler and a simple Python Crawler
I wrote a crawler for capturing taobao images, all of which were written using if, for, and while, which is relatively simple and the entry-level work.
Http://mm.taobao.com/json/request_top_list.htm from web? Type = 0 & page = extract the taobao model photo.
#-*-Coding: cp936-*-import urllib2import urllibmmurl = "http://mm.taobao.com/json/request_top_list.htm? Type = 0 & page = "I = 0 # When there is a personal page with no image on the second page, an I/O error occurs while I <15: url = mmurl + str (I) # print url # print out the list url up = urllib2.urlopen (url) # Open the page and store it in the handle cont = up. read () # print len (cont) # page length ahref = '<a href = "http' # filter the keyword target =" target "pa = cont. find (ahref) # find the header position of the webpage link pt = cont. find (target, pa) # locate the tail position of the webpage link for a in range (): # If you do not need to hard encode 20? How can I find the end of a file? Urlx = cont [pa + len (ahref)-4: pt-2] # store the webpage link from the header to the end in the variable if len (urlx) <60: # If the webpage link length is suitable for [len ()!!!!] Urla = urlx # print it out. print urla # This is the desired personal URL of the model ######### start to operate on the personal URL of the model ## ####### mup = urllib2.urlopen (urla) # Open the personal page of the model and store mcont = mup in the handle. read () # read the handle of the model page and save it to the mcont string imgh = " <dfsdf>. jpg "(if set to 100, it will accidentally hurt) pica = picx # [It's len (picx) <100 instead of picx !!] Otherwise, print pica ################################## ##### start to download the pica image urllib. urlretrieve (pica, "pic \ tb" + str (I) + "x" + str (a) + "x" + str (B) + ". jpg ") ########## The pica image has been downloaded. (add the number of each loop body to avoid repeated names) ########################### iph = mcont. find (imgh, iph + len (imgh) # start the next cycle. find (imgt, iph) ########### image link extraction in the personal URL of the model ########### pa = cont. find (ahref, pa + len (ahref) # use the original header position as the starting point and continue searching for the next header pt = cont. find (target, pa) # continue to find the next tail I + = 1
Teach a small python Crawler
Link: pan.baidu.com/s/1qWsE43q password: sorn
Let's take a look at this. I hope it will be helpful for you to write a python crawler video tutorial.
Can you share your python crawler? This is too simple for me to write. Please refer to it,
I have also seen such a problem with no subject, and I have never understood it very well. Today I finally know that this is targeted help, and I am asking for help...
The python crawler I wrote was written under linux. I deleted linux from the previous two days, but only one version backed up on gmail was left. After reading it, the backup is not the final version, there are a lot of bugs, and the final version is also quite bad. I still don't want to read it. I did it in a hurry.
Let me provide you with a Reference URL. I used to follow this URL. It was well written and added with multiple threads. It is a basic python crawler framework, you can follow the code and understand how to add the code to the east and west.
Reference: blog.csdn.net/..208194