When Google founders wrote their first crude crawler in Python, running on the same rudimentary server;
Few people can imagine how they have subverted the internet and even the human world in the decades that followed.
There is a network where there are reptiles, reptiles English name Spider. It is a program used to crawl Web site data. For example: We through a program, regularly to crawl similar to Baidu glutinous rice, the public comments on the data, the information stored in the database, and then add the Display page, a buy navigation station on the advent. There is no doubt that crawlers are the initial source of data for many websites.
First, the realization of a crawler function
--View the URL of the first article in the blog post directory
The first step is to introduce the Urllib module, find the URL using the Find function, and get the desired URL after the character processing.
#!/usr/bin/env pythonimport urlliburl = [']*40i = 0con = Urllib.urlopen (' Http://blog.sina.com.cn/s/articlelist_ 1191258123_0_1.html '). Read () title = Con.find (R ' <a title= ') href = Con.find (R ' href= ', title) HTML = Con.find (R '. html '), href) url = con[href +6:html +5]print URL
Second, view the blog post directory The first page of all articles URL
A:
#!/usr/bin/env pythonimport urlliburl = [']*40i = 0con = Urllib.urlopen (' Http://blog.sina.com.cn/s/articlelist_ 1191258123_0_1.html '). Read () title = Con.find (R ' <a title= ') href = Con.find (R ' href= ', title) HTML = Con.find (R '. html '), HREF) url[0] = con[href +6:html +5]print urlwhile Title! =-1 and href! =-1 and HTML! =-1 and i < 40:url[i] = Co N[href +6:html +5] Print Url[i] title = Con.find (R ' <a title= ', html) href = Con.find (R ' href= ', title) HTML = Con.find (R '. html ', href) i = i +1
or B:
#!/usr/bin/env Pythonimport Urllibi = 0con = Urllib.urlopen (' Http://blog.sina.com.cn/s/articlelist_1191258123_0_1. HTML '). Read () title = Con.find (R ' <a title= ') href = Con.find (R ' href= ', title) HTML = Con.find (R '. html ', href) url = con[ href +6:html +5]while Title! =-1 and href! =-1 and HTML! =-1 and i < 50:title = Con.find (R ' <a title= ', HTML) href = Con.find (R ' href= ', title) HTML = Con.find (R '. html ', href) url = con[href +6:html +5] Print URL i = i + 1
Third, download the first page of the post directory all the articles
A:
#!/usr/bin/env pythonimport urllibi = 0url = [']*40con = Urllib.urlopen (' http://www.zhihu.com/collection/19668036 '). Read () Target = con.find (R ' <a target= "_blank") Base = con.find (R ' href= ', target) end = con.find (' > ', Base) url[0] = ' http://www.zhihu.com ' + con[target +25 :end - 1]print url[0]while i < 20: url[0] = ' http://www.zhihu.com ' + con[target + 25 :end - 1] print url[0] target = con.find (R ' <a Target= "_blank", end) base = con.find (R ' href= ', target) end = con.find (' > ', Base) i = i + 1while j < 30: Content = urllib.urlopen (Url[j]). Read () print url[0] open (R ' zhihu/' +url[j], ' w+ '). WRITE (content) print ' downloading ', j = j + 1 time.sleep (15)
or B:
#!/usr/bin/env pythonimport timeimport urllibi = 0j = 0url = ["] *30name = [']*30con = urllib.urlopen (' http://www.zhihu.com/collection/19668036 '). Read () Target = con.find (R ' <a target= "_blank ') Base = con.find (R ' href= ', target) end = con.find (' > ', Base) url[0] = ' http://www.zhihu.com ' + con[target +25 :end - 1]while target != -1 and base != -1 and end != -1 and i < 30: url[0] = ' http://www.zhihu.com ' + Con[target +25 :end - 1] name[0] = con[base +16 :end - 1] target = con.find (R ' <a target= "_blank ', end) base = con.find (R ' href= ', target) end = con.find (' > ', Base) content = urllib.urlopeN (url[0]). Read () open (R ' zhihu/' +name[0]+ '. html ', ' w+ '). Write (content) print ' Downloading ', Name[0] time.sleep (5) i = i + 1
Iv. Download all articles
A:
import timeimport urllibpage = 1url = [']*350i = 0link = 1while page <= 7: con = urllib.urlopen (' http://blog.sina.com.cn/s/ articlelist_1191258123_0_ ' +str (page) + '. html '). Read () title = con.find (R ' <a title= ') href = con.find (R ' href= ', title) html = con.find (R '. html ', href) while title != -1 and href != -1 and html != -1 and i < 350: url[i] = con[href +6 : html +5 ] print link,url[i] title = Con.find (R ' <a title= ', HTML) href = con.find (R ' href= ', title) html = con.find (R '. html ', href) link = link + 1 i = i +1 else: print ' find end! ' page = page + 1else: print ' All find end ' J = 0while j < 50: content = urllib.urlopen (url[ J]). Read () open (R ' tmp/' +url[j][-26:], ' w+ '). Write (content) j = j + 1 time.sleep (5) else: print ' download over! '
B:
#!/usr/bin/env pythonimport timeimport urllibi = 0link = 1page = 1url = [']*350while page <= 7: con = urllib.urlopen (' HTTP// blog.sina.com.cn/s/articlelist_1191258123_0_ ' +str (page) + '. html '). Read () title = con.find ( R ' <a title= ') href = con.find (R ' href= ', title) html = Con.find (R '. html ', href) while title != -1 and href != -1 and html != -1 and i < 350: url[i] = con[href +6 :html +5 ] print link,url[i] title = con.find (R ' <a title= ', HTML) href = con.find (r ' href= ', title) html = con.find (R '. html ', href) content = urllib.urlopeN (url[i]). Read () open (R '/tmp/sina/' +url[i][-26:], ' w+ '). Write (content) time.sleep (5) link = link + 1 i = i +1 page = page + 1else: print ' download over! '
Operation Result:
650) this.width=650; "src=" http://s3.51cto.com/wyfs02/M01/70/87/wKiom1W5lmbzdtMBAAJeGvbqJRE714.jpg "title=" 1.png " alt= "Wkiom1w5lmbzdtmbaajegvbqjre714.jpg"/>
This article is from the "World" blog, make sure to keep this source http://xiajie.blog.51cto.com/6044823/1679997
Python Simple crawler function implementation