Next, we climbed the first page of the blog based on the first page of the link, we can not be difficult to find that each page of the link is only a different (page number), we just need to add a loop outside the previous code, so that we will be able to crawl all the blog page posts. That's all the posts.
#-*-Coding:-utf-8-*-import urllibimport timeurl = [']*350page = 1link = 1while page <=7://now co-owns 7 pages. 3 con = urllib.urlopen (' http://blog.sina.com.cn/s/articlelist_1191258123_0_ ' +str (page) + '. html '). Read () i = 0 titl E = Con.find (R ' <a title= ') href = Con.find (R ' href= ', title) HTML = Con.find (R '. html ', href) while title! =-1 An D href! =-1 and HTML! =-1 and I<350:url[i] = con[href + 6:html + 5] content = Urllib.urlopen (Url[i]). R EAD () Open (R ' allboke/' +url[i][-26:], ' w+ '). Write (content) print ' link ', link,url[i] title = Con . Find (R ' <a title= ', html) href = Con.find (R ' href= ', title) HTML = Con.find (R '. html ', href) i = i + 1 link = link + 1 else:print ' page ', page, ' Find end! ' page = page + 1else:print ' All find end ' #i = 0#while i <: #content = Urllib.urlopen (Url[i]). Read () #open ( R ' save/' +url[i][-26:], ' w+ '). Write (content) #print ' downloading ', i,url[i] #i = i + 1 #time. Sleep (1) #else:p rint ' download artical finished! '
In the most part of the code, saving a Web page can only be saved to 50, not knowing where it went wrong.
So just put the code to save the page in the search, find it and save it!
Perform the interface correctly:
Execution Result:
Python crawling Han cold all Sina Blogs