First, the use of web crawler to download Han Cold blog post, the main need to use the following knowledge points:
1, a brief understanding of HTML markup Language, familiar with the HTTP protocol, the discovery of HTML laws
2, familiar with Urllib module
3, familiar with Python
Here I am using IE8 developer tools, of course, can also use the more famous Firebug, which is a Firefox plugin, very useful.
Central idea: Get the URL link, and then use the file's read and write to Local.
First article: Download single article:
#coding:utf-8import urllibstr0 = ' <a title= ' on the seven elements of the film--on My views on the film and some of the news of the later days " href= "http://blog.sina.com.cn/s/blog_4701280b0102eo83.html" target= "_blank" > ' title = Str0.find (R ' <a title ') print titlehref = str0.find (R ' href= ') print href html = str0.find (R '. html ') print htmlurl = str0[href+6:html+5]print urlrequest = urllib.urlopen (URL). Read () #print requestfilename = url[-26:]open (filename, ' W '). Write ( Request) Second: Download a total of 50 articles in the first page #! /usr/bin/env python#coding=utf-8import urlliburl = [']*50i = 0stt = ' http://blog.sina.com.cn/s/articlelist_1191258123_0_1.html ' str1 = urllib.urlopen (STT). Read () Title = str1.find (R ' <a title ') #print titlehref = str1.find (R ' href= ', title) #print href html = str1.find (r '. html ', href) #print Htmlwhile title!=-1 and href != -1 and html != -1 and i < 50: url[i] = str1[href + 6:html + 5] print Url[i] title = str1.find (R ' <a title ', HTML) # Print title href = str1.find (R ' href= ', title) # Print href html = str1.find (R '. html ', href) # &NBSP;PRINT&NBSP;HTML#&NBSP;&NBSP;&NBSP;&NBSP;URL&NBSP;=&NBSP;STR1[HREF&NBSP;+&NBSP;6:HTML&NBSP;+&NBSP;5] #有这句的话是不可以的 # print url i += 1else: print ' Find end ' i = 0while i < 50: con = urllib.urlopen (Url[i]). Read () open (url[i][-26:], ' w+ '). Write (Con) #这里涉及到相对路径的问题, my 2.py is in the folder Hanhan, so write the file name directly. print ' downloading: ', url[i] i += 1else: print ' All find end ' #下面就是用爬虫下下来的文章.
650) this.width=650; "Title=" Han Han blog article. jpg "src=" http://s3.51cto.com/wyfs02/M00/4C/EE/ Wkiom1rhrniqrbsyaaqzr5qxkmu305.jpg "alt=" Wkiom1rhrniqrbsyaaqzr5qxkmu305.jpg "/>
Finish
This article is from the "Genius Strength" blog, please be sure to keep this source http://8299474.blog.51cto.com/8289474/1566906
Python writes web crawler