#coding =utf-8 #新浪博客 Import urllibimport reimport osurl=["]*1500 #每一骗博客的地址title =["]*1500 #每一篇博客的标题page =1 #博客分 Page count=1 #文章计数while page<=9:con=urllib.urlopen (' http://blog.sina.com.cn/s/articlelist_1193491727_0_ ' +str (page ) + '. html '). Read () I=0hrefstart=con.find (R ' href= "Http://blog.sina.com.cn/s/blog_ ') print Hrefstarthrefend=con.find (R '. html ', Hrefstart) print hrefendtitlestart=con.find (R ' > ', hrefend) print titlestarttitleend=con.find (R ' </a > ', titlestart) print Titleendwhile i<=50 and Titleend!=-1 and Hrefend!=-1:url[i]=con[hrefstart+6:hrefend+5] Title[i]=con[titlestart:titleend]print Page,i,count, Title[i]print url[i]hrefstart=con.find (R ' href= "/http Blog.sina.com.cn/s/blog_ ', Titleend) Hrefend=con.find (R '. html ', Hrefstart) Titlestart=con.find (R ' > ', hrefend) Titleend=con.find (R ' </a> ', Titlestart) Content=urllib.urlopen (Url[i]). Read () Filename=url[i][-26:]print Filenameif not Os.path.isdir ("1"): Os.mkdir ("1") Target=open (' 1/' +filename, ' W ') target.write (content) I=i+1counT=count+1else:print page, ' This page finds the end of ' Page=page+1else:print ' this mission is over '
Use python2.7, collect Sina blog, Wang Shi's blog article.
Realize the article List multi-page acquisition, the implementation of the download to the local.
Practiced hand do, if there is better code, also share some to me
Welcome to Exchange
There are a few other things not done:
1, the use of regular implementation to extract each page of the article content.
2. The catalogue is automatically named according to the download time
Use python2.7, collect Sina Blog