Weekend nothing to write a web crawler, first introduced its function, this is a small program, mainly used to crawl pages of articles, blogs, etc., first find the article you want to crawl, such as Han's Sina blog, into his article directory, write down the directory connection such as HTTP/ Blog.sina.com.cn/s/articlelist_1191258123_0_1.html, there is a connection in each article, all we need to do now is to enter and copy the article into your own computer file according to each link. This will be the article climbed down haha, don't say directly to the code bar
Import Urllib
Import time
url=[']*50
j = 0
con = urllib.urlopen (' http://blog.sina.com.cn/s/articlelist_1191258123_0_1.html '). Read () #目录链接
I=0
title = Con.find (R ' <a title= ') #找到第一次出现 the location of the <a title=
href = Con.find (R ' href= ', title) #找到 href= position after <a title=
html = Con.find (R '. html ', href) #同上
While title! =-1 and href! =-1 and HTML! =-1 and i<50: #目录下面大概50篇文章
Url[i] = con[href + 6:html +5] #抓取每篇文章的链接
Print Url[i]
title = Con.find (R ' <a title= ', HTML) #循环抓取每篇文章
href = Con.find (R ' href= ', title)
html = Con.find (R '. html ', href)
I= i+1
While J < 50:
Content = Urllib.urlopen (Url[j]). Read () #读取每个链接内的内容
#print Content
filename = url[j][-26:]
Open (filename, ' w+ '). Write (content) #把内容写到你自己定义的文件下
print ' downloading ', url[j]
j = j+1
Time.sleep (1) #睡眠时间
This article from "Midnight" blog, declined reprint!
Web crawler-python