When you look at it, you find a "how to spit properly" favorites, some of the god in the reply is very funny, but a page to see and a little trouble, and every time to open the Web page, so that if all climbed down to a file inside, is not looking very cool, and can see all, so began to start.
Tools
1.Python 2.7
2.BeautifulSoup
Analyze Web pages
Let's take a look at the information on this page first.
URL:, easy to see, the Web site is regular, page slowly increase, so you can achieve all crawled.
Let's take a look at what we're going to crawl:
We're going to crawl through two things: questions and answers, only answers that show all the content, like this one can't crawl because it doesn't seem to unfold (anyway I won't.) And the answer is not all the words to crawl also useless, so do not climb the answer is not complete it.
OK, so below we'll find out where they are in the source code of the page:
The content we find is contained in the <H2 class = "Zm-item-title" ><a tar...>, so we can find the problem in this tag.
And then the reply:
There are two places to reply, because the above content also includes the <span. > and so on some content, inconvenient processing, we climbed below that content, because that inside content pure non-polluting.
Code
Okay, now we're trying to write the Python code:
Copy Code code as follows:
#-*-coding:cp936-*-
Import Urllib2
From BeautifulSoup import BeautifulSoup
f = open (' HowtoTucao.txt ', ' W ') #打开文件
For Pagenum in range (1,21): #从第1页爬到第20页
Strpagenum = str (pagenum) #页数的str表示
Print "Getting Data for page" + Strpagenum #shell里面显示的, indicating how many pages have been crawled
url = "http://www.zhihu.com/collection/27109279?page=" +strpagenum #网址
page = Urllib2.urlopen (URL) #打开网页
Soup = beautifulsoup (page) #用BeautifulSoup解析网页
#找到具有class属性为下面两个的所有Tag
all = Soup.findall (Attrs = {' class ': [' zm-item-title ', ' zh-summary summary Clearfix ']})
For each in all: #枚举所有的问题和回答
#print type (each.string)
#print Each.name
if each.name = = ' H2 ': #如果Tag为h2类型, description is the problem
Print each.a.string #问题中还有一个 <a. , so to each.a.string out the content
If each.a.string: #如果非空 to write
F.write (each.a.string)
else: #否则写 "No Answer"
F.write ("No Answer")
else: #如果是回答, same write
Print each.string
If each.string:
F.write (each.string)
else:
F.write ("No Answer")
F.close () #关闭文件
Although the code is not often, but wrote me half a day, began a variety of problems.
Run
And then we run and we can crawl:
Results
After the run, we opened the file HowtoTucao.txt, you can see, so crawled successfully. Only the format may still have a problem, the original is my no answer not add line, so no answer will be mixed into the text inside, plus two lines can be.