Python implementation crawl know God back to simple reptile code sharing _python

Source: Internet
Author: User

When you look at it, you find a "how to spit properly" favorites, some of the god in the reply is very funny, but a page to see and a little trouble, and every time to open the Web page, so that if all climbed down to a file inside, is not looking very cool, and can see all, so began to start.

Tools

1.Python 2.7
2.BeautifulSoup

Analyze Web pages

Let's take a look at the information on this page first.

URL:, easy to see, the Web site is regular, page slowly increase, so you can achieve all crawled.

Let's take a look at what we're going to crawl:

We're going to crawl through two things: questions and answers, only answers that show all the content, like this one can't crawl because it doesn't seem to unfold (anyway I won't.) And the answer is not all the words to crawl also useless, so do not climb the answer is not complete it.

OK, so below we'll find out where they are in the source code of the page:

The content we find is contained in the <H2 class = "Zm-item-title" ><a tar...>, so we can find the problem in this tag.

And then the reply:

There are two places to reply, because the above content also includes the <span. > and so on some content, inconvenient processing, we climbed below that content, because that inside content pure non-polluting.

Code

Okay, now we're trying to write the Python code:

Copy Code code as follows:

#-*-coding:cp936-*-
Import Urllib2
From BeautifulSoup import BeautifulSoup

f = open (' HowtoTucao.txt ', ' W ') #打开文件

For Pagenum in range (1,21): #从第1页爬到第20页

Strpagenum = str (pagenum) #页数的str表示
Print "Getting Data for page" + Strpagenum #shell里面显示的, indicating how many pages have been crawled
url = "http://www.zhihu.com/collection/27109279?page=" +strpagenum #网址
page = Urllib2.urlopen (URL) #打开网页
Soup = beautifulsoup (page) #用BeautifulSoup解析网页

#找到具有class属性为下面两个的所有Tag
all = Soup.findall (Attrs = {' class ': [' zm-item-title ', ' zh-summary summary Clearfix ']})

For each in all: #枚举所有的问题和回答
#print type (each.string)
#print Each.name
if each.name = = ' H2 ': #如果Tag为h2类型, description is the problem
Print each.a.string #问题中还有一个 <a. , so to each.a.string out the content
If each.a.string: #如果非空 to write
F.write (each.a.string)
else: #否则写 "No Answer"
F.write ("No Answer")
else: #如果是回答, same write
Print each.string
If each.string:
F.write (each.string)
else:
F.write ("No Answer")
F.close () #关闭文件

Although the code is not often, but wrote me half a day, began a variety of problems.

Run

And then we run and we can crawl:

Results

After the run, we opened the file HowtoTucao.txt, you can see, so crawled successfully. Only the format may still have a problem, the original is my no answer not add line, so no answer will be mixed into the text inside, plus two lines can be.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.