Python implementation crawl know God back to simple reptile code sharing

Python implementation crawl know God back to simple reptile code sharing _python

Last Update:2017-01-19 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

When you look at it, you find a "how to spit properly" favorites, some of the god in the reply is very funny, but a page to see and a little trouble, and every time to open the Web page, so that if all climbed down to a file inside, is not looking very cool, and can see all, so began to start.

Tools

1.Python 2.7
2.BeautifulSoup

Analyze Web pages

Let's take a look at the information on this page first.

URL:, easy to see, the Web site is regular, page slowly increase, so you can achieve all crawled.

Let's take a look at what we're going to crawl:

We're going to crawl through two things: questions and answers, only answers that show all the content, like this one can't crawl because it doesn't seem to unfold (anyway I won't.) And the answer is not all the words to crawl also useless, so do not climb the answer is not complete it.

OK, so below we'll find out where they are in the source code of the page:

The content we find is contained in the <H2 class = "Zm-item-title" ><a tar...>, so we can find the problem in this tag.

And then the reply:

There are two places to reply, because the above content also includes the <span. > and so on some content, inconvenient processing, we climbed below that content, because that inside content pure non-polluting.

Code

Okay, now we're trying to write the Python code:

Copy Code code as follows:

#-*-coding:cp936-*-
Import Urllib2
From BeautifulSoup import BeautifulSoup

f = open (' HowtoTucao.txt ', ' W ') #打开文件

For Pagenum in range (1,21): #从第1页爬到第20页

Strpagenum = str (pagenum) #页数的str表示
Print "Getting Data for page" + Strpagenum #shell里面显示的, indicating how many pages have been crawled
url = "http://www.zhihu.com/collection/27109279?page=" +strpagenum #网址
page = Urllib2.urlopen (URL) #打开网页
Soup = beautifulsoup (page) #用BeautifulSoup解析网页

#找到具有class属性为下面两个的所有Tag
all = Soup.findall (Attrs = {' class ': [' zm-item-title ', ' zh-summary summary Clearfix ']})

For each in all: #枚举所有的问题和回答
#print type (each.string)
#print Each.name
if each.name = = ' H2 ': #如果Tag为h2类型, description is the problem
Print each.a.string #问题中还有一个 <a. , so to each.a.string out the content
If each.a.string: #如果非空 to write
F.write (each.a.string)
else: #否则写 "No Answer"
F.write ("No Answer")
else: #如果是回答, same write
Print each.string
If each.string:
F.write (each.string)
else:
F.write ("No Answer")
F.close () #关闭文件

Although the code is not often, but wrote me half a day, began a variety of problems.

Run

And then we run and we can crawl:

Results

After the run, we opened the file HowtoTucao.txt, you can see, so crawled successfully. Only the format may still have a problem, the original is my no answer not add line, so no answer will be mixed into the text inside, plus two lines can be.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python implementation crawl know God back to simple reptile code sharing _python

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support