How to find out all the replies from zhihu

Source: Internet
Author: User

Sometimes I feel amazed when I see my reply! First, let's take a look.

When you were a child, what methods did your parents use to make you study hard? -Zhihu
Gave me this face.

Is it appropriate for a job seeker to accept an interview with an enterprise without any instructions? -Zhihu
Your company also often says "Go back and wait for our call" and there is nothing below.

What is this? -Zhihu

Enjoy the best and bear the worst

God's reply is always concise, accurate, and the answer is in place. It may even make people feel uncomfortable. How can we find all God's replies? Some people say that they are looking for answer with a lot of votes and few words to answer, for a computer student, it is impossible to repeat it one by one. Let's write a crawler and capture the question we know. Each question is retained with the highest vote answer. First, you need a list of problems. Here, you can find the problem code at http://www.zhihu.com/log/questions. for example:

def getQuestions(start,offset=‘20‘):    #cookies = urllib2.HTTPCookieProcessor()    #opener = urllib2.build_opener(cookies)    #urllib2.install_opener(opener)    header = {"Accept":"*/*",    "Accept-Encoding":"gbk,utf-8,gzip,deflate,sdch",    "Accept-Language":"zh-CN,zh;q=0.8,en;q=0.6",    "Connection":"keep-alive",    "Content-Length":"64",    "Content-Type":"application/x-www-form-urlencoded; charset=utf-8",    ‘Cookie‘:‘*************‘    "Host":"www.zhihu.com",    "Origin":"http://www.zhihu.com",    "Referer":"http://www.zhihu.com/log/questions",    "User-Agent":"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.137 Safari/537.36",    "X-Requested-With":"XMLHttpRequest"    }    parms = {‘start‘:start,            ‘offset‘:offset,            ‘_xsrf‘:‘*************‘}    url = ‘http://www.zhihu.com/log/questions‘    req = urllib2.Request(url,headers=header,data=urllib.urlencode(parms))    content = urllib2.urlopen( req ).read()    html = gzip.GzipFile(fileobj = cStringIO.StringIO(content)).read()    html = eval(html)[‘msg‘][1]    pageSoup = BeautifulSoup(html)    questions = []    items = pageSoup.find_all(‘div‘,{‘class‘:‘zm-item‘})    for item in items:        url = item.find_all(‘a‘,{‘target‘:‘_blank‘})[0].get(‘href‘).rsplit(‘/‘,1)[1]        questions.append(url)    lastId = items[-1].get(‘id‘).split(‘-‘)[1]    return questions,lastId

After obtaining the question list, capture the most votes for each question. The Code is as follows:

def getArticle(url):    page = getPage(url)    pageSoup = BeautifulSoup(page)    title = str(pageSoup.title).replace(‘<title>‘,‘‘).replace(‘</title>‘,‘‘).strip()    item = pageSoup.find_all(‘div‘,{‘class‘:‘zm-item-answer‘})    if item is None or len(item) == 0:        return None    anwser = item[0].find(‘div‘,{‘class‘:‘fixed-summary zm-editable-content clearfix‘}).get_text().strip()    vote = item[0].find(‘div‘,{‘class‘:‘zm-item-vote-info ‘}).get(‘data-votecount‘).strip()    anwser = formatStr(anwser)    ans_len = len(anwser)    if ans_len > 100:        anwser = anwser[0:100]    title = formatStr(title)    out = [title, anwser, str(ans_len),vote,url]    return out

Now we get the title of each question, the highest vote, and the question link. Next, we need to quantify the rule "Reply to short vote height". Obviously, a reply is directly proportional to the number of votes and is inversely proportional to the length of the reply text, but in terms of implementation, we need to pay attention to some details, such as some god pictures. The text length is 0, so it needs to be smoother. In addition, the more detailed the reply, so I defined the following formula: $ score = \ frac {vote} {5 + \ frac {answer \ _ Len ^ 2} {10} $

In general, this formula considers that God's reply may be proportional to the number of votes, and is inversely proportional to the square of the reply length, and 5 is used to smooth the god's matching diagrams.

After crawling for one night, I got 20 thousand questions. Then, based on the score, I got the top 1000 highest score,Appreciate God's reply, Go to GitHub.

Too few questions have been found, and more brilliant replies have not been found.

Reprinted please indicate the source: http://www.cnblogs.com/fengfenggirl/

How to find out all the replies from zhihu

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.