Before looking at the front-end related knowledge for a long time, afraid of Python rusty, write a mock landing to restore
Zhihu Online Info Some of the need to log in to access the crawl, so you might as well try
1 First log in, then use Fiddler to grab the bag
Find login Zhihu need post the following data:
A? , the verification code went, forget it, no better.
The following will write code, wait, first look at the Zhihu response
The type of RESP is in JSON format, and after checking, the value of MSG is our login status, so we will print out this value to prove whether to log in.
2 below is not much to say, directly on the code
#!/usr/bin/python#-*-coding:utf-8-*-ImportRequests fromBs4ImportBeautifulSoupImportCookielibImportJsonhomepage='https://www.zXXXu.com/' #home Page URL#= R ' Zhihu_cookies.txt 'Session=requests.session () cookie= Cookielib. Cookiejar ()#This method can be used to temporarily store cookies" "session.cookies = Cookielib. Mozillacookiejar (filename) #这个方法是将cookie放入文件中try: Session.cookies.load (Filename=filename, Ignore_discard=true, Ignore_expirex=true) #gnore_discard的意思是即使cookies将被丢弃也将它保存下来, ignore_expires means that if the cookie already exists in the file, the overwrite of the original file is written, Except:print ' Cookie can not load! '" "Headers= {'Connection':'keep-alive', 'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8', 'Accept-language':'en-us,en;q=0.8,zh-hans-cn;q=0.5,zh-hans;q=0.3', 'user-agent':'mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/45.0.2454.101 safari/537.36', 'accept-encoding':'gzip, deflate, SDCH', 'Host':'www.zXXXu.com', }defget_xsrf (): Text= Session.get (homepage, headers=headers). Text Soup= BeautifulSoup (Text,'Html.parser') Result= Soup.find ('Div', class_='View View-signin'). Find ('input')['value'] returnresult#Get Verification CodedefGet_captcha ():PassdefLogin_zhihu (Phone, passwd): Login_url= homepage+'/login/phone_num'Data= { '_XSRF':'%s'%get_xsrf (),'Password': passwd,'Phone_num': Phone,'Captcha_type':'cn'} result= Session.post (Login_url, Data=data, headers=headers)PrintJson.loads (Result.text) ['msg']#The body of result is the son format, and the value of ' msg ' is the login state returnif __name__=='__main__': Phone= Raw_input ('Please input phone_num:') passwd= Raw_input ('Please input password:') URL= homepage +'/settings/profile' #login Before you can access your profileLogin_zhihu (phone, passwd) resp_status= Session.get (URL, headers=headers, allow_redirects=false). Status_code#The jump action is closed here PrintResp_status#return result is Access status code
There are two points to explain.
2.1 Cookie processing, I used a cookiejar stored a cookie, we can also ignore this step.
2.2 Headers must write all, before change a UA can login, now need to write on can, Zhihu is also anti-climb struggle (I was here to try many times to realize, we do not like me so silly)
3 and the final result is the return.
Finally, we recommend a Jianshu author of the Zhihu Crawler, which includes processing verification code (I am really annoying manual input) link address
Simulation landing domestic famous knowledge exchange website