Today, inadvertently found a known open source Crawler, is based on Python, named Zhihu_oauth, looked at GitHub above the star number is quite a lot of, seemingly document also quite detailed, so a little research. I found it very useful. Just here to tell you how to use.
The project's home page address is in: Https://github.com/7sDream/zhihu-oauth. The author's knowledge of the homepage is: https://www.zhihu.com/people/7sdream/.
The document address for the project is: http://zhihu-oauth.readthedocs.io/zh_CN/latest/index.html. Reason, the original author of how to use this library has been told very detailed, I repeat here again is simply superfluous. So if you want to learn more about how this library is used, go to the official documentation. Let me just say a few important points that I feel I need to add.
The first is the installation. The author has uploaded the project to PyPI, so we can install it directly using PIP. According to the author, the project for Python3 support better, indifferent is currently compatible with Python2, so it is best to use Python3. Direct PIP3 install-u Zhihu_oauth can be installed.
Install the first step is to login. You can log in directly using the code below.
1 fromZhihu_oauthImportzhihuclient2 fromZhihu_oauth.exceptionImportneedcaptchaexception3Client =zhihuclient ()4user ='Email_or_phone'5PWD ='Password'6 Try:7 client.login (user, PWD)8 Print(U"Landing Success!")9 exceptNeedcaptchaexception:#handling cases where the code is to be verifiedTen #Save verification Code and prompt for input, re-login OneWith open ('A.gif','WB') as F: A F.write (Client.get_captcha ()) -CAPTCHA = input ('Please input captcha:') -Client.login ('Email_or_phone','Password', CAPTCHA) the -Client.save_token ('TOKEN.PKL')#Save Token - #with token, the token file can be loaded directly the next time you log in. - #client.load_token (' filename ')
The above code is directly using the account password login, and finally saved the login token, at the next login we can directly use token login without having to enter the password every time.
After the login is done, of course, you can do a lot of things, such as the following code to get their own knowledge of the account of the basic information
1 from __future__ ImportPrint_function#Print method using the Python32 fromZhihu_oauthImportzhihuclient3 4Client =zhihuclient ()5Client.load_token ('TOKEN.PKL')#Load Token file6 #display information about yourself7me =client.me ()8 9 #get the last 5 answersTen for_, AnswerinchZip (Range (5), me.answers): One Print(Answer.question.title, Answer.voteup_count) A - Print('----------') - the #get the top 5 answers - for_, AnswerinchZip (range (5), me.answers.order_by ('Votenum')): - Print(Answer.question.title, Answer.voteup_count) - + Print('----------') - + #get the 5 most recently raised questions A for_, QuestioninchZip (Range (5), me.questions): at Print(Question.title, Question.answer_count) - - Print('----------') - - #get the 5 most recently published articles - for_, articleinchZip (Range (5), me.articles): in Print(Article.title, Article.voteup_count)
Of course we can do more than this, for example, we know the URL of a problem or problem ID, you can get the question of how many answers, the author's information and so on a series of detailed information. Developers think of really very thoughtful, generally common needs of the information basically all included. I will not post the specific code, we refer to the official documents.
A small tip: Because the library has a lot of classes, such as the class to get the author information, the class to get the article information, and so on. Every class has a lot of methods, I went to see the official document, the author some of the properties of the class is not fully listed, then how do we look at the properties of this class all? In fact, it's simple, just use the Dir function of Python, and use Dir (object) to see all the properties of the object class (or object). For example, we have a answer class object, and Using Dir (answer) returns a list of all the properties of the answer object. By removing some of the default properties, we can find the properties we need for this class, which is handy. (The collection is the full property of the Favorites class)
[' __class__ ', ' __delattr__ ', ' __dict__ ', ' __doc__ ', ' __format__ ', ' __getattribute__ ', ' __hash__ ', ' __init__ ', ' __ module__ ', ' __new__ ', ' __reduce__ ', ' __reduce_ex__ ', ' __repr__ ', ' __setattr__ ', ' __sizeof__ ', ' __str__ ', ' __ Subclasshook__ ', ' __weakref__ ', ' _build_data ', ' _build_params ', ' _build_url ', ' _cache ', ' _data ', ' _get_data ', ' _id ', ' _ Method ', ' _refresh_times ', ' _session ', ' answer_count ', ' answers ', ' articles ', ' comment_count ', ' comments ', ' contents ', ' Created_time ', ' creator ', ' description ', ' follower_count ', ' followers ', ' id ', ' is_public ', ' pure_data ', ' Refresh ', ' Title ', ' Updated_time ']
In the end, I used this class to crawl all the answers in a question (grab a beautiful picture, haha), using less than 30 lines of code (minus comments). Share to everyone.
1 #!/usr/bin/env python2 #-*-coding:utf-8-*-3 #@Time: 2017/5/3 14:274 #@Author: Lyrichu5 #@Email: [Email protected]6 #@File: save_images.py7 " "8 @Description: Save a picture that knows all the answers to a question9 " "Ten from __future__ ImportPrint_function#Print method using the Python3 One fromZhihu_oauthImportzhihuclient A ImportRe - ImportOS - ImportUrllib the -Client =zhihuclient () - #Login -Client.load_token ('TOKEN.PKL')#Load Token file +id = 24400664#https://www.zhihu.com/question/24400664 (How to look good is a kind of experience) -Question =client.question (ID) + Print(U"Questions:", Question.title) A Print(U"Number of answers:", Question.answer_count) at #Create a folder to store pictures -Os.mkdir (Question.title + u"(image)") -Path = question.title + u"(image)" -index = 1#Picture Number - forAnswerinchquestion.answers: -Content = Answer.content#Answer content inRe_compile = Re.compile (r'') -Img_lists =Re.findall (re_compile,content) to if(img_lists): + forImginchimg_lists: -Img_url = img[0]#Picture URL theUrllib.urlretrieve (img_url,path+u"/%d.jpg"%index) * Print(U"successfully saved the first%d pictures"%index) $Index + = 1
If you write your own, directly crawl parsing page can not get all the answers, so only to crack the API, more trouble, the use of this ready-made wheel is much easier. Later want to slowly appreciate the beautiful girl will no longer worry, hey hey.
Python-based knowledge of open source Crawler Zhihu_oauth Usage Introduction