I want to crawl the user's attention information, such as to see who A is concerned about, through the Www.zhihu.com/people/XXX/followees this page to get a list of Followee, but in the crawl encountered 403 problems.
1. Crawlers are only intended to collect user-focused information for academic research, not for commercial or other purposes
2. Using PHP to construct the request using Curl, use Simple_html_dom to parse the document
3. In the user's list of followers (followees), should be using AJAX to dynamically load more followees, so I want to directly crawl interface data, through Firebug to see, load more followers seems to be through http://www. Zhihu.com/node/profilef olloweesListV2 , and the post data is _xsrf,method,parmas, so I submit a request for the link in the case of the simulation to remain logged in, with the required parameters for post past, but the return is 403.
4. But I also simulate the case of login, can be resolved to such as approval number, thanks to the number of these do not need AJAX data
5. I use curl_setopt ($ch, Curlopt_httpheader, $header) to set the request header to match the request header of the request I submitted in the browser, but this will result in a 403 error
6. I tried to print out Curl's request header compared to the browser's request header, but did not find the correct way (Baidu Curl_getinfo () seems to print out the corresponding message)
7. There have been many people who have encountered 403 because they have not set user-agent or x-requested-with, but I have set the request header in 5.
8. If the description is not an unknown need to post code, I can post code
9. This crawler is part of my bi, and I need to get the data to do the next work, as 1 says, crawling data is purely for academic research
Reply content:
If a server with a firewall function, successive crawls may be killed, unless you have a lot of proxy servers. Or the simplest to constantly re-dial with ADSL to replace the IP you first find a browser, study the HTTP header of request to catch these two days just do a crawl user's attention and followers of the crawler is grasping data, using Python. Here is a Python code, you can look at the code to see your code problems.
403 is supposed to be the request. Some data is wrong, the following code involves an open text, the content of the text is the user's ID, the text inside the content style I cut the picture on the last side.
#encoding =utf8Import Urllib2Import JSONImport Requests from BS4 Import BeautifulSoupDefault_header = {' X-requested-with ': ' XMLHttpRequest ', ' Referer ': ' http://www.zhihu.com ', ' User-agent ': ' mozilla/5.0 (X11; Ubuntu; Linux x86_64; ' ' rv:39.0) gecko/20100101 firefox/39.0 ', ' Host ': ' www.zhihu.com '}_session = Requests.Session()_session.Headers.Update(Default_header) Resourcefile = Open('/root/desktop/userid.text ',' R ')Resourcelines = Resourcefile.ReadLines()Resultfollowerfile = Open('/root/desktop/useridfollowees.text ',' A + ')Resultfolloweefile = Open('/root/desktop/useridfollowers.text ',' A + ')Base_url = ' https://www.zhihu.com/'Capture_url = Base_url+' Captcha.gif?r=1466595391805&type=login 'Phone_login = Base_url + ' Login/phone_num 'def Login(): ' login to know ' username = ''#用户名 Password = ''#密码, note that I use a mobile phone number to log in, with the mailbox login need to change the following login address cap_content = Urllib2.Urlopen(Capture_url).Read() Cap_file = Open('/root/desktop/cap.gif ',' WB ') Cap_file.Write(cap_content) Cap_file.Close() Captcha = Raw_input(' Capture: ') Data = {"Phone_num":username,"Password":Password,"Captcha":Captcha} R = _session.Post(Phone_login, Data) Print (R.JSON())[' msg '] def readfollowernumbers(Followerid,Followtype): "' Read each user's followers and follower, judging by type ' Print Followerid Personurl = ' https://www.zhihu.com/people/' + Followerid.Strip('\ n') XSRF =GETXSRF() hash_id = Gethashid(Personurl) Headers = Dict(Default_header) Headers[' Referer ']= Personurl + '/follow '+Followtype Followerurl = ' Https://www.zhihu.com/node/ProfileFollow '+Followtype+' ListV2 ' params = {"offset":0,"Order_by":"Created","hash_id":hash_id} Params_encode = JSON.dumps(params) Data = {"Method":"Next","params":Params_encode,' _XSRF ':XSRF} Signindex = - Offset = 0 while Signindex == -: params[' offset '] = Offset Data[' params '] = JSON.dumps(params) Followerurljson = _session.Post(Followerurl,Data=Data,Headers = Headers) Signindex = Len((Followerurljson.JSON())[' msg ']) Offset = Offset + Signindex followerhtml = (Followerurljson.JSON())[' msg '] for everhtml inch followerhtml: Everhtmlsoup = BeautifulSoup(everhtml) personId = Everhtmlsoup.a[' href '] Resultfollowerfile.Write(personId+'\ n') Print personId def GETXSRF(): "' get the user's xsrf this is the current user's ' Soup = BeautifulSoup(_session.Get(Base_url).content) _XSRF = Soup.Find(' input ',Attrs={' name ':' _XSRF '})[' value '] return _XSRF def Gethashid(Personurl): "This is the Hashid of the user who needs to crawl, not the current login user's Hashid ' Soup = BeautifulSoup(_session.Get(Personurl).content) Hashidtext = Soup.Find(' script ', Attrs={' Data-name ': ' Current_people '}) return JSON.loads(Hashidtext.text)[3]def Main(): Login() Followtype = input(' Please configure crawl Category: 0-grab attention to who else-who is concerned ') Followtype = ' Ees ' if Followtype == 0 Else ' ers ' for Followerid inch Resourcelines: Try: readfollowernumbers(Followerid,Followtype) Resultfollowerfile.Flush() except: Pass if __name__==' __main__ ': Main()
Nothing more than that, Useragent,referer,token,cookie think there could be 2 reasons for this:
- With no cookies.
- _XSRF or hash_id Error
Let me answer this question, knowing that in the "_xsrf" this field to make a small action, not the home page to take the value of the _XSRF, but after the successful login through the cookie returned by the "_XSRF" value, so you need to get the correct value, Otherwise you will always report 403 errors (I found it in the post question, believe that you have encountered a similar problem, directly on the code):
///
Ask questions
///
/// Question Title
/// Detailed content
/// Cookies obtained after login
public void Zhihufatie (string question_title,string question_detail,cookiecontainer cookie)
{
question_title= "Question content";
Question_detail= "Detailed description of the problem";
Traverse a cookie to get the value of _XSRF
var list = getallcookies (cookie);
foreach (var item in list)
{
if (item. Name = = "_XSRF")
{
XSRF = Item. Value;
Break
}
}
Post
var fatieposturl = " http://www. Zhihu.com/question/add ";
var dd = Topicstr.tochararray ();
var fatiepoststr = "question_title=" + httputility.urlencode (question_title) + "&question_detail=" + Httputility.urlencode (Question_detail) + "&anon=0&topic_ids=" + topicid + "&new_topics=&_xsrf=" +xsrf ;
var fatieresult = nhp. Postresulthtml (Fatieposturl, Cookie, "http://www.zhihu.com/", fatiepoststr);
}
///
Traverse Cookiecontainer
///
///
///
public static List Getallcookies (Cookiecontainer cc)
{
List Lstcookies = new List ();
Hashtable table = (Hashtable) cc. GetType (). InvokeMember ("M_domaintable",
System.Reflection.BindingFlags.NonPublic | System.Reflection.BindingFlags.GetField |
System.Reflection.BindingFlags.Instance, NULL, CC, new object[] {});
foreach (Object pathList in table. Values)
{
SortedList Lstcookiecol = (SortedList) pathlist.gettype (). InvokeMember ("M_list",
System.Reflection.BindingFlags.NonPublic | System.Reflection.BindingFlags.GetField
| System.Reflection.BindingFlags.Instance, NULL, pathList, new object[] {});
foreach (Cookiecollection colcookies in lstcookiecol.values)
foreach (Cookie C in colcookies) Lstcookies.add (c);
}
return lstcookies;
Modify the header's X-forwarded-for field to disguise the IP it was a coincidence that I had just met this problem last night. There may be a lot of reasons, I only said that I met, for reference only, to provide a way of thinking. I crawled on Sina Weibo and used an agent. 403 is because the site refused to visit, I do the same thing in the browser, just look inside a few pages will appear 403, but refresh a few times just fine. Implementation in code is just a few more requests. Looked upstairs the answer, instantly be intimidated. Daniel really many, but I suggest the main question to ask Kai-fu Lee Good ~ haha say interface is how to catch ... Why I can't catch the interface with Firebug. Chrome's network is not catching the interface
In other words, direct request followees can also be directly obtained, the rest is the regular