first, micro-Bo must log in order to crawl?At present, for micro-blogging reptiles, most of them are based on the analog microblogging account login mode to achieve, this way if the real operation, is actually a very painful thing, you may live every day, fearing that Sina Dad to your account to seal, and now with the real name of the landing, The number of channels to get an account is also going to be getting less.
But the days have to go on, in such difficult conditions, for the survival of reptiles must seek to evolve. The good news is that when God closes the door, he will open the window, and in the impact of other new media platforms such as headlines and so on, it gradually releases the right to view the information. Today's microblog can still see a lot of tweets, even without logging in, and our foothold is here.
This article details how to obtain the relevant cookies and httpclient to achieve a login-free purpose to support the data crawling task on the microblog. The following from the microblogging home http://weibo.com start.
Ii. Preparatory workPreparation is very simple, a modern browser (you know why I write "modern" two words), as well as httpclient (I use the version is 4.5.3)
Like a login crawler, a login-free crawler is also required to load cookies. Cookies are used to identify visitors, and use this cookie to access the content that is allowed to be accessed on the microblogging platform.
Here we can use the browser's network tool to look at the request http://weibo.com after the server returned what things, of course, in advance to clear the browser cache.
My blog: Code encyclopedia: www.codedq.net; amateur grass: www.xttblog.com; love to share: www.ndislwf.com or ifxvn.com.
No surprises, you should be able to see the following figure
The 1th time the request weibo.com, its state is 302 redirect, that is, this time did not really start loading the page, and the last request weibo.com status of 200, indicating the success of the request, compared to two requests header:
Obviously, the middle of these processes to the client loaded with a variety of cookies, so that the page can be successfully accessed, and then we analyze one by one.
Third, Cobwebs 2nd request is Https://passport.weibo.com/vi ..., you can copy this URL, with httpclient a separate access to this URL, you can see the return of an HTML page , there is a large section of JavaScript script, the other head also refers to a JS file mini_original.js, that is, the 3rd request. The script is more functional, not one by one, but simply the access control of Weibo, which is worth our attention is a function:
Give the user a visitor identity.
var incarnate = function (Tid, where, conficence) {
var gen_conf = "";
var from = "Weibo";
var incarnate_intr = Window.location.protocol
+ "//" + Window.location.host + "/visitor/visitor?a=incarnate&t= "
+ encodeURIComponent (tid) +" &w= "+ encodeURIComponent (where)
+" &c= "+ encodeuricomponent ( conficence) + "&gc="
+ encodeuricomponent (gen_conf) + "&cb=cross_domain&from="
+ from + "&_" Rand= "+ math.random ();
URL.L (INCARNATE_INTR);
};
My blog: Code encyclopedia: www.codedq.net; amateur grass: www.xttblog.com; love to share: www.ndislwf.com or ifxvn.com.
Here is for the requester to give a visitor identity, and the control of the jump link is also a number of parameters splicing, that is, the 6th request in the above figure. So the following work is to obtain these 3 parameters: Tid,w (where), C (Conficence, from the following point of view should be confidence, presumably Sina Engineer's hand mistake). Continue to read the source code, you can see that the function is the Tid.get method of the callback function, and this tid is defined in that Mini_original.js an object, its part of the source code is:
var tid = {key: ' Tid ', Value: ', recover:0, confidence: ', Postinterface:posturl, Fpcollectinterface:sen
Durl, Callbackstack: [], Init:function () {tid.get ();
}, Runstack:function () {var F; while (f = tid.callbackStack.pop ()) {f (tid.value, Tid.recover, tid.confidence);/note here, corresponding to the above 3 parameters}}, Get:func
tion (callback) {callback = Callback | | function () {};
Tid.callbackStack.push (callback);
if (tid.value) {return tid.runstack ();
} Store.DB.get (Tid.key, function (v) {if (!v) {tid.gettidfromserver ();
} else {...}
});
}, ...... }
... gettidfromserver:function () {tid.gettidfromserver = function () {}; if (WINDOW.USE_FP) {GETFP (function (data) {Util.postdata (Window.location.protocol + '//' + window.location . Host + '/' + tid.postinterface, ' cb=gen_callback&fp= ' + encodeuricomponent (data), function (res) {if (res)
{eval (res); }
});
}); else {util.postdata Window.location.protocol + '//' + Window.location.host + '/' + tid.postinterface, ' cb=g
En_callback ", function (res) {if (res) {eval (res);
}
});
}, ...//get parameter Window.gen_callback = function (fp) {var value = false, confidence; if (FP) {if (Fp.retcode = = 20000000) {confidence = typeof (Fp.data.confidence)!= ' undefined '?
' + ' + fp.data.confidence: ' 100 '; Tid.recover = Fp.data.new_tid?
3:2;
Tid.confidence = confidence = confidence.substring (confidence.length-3);
value = Fp.data.tid;
Store.DB.set (Tid.key, value + ' __ ' + confidence);
} tid.value = value;
Tid.runstack (); };
My blog: Code encyclopedia: www.codedq.net; amateur grass: www.xttblog.com; love to share: www.ndislwf.com or ifxvn.com.
Obviously, Tid.runstack () is where the callback function is actually executed, where you can see the 3 parameters passed in. In the Get method, when the cookie is empty, Tid invokes Gettidfromserver, and then a 5th request Https://passport.weibo.com/vi ..., which requires two parameters CB and FP, whose parameter values can be used as constants:
The result of the request returns a string of JSON
{"
msg": "Succ",
"data": {
"New_tid": false,
"confidence": ","
tid ":" Kirvlolhrcr5iscc80twqdymwbvlrvlny2+yvcq1vva= "
},
" Retcode ": 20000000
}
My blog: Code encyclopedia: www.codedq.net; amateur grass: www.xttblog.com; love to share: www.ndislwf.com or ifxvn.com.
which contains the TID and confidence, this confidence, I guess presumably is the assumption that the client is true a confidence level, not necessarily appear, according to the Window.gen_callback method, does not appear when the default is 100, in addition, when the New_ Tid is true when the argument where equals 3, otherwise equals 2.
Now that all 3 parameters have been obtained, you can use HttpClient to launch the 6th request above and return to another string of JSON:
{"
msg": "Succ",
"data": {
"sub": "_2akmu428tf8nxqwjrmpacxwzmzyh_ Zqjeiekyv572jrmxhrl-yt83qnmgtrcnhyr4ezqqzqrbro3gvmwm5zb2hq ... ",
" SUBP ":" 0033WRSXQPXFM72-WS9JQGMF55529P9D9WWU2MGYNITKSS2AWP.AX-DQ "
},
" Retcode ": 20000000
}
My blog: Code encyclopedia: www.codedq.net; amateur grass: www.xttblog.com; love to share: www.ndislwf.com or ifxvn.com.
Refer to the header of the last Request weibo.com, where the sub and SUBP are the final cookie values to get. We may have a small question, how did the first cookie come? Yes, this cookie was first accessed by weibo.com and tested without loading.
Finally, we use the above two cookies loaded into the httpclient to request a weibo.com, you can get the full HTML page, the following is the moment to witness the miracle:
<!doctype html>