Python crawler Sina Weibo login

Source: Internet
Author: User
Tags base64

Fiddler before understanding some common to the anti-crawling measures, JS encryption is more difficult, and micro-blog login is to use JS encryption to crawl, today to understand.

Analysis Process

First we go to grab the package, from the login to the Micro-blog page loaded out of the process. We focus on the login operation, followed by the first page of the request, login is generally a POST request. Let's search for a moment:

Learn that the login URL is https://login.sina.com.cn/sso/login.php?client=ssologin.js (v1.4.19), and then click the WebForms menu to view the parameters:

There are many parameters to be submitted, the general parameters of the value of 3 kinds of cases:

    1. The parameter value is fixed. In general, we find that a certain parameter value is fixed when we grab the packet several times;
    2. The parameter value comes from the previous server's response. Multiple catches find the parameter value change, at this point we can find the value of the parameter in fiddler to see if we can find it in the previous response. For example, the nonce, RSAKV, Servertime
    3. The parameter value comes from JS generation. If the value of the multiple-capture parameter is neither fixed nor found in the previous response, the most likely result is that the value of this parameter is generated by the JS code.

We find the nonce in fiddler:

A previous request was found to be highlighted, indicating that this parameter was previously present. Click on this request to find this value in the response:

Can find this parameter, so we want to login, we have to get the value of the nonce, and to get the value of the nonce, we must first request the found request, the URL of the request is https://login.sina.com.cn/sso/prelogin.php? entry=weibo&callback=sinassocontroller.prelogincallback&su=mtgzmti0otmxmdc%3d&rsakt=mod& Su in the Checkpin=1&client=ssologin.js (v1.4.19) &_=1533119627438,url is going to explain later that the last parameter looks like a timestamp, and we can simulate it with a timestamp first. The necessary parameters, such as Servertime and RSAKV, can also be found in this response.

Now we've solved the problem with most of the parameters, but there are two hard-to-chew bones: SP and Su, and these two values were not found in the previous response. And we will find that our login entered the account and password does not appear in these parameters, we boldly guessed: Su and SP is the account and password! So how do we find the value of them? The answer is to find the appropriate JS code and rewrite it with Python.

Now our problem to how to locate the two values of the JS code, before we learn Chrome debugging, learned to break point, this place is the way to find a breakpoint. Every time we login to click on the page "login" button, we fill out the account password, set a click on the event breakpoint, and then click Login. This allows the request to be paused while the login is in progress, while the SU and SP parameters are also encrypted at this time!

We then use these function keys in the upper right corner of the debug interface to step through the analysis.

Note: Generally just assignment operation, we can skip, if it is the function of the execution, we want to look inside the function, especially the function parameter is to carry important parameters, to focus on. In the console interface, we can also view the values of some parameters.

In addition, if you exit a function, the cursor is still in this line, indicating that there is a function of this line, must not directly next, a lot of key information in this function.

Micro-Blog Login js positioning process will not elaborate, we finally locate the SU and SP encryption code as follows:

That is, SU is encoded with base64, and SP is encrypted with RSA, we can use the JavaScript code to implement the Python code.

Currently, the login problem is resolved. Now look at the question asking for the homepage. We look at each one to know the first page of the request is as follows:

URL is https://weibo.com/u/6505689778/home?wvr=5&lf=reg, and this URL has a 6505689778, this value we find in the Fiddler, in the request https:// Passport.weibo.com/wbsso/login?ticket=st-njuwnty4otc3oa%3d%3d-1533119623-gz-0deaf5775e6f1d983147b0b96ee915b9-1 &ssosavestate=1564655623&callback=sinaSSOController.doCrossDomainCallBack&scriptId=ssoscript0& Client=ssologin.js (v1.4.19) &_=1533119634900 can find it in the response.

and request this page, but also get the parameter ticket, ssosavestate value, we look again, we can know that the two values in another request https://login.sina.com.cn/crossdomain2.php?action= Login&entry=weibo&r=https%3a%2f%2fpassport.weibo.com%2fwbsso%2flogin%3fssosavestate%3d1564655623%26url %3dhttps%253a%252f%252fweibo.com%252fajaxlogin.php%253fframelogin%253d1%2526callback% 253dparent.sinassocontroller.feedbackurlcallback%2526sudaref%253dweibo.com%26display%3d0%26ticket% 3dst-njuwnty4otc3oa%3d%3d-1533119623-gz-39b6b6d3d3979d6da2860b54e4e61a01-1%26retcode%3d0&login_time= In the 1533119622&sign=0db5e9f42ceb691c&sr=1536%2a864 response.

So how does this long URL come from, and we look again, and we know it's in the response after login.

By the way, the steps have gone through!

Let's take a look at the steps:

1, first account, password encryption after the ciphertext to get

2. Request Https://login.sina.com.cn/sso/prelogin.php?entry=weibo&callback=sinaSSOController.preloginCallBack &su=mtgzmti0otmxmdc%3d&rsakt=mod&checkpin=1&client=ssologin.js (v1.4.19) &_= 1533119627438 get the Nonce, RSAKV and other parameters

3, constructs the parameter and requests the login url:https://login.sina.com.cn/sso/login.php?client=ssologin.js (v1.4.19), obtains the jump URL in the response

4, the request jumps the url:https://login.sina.com.cn/crossdomain2.php?action=login&entry=weibo&r=https%3a%2f% 2fpassport.weibo.com%2fwbsso%2flogin%3fssosavestate%3d1564655623%26url%3dhttps%253a%252f%252fweibo.com% 252fajaxlogin.php%253fframelogin%253d1%2526callback%253dparent.sinassocontroller.feedbackurlcallback% 2526sudaref%253dweibo.com%26display%3d0%26ticket%3dst-njuwnty4otc3oa%3d% 3d-1533119623-gz-39b6b6d3d3979d6da2860b54e4e61a01-1%26retcode%3d0&login_time=1533119622&sign= 0db5e9f42ceb691c&sr=1536%2a864, get the value of ticket, ssosavestate parameter

5. Request https://passport.weibo.com/wbsso/login?ticket=ST-NjUwNTY4OTc3OA%3D% 3d-1533119623-gz-0deaf5775e6f1d983147b0b96ee915b9-1&ssosavestate=1564655623&callback= Sinassocontroller.docrossdomaincallback&scriptid=ssoscript0&client=ssologin.js (v1.4.19) &_= 1533119634900 getting UniqueID parameters

6, Request home: Https://weibo.com/u/6505689778/home?wvr=5&lf=reg

OK, so far, we have successfully logged on to the Weibo, you want to get the data on the microblog, you can request.

Implementation code
ImportRequestsImportRSAImport TimeImportReImportRandomImportURLLIB3ImportBase64 fromUrllib.parseImportQuote fromBinasciiImportb2a_hexurllib3.disable_warnings ()#Cancel WarningdefGet_timestamp ():returnInt (time.time () *1000)#get a 13-bit timestampclassWeiBo ():def __init__(Self,username,password): Self.username=username Self.password=Password Self.session= Requests.session ()#log in with sessionself.session.headers={            'user-agent':'mozilla/5.0 (Windows NT 10.0; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/65.0.3325.181 safari/537.36'} self.session.verify= False#Canceling certificate validation    defPrelogin (self):" "Pre-login, get some required parameters" "self.su= Base64.b64encode (Self.username.encode ())#Read JS to know the user name for base64 transcodingURL ='Https://login.sina.com.cn/sso/prelogin.php?entry=weibo&callback=sinaSSOController.preloginCallBack &su={}&rsakt=mod&checkpin=1&client=ssologin.js (v1.4.19) &_={}'. Format (quote (SELF.SU), Get_timestamp ())#Note that Su is going to quote transcodingResponse =self.session.get (URL). Content.decode ()#Print (response)Self.nonce = Re.findall (r'" nonce": "(. *?)"', Response) [0] Self.pubkey= Re.findall (r'" PubKey": "(. *?)"', Response) [0] Self.rsakv= Re.findall (r'" rsakv": "(. *?)"', Response) [0] Self.servertime= Re.findall (r'"Servertime":(. *?),', Response) [0]returnSelf.nonce,self.pubkey,self.rsakv,self.servertimedefget_sp (self):" "use RSA to encrypt the plaintext password, encryption rules by reading the JS code to know" "PublicKey= RSA. PublicKey (int (self.pubkey,16), int ('10001', 16)) Message= str (self.servertime) +'\ t'+ STR (self.nonce) +'\ n'+Str (self.password) SELF.SP=Rsa.encrypt (Message.encode (), PublicKey)returnB2a_hex (SELF.SP)defLogin (self): URL='https://login.sina.com.cn/sso/login.php?client=ssologin.js (v1.4.19)'Data= {        'entry':'Weibo',        'Gateway':'1',        ' from':"',        'SaveState':'7',        'Qrcode_flag':'false',        'Useticket':'1',        'Pagerefer':'https://login.sina.com.cn/crossdomain2.php?action=logout&r=https%3A%2F%2Fweibo.com%2Flogout.php% 3fbackurl%3d%252f',        'VSNF':'1',        'su': self.su,'Service':'Miniblog',        'Servertime': str (int (self.servertime) +random.randint (1,20)),        'nonce': Self.nonce,'Pwencode':'RSA2',        'rsakv': SELF.RSAKV,'SP': Self.get_sp (),'SR':'1536 * 864',        'encoding':'UTF-8',        'Prelt':' *',        'URL':'Https://weibo.com/ajaxlogin.php?framelogin=1&callback=parent.sinaSSOController.feedBackUrlCallBack ',        'ReturnType':'META',} response= Self.session.post (url,data=data,allow_redirects=false). Text#Submit account password and other parametersRedirect_url = Re.findall (r'location.replace\ ("(. *?)" \);', response) [0]#Weibo will jump after submitting data, here to get the URL of the jumpresult = Self.session.get (redirect_url,allow_redirects=false). Text#Request a jump pageTicket,ssosavestate = Re.findall (r'ticket= (. *?) &ssosavestate= (. *?) "', result) [0]#get ticket and Ssosavestate parametersUid_url ='https://passport.weibo.com/wbsso/login?ticket={}&ssosavestate={}&callback= Sinassocontroller.docrossdomaincallback&scriptid=ssoscript0&client=ssologin.js (v1.4.19) &_={}'. Format (Ticket,ssosavestate,get_timestamp ()) Data= Self.session.get (uid_url). Text#request to get UIDUID = Re.findall (r'" UniqueID": "(. *?)"', data) [0]Print(UID) home_url='Https://weibo.com/u/{}/home?wvr=5&lf=reg'. Format (UID)#Request HomeHTML =self.session.get (home_url) html.encoding='Utf-8'        Print(Html.text)defMain (self): Self.prelogin () self.get_sp () Self.login ()if __name__=='__main__': Username='xxxxxxxxx' #Weibo accountPassword ='xxxxxxxxx' #Micro-blog PasswordWeibo =WeiBo (Username,password) weibo.main ()

Results:

Python crawler Sina Weibo login

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.