Python simulates Sina Weibo login (Sina Weibo crawler)

Source: Internet
Author: User

1. Main Function (WeiboMain. py ):

Copy codeThe Code is as follows:
Import urllib2
Import cookielib

Import WeiboEncode
Import WeiboSearch

If _ name _ = '_ main __':
WeiboLogin = WeiboLogin ('××× @ gmail.com ', '××××') # email (account) and password
If weiboLogin. Login () = True:
Print "Login successful! "

The first two imports are the network programming modules for loading Python, and the subsequent import is to load the other two files WeiboEncode. py and Weiboseach. py (will be introduced later ). Create a new Login object for the main function, and then log in.

2. WeiboLogin class (WeiboMain. py ):

Copy codeThe Code is as follows:
Class WeiboLogin:
Def _ init _ (self, user, pwd, enableProxy = False ):
"Initialize WeiboLogin. enableProxy indicates whether to use the proxy server. It is disabled by default"

Print "Initializing WeiboLogin ..."
Self. userName = user
Self. passWord = pwd
Self. enableProxy = enableProxy

Self. serverUrl = "http://login.sina.com.cn/sso/prelogin.php? Entry = Wei O & callback = sinaSSOController. preloginCallBack & su = & rsakt = mod & client = ssologin. js (v1.4.11) & _ = 1379834957683"
Self. loginUrl = "http://login.sina.com.cn/sso/login.php? Client = ssologin. js (v1.4.11 )"
Self. postHeader = {'user-agent': 'mozilla/5.0 (Windows NT 6.1; rv: 24.0) Gecko/20100101 Firefox/123456 '}

The initialization function defines two key url members: self. serverUrl is used for the first step of login (obtaining servertime, nonce, etc.). The first step here actually contains 1 and 2 of the login process of Sina Weibo; self. loginUrl is used in the second step (after the user and password are encrypted, POST to this URL, self. postHeader is the POST header information). This step corresponds to 3 of the login process of Sina Weibo. There are three functions in the class:

Copy codeThe Code is as follows:
Def Login (self ):
"Login program"
Self. EnableCookie (self. enableProxy) # cookie or proxy server configuration

ServerTime, nonce, pubkey, rsakv = self. GetServerTime () # Step 1 of login
PostData = WeiboEncode. PostEncode (self. userName, self. passWord, serverTime, nonce, pubkey, rsakv) # encrypt users and passwords
Print "Post data length: \ n", len (postData)

Req = urllib2.Request (self. loginUrl, postData, self. postHeader)
Print "Posting request ..."
Result = urllib2.urlopen (req) # Step 2 of login -- parsing the logon process of Sina Weibo 3
Text = result. read ()
Try:
LoginUrl = WeiboSearch. sRedirectData (text) # parse the relocation result
Urllib2.urlopen (loginUrl)
Except t:
Print 'login error! '
Return False

Print 'login sucess! '
Return True

Self. EnableCookie is used to set cookies and proxy servers. There are many free proxy servers on the network. It can be used to prevent Sina mail IP addresses. Then, the first step of login is to access the Sina server to obtain serverTime and other information, and then use this information to encrypt the user name and password to construct a POST request. Execute the second step to self. loginUrl sends the user and password, obtains the relocation information, parses the final jump URL, opens the URL, the server automatically writes the user login information into the cookie, login successful.

Copy codeThe Code is as follows:
Def EnableCookie (self, enableProxy ):
"Enable cookie & proxy (if needed )."

Cookiejar = cookielib. LWPCookieJar () # create a cookie
Cookie_support = urllib2.HTTPCookieProcessor (cookiejar)

If enableProxy:
Proxy_support = urllib2.ProxyHandler ({'http': 'http: // xxxxx. pac '}) # use a proxy
Opener = urllib2.build _ opener (proxy_support, cookie_support, urllib2.HTTPHandler)
Print "Proxy enabled"
Else:
Opener = urllib2.build _ opener (cookie_support, urllib2.HTTPHandler)

Urllib2.install _ opener (opener) # construct the opener corresponding to the cookie

The EnableCookie function is relatively simple.

Copy codeThe Code is as follows:
Def GetServerTime (self ):
"Get server time and nonce, which are used to encode the password"

Print "Getting server time and nonce ..."
ServerData = urllib2.urlopen (self. serverUrl). read () # obtain the webpage content
Print serverData

Try:
ServerTime, nonce, pubkey, rsakv = WeiboSearch. sServerData (serverData) # parse serverTime, nonce, etc.
Return serverTime, nonce, pubkey, rsakv
Except t:
Print 'get server time & nonce error! '
Return None

Functions in the WeiboSearch file are mainly used to parse the data obtained from the server, which is relatively simple.

3. sServerData function (WeiboSearch. py ):

Copy codeThe Code is as follows:
Import re
Import json

Def sServerData (serverData ):
"Search the server time & nonce from server data"

P = re. compile ('\((.*)\)')
JsonData = p. search (serverData). group (1)
Data = json. loads (jsonData)
ServerTime = str (data ['servertime'])
Nonce = data ['nonce ']
Pubkey = data ['pubkey'] #
Rsakv = data ['rsak'] #
Print "Server time is:", serverTime
Print "Nonce is:", nonce
Return serverTime, nonce, pubkey, rsakv

The parsing process mainly uses regular expressions and JSON, which is easier to understand. In addition, some functions in the Login to parse the relocation result are also included in this file as follows:

Copy codeThe Code is as follows:
Def sRedirectData (text ):
P = re. compile ('location \. replace \ ([\ '"] (. *?) [\ '"] \)')
LoginUrl = p. search (text). group (1)
Print 'loginurl: ', loginUrl
Return loginUrl

4. From step 1 to step 2, encrypt the user and password (WeiboEncode. py)

Copy codeThe Code is as follows:
Import urllib
Import base64
Import rsa
Import binascii

Def PostEncode (userName, passWord, serverTime, nonce, pubkey, rsakv ):
"Used to generate POST data"

EncodedUserName = GetUserName (userName) # Use base64 to encrypt the user name
EncodedPassWord = get_pwd (passWord, serverTime, nonce, pubkey) # The current passWord is encrypted using rsa
PostPara = {
'Entry ': 'weibo ',
'Gateway': '1 ',
'From ':'',
'Savestate': '7 ',
'Userticket': '1 ',
'Ssosimplelogin': '1 ',
'Vsnf ': '1 ',
'Vsnval ':'',
'Su': encodedUserName,
'Service': 'miniblog ',
'Servertime': servertime,
'Nonce ': nonce,
'Pwencode': 'rsa2 ',
'SP ': encodedPassWord,
'Encoding': 'utf-8 ',
'Prelt ': '123 ',
'Rsak': rsakv,
'Url': 'http: // weibo.com/ajaxlogin.php? Framelogin = 1 & callback = parent. sinaSSOController. feedBackUrlCallBack ',
'Returntype': 'meta'
}
PostData = urllib. urlencode (postPara) # Network encoding
Return postData

The PostEncode function constructs the POST message body and requires that the constructed content be the same as the information required for real login. The difficulty lies in the user name and password encryption method:

Copy codeThe Code is as follows:
Def GetUserName (userName ):
"Used to encode user name"

UserNameTemp = urllib. quote (userName)
UserNameEncoded = base64.encodestring (userNameTemp) [:-1]
Return userNameEncoded


Def get_pwd (password, servertime, nonce, pubkey ):
RsaPublickey = int (pubkey, 16)
Key = rsa. PublicKey (rsaPublickey, 65537) # create a public key
Message = str (servertime) + '\ t' + str (nonce) +' \ n' + str (password) # obtained by splicing the plaintext js encrypted file
Passwd = rsa. encrypt (message, key) # Encryption
Passwd = binascii. b2a_hex (passwd) # convert the encrypted information to hexadecimal notation.
Return passwd

In the Sina login process, the password is encrypted in The SHA1 format and now changed to RSA, which may change in the future. However, various encryption algorithms are implemented in Python, as long as the encryption method () is found, the program is easier to implement.

Here, Python successfully simulates login to Sina Weibo and runs the output:

Copy codeThe Code is as follows:
LoginUrl: http://weibo.com/ajaxlogin.php? Framelogin = 1 & callback = parent. sinaSSOController. feedBackUrlCallBack & ssosavestate = 1390390056 & ticket = ST-MzQ4NzQ5NTYyMA =-found & retcode = 0
Login sucess!

If you need to crawl the information in Weibo, you only need to add the crawler and resolution module after the Main function. For example, you can read the content of a Weibo webpage:

Copy codeThe Code is as follows:
HtmlContent = urllib2.urlopen (myurl). read () # obtain all content of the myurl webpage (html)

You can design different crawler modules according to different requirements. The simulated login code is put here.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.