Python simulation Sina Weibo landing function (Sina Weibo crawler)

Source: Internet
Author: User
1. Main function (weibomain.py):

Copy the Code code as follows:


Import Urllib2
Import Cookielib

Import Weiboencode
Import Weibosearch

if __name__ = = ' __main__ ':
Weibologin = weibologin (' xxx @gmail. com ', ' xxxx ') #邮箱 (account), password
If weibologin.login () = = True:
Print "Successful landing!" "

The first two imports are the network programming modules that load Python, and the next import is to load another two files weiboencode.py and weiboseach.py (described later). The main function creates a new landing object and then logs in.

2. Weibologin Class (weibomain.py):

Copy the Code code as follows:


Class Weibologin:
def __init__ (self, user, pwd, enableproxy = False):
"Initialize Weibologin,enableproxy to indicate whether to use a proxy server, shutdown by default"

Print "Initializing weibologin ..."
Self.username = user
Self.password = pwd
Self.enableproxy = EnableProxy

Self.serverurl = "http://login.sina.com.cn/sso/prelogin.php?entry=weibo&callback= Sinassocontroller.prelogincallback&su=&rsakt=mod&client=ssologin.js (v1.4.11) &_=1379834957683 "
Self.loginurl = "Http://login.sina.com.cn/sso/login.php?client=ssologin.js (v1.4.11)"
Self.postheader = {' user-agent ': ' mozilla/5.0 (Windows NT 6.1; rv:24.0) gecko/20100101 firefox/24.0 '}

The initialization function, which defines two key URL members: The first step of Self.serverurl for landing (get servertime, nonce, etc.), the first step here contains 1 and 2 of the login process to parse Sina Weibo. Self.loginurl is used for the second step (after encrypting the user and password, post to the Url,self.postheader is the header of the post), this step corresponds to 3 of the login process that resolves Sina Weibo. There are also 3 in-class functions:

Copy the Code code as follows:


def Login (self):
"Login Procedure"
Self. Enablecookie (Self.enableproxy) #cookie或代理服务器配置

Servertime, Nonce, pubkey, rsakv = self. GetServerTime () #登陆的第一步
PostData = Weiboencode.postencode (Self.username, Self.password, Servertime, Nonce, PubKey, RSAKV) #加密用户和密码
Print "Post data length:\n", Len (postdata)

req = Urllib2. Request (Self.loginurl, PostData, Self.postheader)
print "Posting request ..."
result = Urllib2.urlopen (req) #登陆的第二步--Analysis of the login process of Sina Weibo 3
Text = Result.read ()
Try
loginurl = Weibosearch.sredirectdata (text) #解析重定位结果
Urllib2.urlopen (loginurl)
Except
print ' Login error! '
Return False

print ' Login sucess! '
Return True

Self. Enablecookie used to set up cookies and proxy servers, there are many free proxy servers on the network, to prevent the Sina IP, can be used. Then make the first step of landing, Access Sina server to get servertime and other information, and then use this information to encrypt the user name and password, build the post request; Take the second step, Send the user and password to Self.loginurl, after the relocation information, resolve to get the final URL to jump to, open the URL, the server automatically writes the user login information to the cookie, successful landing.

Copy the Code code as follows:


def enablecookie (self, enableproxy):
"Enable Cookies & Proxies (if needed)."

Cookiejar = Cookielib. Lwpcookiejar () #建立cookie
Cookie_support = Urllib2. Httpcookieprocessor (Cookiejar)

If EnableProxy:
Proxy_support = Urllib2. Proxyhandler ({' http ': ' Http://xxxxx.pac '}) #使用代理
Opener = Urllib2.build_opener (Proxy_support, Cookie_support, Urllib2. HttpHandler)
Print "Proxy enabled"
Else
Opener = Urllib2.build_opener (Cookie_support, Urllib2. HttpHandler)

Urllib2.install_opener (opener) #构建cookie对应的opener

Enablecookie function is relatively simple

Copy the Code code as follows:


def getservertime (self):
"Get server time and Nonce, which is used to encode the password"

Print "Getting server time and nonce ..."
Serverdata = Urllib2.urlopen (Self.serverurl). Read () #得到网页内容
Print Serverdata

Try
Servertime, Nonce, pubkey, rsakv = Weibosearch.sserverdata (serverdata) #解析得到serverTime, nonce, etc.
Return servertime, Nonce, PubKey, RSAKV
Except
print ' Get server time & nonce error! '
Return None

The functions in the Weibosearch file are primarily used to parse data from the server and are relatively straightforward.

3. Sserverdata function (weibosearch.py):

Copy the Code code as follows:


Import re
Import JSON

def sserverdata (serverdata):
"Search the server time & nonce from server data"

p = re.compile (' \ ((. *) \) ')
Jsondata = P.search (serverdata). Group (1)
data = Json.loads (Jsondata)
Servertime = str (data[' servertime ')
nonce = data[' nonce ']
PubKey = data[' PubKey ']#
RSAKV = data[' rsakv ']#
Print "Server time is:", servertime
Print "Nonce is:", nonce
Return servertime, Nonce, PubKey, RSAKV

The parsing process mainly uses regular expressions and JSON, which is relatively easy to understand. In addition, the parse relocation result part function in login is also shown in this file:

Copy the Code code as follows:


def sredirectdata (text):
p = re.compile (' location\.replace\ ([\ ' "] (. *?) [\'"]\)')
loginurl = P.search (text). Group (1)
print ' loginurl: ', loginurl
Return loginurl

4, from the first step to the second step to encrypt the user and password, encoding operation (weiboencode.py)

Copy the Code code as follows:


Import Urllib
Import Base64
Import RSA
Import Binascii

def postencode (UserName, PassWord, Servertime, Nonce, PubKey, RSAKV):
"Used to generate POST data"

Encodedusername = GetUserName (userName) #用户名使用base64加密
Encodedpassword = Get_pwd (PassWord, Servertime, Nonce, PubKey) #目前密码采用rsa加密
Postpara = {
' Entry ': ' Weibo ',
' Gateway ': ' 1 ',
' From ': ',
' SaveState ': ' 7 ',
' Userticket ': ' 1 ',
' Ssosimplelogin ': ' 1 ',
' VSNF ': ' 1 ',
' Vsnval ': ',
' su ': encodedusername,
' Service ': ' Miniblog ',
' Servertime ': Servertime,
' Nonce ': nonce,
' Pwencode ': ' RSA2 ',
' SP ': Encodedpassword,
' Encoding ': ' UTF-8 ',
' Prelt ': ' 115 ',
' RSAKV ': rsakv,
' URL ': ' Http://weibo.com/ajaxlogin.php?framelogin=1&callback=parent.sinaSSOController.feedBackUrlCallBack ',
' ReturnType ': ' META '
}
PostData = Urllib.urlencode (Postpara) #网络编码
return PostData

The Postencode function constructs the message body of the post, which requires that the content be constructed to be the same as the information needed to actually log in. The difficulty in encrypting the user name and password means:

Copy the Code code as follows:


def getusername (userName):
"Used to encode user name"

Usernametemp = Urllib.quote (userName)
usernameencoded = base64.encodestring (usernametemp) [:-1]
Return usernameencoded


def get_pwd (password, servertime, nonce, PubKey):
Rsapublickey = Int (PubKey, 16)
Key = RSA. PublicKey (Rsapublickey, 65537) #创建公钥
message = str (servertime) + ' \ t ' + str (nonce) + ' \ n ' + str (password) #拼接明文js加密文件中得到
passwd = rsa.encrypt (message, key) #加密
passwd = Binascii.b2a_hex (passwd) #将加密信息转换为16进制.
return passwd

Sina login process, password encryption method originally is SHA1, now become RSA, later may also change, but a variety of encryption algorithms in Python has a corresponding implementation, as long as the discovery of its encryption method (), the program is easier to implement.

Here, the python simulation landing Sina Weibo success, run the output:

Copy the Code code as follows:


Loginurl:http://weibo.com/ajaxlogin.php?framelogin=1&callback=parent.sinassocontroller.feedbackurlcallback &ssosavestate=1390390056&ticket=st-mzq4nzq5ntyyma==-1387798056-xd-284624bfc19fe242bbae2c39fb3a8ca8 &retcode=0
Login sucess!

If you need to crawl the information in the microblog, then just add the crawl and parse module after the main function, such as reading the content of a Weibo Web page:

Copy the Code code as follows:


Htmlcontent = Urllib2.urlopen (Myurl). Read () #得到myurl网页的所有内容 (HTML)

We can design different crawler modules according to different requirements, and put the code of the simulated landing here.

  • Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.