First, the basic knowledge
http://blog.csdn.net/pi9nc/article/details/9734437
Second, the analog login
Because last semester took part in a big data game, need to crawl data, so just want to write a crawler crawl Sina Weibo data.
Of course crawling data is not aimless, I need to follow the key words to crawl related Weibo.
Just like Weibo has an advanced search feature, but to get more tweets, you need to log in, so you'll need to simulate a login.
The following code is modeled by the RSA Cryptographic algorithm module. It should be noted that Sina has anti-crawler, so when we log in to disguise as a browser.
The code is not written by itself, so the article type is labeled reproduced, because the code is similar, so I do not write, some of the specific code and problem analysis, I do not repeat, because the simulation is not my focus, the next I will talk to you after the crawl and Web page parsing part. As for the login, the article at the beginning of the link has a detailed tutorial, interested can see.
[Python]View PlainCopy
- #! /usr/bin/env python
- #coding =utf8
- Import Urllib
- Import Urllib2
- Import Cookielib
- Import Base64
- Import re
- Import JSON
- Import Hashlib
- Import RSA
- Import Binascii
- CJ = Cookielib. Lwpcookiejar ()
- Cookie_support = Urllib2. Httpcookieprocessor (CJ)
- Opener = Urllib2.build_opener (Cookie_support, Urllib2. HttpHandler)
- Urllib2.install_opener (opener)
- PostData = {
- ' entry ': ' Weibo ',
- ' Gateway ': ' 1 ',
- ' from ': ' ,
- ' savestate ': ' 7 ',
- ' userticket ': ' 1 ',
- ' ssosimplelogin ': ' 1 ',
- ' vsnf ': ' 1 ',
- ' vsnval ': ' ,
- ' su ': ' ,
- ' service ': ' Miniblog ',
- ' servertime ': ' ,
- ' nonce ': ' ,
- ' pwencode ': ' rsa2 ', #加密算法
- ' SP ': ' ,
- ' encoding ': ' UTF-8 ',
- ' prelt ': ' 401 ',
- ' rsakv ': ' ,
- ' url ': ' http://weibo.com/ajaxlogin.php?framelogin=1&callback= Parent.sinaSSOController.feedBackUrlCallBack ',
- ' returntype ': ' META '
- }
- Class Weibologin:
- def __init__ (self, username, password):
- Self.username = Username
- Self.password = password
- def __get_spwd (self):
- Rsapublickey = Int (self.pubkey, + )
- Key = RSA. PublicKey (Rsapublickey, 65537) #创建公钥
- Message = self.servertime + ' \ t ' + self.nonce + ' \ n ' + self.password #拼接明文js加密文件中得到
- passwd = rsa.encrypt (message, key) #加密
- passwd = Binascii.b2a_hex (passwd) #将加密信息转换为16进制.
- return passwd
- def __get_suser (self):
- Username_ = Urllib.quote (self.username)
- Username = base64.encodestring (username_) [:-1]
- return username
- def __prelogin (self):
- Prelogin_url = ' http://login.sina.com.cn/sso/prelogin.php?entry=sso&callback= Sinassocontroller.prelogincallback&su=%s&rsakt=mod&client=ssologin.js (v1.4.4) '% self.username
- Response = Urllib2.urlopen (Prelogin_url)
- p = re.compile (R'(. ∗?)')
- strURL = P.search (Response.read ()). Group (1)
- DIC = Dict (eval (strurl)) #json格式的response
- Self.pubkey = str (dic.get (' PubKey '))
- self.servertime = str (dic.get (' servertime '))
- self.nonce = str (dic.get (' nonce '))
- self.rsakv = str (dic.get (' rsakv '))
- def login (self):
- url = ' http://login.sina.com.cn/sso/login.php?client=ssologin.js (v1.4.18) '
- Try:
- self.__prelogin () #预登录
- except:
- print ' prelogin Error '
- return
- Global PostData
- postdata[' servertime '] = self.servertime
- postdata[' nonce '] = self.nonce
- postdata[' su '] = self.__get_suser ()
- postdata[' sp '] = self.__get_spwd ()
- postdata[' rsakv '] = self.rsakv
- PostData = Urllib.urlencode (postdata)
- headers = {' user-agent ':' mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:37.0) gecko/20100101 firefox/37.0 '} #伪装成浏览器
- req = Urllib2. Request (
- url = URL,
- data = PostData,
- headers = Headers
- )
- result = Urllib2.urlopen (req)
- Text = Result.read ()
- p = re.compile (' location\.replace\ '(. ∗?) \ ")
- Try:
- Login_url = P.search (text). Group (1)
- Urllib2.urlopen (Login_url)
- print "Login succeed!"
- except:
- print ' Login error! '
"Python Network Programming" uses RSA cryptographic algorithm module to simulate login Sina Weibo