I started to learn Python some time ago. I never thought of any good small projects that I could do. I was so anxious that I had to find a way to crawl Sina Weibo, A simple statistical item for crawling data. At first, I thought I had learned some Python Regular Expressions and I was able to deal with it. The clinker planted a heel on the machine login and obtained data from the preliminary login, it took four or five days. I have never done any machine login code before, so the initial completion of this project is entirely attributed to some of the great online players. I just picked up some of them and pieced together the code of some great gods, add several lines of comments. Copy the Code # import. Note that only one rsa module needs to be installed, and others require built-in import re and urllib. parse, urllib. request, http. cookiejar, base64, binascii, rsa # The following four lines of code make it simple to enable all your next get and post requests to carry the obtained cookies, because the login verification for larger websites depends on cookiecj = http. cookiejar. LWPCookieJar () cookie_support = urllib. request. HTTPCookieProcessor (cj) opener = urllib. request. build_opener (cookie_support, urllib. request. HTTPHandler) urllib. request. install_ope Ner (opener) # encapsulate a function for get. Sina Weibo's get content encoding is-8, so UTF-8 is written to the end, in real projects, it is recommended that def getData (url) be determined based on actual content encoding: request = urllib. request. request (url) response = urllib. request. urlopen (request) text = response. read (). decode ('utf-8') return text # encapsulate a function for post, and verify that both the password and user name are post, therefore, this postData is used in this demo to verify the username and password def postData (url, data): # headers needs to simulate headers = {'user-agent' by ourselves ': 'The Mozilla/5.0 (compatible; MSIE 9. 0; Windows NT 6.1; WOW64; Trident/5.0) '} # The urlencode here is used to concatenate a request object using, then encode it into UTF-8 data = urllib. parse. urlencode (data ). encode ('utf-8') request = urllib. request. request (url, data, headers) response = urllib. request. urlopen (request) text = response. read (). decode ('gbk') return text def login_weibo (nick, pwd ): #==================================== get servertime, pcid, pubkey, rsakv ============================== ========# For a pre-login request, obtain several parameters prelogin_url = 'HTTP: // login.sina.com.cn/sso/prelogin.php? Entry = weibo & callback = sinaSSOController. preloginCallBack & su = % s & rsakt = mod & checkpin = 1 & client = ssologin. js (v1.4.15) & _ = 1400822309846 '% nick preLogin = getData (prelogin_url) # The following four values are servertime = re. findall ('"servertime ":(. *?), ', PreLogin) [0] pubkey = re. findall (' "pubkey ":"(.*?) ", ', PreLogin) [0] rsakv = re. findall ('" rsakv ":"(.*?) ", ', PreLogin) [0] nonce = re. findall ('" nonce ":"(.*?) ", ', PreLogin) [0] #==================== encrypt the user name and password ========================== # Good, you have already come to the most difficult part of Sina Weibo login. If you don't give me some advice on this part, it would be too difficult. I don't want to say anything about it. It's all about encryption, finally, the encrypted su and sp su = base64.b64encode (bytes (urllib. request. quote (nick), encoding = 'utf-8') rsaPublickey = int (pubkey, 16) key = rsa. publicKey (rsaPublickey, 65537) # In my articles I found online, some articles do not bytes the concatenated strings, this is the new method of python3. It seems to be. Rsa. encrypt requires a byte parameter, which is different from the previous one. In fact, the above base64.b64encode is also the same as the message = bytes (str (servertime) + '\ t' + str (nonce) +' \ n' + str (pwd ), encoding = 'utf-8') sp = binascii. b2a_hex (rsa. encrypt (message, key )) #=================================================== ========## param is an exciting post login parameter, this parameter uses the data obtained in the first step. Not much can be said: param = {'entry ': 'weibo', 'Gateway': 1, 'from ': '', 'savestate': 7, 'useticket ': 1, 'pagerefer': 'http: // login.sina.com.cn /Sso/logout. php? Entry = miniblog & r = http % 3A % 2F % 2Fweibo.com % 2Flogout. php % 3 Fbackurl % 3D ', 'vsnf': 1, 'su ': su, 'service': 'miniblog', 'servertime': servertime, 'nonce': nonce, 'pwencode': 'rsa2', 'rsak': rsakv, 'SP ': sp, 'sr': '2017 * 66661', 'encoding': 'utf-8 ', 'prelt': 961, 'url': 'http: // weibo.com/ajaxlogin.php? Framelogin = 1 & callback = parent. sinaSSOController. feedBackUrlCallBack '} # This is the only place where postData is used. It is also very simple: s = postData? Client = ssologin. js (v1.4.15) ', param) # Well, when your code is executed here, most of it has been completed, however, many crawler shoes are planted here like me, if you skip this step and directly execute these lines of code to get fans, you will find that what you get is still the page that allows you to log on. It's really depressing, I planted it here for one day. # well, let's continue. This urll is a url for further login defined in a script returned by Sina after login. Parameters and verification are also obtained before. This step is the real login, so you need to get this urll again and use get to log on to urll = re. findall ("location. replace \(\'(. *?) \ '\); ", S) [0] getData (urll) #=================================================== ====## if you haven't skipped the urll that just came here, congratulations! Now that you have succeeded, it's time for you to crawl through Sina Weibo and get any data you want! # If you try to get your own Weibo homepage, you will find that it is a file of several hundred kb in size. text = getData ('HTTP: // weibo.com/527891819/home? Wvr = 5 & lf = reg ') fp = open('yeah.txt', 'w', encoding = 'utf-8') fp. write (text) fp. close ()