Using Python to realize Sina micro-blog crawler

Last Update:2018-07-24 Source: Internet

Author: User

Tags base64 eval sha1 urlencode

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

A new version of Sina Micro-blog simulation landing please see: http://blog.csdn.net/monsion/article/details/8656690

The solution to dynamic loading is still valid later in this article

It's been edited again, something's wrong.

The first module, analog login Sina Weibo, create weibologin.py file, enter the following code:

#!  /usr/bin/env python #-*-coding:utf-8-*-import sys import urllib import urllib2 import cookielib import base64 Import Re Import JSON import Hashlib class WEIBOLOGIN:CJ = Cookielib. Lwpcookiejar () Cookie_support = Urllib2. Httpcookieprocessor (CJ) opener = Urllib2.build_opener (Cookie_support, Urllib2. HttpHandler) Urllib2.install_opener (opener) PostData = {' entry ': ' Weibo ', ' Gateway ': ' 1 ', ' from ': ', ' Savestat 
		E ': ' 7 ', ' Userticket ': ' 1 ', ' ssosimplelogin ': ' 1 ', ' VSNF ': ' 1 ', ' vsnval ': ', ' su ': ', ' service ': ' Miniblog ', ' Servertime ': ', ' nonce ': ', ' pwencode ': ' Wsse ', ' sp ': ', ' Encoding ': ' UTF-8 ', ' url ': ' Http://weibo.com/aj Axlogin.php?framelogin=1&callback=parent.sinassocontroller.feedbackurlcallback ', ' returntype ': ' META '} def get _servertime (self): url = ' http://login.sina.com.cn/sso/prelogin.php?entry=weibo&callback=
Sinassocontroller.prelogincallback&su=dw5kzwzpbmvk&client=ssologin.js (v1.3.18) &_=1329806375939 '		data = Urllib2.urlopen (URL). Read () p = re.compile (' (. *) ') Try:json_data = P.search (data). Group (1) data = Json.loads (json_data) servertime = str (data[' servertime ']) nonce = data[' nonce '] return servertime, Nonce excep
			T:print ' Get Severtime error! ' Return None def get_pwd (self, pwd, servertime, nonce): Pwd1 = HASHLIB.SHA1 (pwd). Hexdigest () Pwd2 = HASHLIB.SHA1 (pwd1 ). Hexdigest () pwd3_ = pwd2 + servertime + nonce pwd3 = HASHLIB.SHA1 (pwd3_). Hexdigest () return pwd3 def get_user (SE


	LF, username): Username_ = urllib.quote (username) Username = base64.encodestring (username_) [: -1] return username def login (self,username,pwd): url = ' http://login.sina.com.cn/sso/login.php?client=ssologin.js (v1.3.18) ' Try:serve
			Rtime, nonce = Self.get_servertime () except:print ' Get Servertime error! '  Return weibologin.postdata[' servertime '] = servertime weibologin.postdata[' nonce '] = nonce weibologin.postdata[' su ' = Self.get_user (usernameweibologin.postdata[' sp '] = self.get_pwd (pwd, Servertime, nonce) Weibologin.postdata = Urllib.urlencode (weiboLogin.p Ostdata) headers = {' user-agent ': ' mozilla/5.0 (X11; Linux i686; rv:8.0) gecko/20100101 firefox/8.0 chrome/20.0.1132.57 safari/536.11 '} req = Urllib2. Request (url = URL, data = weibologin.postdata, headers = headers) result = Urllib2.urlopen (req) Text = RE Sult.read () p = re.compile (' location\.replace\ ' (. *?)
		\ ') ') Try:login_url = P.search (text). Group (1) urllib2.urlopen (login_url) print "Login success!" Except:print ' Login error! '

Then create the main.py file and enter the following code:

#!/usr/bin/env python
#-*-coding:utf-8-*-

import weibologin
import urllib
import urllib2

Username = ' Your microblog username '
pwd = ' Your microblog password '

wblogin = Weibologin.weibologin (wblogin.login
) Username

Note: If landing fails, your account may be required to enter the verification code when landing. You log on to your account on the Web page to try, in the account settings can be set in some areas do not enter the verification code.

Reference: http://www.douban.com/note/201767245/

Next, consider implementing the content of the crawl microblogging.

At this point, when you crawl the Twitter-specific URL, the initial display is only 15. The following is the delay display (Ajax is called lazy load?). That is, when the scroll bar is dragged to the bottom for the first time, the second part is displayed, then the bottom one, and the third part is displayed. At this time a page of the micro-blog is complete. So, to get all the tweets on a microblog page, you need to visit this page three times. Create the getweibopage.py file, and the corresponding code is as follows:

#!/usr/bin/env python #-*-coding:utf-8-*-import urllib import urllib2 import sys import time reload (SYS) sys.setdef
		Aultencoding (' Utf-8 ') class Getweibopage:body = {' __rnd ': ', ' _k ': ', ' _t ': ' 0 ', ' count ': ', ' end_id ': ', ' ' max_id ': ', ' page ': 1, ' pagebar ': ', ' pre_page ': ' 0 ', ' uid ': '} uid_list = [] charset = ' UTF8 ' def get_msg (s
		ELF,UID): getweibopage.body[' uid ' = uid URL = self.get_url (uid) self.get_firstpage (URL) self.get_secondpage (URL) Self.get_thirdpage (URL) def get_firstpage (self,url): getweibopage.body[' pre_page '] = getweibopage.body[' page ']-1 ur L = URL +urllib.urlencode (getweibopage.body) req = urllib2. Request (URL) result = Urllib2.urlopen (req) Text = Result.read () self.writefile ('./output/text1 ', text) self.write	File ('./output/result1 ', eval ("U" "" +text+ "")) def get_secondpage (self,url): getweibopage.body[' count ' = ' 15 ' # getweibopage.body[' end_id '] = ' 3490160379905732 ' # getweibopage.body[' max_id '] = ' 3487344294660278 ' getweibopage.body[' pagebar '] = ' 0 ' getweibopage.body[' pre_page '] = getweibopage.body[' page '] url = URL +urllib.urlencode (getweibopage.body) req = urllib2. Request (URL) result = Urllib2.urlopen (req) Text = Result.read () self.writefile ('./output/text2 ', text) self.write File ('./output/result2 ', eval ("U" "" +text+ "")) def get_thirdpage (self,url): getweibopage.body[' count '] = ' GETW ' eibopage.body[' pagebar '] = ' 1 ' getweibopage.body[' pre_page '] = getweibopage.body[' page ' url = URL +urllib.urlencode ( Getweibopage.body) req = urllib2. Request (URL) result = Urllib2.urlopen (req) Text = Result.read () self.writefile ('./output/text3 ', text) self.write File ('./output/result3 ', eval ("U" "" +text+ "")) def Get_url (self,uid): url = ' http://weibo.com/' + uid + '? From=otherp Rofile&wvr=3.6&loc=tagweibo ' return URL def get_uid (self,filename): fread = file (filename) for line in Frea
D:getweibopage.uid_list.append (line) Print line			Time.sleep (1) def WriteFile (self,filename,content): FW = file (filename, ' W ') fw.write (content) Fw.close ()

Add the appropriate content to the main.py, complete with:

#!/usr/bin/env python
#-*-coding:utf-8-*-

import weibologin
import getweibomsg
import urllib
Import urllib2

username = ' Your microblog username '
pwd = ' Your microblog password '

wblogin = Weibologin.weibologin ()
Wblogin.login ( Username, pwd)

wbmsg = getweibomsg.getweibomsg ()
url = ' http://weibo.com/1624087025?from=otherprofile& Wvr=3.6&loc=tagweibo '

wbmsg.get_firstpage (URL)
wbmsg.get_secondpage (URL)
wbmsg.get_ Thirdpage (URL)

Reference: http://www.cnblogs.com/sickboy/archive/2012/01/08/2316248.html
Execute the Python main.py and it should run, and the results will be saved in the./output/folder, which is created in advance.
Yesterday engaged in an afternoon, a lot of things have not been done, welcome message exchange.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Using Python to realize Sina micro-blog crawler

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Using Python to realize Sina micro-blog crawler

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support