Using Python to realize Sina micro-blog crawler

Source: Internet
Author: User
Tags base64 eval sha1 urlencode

A new version of Sina Micro-blog simulation landing please see: http://blog.csdn.net/monsion/article/details/8656690

The solution to dynamic loading is still valid later in this article

It's been edited again, something's wrong.

The first module, analog login Sina Weibo, create weibologin.py file, enter the following code:

#!  /usr/bin/env python #-*-coding:utf-8-*-import sys import urllib import urllib2 import cookielib import base64 Import Re Import JSON import Hashlib class WEIBOLOGIN:CJ = Cookielib. Lwpcookiejar () Cookie_support = Urllib2. Httpcookieprocessor (CJ) opener = Urllib2.build_opener (Cookie_support, Urllib2. HttpHandler) Urllib2.install_opener (opener) PostData = {' entry ': ' Weibo ', ' Gateway ': ' 1 ', ' from ': ', ' Savestat 
		E ': ' 7 ', ' Userticket ': ' 1 ', ' ssosimplelogin ': ' 1 ', ' VSNF ': ' 1 ', ' vsnval ': ', ' su ': ', ' service ': ' Miniblog ', ' Servertime ': ', ' nonce ': ', ' pwencode ': ' Wsse ', ' sp ': ', ' Encoding ': ' UTF-8 ', ' url ': ' Http://weibo.com/aj Axlogin.php?framelogin=1&callback=parent.sinassocontroller.feedbackurlcallback ', ' returntype ': ' META '} def get _servertime (self): url = ' http://login.sina.com.cn/sso/prelogin.php?entry=weibo&callback=
Sinassocontroller.prelogincallback&su=dw5kzwzpbmvk&client=ssologin.js (v1.3.18) &_=1329806375939 '		data = Urllib2.urlopen (URL). Read () p = re.compile (' (. *) ') Try:json_data = P.search (data). Group (1) data = Json.loads (json_data) servertime = str (data[' servertime ']) nonce = data[' nonce '] return servertime, Nonce excep
			T:print ' Get Severtime error! ' Return None def get_pwd (self, pwd, servertime, nonce): Pwd1 = HASHLIB.SHA1 (pwd). Hexdigest () Pwd2 = HASHLIB.SHA1 (pwd1 ). Hexdigest () pwd3_ = pwd2 + servertime + nonce pwd3 = HASHLIB.SHA1 (pwd3_). Hexdigest () return pwd3 def get_user (SE


	LF, username): Username_ = urllib.quote (username) Username = base64.encodestring (username_) [: -1] return username def login (self,username,pwd): url = ' http://login.sina.com.cn/sso/login.php?client=ssologin.js (v1.3.18) ' Try:serve
			Rtime, nonce = Self.get_servertime () except:print ' Get Servertime error! '  Return weibologin.postdata[' servertime '] = servertime weibologin.postdata[' nonce '] = nonce weibologin.postdata[' su ' = Self.get_user (usernameweibologin.postdata[' sp '] = self.get_pwd (pwd, Servertime, nonce) Weibologin.postdata = Urllib.urlencode (weiboLogin.p Ostdata) headers = {' user-agent ': ' mozilla/5.0 (X11; Linux i686; rv:8.0) gecko/20100101 firefox/8.0 chrome/20.0.1132.57 safari/536.11 '} req = Urllib2. Request (url = URL, data = weibologin.postdata, headers = headers) result = Urllib2.urlopen (req) Text = RE Sult.read () p = re.compile (' location\.replace\ ' (. *?)
		\ ') ') Try:login_url = P.search (text). Group (1) urllib2.urlopen (login_url) print "Login success!" Except:print ' Login error! '

Then create the main.py file and enter the following code:

#!/usr/bin/env python
#-*-coding:utf-8-*-

import weibologin
import urllib
import urllib2

Username = ' Your microblog username '
pwd = ' Your microblog password '

wblogin = Weibologin.weibologin (wblogin.login
) Username

Note: If landing fails, your account may be required to enter the verification code when landing. You log on to your account on the Web page to try, in the account settings can be set in some areas do not enter the verification code.

Reference: http://www.douban.com/note/201767245/

Next, consider implementing the content of the crawl microblogging.

At this point, when you crawl the Twitter-specific URL, the initial display is only 15. The following is the delay display (Ajax is called lazy load?). That is, when the scroll bar is dragged to the bottom for the first time, the second part is displayed, then the bottom one, and the third part is displayed. At this time a page of the micro-blog is complete. So, to get all the tweets on a microblog page, you need to visit this page three times. Create the getweibopage.py file, and the corresponding code is as follows:

#!/usr/bin/env python #-*-coding:utf-8-*-import urllib import urllib2 import sys import time reload (SYS) sys.setdef
		Aultencoding (' Utf-8 ') class Getweibopage:body = {' __rnd ': ', ' _k ': ', ' _t ': ' 0 ', ' count ': ', ' end_id ': ', ' ' max_id ': ', ' page ': 1, ' pagebar ': ', ' pre_page ': ' 0 ', ' uid ': '} uid_list = [] charset = ' UTF8 ' def get_msg (s
		ELF,UID): getweibopage.body[' uid ' = uid URL = self.get_url (uid) self.get_firstpage (URL) self.get_secondpage (URL) Self.get_thirdpage (URL) def get_firstpage (self,url): getweibopage.body[' pre_page '] = getweibopage.body[' page ']-1 ur L = URL +urllib.urlencode (getweibopage.body) req = urllib2. Request (URL) result = Urllib2.urlopen (req) Text = Result.read () self.writefile ('./output/text1 ', text) self.write	File ('./output/result1 ', eval ("U" "" +text+ "")) def get_secondpage (self,url): getweibopage.body[' count ' = ' 15 ' # getweibopage.body[' end_id '] = ' 3490160379905732 ' # getweibopage.body[' max_id '] = ' 3487344294660278 ' getweibopage.body[' pagebar '] = ' 0 ' getweibopage.body[' pre_page '] = getweibopage.body[' page '] url = URL +urllib.urlencode (getweibopage.body) req = urllib2. Request (URL) result = Urllib2.urlopen (req) Text = Result.read () self.writefile ('./output/text2 ', text) self.write File ('./output/result2 ', eval ("U" "" +text+ "")) def get_thirdpage (self,url): getweibopage.body[' count '] = ' GETW ' eibopage.body[' pagebar '] = ' 1 ' getweibopage.body[' pre_page '] = getweibopage.body[' page ' url = URL +urllib.urlencode ( Getweibopage.body) req = urllib2. Request (URL) result = Urllib2.urlopen (req) Text = Result.read () self.writefile ('./output/text3 ', text) self.write File ('./output/result3 ', eval ("U" "" +text+ "")) def Get_url (self,uid): url = ' http://weibo.com/' + uid + '? From=otherp Rofile&wvr=3.6&loc=tagweibo ' return URL def get_uid (self,filename): fread = file (filename) for line in Frea
D:getweibopage.uid_list.append (line) Print line			Time.sleep (1) def WriteFile (self,filename,content): FW = file (filename, ' W ') fw.write (content) Fw.close () 


Add the appropriate content to the main.py, complete with:


#!/usr/bin/env python
#-*-coding:utf-8-*-

import weibologin
import getweibomsg
import urllib
Import urllib2

username = ' Your microblog username '
pwd = ' Your microblog password '

wblogin = Weibologin.weibologin ()
Wblogin.login ( Username, pwd)

wbmsg = getweibomsg.getweibomsg ()
url = ' http://weibo.com/1624087025?from=otherprofile& Wvr=3.6&loc=tagweibo '

wbmsg.get_firstpage (URL)
wbmsg.get_secondpage (URL)
wbmsg.get_ Thirdpage (URL)



Reference: http://www.cnblogs.com/sickboy/archive/2012/01/08/2316248.html
Execute the Python main.py and it should run, and the results will be saved in the./output/folder, which is created in advance.
Yesterday engaged in an afternoon, a lot of things have not been done, welcome message exchange.


Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.