新浪微博的開放平台的開發人員日益活躍,除了商業因素外還有很大的一股民間工程師力量;大量熱衷於群體行為研究與自然語言處理以及機器學習和資料採礦的研究者 and 攻城師們開始利用新浪真實的資料和平台為使用者提供更好的應用或者發現群體的行為規律包括一些統計資訊,本文就是利用新浪開放平台提供的API對微博的使用者標籤進行分詞處理,然後根據分詞後的關鍵字給使用者推薦感興趣的人,在此記錄下以備後用。
requisition:
python+sinaWeibo python SDK+ICTCLAS
備忘:ICTCLAS是中國科學院計算技術研究所提供的中文分詞包
開始上代碼:
1.先要註冊新浪開發人員以獲得APP_KEY和APP_SECRET
2.根據python SDK的howto根據Authou2機制獲得授權(得到code進而得到access_token與expires_in),代碼如下:
1 #-*-coding:UTF-8-*- 2 ''' 3 Created on 2012-12-10 4 5 @author: jixianwu 6 ''' 7 from weibo import APIClient,APIError 8 import urllib,httplib 9 10 class AppClient(object):11 ''' initialize a app client '''12 def __init__(self,*aTuple):13 self._appKey = aTuple[0] #your app key14 self._appSecret = aTuple[1] #your app secret15 self._callbackUrl = aTuple[2] #your callback url16 self._account = aTuple[3] #your weibo user name (eg.email)17 self._password = aTuple[4] # your weibo pwd18 self.AppCli = APIClient(app_key=self._appKey,app_secret=self._appSecret,redirect_uri=self._callbackUrl)19 self._author_url = self.AppCli.get_authorize_url()20 self._getAuthorization()21 22 def __str__(self):23 return 'your app client is created with callback %s' %(self._callbackUrl)24 25 def _get_code(self):#使用該函數避免了手動輸入code,實現了類比使用者授權後獲得code的功能26 conn = httplib.HTTPSConnection('api.weibo.com')27 postdict = {"client_id": self._appKey,28 "redirect_uri": self._callbackUrl,29 "userId": self._account,30 "passwd": self._password,31 "isLoginSina": "0",32 "action": "submit",33 "response_type": "code",34 }35 postdata = urllib.urlencode(postdict)36 conn.request('POST', '/oauth2/authorize', postdata, {'Referer':self._author_url,'Content-Type': 'application/x-www-form-urlencoded'})37 res = conn.getresponse()38 location = res.getheader('location')39 code = location.split('=')[1]40 conn.close()41 return code42 43 def _getAuthorization(self):#將上面函數獲得的code再發送給新浪證明伺服器,返回給用戶端access_token和expires_in,有了這兩個東西,咱就可以調用api了44 ''' get the authorization from sinaAPI with oauth2 authentication method '''45 code = self._get_code()46 r = self.AppCli.request_access_token(code)47 access_token = r.access_token # The token return by sina48 expires_in = r.expires_in49 self.AppCli.set_access_token(access_token, expires_in)
3.根據api獲得使用者標籤:
1 def getTags(self,userid): 2 ''' get last three tags stored by weight of this user''' 3 try: 4 tags = self.AppCli.tags.get(uid=userid) 5 except Exception: 6 print 'get tags failed' 7 return 8 userTags = [] 9 sortedT = sorted(tags,key=operator.attrgetter('weight'),reverse=True)10 if len(sortedT) > 3:11 sortedT = sortedT[-3:]12 for tag in sortedT:13 for item in tag:14 if item != 'weight':15 userTags.append(tag[item])16 return userTags
4.獲得使用者以關注的人:
1 def getFocus(self,userid):2 ''' get focused users list by current user '''3 focus = self.AppCli.friendships.friends.ids.get(uid=userid)4 try:5 return focus.get('ids')6 except Exception:7 print 'get focus failed'8 return
5.對3中獲得的使用者標籤進行分詞處理:(之前要寫個class進行分詞處理,本文最後給出完整源碼)
1 from wordSegmentation import tokenizer2 3 tkr = tokenizer()4 #concatenate all the tags of the user into a string ,then segment the string5 for tag in userTags:6 utf8_tag = tag.encode('utf-8')7 #print utf8_tag8 lstrwords += utf8_tag9 words = tkr.parse(lstrwords)
6.根據5中獲得的關鍵詞+新浪api中搜尋介面最終給出使用者未關注但感興趣的使用者:
1 for keyword in words: 2 print keyword.decode('utf-8').encode('gbk') 3 searchUsers = self.AppCli.search.suggestions.users.get(q=keyword.decode('utf-8'),count=10) 4 5 #recommendation the top ten users 6 ''' 7 if len(searchUsers) >6: 8 searchUsers = searchUsers[-6:] 9 ''' 10 for se_user in searchUsers:11 #print se_user12 uid = se_user['uid']13 #filter those had been focused by the current user14 if uid not in userFocus:15 recommendUsers[uid] = se_user['screen_name'].encode('utf-8')
------
實際運行:
下面是自己微博的例子,我的標籤是:
運行推薦程式後得到的結果為:
紅線框中為推薦結果,這些微博使用者都是與被推薦使用者標籤一致並具有較高影響力,同時也是最有可能給使用者傳遞效用較高資訊的使用者。(圖中只標註了部分使用者)
到此,真箇推薦任務完成,完整源碼在個github上,還望感興趣的同學指正。