利用python對新浪微博使用者標籤進行分詞並推薦相關使用者

最後更新：2018-12-07 來源：互聯網

上載者：User

創建阿里雲帳戶，並獲得超過 40 款產品的免費試用版；而企業帳戶則可以享有總值 $1200 的免費試用版。立即註冊！

新浪微博的開放平台的開發人員日益活躍，除了商業因素外還有很大的一股民間工程師力量；大量熱衷於群體行為研究與自然語言處理以及機器學習和資料採礦的研究者 and 攻城師們開始利用新浪真實的資料和平台為使用者提供更好的應用或者發現群體的行為規律包括一些統計資訊，本文就是利用新浪開放平台提供的API對微博的使用者標籤進行分詞處理，然後根據分詞後的關鍵字給使用者推薦感興趣的人，在此記錄下以備後用。

requisition:

　　python+sinaWeibo python SDK+ICTCLAS

備忘：ICTCLAS是中國科學院計算技術研究所提供的中文分詞包

開始上代碼：

1.先要註冊新浪開發人員以獲得APP_KEY和APP_SECRET

2.根據python SDK的howto根據Authou2機制獲得授權(得到code進而得到access_token與expires_in),代碼如下：

 1 #-*-coding:UTF-8-*- 2 ''' 3 Created on 2012-12-10 4  5 @author: jixianwu 6 ''' 7 from weibo import APIClient,APIError 8 import urllib,httplib 9 10 class AppClient(object):11     ''' initialize a app client '''12     def __init__(self,*aTuple):13         self._appKey = aTuple[0] #your app key14         self._appSecret = aTuple[1] #your app secret15         self._callbackUrl = aTuple[2] #your callback url16         self._account = aTuple[3] #your weibo user name (eg.email)17         self._password = aTuple[4] # your weibo pwd18         self.AppCli = APIClient(app_key=self._appKey,app_secret=self._appSecret,redirect_uri=self._callbackUrl)19         self._author_url = self.AppCli.get_authorize_url()20         self._getAuthorization()21     22     def __str__(self):23         return 'your app client is created with callback %s' %(self._callbackUrl)24     25     def _get_code(self):#使用該函數避免了手動輸入code，實現了類比使用者授權後獲得code的功能26         conn = httplib.HTTPSConnection('api.weibo.com')27         postdict = {"client_id": self._appKey,28              "redirect_uri": self._callbackUrl,29              "userId": self._account,30              "passwd": self._password,31              "isLoginSina": "0",32              "action": "submit",33              "response_type": "code",34              }35         postdata = urllib.urlencode(postdict)36         conn.request('POST', '/oauth2/authorize', postdata, {'Referer':self._author_url,'Content-Type': 'application/x-www-form-urlencoded'})37         res = conn.getresponse()38         location = res.getheader('location')39         code = location.split('=')[1]40         conn.close()41         return code42     43     def _getAuthorization(self):#將上面函數獲得的code再發送給新浪證明伺服器，返回給用戶端access_token和expires_in，有了這兩個東西，咱就可以調用api了44         ''' get the authorization from sinaAPI with oauth2 authentication method '''45         code = self._get_code()46         r = self.AppCli.request_access_token(code)47         access_token = r.access_token # The token return by sina48         expires_in = r.expires_in49         self.AppCli.set_access_token(access_token, expires_in)

3.根據api獲得使用者標籤：

 1     def getTags(self,userid): 2         ''' get last three tags stored by weight of this user''' 3         try: 4             tags = self.AppCli.tags.get(uid=userid) 5         except Exception: 6             print 'get tags failed' 7             return 8         userTags = [] 9         sortedT = sorted(tags,key=operator.attrgetter('weight'),reverse=True)10         if len(sortedT) > 3:11             sortedT = sortedT[-3:]12         for tag in sortedT:13             for item in tag:14                 if item != 'weight':15                     userTags.append(tag[item])16         return userTags

4.獲得使用者以關注的人：

1     def getFocus(self,userid):2         ''' get focused users list by current user '''3         focus = self.AppCli.friendships.friends.ids.get(uid=userid)4         try:5             return focus.get('ids')6         except Exception:7             print 'get focus failed'8             return

5.對3中獲得的使用者標籤進行分詞處理：(之前要寫個class進行分詞處理，本文最後給出完整源碼)

1 from wordSegmentation import tokenizer2 3 tkr = tokenizer()4         #concatenate all the tags of the user into a string ,then segment the string5         for tag in userTags:6             utf8_tag = tag.encode('utf-8')7             #print utf8_tag8             lstrwords += utf8_tag9         words = tkr.parse(lstrwords)

6.根據5中獲得的關鍵詞+新浪api中搜尋介面最終給出使用者未關注但感興趣的使用者：

 1 for keyword in words: 2             print keyword.decode('utf-8').encode('gbk') 3             searchUsers = self.AppCli.search.suggestions.users.get(q=keyword.decode('utf-8'),count=10) 4              5             #recommendation the top ten users 6             ''' 7             if len(searchUsers) >6: 8                 searchUsers = searchUsers[-6:] 9             '''    10             for se_user in searchUsers:11                 #print se_user12                 uid = se_user['uid']13                 #filter those had been focused by the current user14                 if uid not in userFocus:15                     recommendUsers[uid] = se_user['screen_name'].encode('utf-8')

------

實際運行：

下面是自己微博的例子，我的標籤是：

運行推薦程式後得到的結果為：

　　紅線框中為推薦結果，這些微博使用者都是與被推薦使用者標籤一致並具有較高影響力，同時也是最有可能給使用者傳遞效用較高資訊的使用者。(圖中只標註了部分使用者)

到此，真箇推薦任務完成，完整源碼在個github上，還望感興趣的同學指正。

本文章原先以中文撰寫並發佈於 aliyun.com，亦設英文版本，僅作資訊用途。本網站不對文章的準確性，完整性或可靠性或其任何翻譯作出任何明示或暗示的陳述或保證。如對該文章有任何疑慮或投訴，請傳送電郵至 info-contact@alibabacloud.com 並提供相關疑慮或投訴的詳細說明。職員會於 5 個工作天內與您聯絡，一經驗證之後，即會刪除該侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More