"Python Sina Weibo crawler" python crawls Sina Weibo 24-hour hot Topic Top500_

"Python Sina Weibo crawler" python crawls Sina Weibo 24-hour hot Topic Top500__python

Last Update:2018-07-31 Source: Internet

Author: User

Tags base64

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

I. Demand analysis
Analog landing Sina Weibo, climbed the popular topic of Sina Weibo forum 24 hours before TOP500 topic name, the topic of reading, discussion, number of fans, topic moderator, as well as the topic of the moderator's attention, number of fans and Weibo number.

Second, the development of language
python2.7

Third, the need to import modules
Import requests
Import JSON
Import Base64
Import re
Import time
Import Pandas as PD

Four, crawl process
First mobile phone-side analog landing Sina
Then send a request to get the Web page source code data
The data is then parsed with a regular expression

Five, the field description

Topic Name: topic_name
Number of readings: topic_reading
Number of discussions: Topic_discuss
Topic number of fans: Topic_fans
Topic Moderator: HOST_NAME
Moderator's attention Number: Host_follow
Number of host fans: Host_fans
Moderator Weibo number: Host_weibo

Six, crawl steps
1, Analog login Sina Weibo (mobile version), decrypt, send a POST request with requests, the session method remembers login status.

########## #模拟登录新浪 def login (username, password): username = Base64.b64encode (username.encode (' Utf-8 ')). Decode (' Utf-8 ') PostData = {"Entry": "SSO", "Gateway": "1", "from": "null", "SaveState": "30" , "Useticket": "0", "Pagerefer": "", "VSNF": "1", "su": Username, "service": "SSO" , "SP": Password, "sr": "1440*900", "Encoding": "UTF-8", "Cdult": "3", "Domain": " Sina.com.cn "," Prelt ":" 0 "," returntype ":" TEXT ",} loginurl = R ' Https://login.sina.com.cn/sso/log In.php?client=ssologin.js (v1.4.15) ' session = requests. Session () res = Session.post (loginurl, data = postdata) Jsonstr = Res.content.decode (' gbk ') info = json.loads ( JSONSTR) If info["retcode"] = = "0": Print (U "Login succeeded") # Add cookies to headers, you must write this step, otherwise call API failure Co Okies = session.cookies.get_dict () cookies = [key + "=" + value for key, value in CookIes.items ()] cookies = ";". Join (Cookies) session.headers["cookies" = Cookies Else:print (U "Login failed, Reason:%s"% info["Reason"]) ret Urn Session session = login (' Sina Weibo account ', ' Sina Weibo password ')

2. Define the data structure of the list.

################# #定义数据结构列表存储数据
top_name = []
top_reading = []
Top_discuss = []
Top_fans = []
host_ name = []
host_follow = []
Host_fans = []
Host_weibo = []
url_new1=[]
url_new2=[]

3, request the page to realize, with fiddle grab bag tool crawl request URL, click on the next page, find the real request URL, through splicing page to achieve page

4, the analysis of the Web page using regular expression Re.findall () # # # # # # # # #正则表达式匹配

    Name=re.findall ("Pl_discover_pt6rank__5" (. *?) </script> ", Html,re. S) for each in
    name:
        # Print each

5, the host has three data from the handset side crawl, through analysis (according to F12 analysis XHR request mobile version of the URL http://m.weibo.cn/api/container/getIndex?type=uid&value= 5710151998

Returns the JSON, the data, the number of user concerns, the number of tweets, and the number of fans.
The three fields are obtained by parsing the JSON data.

The method is as follows:

Url= "http://m.weibo.cn/api/container/getIndex?type=uid&value=5710151998"
html=session.get (URL). Content
html=json.loads (HTML)
userinfo=html[' UserInfo ']
statuses_count=userinfo[' Statuses_count '
followers_count=userinfo[' Followers_count ']
follow_count=userinfo[' follow_count ']

print Statuses_count, Followers_count,follow_count

6, the use of circular crawl data, the data into the form of a data box (table) written in Excel

7, Through Time module Time.sleep (4) Method control network request speed.

Vii. Results Map

VIII. implementation of source code

# encoding:utf-8 Import sys reload (SYS) sys.setdefaultencoding (' Utf-8 ') Import requests import JSON import base64 Import Re import time Import pandas as PD time1=time.time () ########## #模拟登录新浪 def login (username, password): username = ba
        Se64.b64encode (Username.encode (' Utf-8 ')). Decode (' utf-8 ') PostData = {"Entry": "SSO", "Gateway": "1", 
        "from": "null", "SaveState": "A", "Useticket": "0", "Pagerefer": "", "VSNF": "1", "SU": Username, "service": "SSO", "SP": Password, "sr": "1440*900", "Encoding": "
    UTF-8 "," Cdult ":" 3 "," Domain ":" sina.com.cn "," Prelt ":" 0 "," returntype ":" TEXT ",} Loginurl = R ' https://login.sina.com.cn/sso/login.php?client=ssologin.js (v1.4.15) ' session = requests. Session () res = Session.post (loginurl, data = postdata) Jsonstr = Res.content.decode (' gbk ') info = json.loads ( JSONSTR) If info["retcode"] = = "0": Print (U "Login succeeded") # Add cookies to headers, you must write this step, otherwise call API failed cookies = Session.cookies.get_dict () cookies = [key + "=" + value for key, value in Cookies.items ()] cookies = ";". Join (Cookies) session.headers["cookies" = Cookies Else:print (U "Login failed, Reason:%s"% info["Reason"]) ret Urn Session session = login (' Fill your Weibo account here ', ' Fill in your Twitter password ') ################# #定义数据结构列表存储数据 top_name = [] top_reading = [] To P_discuss = [] Top_fans = [] host_name = [] Host_follow = [] Host_fans = [] Host_weibo = [] url_new1=[] url_new2=[] #################### #开始循环抓取 for I in Range (1,501): Try:print "is crawling the +str (i) +" page .................. 。。。。。。。。。。。。。。。。。。 "Url2=" http://d.weibo.com/100803?pids=Pl_Discover_Pt6Rank__5&cfs=920&Pl_Discover_Pt6Rank__5_filter= Hothtlist_type=1&pl_discover_pt6rank__5_page= "+str (i) +" &ajaxpagelet=1&__ref=/100803&_t=fm_

  149273744327929 "Html=session.get (URL2). Content      ########## #正则表达式匹配 ####################### name=re.findall ("Pl_discover_pt6rank__5" (. *?) </script> ", Html,re. S) for each in name: # Print each k=re.findall (' HTML ': ' (. *?) '} ', Each,re. S) for Each1 in K:k1=str (EACH1). Replace (' \\t ', ""). Replace (' \\n ', '). replace (' \ \ ', '). Replace ( ' # ', '] # print K1 k2=re.findall (' alt= ' (. *?) "class=" pic ">", str (k1), re.

                S) for EACH2 in K2:print each2 top_name.append (EACH2) K3=re.findall (' </span><a target= ' _blank "href=" (. *?) "class=" S_txt1 ">", str (k1), re.
                    S) for Each3 in K3:print Each3 url_new1.append (EACH3) heads={"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8"
           , "accept-encoding": "gzip, deflate, SDCH",         "Accept-language": "zh-cn,zh;q=0.8", "Cache-control": "Max-age=0", "Connect
                    Ion ":" Keep-alive "," Host ":" Weibo.com "," upgrade-insecure-requests ":" 1 ", "User-agent": "mozilla/5.0" (Windows NT 6.1; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/53.0.2785.143 safari/537.36 "} html2=session.get ( each3,headers=heads). Content Time.sleep (4) p1=re.findall (' Pl_core_t8customtricolum N__12 (. *?) </script> ', str (HTML2), re. S) for Each3_1 in P1:p2=str (each3_1). Replace (' \\t ', ""). Replace (' \\n ', '). Repl Ace (' \ \ ', ') '. replace (' # ', ') # print P2 p3=re.findall (' Read </span>r&lt TD class= "S_line1" >r<strong (. *?) </strong><span class= "S_txt2" > Discussion </span> ', str (p2), re.
                  S) for each3_2 in P3:          Print str (each3_2). Replace (' class= ', ' > ', '). Replace (' class= ' w_f12 ' > ', '). Replace (' class= ' w_f16 ' > ', '). R Eplace (' class= ' w_f14 ' > ', '). Replace (' class= ' w_f18 ' > ', ') top_discuss.append (str (each3_2) . replace (' class= ' > ', '). Replace (' class= ' w_f12 ' > ', '). Replace (' class= ' w_f16 ' > ', '). Replace (' class= ' w_ F14 ">", ""). Replace (' class= "w_f18 ' > ', ')) P4=re.findall (' ><strong class (. *?) </strong><span class= "S_txt2" > Fan, str (p2), re. S) for each3_3 in P4:print str (each3_3). replace (' = ' > ', '). Replace
                            (' = ' w_f12 ' > ', ']. Replace (' = ' w_f16 ' > ', '). Replace (' = ' w_f14 ' > ', '). Replace (' = ' w_f18 ' > ', ') Top_fans.append (str (each3_3). replace (' = ' > ', '). Replace (' = ' w_f12 ' > ', '). Replace (' = ' w_f16 ' > ', '). Replace (' = ' w_f14 ' > ', '). Replace (' = ' w_f18 ' > ', ')) K4=re.findall (' read: <span><span class= "nUmber "> (. *?) </span></div> <div class=" Sub_box w_fl ">", str (k1), re.

                S) for EACH4 in K4:print each4 top_reading.append (EACH4) K5=re.findall (' Moderator: <span><a target= "_blank" href= "(. *?)" class= "Tlink s_txt1" ", str (k1), re. S) for Each5 in K5:print each5 mm=re.findall (' \d+ ', str (EACH5), re. S) for mm_1 in mm:pp1= "http://m.weibo.cn/api/container/getindex?type=uid&
                        Value= "+str (mm_1) html3=session.get (PP1). Content Html3=json.loads (HTML3)
                        userinfo=html3[' UserInfo '] statuses_count=userinfo[' statuses_count '] followers_count=userinfo[' Followers_count '] follow_count=userinfo[' follow_count ' ] Print Statuses_count,followers_count,follow_couNT Host_follow.append (Follow_count) host_fans.append (Followers_count)  Host_weibo.append (Statuses_count) url_new2.append (pp1) K6 = Re.findall (' "class=" Tlink s_txt1 "> (. *?) </a></div> </div><div class= "Opt_box", str (k1), re. S) for Each6 in K6:print each6 host_name.append (EACH6) excep T:pass print len (top_name), Len (top_reading), Len (Top_discuss), Len (Top_fans), Len (HOST_NAME), Len (URL_NEW2), Len ( Host_follow), Len (Host_fans), Len (host_weibo) data = PD. Dataframe ({"Top_name": top_name[0:501], "top_reading": top_reading[0:501], "Top_discuss": top_discuss[0:501], "top_ Fans ": top_fans[0:501]," host_name ": Host_name[0:501],\" Host_follow ": host_follow[0:501]," Host_fans ": hos T_FANS[0:501], "Host_weibo": host_weibo[0:501]}) print Len (data) # writes Excel writer = PD. Excelwriter (R ' C: \\sina_weibo_topic500.xlsx ', engine= ' xlsxwriter ', options={' strings_to_urls ': False}) Data.to_excel (writer, index=

 False) writer.close () time2 = Time.time () print u ' OK, crawler end! ' Print U ' total time consuming: ' + str (time2-time1) + ' s '

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More