How to use "keyword + time period + region" to collect Sina Weibo data

Source: Internet
Author: User
Tags timedelta

As a pioneer in social media in China, Sina Weibo does not provide official APIs that are obtained using the keyword + time + region method. When we see that foreign scientific research results are based on social media data obtained by a keyword, we can't help but feel a little cool, or we can switch to Twitter. We recommend that Weibo be more open!


1. Entry Point

Fortunately, Sina provides the advanced search function. Not found? This function can be used only when users log on to it ...... It doesn't matter. The following describes in detail how to obtain Sina Weibo "keyword + time + region" without logon.

First, we need to log on to see what functions are available.


Then let's look at the address bar:

http://s.weibo.com/wb/%25E4%25B8%25AD%25E5%259B%25BD%25E5%25A5%25BD%25E5%25A3%25B0%25E9%259F%25B3&xsort=time&region=custom:11:1000×cope=custom:2014-07-09-2:2014-07-19-4&Refer=g


So long? It's actually quite clear and simple. Resolution:


Fixed address part: http://s.weibo.com/wb/

Keyword (2 urlencode encoding): % 25e4% 25b8% 25ad % 25e5% 259b % 25bd % 25e5% 25a5% 25bd % 25e5% 25a3% 25b0% 25e9% 259f % 25b3

Return Weibo sorting method ("real-time" Here): xsort = Time

Search region: region = custom: 11: 1000

Search Time Range: timescript = custom: 2013-07-02--07-09-2

Negligible items: Refer = G

Show Weibo (not shown): nodup = 1 Note: This option can be used to collect more Weibo posts. We recommend that you add this option. The parameter is omitted by default.

Number of pages in a request (not displayed): page = 1


Now that this is the case, we can use web crawlers to retrieve Weibo posts with "keyword + time + region ......

2. Collection ideas

The general idea is as follows: Construct a URL, crawl a webpage, and parse Weibo information on the webpage, as shown in. Weibo officially provides APIs for querying Weibo Information Based on Weibo ID. Therefore, this article only describes how to collect Weibo ID.

In addition, if advanced search returns a maximum of 50 Weibo pages, the minimum interval is recommended. Therefore, the time range (timescript) can be set to 1 hour, for example, 2013-07-01--07-01-2.

Currently, no simulated login is performed. Therefore, you need to set the random sleep time between two adjacent URL requests. If the request is too frequent, it is considered a robot, you know.



3. Specific implementation

As a crawler tool, python is very suitable. As a beginner in Python, do not blame me for writing like Java. First, implement a class to crawl every hour.

Class collectdata (): "the hourly data collection class uses the Weibo advanced search function to collect Weibo posts within a certain period of time based on keywords. General idea: Construct a URL, crawl a webpage, and parse the Weibo ID in the webpage. In the future, Weibo API will be used to import data into the database. This program is only responsible for collecting Weibo ID. Log on to Sina Weibo and enter the advanced search. Enter the keyword "air pollution" and select "real-time". The time is "2013-07-02--07-09-2". The region is "Beijing". after sending the request will find the address bar becomes as follows: fixed address part: http://s.weibo.com/wb/ keyword secondary UTF-8 encoding: % 25e7% 25a9% 25ba % 25e6% 25b0% 2594% 25e6% 25b1% 25a1% 25e6% 259f % 2593 sorted as "real-time": xsort = time search region: region = custom: 11: 100 0 search time range: timescript = custom: 2013-07-02--07-09-2 negligible items: Refer = g display similar Weibo: nodup = 1 Note: this option can be used to collect more Weibo posts. We recommend that you add this option. This parameter is not added by default, and some similar Weibo posts are omitted. The number of pages in a request: page = 1. In addition, if advanced search returns a maximum of 50 Weibo pages, the minimum interval is recommended. Therefore, this class is set to collect up to 50 Weibo posts within a certain period of time. "Def _ init _ (self, keyword, starttime, region, savedir, interval = '50', flag = true, begin_url_per =" http://s.weibo.com/weibo/ "): Self. begin_url_per = begin_url_per # Set the fixed address section, which defaults to "http://s.weibo.com/weibo/" or "http://s.weibo.com/wb/" self. setkeyword (keyword) # Set the keyword self. setstarttimescope (starttime) # Set the Search Start Time self. setregion (region) # Set the search region self. setsave_dir (savedir) # Set the result storage directory self. setinterval (int Erval) # set the basic time interval between requests from neighboring webpages (Note: robots are deemed to be too frequent) self. setflag (FLAG) # Set Self. logger = logging. getlogger ('main. collectdata') # initialization log # setting keywords # decoding def setkeyword (self, keyword): Self. keyword = keyword. decode ('gbk '). encode ("UTF-8") print 'twice encode: ', self. getkeyword () # Set the start interval to 1 hour # format: yyyy-mm-dd-hh def setstarttimescope (self, starttime ): if not (starttime = '-'): Self. timescript = starttime + ": "+ Starttime else: Self. timescript = '-' # Set the search region def setregion (self, region): Self. region = region # Set the result storage directory def setsave_dir (self, save_dir): Self. save_dir = save_dir if not OS. path. exists (self. save_dir): OS. mkdir (self. save_dir) # set the basic interval def setinterval (self, interval) between neighboring webpage requests: Self. interval = int (interval) # Set whether the robot flag is considered. If it is false, enter the page and manually enter the verification code def setflag (self, flag): Self. flag = flag # construct URL def geturl (Self): return self. begin_url_per + self. getkeyword () + "& region = custom:" + self. region + "& xsort = Time × hour = custom:" + self. timescript + "& nodup = 1 & page =" # urlencode def getkeyword (Self): Once = urllib. urlencode ({"kW": Self. keyword}) [3:] Return urllib. urlencode ({"kW": once}) [3:] # crawl all webpages in a request and return up to 50 def download (self, ur L, maxtrynum = 4): content = open (self. save_dir + OS. SEP + "weibo_ids.txt", "AB") # Write Weibo ID hasmore = true to the result file # A request may contain less than 50 pages, set the tag, determine whether there is a next page iscaught = false # A request is considered as a robot, set the flag, and determine whether it is captured. After capturing the file, copy the file in the log and enter the verification code mid_filter = set ([]). # filter duplicate Weibo id I = 1 # record the number of pages returned by this request while hasmore and I <51 and (not iscaught): # Return up to 50 pages, parse each page, and write the result file source_url = URL + STR (I) # construct a page url DATA = ''# store the page data goon = true # network interruption mark # Poor network conditions, try to request three times for trynum in range (maxtrynum): Try: html = urllib2.urlopen (source_url, timeout = 12) Data = html. read () Break failed T: If trynum <(maxTryNum-1): time. sleep (1 0) else: Print 'Internet connect error! 'Self. Logger. Error ('Internet connect error! ') Self.logger.info ('filepath:' + savedir) self.logger.info ('url: '+ source_url) self.logger.info ('filenum:' + STR (filenum) self.logger.info ('page: '+ STR (I) self. flag = false goon = false break if goon: lines = data. splitlines () iscaught = true for line in lines: # determine whether Weibo content exists. If this line appears, it indicates that it is not considered a robot if line. startswith ('<SCRIPT> STK & STK. pageletm & STK. pageletm. view ({"PID": "pl_weibo_direct" '): I Scaught = false n = line. find ('html ":" ') if n> 0: J = line [n + 7:-12]. encode ("UTF-8 "). decode ('unicode _ escape '). encode ("UTF-8 "). replace ("\", "") # No more results page if (J. find ('<Div class = "search_noresult">')> 0): hasmore = false # result page else: page = etree. HTML (j) DLS = page. XPath (U "// DL") # Use XPath to parse for DL in DLS: Mid = STR (DL. attrib. get ('mid ') if (Mid! = 'None' and mid not in mid_filter): mid_filter.add (MID) content. write (MID) content. write ('\ n') Break Lines = none # handle what is considered a robot if iscaught: print' be caught! 'Self. Logger. Error ('be caught error! ') Self.logger.info ('filepath:' + savedir) self.logger.info ('url: '+ source_url) self.logger.info ('filenum:' + STR (filenum) self.logger.info ('page: '+ STR (I) Data = none self. flag = false break # if there are no more results, end the request and jump to the next request if not hasmore: Print 'no more results! 'If I = 1: time. sleep (random. randint (55,75) else: time. sleep (15) Data = none break I + = 1 # Set the random sleep time between two neighboring URL requests, you know. No simulated login sleeptime_one = random. randint (self. interval-30, self. interval-10) sleeptime_two = random. randint (self. interval + 10, self. interval + 30) If I % 2 = 0: sleeptime = sleeptime_two else: sleeptime = sleeptime_one print 'sleeping' + STR (sleeptime) + 'Seconds... 'Time. sleep (sleeptime) else: Break content. close () content = none # Changing the search time range is conducive to obtaining the most data def gettimescope (self, pertimescope, hours): If not (pertimescope = '-'): times_list = pertimestasks. split (':') start_datetime = datetime. datetime. fromtimestamp (time. mktime (time. strptime (times_list [-1], "% Y-% m-% d-% H") start_new_datetime = start_datetime + datetime. timedelta (seconds = 3600) end_new_datetime = start_new_datetime + datetime. timedelta (seconds = 3600 * (hours-1) start_str = start_new_datetime.strftime ("% Y-% m-% d-% H ") end_str = end_new_datetime.strftime ("% Y-% m-% d-% H") return start_str + ":" + end_str else: Return '-'

With each hour's class, you can set the start time for collection.

Def main (): logger = logging. getlogger ('main') logfile = '. /collect. log'logger. setlevel (logging. debug) filehandler = logging. filehandler (logfile) formatter = logging. formatter ('% (asctime) S-% (name) S-% (levelname) S: % (Message) s') filehandler. setformatter (formatter) Logger. addhandler (filehandler) While true: # accept the keyboard input keyword = raw_input ('Enter the keyword (Type \ 'Quit \ 'to exit ):') if keyword = 'quit': SYS. exit () starttime = raw_input ('Enter the start time (Format: yyyy-mm-dd-hh): ') region = raw_input ('Enter the region ([Bj] 11:1000, [sh] 31: 1000, [GZ] 44: 1, [cd] 51: 1): ') savedir = raw_input ('Enter the Save directory (like C: // data //): ') interval = raw_input ('Enter the time interval (> 30 and deafult: 50):') # instantiate the collection class, collect Weibo Cd = collectdata (keyword, starttime, region, savedir, interval) while CD for the specified keyword and start time. flag: Print CD. timescript logger.info (CD. timescript) url = CD. geturl () CD. download (URL) CD. timescript = CD. gettimescript (CD. timescript, 1) # change the search time to the next hour else: Cd = none print '----------------------------------------------------- 'print' Then 'else: logger. removehandler (filehandler) Logger = none
Everything is ready. Run it!

if __name__ == '__main__':    main()

That's it ......

If you want to compile it into a Windows window file or transform it into a small crawler, welcomeGitHubPull !!




How to use "keyword + time period + region" to collect Sina Weibo data

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.