Last semester to participate in a large data competition, need to crawl a lot of data, so I started from Sina Weibo, was ready to use Sina's API, but Sina did not open the keyword search API, so can only use crawler to get. Fortunately, Sina offers an advanced search function that provides a good starting point for us to crawl data.
After looking at some information, referring to some examples of reptiles, get a general idea: construct the URL, crawl the page, and then parse the page
Look down the Concrete ~
On Sina Weibo, enter advanced search, such as image input, then send the request will find the address bar into the following: http://s.weibo.com/weibo/%25E4%25B8%25AD%25E5%25B1%25B1%25E5%25A4%25A7% 25e5%25ad%25a6®ion=custom:44:1&typeall=1&suball=1×cope=custom : 2015-08-07-0:2015-08-08-0&refer=g resolved as follows:
Fixed Address part: http://s.weibo.com/weibo/
Keywords two times UTF-8 encoding:%25E4%25B8%25AD%25E5%25B1%25B1%25E5%25A4%25A7%25E5%25AD%25A6
Search area: region=custom:44:1
Search Time range: timescope=custom:2015-08-07-0:2015-08-08-0
Items that can be ignored: refer=g
Number of pages requested: page=1 (first page can not be added)
Let's take a look at the source code to see what ghosts are:
Small partners to see the first affirmation of my God ah, really is to see the dazzling.
Don't worry, let me explain.
First, we navigate to the diagram where the string <script>stk && Stk.pageletm && STK.pageletM.view ({"pid": "Pl_weibo_direct "Place, here is the code of the microblogging page to search for ~
The page is a Unicode code, so the Chinese can not be normal display ~ and there is no typesetting, it seems so messy.
We can first to crawl to the page to deal with, then we need to use the lxml etree, it can be the content of the Web page to build a tree.
Let's take a look at one of the nodes:
<a class=\ "W_texta w_fb\" nick-name=\ "\u554a\u5be7\u5504\" href=\ "http:\/\/weibo.com\/612364698\" _ Blank\ "title=\" \u554a\u5be7\u5504\ "usercard=\" id=1884932730&usercardkey=weibo_mp\ "\t\tsuda-data=\" key= Tblog_search_weibo&value=weibo_ss_1_name\ "class=\" Name_txt w_fb\ ">
In this node, we can get some information about the blogger of the Weibo, such as Nick-name, Weibo address href.
Let's look at another node:
<p class=\ "comment_txt\" node-type=\ "feed_list_content\" nick-name=\ "\u554a\u5be7\u5504\" >\u8fd9\u4e48\ u52aa\u529b \u5c45\u7136\u5012\u6570\u7b2c\u4e94 \u5509 \u4e0d\u884c\u6211\u8981\u8ffd\u56de\u6765 \u8d8a\u632b\ U8d8a\u52c7 \u4e0d\u53ef\u4ee5\u81ea\u66b4\u81ea\u5f03 \u4e0d\u53ef\u4ee5\u8ba9\u8d1f\u9762\u60c5\u7eea\u8dd1\ u51fa\u6765 \u83dc\u575a\u5f3a \u52a0\u6cb9\u52a0\u6cb9\u52a0\u6cb9 \u6211\u8981\u4e0a<em class=\ "red\" >\ U4e2d\u5c71\u5927\u5b66<\/em> \u6211\u8981\u548c\u5c0f\u54c8\u5427\u4e00\u6240\u5927\u5b66 \u62fc\u4e86 <\/p>
This node contains data that is the content of the microblog.
It's a lot clearer. As for how to search the corresponding node, get the properties and content of the node, we use the XPath tool.
About XPath, see Wen http://blog.csdn.net/raptor/article/details/4516441
After getting the data, it's the data saved, I'm importing the data into Excel, using the XLWT and xlrd two modules.
The results of the final data (I collected more specific information, need to visit the blogger's personal homepage for easy reading, understanding, the following code to delete this part):
Code:
# coding:utf-8 ' The keyword collection Sina Weibo ' import wx import sys import urllib import urllib2 import re import json import Hashlib Import OS import time from datetime import datetime to datetime import Timedelta import random from lxml import etree I Mport Logging Import XLWT import xlrd from xlutils.copy Import copy Class Collectdata (): "" "The Data collection class utilizes the Weibo advanced Search feature to key
The word collects micro-blogs within a certain time range.
"" "Def __init__ (self, keyword, starttime, interval= ', flag=true, begin_url_per =" http://s.weibo.com/weibo/"): Self.begin_url_per = begin_url_per #设置固定地址部分 self.setkeyword (keyword) #设置关键字 self.setstarttimesco PE (starttime) #设置搜索的开始时间 #self. Setregion (region) #设置搜索区域 Self.setinterval (interval) #设置邻近网页请求之间的基础时间间隔 (Note: Too often it would be considered a robot) self.setflag (flag) Self.logger = Logging.getlogger (' Main. Collectdata ') #初始化日志 # #设置关键字 # #关键字需解码后编码为utf-8 def setkeyword (self, keyword): self.keyword = keyword. Decode (' GBK ', ' ignore '). Encode ("UTF-8 ") print ' twice encode: ', Self.getkeyword () # #关键字需要进行两次urlencode def getkeyword (self): once = ur Llib.urlencode ({"KW": Self.keyword}) [3:] return Urllib.urlencode ({"KW": Once}) [3:] # #设置起始范围, Interval of 1 days # #格式为 : Yyyy-mm-dd def setstarttimescope (self, starttime): if not (starttime = = '-): Self.timescope = St
Arttime + ":" + starttime Else:self.timescope = '-' # #设置搜索地区 #def setregion (self, region): # self.region = region # #设置邻近网页请求之间的基础时间间隔 def setinterval (self, interval): self.interval = int (int Erval) # #设置是否被认为机器人的标志. If False, you need to enter the page and manually enter the verification code def setflag (self, flag): Self.flag = flag # #构建URL def getURL (self): RE Turn Self.begin_url_per+self.getkeyword () + "&typeall=1&suball=1xcope=custom: +self.timescope+" &page= "# #爬取一次请求中的所有网页, return up to 50 page def download (self, URL, maxtrynum=4): Hasmore = True #某次请求可能少于50页, set the tag to determine if there is another
Page Iscaught = False #某次请求被认为是机器人, sets the tag to determine if it is caught. After seizing, need, enter the page, enter the verification code Name_filter = set ([]) #过滤重复的微博ID i = 1 #记录本次请求所返回的页数 while Hasmore
And I < (not iscaught): #最多返回50页, parse each page and write the result file Source_url = URL + str (i) #构建某页的URL data = ' #存储该页的网页数据 goon = True #网络中断标记 # #网络不好的情况, try to request three-time for Trynum in range (m Axtrynum): try:html = Urllib2.urlopen (Source_url, timeout=12) dat
A = Html.read () Break Except:if Trynum < (maxTryNum-1):
Time.sleep (a) else:print ' Internet Connect error! '
Self.logger.error (' Internet Connect error! ')
Self.logger.info (' URL: ' + source_url ') self.logger.info (' FileNum: ' + str (filenum)) SElf.logger.info (' page: ' + str (i)) Self.flag = False Goon = False
Break If Goon:lines = Data.splitlines () Iscaught = True For lines in lines: # # To determine if there is a microblogging content, this line is not considered to be a robot if Line.startswith ('
<SCRIPT>STK && Stk.pageletm && STK.pageletM.view ({"pid": "Pl_weibo_direct"):
Iscaught = False n = line.find (' HTML: ') if n > 0:
j = line[n + 7: -12].encode ("Utf-8"). Decode (' Unicode_escape '). Encode ("Utf-8"). replace ("\ \") #去掉所有的 \
# # No More results page if (j.find (' <div class= ' search_noresult ' > ') > 0):
Hasmore = False # # results page Else: #此处j要decode, because the top J was encode into utf-8 page = etree. HTML (J.decode (' utf-8 ')) PS = Page.xpath ("//p[@node-type= ' feed_list_content ']") #使用xpath解
Analysis of the content Addrs = Page.xpath ("//a[@class = ' w_texta w_fb '") #使用xpath解析得到博主地址
Addri = 0 #获取昵称和微博内容 for P in PS: Name = P.attrib.get (' nick-name ') #获取昵称 txt = p.xpath (' St Ring (.) ')
#获取微博内容 addr = addrs[addri].attrib.get (' href ') #获取微博地址 Addri + + 1 if (name!= ' None ' and str (TXT)!= ' none ' and name not in Name_filter) : #导出数据到excel中 name_filter.add (name) OLDW b = Xlrd.open_workbook (' Weibodata.xls ', formatting_info=true) Oldws = Oldwb.sheet_by_index (0)
rows = Int (Oldws.cell (0,0). Value) NEWWB = Copy (OLDWB)
NEWWS = Newwb.get_sheet (0) newws.write (rows, 0, str (rows)) Newws.write (rows, 1, name) newws.write (rows, 2, self . TimeScope) Newws.write (rows, 3, addr) New
Ws.write (rows, 4, txt) newws.write (0, 0, str (rows+1))
Newwb.save (' Weibodata.xls ') print "Save with same name OK"
Break lines = None # # processing is considered to be a robot condition if iscaught: print ' Be caught!' Self.logger.error (' be caught error! ') Self.logger.info (' FilePath: ' + savedir) self.logger.info (' URL: ' + source_url ') se Lf.logger.info (' filenum: ' + str (filenum)) self.logger.info (' page: ' + str (i)) data
= None Self.flag = False Break # # No more results, end the request, skip to the next request
If not Hasmore:print ' No more results! ' if i = = 1:time.sleep (Random.randint (3,8)) Else:time. Sleep (a) data = None Break i = 1 # # set two neighboring URL requests The random sleep time between to prevent be caught Sleeptime_one = Random.randint (self.interval-25,self.interval-15) SLE
Eptime_two = Random.randint (self.interval-15,self.interval) If i%2 = = 0: Sleeptime = sleeptime_two Else:sleeptime = Sleeptime_one pri NT ' sleeping ' + str (sleeptime) + ' seconds ... ' time.sleep (sleeptime) else:br
Eak # #改变搜索的时间范围, which facilitates obtaining the most data def gettimescope (self, pertimescope): If not (pertimescope== '-'): Times_list = Pertimescope.split (': ') start_date = datetime (int (times_list[-1][0:4)), int (times_list[-1][5 : 7]), int (times_list[-1][8:10])) Start_new_date = start_date + Timedelta (days = 1) Start_str = s Tart_new_date.strftime ("%y-%m-%d") return start_str + ":" + start_str else:return '-' de F Main (): Logger = Logging.getlogger (' main ') LogFile = './collect.log ' Logger.setlevel (logging. DEBUG) Filehandler = logging. Filehandler (logFile) formatter = logging. Formatter ('% (asctime) s-% (name) s-% (levelname) s:% (message) s ') filehandler.seTformatter (Formatter) Logger.addhandler (Filehandler) while True: # # accepts keyboard input keyword = raw_input (' Enter the keyword (type \ ' quit\ ' to exit): ') if keyword = = ' quit ': sys.exit () StartTime = Raw_ Input (' Enter the start Time (FORMAT:YYYY-MM-DD): ') #region = Raw_input (' Enter the region ' ([bj]11:1000,[sh]31:1000,[g z]44:1,[cd]51:1): ' Interval = raw_input (' Enter The time interval (>30 and deafult:50): ') # #实例化收集类, collecting means
Collectdata cd = keyword and start time (keyword, starttime, interval) while Cd.flag:print Cd.timescope Logger.info (cd.timescope) url = cd.geturl () cd.download (URL) cd.timescope = Cd.gettimescope (Cd.timescope) #改变搜索的时间, to the next day ELSE:CD = None print '--------------------
---------------------------------' print '-----------------------------------------------------' else: Logger.removehandleR (filehandler) logger = None # #if __name__ = = ' __main__ ': # # Main ()
The above implementation of the data crawl, combined with the previous article in the analog login, you can meimei grasp the data ~