The research of "crawler" and the implementation of Sina Micro-BO search crawler

Last Update:2016-05-03 Source: Internet

Author: User

Tags xpath

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Full Text Overview

features : Crawl search results on Sina Weibo, support search time limits in advanced search
website : http://s.weibo.com/
implementation : Take selenium test tool, simulate micro-blog login, combine Phantomjs/firefox, analyze the DOM node, use XPath to obtain the information of the node, realize the important information crawl, and store in Excel.
Access to Weibo information includes: Blogger nickname, blogger home page, Weibo certification, micro-Boda, Weibo content, release time, Weibo address, Weibo source, forwarding, comments, likes

Please see the code for Github:weibo_search_spider

Implement

First, micro-blog Landing
General microblogging simulation login to the server to transmit cookies, but selenium can be achieved through the simulation click Login, login need to enter a verification code. By landing the Sina Pass (http://login.sina.com.cn/), you can open the Weibo as a login.

Driver = Webdriver. Firefox () def Loginweibo(username, password):    Try:#输入用户名/Password login        Print u ' ready to login weibo.cn website ... 'Driver.get ("http://login.sina.com.cn/") Elem_user = Driver.find_element_by_name ("username") Elem_user.send_keys (username)#用户名Elem_pwd = Driver.find_element_by_name ("Password") Elem_pwd.send_keys (password)#密码Elem_sub = Driver.find_element_by_xpath ("//input[@class = ' smb_btn ']") Elem_sub.click ()#点击登陆 because there is no Name property        Try:#输入验证码Time.sleep (Ten) Elem_sub.click ()except:#不用输入验证码            Pass        Print ' Crawl in ', Driver.current_urlPrint u ' output cookie key value pair information: '         forCookiesinchDriver.get_cookies ():PrintCookies forKeyinchCookies:PrintKey, Cookie[key]Print u ' landed successfully ... '    exceptException,e:Print "Error:"Efinally:Print u ' End loginweibo!\n '

Note: If you need to enter a verification code on a different machine when you log in, for example, you can login on my computer without entering a verification code and you need to enter a verification code on another server. So if you need to enter a verification code, it can only be implemented via Firefox, because Phantomjs as a headless browser, can not implement the function of the input verification code.

second, search and process the results
Visit http://s.weibo.com/page, enter keywords, click Search, limit the time range of search, process the search results of the page.

Overall scheduling
The total scheduler for the search is as follows:

 def getsearchcontent(key):Driver.get ("http://s.weibo.com/")Print ' Search hot topics: ', Key.decode (' Utf-8 ')#输入关键词并点击搜索ITEM_INP = Driver.find_element_by_xpath ("//input[@class = ' searchinp_form ']") Item_inp.send_keys (Key.decode (' Utf-8 ')) Item_inp.send_keys (Keys.return)#采用点击回车直接搜索    #获取搜索词的URL, url stitching for late query by TimeCurrent_url = Driver.current_url Current_url = Current_url.split (' & ')[0]#http://s.weibo.com/weibo/%25e7%258e%2589%25e6%25a0%2591%25e5%259c%25b0%25e9%259c%2587    GlobalStart_stampGlobalPage#需要抓取的开始和结束日期start_date = Datetime.datetime ( .,4, -,0) end_date = Datetime.datetime ( .,4, -,0) Delta_date = Datetime.timedelta (days=1)#每次抓取一天的数据Start_stamp = start_date End_stamp = start_date + delta_dateGlobalOutFileGlobalSheet outfile = xlwt. Workbook (encoding =' Utf-8 ') whileEnd_stamp <= end_date:page =1        #每一天使用一个sheet存储数据sheet = outfile.add_sheet (str (Start_stamp.strftime ("%y-%m-%d-%h"))) Initxls ()#通过构建URL实现每一天的查询url = current_url +' &typeall=1&suball=1&timescope=custom: '+ STR (start_stamp.strftime ("%y-%m-%d-%h")) +': '+ STR (end_stamp.strftime ("%y-%m-%d-%h")) +' &refer=g 'Driver.get (URL) handlepage ()#处理当前页面内容Start_stamp = End_stamp End_stamp = end_stamp + delta_date

Construct Search Time
In terms of the construction of the search time, there is actually an advanced search-search Time selection button that can be used to select the search time. But the selenium click Simulation for all of the behaviors seems to be a bit too complex, and in the process of implementation found that due to the two-day selection of a common calendar, I do not know why the deadline for the selection error.

So here's an easier way: Construct a DateTime object and add a URL.
By analyzing the search URL for a limited time, you can see that only the time limit can be added to the base URL.
Http://s.weibo.com/weibo/%25E7%258E%2589%25E6%25A0%2591%25E5%259C%25B0%25E9%259C%2587&typeall=1&suball =1&timescope=custom:2016-05-01-0:2016-05-02-0&refer=g

single page data processing
After each page has been loaded, a series of judgments will be needed to determine if there is any content to get

#页面加载完成后, process the content of the page def handlepage():     while True:#之前认为可能需要sleep等待页面加载, and later found that the program execution will wait for the page to finish loading        #sleep的作用是对付微博的反爬虫机制, crawling too fast may be judged as a robot, need to enter a verification codeTime.sleep (2)#先行判定是否有内容        ifCheckcontent ():Print "GetContent"GetContent ()#先行判定是否有下一页按钮            ifChecknext ():#拿到下一页按钮NEXT_PAGE_BTN = Driver.find_element_by_xpath ("//a[@class = ' page next s_txt1 s_line1 ']") Next_page_btn.click ()Else:Print "No Next"                 Break        Else:Print "No Content"             Break

page load result judgment
Here, the search results after clicking the Search button may have the following conditions

1. No search results on the page

, if within the specified time, the keyword has no search results (at this point, the page will be featured Weibo, recommended Weibo is easy to be considered as normal Weibo). By comparing the pages with search results, the most typical feature of this type of page is the owning "class = pl_noresult" Div, which can be determined by the presence of the node in the XPath lookup to determine the "//div[@class=‘pl_noresult‘]" search results as follows:

#判断页面加载完成后是否有内容def checkContent():    #有内容的前提是有“导航条”？错！只有一页内容的也没有导航条    #但没有内容的前提是有“pl_noresult”    try:        driver.find_element_by_xpath("//div[@class=‘pl_noresult‘]")        False    except:        True    return flag

2, the page has search results, and only one page

By careful analysis, the most typical feature of such a page is that there is no navigation bar, that is, there is no "next page" button

3, the page has search results, and there are more than one page

In this case, the most typical feature is the "next page" button, so the criteria for judging 2 and 3 can be whether there is a "next page" button. If so, you can click to go to the next page!

#判断是否有下一页按钮def checkNext():    try:        driver.find_element_by_xpath("//a[@class=‘page next S_txt1 S_line1‘]")        True    except:        False    return flag

Get page Content
On the premise that the content of the page is determined, the page content acquisition is the most important part, through the analysis of the page Dom node, each to obtain the required information:

#在页面有内容的前提下, get content def getcontent():    #寻找到每一条微博的classnodes = Driver.find_elements_by_xpath ("//div[@class = ' wb_cardwrap s_bg2 clearfix ']")#在运行过程中微博数 ==0, may be a micro-blog anti-crawling mechanism, need to enter a verification code    ifLen (nodes) = =0: Raw_input ("Please enter the verification code on the Weibo page!" ") URL = driver.current_url driver.get (URL) getcontent ()returnDIC = {}GlobalPagePrintSTR (Start_stamp.strftime ("%y-%m-%d-%h"))Print u ' pages: ', page page = page +1    Print u ' micro-blog number ', Len (nodes) forIinchRange (len (nodes)): dic[i] = []Try: Bznc = Nodes[i].find_element_by_xpath (".//div[@class = ' feed_content wbcon ']/a[@class = ' w_texta w_fb ']"). Textexcept: BZNC ="'        Print u ' bo Master nickname: ', Bznc dic[i].append (BZNC)Try: Bzzy = Nodes[i].find_element_by_xpath (".//div[@class = ' feed_content wbcon ']/a[@class = ' w_texta w_fb ']"). Get_attribute ("href")except: Bzzy ="'        Print u ' bo home page: ', Bzzy dic[i].append (Bzzy)Try: Wbrz = Nodes[i].find_element_by_xpath (".//div[@class = ' feed_content wbcon ']/a[@class = ' Approve_co ']"). Get_attribute (' title ')#若没有认证则不存在节点        except: WBRZ ="'        Print ' Weibo certification: ', Wbrz dic[i].append (WBRZ)Try: wbDR = Nodes[i].find_element_by_xpath (".//div[@class = ' feed_content wbcon ']/a[@class = ' ico_club ']"). Get_attribute (' title ')#若非达人则不存在节点        except: WBDR ="'        Print ' micro Boda: ', wbDR dic[i].append (WBDR)Try: WBNR = Nodes[i].find_element_by_xpath (".//div[@class = ' feed_content wbcon ']/p[@class = ' comment_txt ']"). Textexcept: WBNR ="'        Print ' Micro Blog content: ', WBNR dic[i].append (WBNR)Try: FBSJ = Nodes[i].find_element_by_xpath (".//div[@class = ' feed_from w_textb ']/a[@class = ' W_TEXTB ']"). Textexcept: FBSJ ="'        Print u ' release time: ', FBSJ dic[i].append (FBSJ)Try: Wbdz = Nodes[i].find_element_by_xpath (".//div[@class = ' feed_from w_textb ']/a[@class = ' W_TEXTB ']"). Get_attribute ("href")except: WBDZ ="'        Print ' micro-blog address: ', Wbdz dic[i].append (WBDZ)Try: wbly = Nodes[i].find_element_by_xpath (".//div[@class = ' feed_from w_textb ']/a[@rel]"). Textexcept: wbly ="'        Print ' Weibo source: ', wbly dic[i].append (wbly)Try: Zf_text = Nodes[i].find_element_by_xpath (".//a[@action-type= ' Feed_list_forward ']//em"). TextifZf_text = ="': ZF =0            Else: ZF = Int (zf_text)except: ZF =0        Print ' forwarding: ', ZF Dic[i].append (ZF)Try: Pl_text = Nodes[i].find_element_by_xpath (".//a[@action-type= ' feed_list_comment ']//em"). Text#可能没有em元素            ifPl_text = ="': PL =0            Else: PL = Int (pl_text)except: PL =0        Print ' Comment: ', PL dic[i].append (str (PL))Try: Zan_text = Nodes[i].find_element_by_xpath (".//a[@action-type= ' feed_list_like ']//em"). Text#可为空            ifZan_text = ="': ZAN =0            Else: ZAN = Int (zan_text)except: ZAN =0        Print ' likes: ', ZAN dic[i].append (str (ZAN))Print ' \ n '    #写入ExcelWritexls (DIC)

written in the last

As I mentioned in the previous blog: In general, there are two main ways to crawl data: one is to grab the package tool (fiddle) to capture the packet analysis, get the URL of the AJAX request, fetch the data through the URL, this is a more general, recommended method Another way is to use a crawler that simulates browser behavior later.
I know that the data capture method used in this article is a less efficient way, but on the other hand, it is also a faster way to get started, only need to master the basic syntax of selenium, XPath, you can quickly build a crawler. The next step is to delve deeper into the efficient crawler approach.
Hope this simple thought and method can help you, but also welcome to exchange more advice.

Thanks for the help of Eastmount, nine tea

(by Mrhammer 2016-05-02 6 o'clock in the afternoon @Bin house Rainy)

The research of "crawler" and the implementation of Sina Micro-BO search crawler

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More