[Python crawler] Selenium crawling Sina Weibo hot topics and comments (next)

Source: Internet
Author: User

This article focuses on hot topics and comments about using Python+selenium to crawl Sina Weibo. The disadvantage of using this crawler is very low efficiency, fool-like crawler, can not be executed in parallel, but its advantage is the analysis of the DOM tree structure analysis of the Web page source and information crawling, and it can be crawled through the browser of the intermediate process demonstration and verification code input. This article on the detailed process of the crawler is no longer discussed, mainly to provide operational code and run it. I hope the article is helpful to you.

Reference articles
[Python crawler] Selenium crawling Sina Weibo content and user information
[Python crawler] Selenium crawling Sina Weibo client user information, hot topics and reviews (UP)
[Python crawler] installs pip+phantomjs+selenium under Windows
[Python crawler] Selenium Implementation Automatic login 163 mailbox and locating elements Introduction
Http://selenium-python.readthedocs.org/locating-elements.html

Implementation process
Running as shown below, it automatically logs on to Sina Weibo by calling the Firefox browser and then entering the mobile URL, which automatically enters the username and password, but requires the user to enter the verification code within 20 seconds of my setting pause. After landing it will automatically jump to the microblog theme search page, when the user entered the keyword "Ode to Joy", will be returned to the microblog information and comments to crawl, crawling process need to pay attention to the page.



Run Results
The results are as follows: Two files, one is the URL of the microblog content comment, and the other is the comment information in the crawl URL.





Source Code
The source code is as follows:
# coding=utf-8 "" "Created on 2016-04-24 @author: Eastmount features: Crawling Sina Weibo user information and Weibo comment Web site: http://weibo.cn/data volume smaller vs./HTTP  weibo.com/the correct way is to get all the URLs and then access "" "Import time import re import OS import sys import codecs Import Shutilimport urllib from selenium import webdriver from selenium.webdriver.common.keys import keys im Port Selenium.webdriver.support.ui as UI from Selenium.webdriver.common.action_chains import actionchains# first Call no interface browser Phantomjs or Firefox #driver = Webdriver. Phantomjs (executable_path= "G:\phantomjs-1.9.1-windows\phantomjs.exe") Driver = Webdriver. Firefox () wait = UI. Webdriverwait (driver,10) #全局变量 file operation read/write information inforead = Codecs.open ("Sinaweibo_list_best_1.txt", ' R ', ' Utf-8 ') Infofile = Codecs.open ("Sinaweibo_info_best_1.txt", ' W ', ' utf-8 ') #********************************************************* # Step One: Login weibo.cn # This method is valid for weibo.cn (clear text transfer data) weibo.com see the Learning brother set up post and           Header Method #     Loginweibo (username, password) parameter user name password #************************************************************************* def loginweibo (username, password): try: #输入用户名/Password login print u ' ready to login to weibo.cn website ... ' Driver.get ( "http://login.weibo.cn/login/") Elem_user = Driver.find_element_by_name ("mobile") Elem_user.send_keys (Usern AME) #用户名 elem_pwd = Driver.find_element_by_xpath ("/html/body/div[2]/form/div/input[2]") Elem_pwd.send_keys (             Password) #密码 name=password_6785 #elem_rem = Driver.find_element_by_name ("Remember") #elem_rem. Click () #记住登录状态 #重点: Pause time input verification code (http://login.weibo.cn/login/on mobile) Time.sleep #点击submit按钮登陆方式或输入回车 Key login Mode #elem_sub = Driver.find_element_by_name ("Submit") #elem_sub. Click () elem_pwd.send _keys (Keys.return) time.sleep (2) #获取Coockie print driver.current_url print Driver.get _cookies () #获得cookie信息 dictStore print u ' output cookie key value pair information: ' For Cookie ' in Driver.get_cookies (): #print Cookie for key In Cookie:print key, Cookie[key] #driver. get_cookies () Type list contains only one element cookie type dic T print U ' landed successfully ... ' except Exception,e:print "Error:", E Finally:pri NT U ' End loginweibo!\n\n ' #**************************************************************************************        # Step Two: Access personal page http://weibo.cn/5824697471 and get information # visitpersonpage () # Common coding error Unicodeencodeerror: ' ASCII ' codec can ' t encode characters file Utf-8 encoding #****************************************** def visitpersonpage (user_id): Try:global infofile #全局 File variable url = "http://weibo.com/" + user_id driver.get (URL) print U ' ready to access personal website ... ', url print u '       Person Details ' #用户id Print u ' User id: ' + user_id #昵称 concern number of fans number of Twitter data other information #URL Http://weibo.cn/5824697471/follow Except Exception,e:print "Error:", E finally:print u ' visitpersonpage!\n\n ' #***** # Step three: Visit Http://weibo.cn/searc h/(MOBILE) page search hotspot Information # Crawl microblogging information and comments, note the effect of page flipping and the number of tweets #***************************************************** def getcomment (key): Try:global infofile #全局文件变量 driver.get ("http:/ /weibo.cn/search/") Print U ' Search hot Topic Keywords: ', key #输入主题并点击搜索 ITEM_INP = Driver.find_element_by_xpath ("//di v[@class = ' C ']/form/div/input ") #name =keyword Item_inp.send_keys (key) Item_inp.send_keys (Keys.return) #采用点 Hit Enter direct search #内容 #content = Driver.find_elements_by_xpath ("//div[@class = ' content clearfix ']/div/p") commen t = driver.find_elements_by_xpath ("//a[@class = ' cc ']") content = Driver.find_elements_by_xpath ("//div[@class = ' C ']") print content  All_comment_url = [] #存储所有文件URL i = 0 j = 0 infofile.write (u ' start: \ r \ n ') print U ' length ', len (content) while I<len (content): #print content[i].text if (U ' collection ' in Content[i]. Text) and (U ' comment ' in content[i].text): #过滤其他标签 print Content[i].text infofile.write (U ' Weibo info: \ r                \ n ') infofile.write (Content[i].text + ' \ r \ n ') div_id = Content[i].get_attribute ("id") Print div_id while (1): #存在其他包含class =cc such as "original comment" url_com = Comment[j].get_attrib                        Ute ("href") if (' comment ' in url_com) and (' UID ' in url_com): Print url_com                        Infofile.write (U ' comment information: \ r \ n ') infofile.write (url_com+ ' \ r \ n ') All_comment_url.appenD (url_com) #保存到变量里 j = j + 1 Break Else: j = j + 1 i = i + 1 #http://weibo.cn/search/?pos=search Print Drive R.current_url #python中文转换url编码 urllib.quote (key) Urllib.unquote turn back #转码失败 #http://weibo.cn/sea Rch/mblog?hidesearchframe=&keyword= Ode to Joy &page=2 #url = "http://weibo.cn/search/mblog?hidesearchframe=& Keyword= "+ Key_url +" &page=2 "#获取10个下页 N = 2 while n<=10: #后面采用换页 first time for the convenience of the solution to the use of Get search Box id enter access Url_get = Driver.find_element_by_xpath ("//div[@id = ' pagelist ']/form/div/a") url = url_get . Get_attribute ("href") print URL #获取下页 driver.get (url) comment = driver.find_elements_by _xpath ("//a[@class = ' cc ']") content = Driver.find_elements_by_xpath ("//div[@class = ' C ']") print Conten         t i = 0   j = 0 #第一个 <a class= ' cc ' href> is redundant print u ' length ', len (content) Infofile . write (u \ r \ n: \ r \ n ') while I<len (content): #print content[i].text if (U ' favorites ') In Content[i].text) and (U ' comment ' in content[i].text): Print Content[i].text INFOFILE.W                    Rite (U ' Weibo info: \ r \ n ') infofile.write (content[i].text + ' \ \ r \ n ') #获取该信息id值 get comments via ID #先获取: <div id= "M_DU3NPZQSD" class= "C" > #再获取: <a class= "CC" href= "Http://weibo. cn/comment/#cmtfrm "></a> div_id = Content[i].get_attribute (" id ") Print div_                    Id ' ' URL = Driver.find_elements_by_xpath ("//div[@id =" + div_id + "]/a")                    Print URL for u in url:print u.get_attribute ("href")                   ''' while (1): #存在其他包含class =cc such as "original comment" url_com = Comment[j].get_attribute ("href") if (' comment ' in url_com) and (' UID ' in url_com): Print url_com I Nfofile.write (U ' comment information: \ r \ n ') infofile.write (url_com + ' \ r \ n ') all_com Ment_url.append (url_com) j = j + 1 Break Else : j = j + 1 i = i + 1 n = n + 1 else:print u ' End crawl evaluation On the URL alignment while loop ' #方位评论URL并进行爬取 print u ' \ n \ infocomment = Codecs.open ("Sinaweibo_info_b Est_2.txt ", ' W ', ' Utf-8 ') for the URL in all_comment_url:print url driver.get (URL) #d River.refresh () Time.sleep (2) infocomment.write (url+ ' \ r \ n ') test = Driver.find_elements_b       Y_class_name (' C ')     Print len (test) #Error: Message:element not found in the cache-#perhaps the page has a change D since it is looked up #http://www.51testing.com/html/21/n-862721-2.html #异常的说明已经很明显了: Element not found in cache            , the page is transformed after the element is found.            #这就说明, when the current page jumps, the element that exists in the cache about the page is emptied. k = 0 while K<len (test): Print Test[k].text infocomment.write (Test[k].text +                                          ' \ r \ n ') K = k + 1 infocomment.write (' \ r \ n ') Infocomment.close ()        Except Exception,e:print "Error:", E finally:print u ' visitpersonpage!\n\n ' print ' **********************************************\n ' #******************************************** # program Entry pre-call # Note: Because Sina Weibo added a verification code, but you use the Firefox login input verification Code # Jump directly to the star Weibo part, namely: http://weibo.cn/guangxianliuyan#*****if __name__ = = ' __main__ ': #定义变量 usern Ame = ' 15201615157 ' #输入你的用户名 password = ' ********* ' #输入你的密码 #操作函数 Loginweibo (username, Password) #登陆微博 #在if __name__ = = ' __main__ ': Reference global variable does not need to define global inforead omit to print ' Read file: ' user_id = i Nforead.readline () while user_id!= "": user_id = User_id.rstrip (' \ r \ n ') print user_id Visitpersonpa GE (user_id) #访问个人页面http://weibo.cn/guangxianliuyan user_id = Inforead.readline () #break #搜索热点微博 Crawl         Take a comment key = U ' Ode to Joy ' getcomment (key) Infofile.close () Inforead.close ()




Login Part Core code
The core code of the landing section and the corresponding DOM tree structure are analyzed as follows:
#调用Firefox浏览器
Driver = Webdriver. Firefox ()
driver.get ("http://login.weibo.cn/login/")
#用户名
Elem_user = Driver.find_element_by_name ("mobile")
Elem_user.send_keys (username)
#密码 name=password_6785
elem_pwd = Driver.find_element_by_xpath ("/html/body/div[2]/form/div/input[2]")
elem_pwd.send_keys (password)
#记住登录状态
Elem_rem = Driver.find_element_by_name ("Remember")
Elem_rem.click ()
#重点: Pause time to enter the verification code (http://login.weibo.cn/login/mobile)
Time.sleep ()
#点击submit按钮登陆方式或输入回车键登陆方式
elem_sub = driver.find_element_by_name ("Submit")
Elem_sub.click ()
Elem_pwd.send_keys (Keys.return)



Why do I need to log in from the mobile side?
Sina Weibo has two entrances:
Sina Weibo login Common interface:http://login.sina.com.cn/ 
Corresponding main interface:http://weibo.com/
However, the personal recommendation is to use the mobile phone Weibo portal: http://login.weibo.cn/login/
       corresponding main interface: http://weibo.cn/
Because when the client logs on to crawl the microblogging review information, it is always dynamically loaded through JavaScript, you need to click the button to load, simply to get comments on the node or use regular expressions to crawl the comment part of the HTML source code, are empty values, so the use of the mobile side to crawl.
Their main difference is that the mobile data is relatively concise, but the content is basically the same, but the picture is small, the focus on the number of fans can only display 20 pages, missing personal information, etc., for mobile phone users, but the information is corresponding to the client.
As shown, it represents a dynamically loaded comment, and you can see that "href=javascript:void (0)" also has some dynamic loading of the script function implementations.



jump page Next page
Commonly used methods, such as my previous article on crawling tiger flutter pictures, biological expected, are analyzed by the URL of the & composition. Sina Weibo is also:
Http://weibo.cn/search/mblog?hideSearchFrame=&keyword= Ode to Joy &page=2
The search keyword "Ode to Joy", only need to modify page pages. But the "Ode to Joy" character is always an error when converted to URL encoding, pythonUseChinese conversion URL encoding method: Urllib.quote (Key).
So take the second method, get the next page URL, then driver.get (URL) access, loop n from 2 to 10.
#获取下页
url_get = Driver.find_element_by_xpath ("//div[@id = ' pagelist ']/form/div/a")
url = url_get.get_attribute ("href")
Driver.get (URL)
#评论URL
comment = Driver.find_elements_by_xpath ("//a[@class = ' cc ']")
#微博内容 need to filter other class=c
content = Driver.find_elements_by_xpath ("//div[@class = ' C ']")


Focus Get URL queue crawl comments sequentially
Since Sina Weibo needs to be simulated, it gets the URL of the comment at the same time, but the browser driver used here cannot access the comment information at the same time, because it is still in use in the loop driver. In other words,Find_elements_by_xpath is to get multiple values, of course you can open a driver2, but also need to login, and then Driver2.get (url_comment) can.
However, commonly used crawlers have a concept of a URL queue, and the crawled comment URLs are stored in a queue or array. The program loops n from 2 to 10, crawls 10 pages of microblog information and comment URLs, then crawls the comment information, and if it gets a URL, it can crawl the content of the microblog.
key: Use an array to store all comments in the URL queue, and once again driver.get (URL) Crawl all tweets and comments.
#方位评论URL并进行爬取
print U ' n/a comment '
Infocomment = Codecs.open ("Sinaweibo_info_best_2.txt", ' W ', ' Utf-8 ')
For URL in All_comment_url:
Print URL
Driver.get (URL)
#driver. Refresh ()
Time.sleep (2)
Infocomment.write (url+ ' \ r \ n ')
Test = Driver.find_elements_by_class_name (' C ')
Print len (test)
K = 0
While K<len (test):
Print Test[k].text
Infocomment.write (Test[k].text + ' \ r \ n ')
K = k + 1
Infocomment.write (' \ r \ n ')
Infocomment.close ()


error element not found in the cache
Errors may be encountered during the crawl: Error:Message:Element not found in the Cache-perhaps the page has changed since it is looked up
Reference: http://www.51testing.com/html/21/n-862721-2.html
Explanation of the exception: the element was not found in the cache and the page was transformed after the element was found. This means that when a jump occurs on the current page, the element that exists in the cache about the page is also emptied.
The experiment found that it was using Driver.get (URL) to access the comment, too quickly did not load the page to get the element, using the method is Time.sleep (2) display 2 seconds. Want to have a better solution ~ Of course time set short point also line.


Hot Topics pinned
By comparing the client and Mobile information, you will find the search microblogging hotspot, which first shows the most popular tweets on the topic, such as "Ode to Joy" Ziyi status.




Compare mobile and client microblogging information
This is the comparison between the mobile and the client's microblogging information, the information is basically consistent, probably just because the sorting display problem may be slightly fine-tuning, while the client also used hot Topic pinned display some of the recommended related algorithms it! such as "a person who woke up before five o'clock ..."





Then the following is the specific comment information, can be found to be corresponding, but through the mobile phone can display the comment information can be crawled, and the client crawl results are empty.







PS: Finally hope that the article is helpful to you! In fact, the method is very simple, I hope you can understand this idea, how to analyze the HTML source and DOM tree structure, and then dynamically get the information you need. (By:eastmount late 2016-05-06 4:30 http://blog.csdn.net/eastmount/)


[Python crawler] Selenium crawling Sina Weibo hot topics and comments (next)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.