[Python crawler] Selenium Crawl Csdn Blog Summary and questions

Source: Internet
Author: User

This article is mainly to use selenium to crawl CSDN Blog summary, for the back of the CSDN hot technology, experts published in recent years the blog for data analysis. The author uses the selenium crawl because BeautifulSoup crawls the site with an error "Httperror:forbidden". At the same time, in the process of crawling encountered the problem of local dynamic update, can not locate the problem of page change, the author uses Firebug analysis, but also hope that the reader to propose a better method.
Code:


I. CSDN Blog website analysis and Problems

This article mainly crawls csdn expert's blog, because the expert's paper level is relatively high, at the same time the column is more, more representative. Website: http://blog.csdn.net/experts.html
The browser review element allows you to get a URL link to the expert blog, as shown in:


For example, to visit all my blogs, the address is: http://blog.csdn.net/eastmount/
Through analysis, the expert's information is located under the <class= ' Experts_list_wrap Clearfix ' > node, which can be crawled by the following code:
URLs = Driver.find_elements_by_xpath ("//div[@class = ' experts_list_wrap clearfix ']/dl")
For u in URLs:
Print U.text

Then you need to do the paging operation, the same way the element is reviewed. However, the difficulty is that its page number is dynamically loaded and is always displayed as:
<a value= "/peoplelist.html?channelid=0&amp;page=2" href= "#list" > Next </a>


and click on any page or next page, the top URL is always: http://blog.csdn.net/experts.html#list

Problem:
In the process of writing a web crawler, it is often necessary to obtain the next page, the traditional method is through the URL analysis, such as:
Http://www.medlive.cn/pubmed/pubmed_search.do?q=protein&page=1
However, there are two types of Web sites.

1. One is the drop-down menu to refresh the new page, this reference is below this article.
Python crawler crawling dynamic page Ideas + example (ii)-Kong Tianyi
Core code, online note recording:
# pull down the scroll bar, so that the browser load the content of the dynamic loading, it may be like this to pull many times, the middle of the appropriate delay (with connection to the speed). # If the content is long, increase the length of the drop. Driver.execute_script ("Window.scrollby (0,10000)") Time.sleep (3) # Many times the page consists of multiple <frame> or <iframe>, Webdriver default positioning is the outermost frame,# so here you need to check the frame, or you can not find the following required page element Driver.switch_to.frame ("app_canvas_frame") soup = BeautifulSoup (driver.page_source, ' xml ') contents = Soup.find_all (' pre ', {' class ': ' Content '}) # contents times = Soup.find_all (' A ', {' class ': ' C_tx c_tx3 Godetail '}) # Publish time for content, _time in zip (contents, times): # Here The _time underscore is to differentiate the prin from a time module T Content.get_text (), _time.get_text () # When it has reached the end, the "next" button does not have an ID, can end the If Driver.page_source.find (' Pager_next_ ' + str (         Next_num) = = -1:break# Find "next page" button Elem = driver.find_element_by_id (' pager_next_ ' + str (next_num)) # Click "Next Page" Elem.click () # Next "next page" id Next_num + = 1 # because in the next loop you have to pull the page down first, so jump to the outer frame DRIVER.S Witch_to.parent_frame ()
2. The second is dynamic loading, the URL is inconvenient to use but can page, such as CSDN.
At the same time the position of the next page is constantly changing, may be the 5th <a></a>, after clicking "Home", "previous page" came out, and became the 7th <a></a>.


If this is the picture above, you can get class= "next", but if it is like Csdn page number, how to solve it? Do I currently want to get "next" text or analog http? Or get JS? Headache.


is the Python code that jumps once to the next page, where:
NextPage = Driver.find_element_by_xpath ("//div[@class = ' Page_nav ']/a[6]")
Get the 6th <a></a>, but the position of "next page" after jumping is changed.
# Coding=utf-8 from selenium import webdriver from Selenium.webdriver.common.keys import keys import Selenium . Webdriver.support.ui as UI import re import OS #打开Firefox浏览器 set wait load time Access URL Driver = webdriver.f Irefox () wait = UI.  Webdriverwait (driver,10) driver.get ("http://blog.csdn.net/experts.html") #获取列表页数 <div class= "page_nav> a total of 46 Total: 8 pages. </div> def getpage (): Number = 0 texts = Driver.find_element_by_xpath ("//div[@class = ' Page_nav ']"). T Ext Print Texts M = Re.findall (R ' (\w*[0-9]+) \w* ', texts) #正则表达式寻找数字 print ' pages: ' + str (m[1]) ret        Urn Int (m[1]) #主函数 def main (): Pagenum = GetPage () print pagenum I=1 #循环获取标题和URL while (i<=2): #pageNum即为总页码 urls = Driver.find_elements_by_xpath ("//div[@class = ' experts_list_wrap clearfix ']/d L ") for u in urls:print u.text nextPage = Driver.find_element_by_xpath ("//div[@class = ' Page_nav ']/A[6] ") Print Nextpage.text Nextpage.click () time.sleep (2) i = i + 1 else:     print ' Load over ' main ()
the output looks like this:




two. Firebug Review elements

The next use of firebug analysis of Web data, this is a very frequent use of the tool, the following writing an article to introduce.
Installation is simple, download the plugin directly from the Firefox browser to install, as shown in:


after successful installation, display a crawler, and then right-click to use the Firebug review element.


Then click on "Network", can be found through the following URL access.
http://blog.csdn.net/peoplelist.html?channelid=0&page=2


In fact, the previous code can also be found, it is through the value of the local jump refresh.
<a value= "/peoplelist.html?channelid=0&amp;page=2" href= "#list" > Next </a>
Then use selenium crawl the information of the URL, and then analyze the site is more convenient.


PS: There must be a better way to crawl directly, rather than through this method, but here the main want to introduce Firebug plug-ins, the following time detailed introduction.


three. Selenium crawling expert information and URLs

After obtaining a new article URL via Firebug, get the personal information of the expert and the URL of the blog address first.
The analysis looks like this:


by <dl class= "Experts_list" > Get personal Information, the href attribute is a URL.
Detailed code is as follows, "csdn_blog_url.py" file.
# Coding=utf-8 from selenium import webdriver from Selenium.webdriver.common.keys import keys import Selenium . Webdriver.support.ui as UI import re import osimport codecs #打开Firefox浏览器 set wait load time Driver = Webdriv Er. Firefox () wait = UI. Webdriverwait (driver,10) #主函数 def Main (): page = 1 Allpage = Infofile = Codecs.open ("Blog_name.txt" , ' A ', ' utf-8 ') Urlfile = Codecs.open ("Blog_url.txt", ' a ', ' utf-8 ') #循环获取标题和URL while (page <= 2): #        Crawl only 2 pages, normal should be allpage URL = "http://blog.csdn.net/peoplelist.html?channelid=0&page=" + str (page) Print URL        Driver.get (URL) #获取URL name_urls = Driver.find_elements_by_xpath ("//dl[@class = ' experts_list ']/dt/a") For URLs in name_urls:u = Url.get_attribute ("href") print u urlfile.write (U + "\r\        n ") #保存信息 info = Driver.find_elements_by_xpath ("//dl[@class = ' experts_list ']/dd ") for u in info:    Content = U.text content = content.replace (' \ n ', ') #换行替换成空格 writing the file easy print content in        Fofile.write (content + "\ r \ n") page = page + 1 infofile.write ("\ r \ n") Else:infofile.close ()     Urlfile.close () print ' Load over ' main ()
The output is as follows and is written to the file, where only 2 pages of information are crawled.
Blog_name.txt stores personal information.


Blog_url.txt storage expert Blogger's Url.




four. Selenium Crawling blog Information

Finally crawl each blogger's blog information, here also need to analyze the page, but the blog page to use a URL connection, more convenient. such as: HTTP://BLOG.CSDN.NET/EASTMOUNT/ARTICLE/LIST/2
Therefore, only need to: 1. Get the total page number; 2. Crawl each page of information, 3.URL set to cycle the page; 4. Crawl again.
You can also use click "Next Page" Jump, no "next page" Stop jump, crawler end, and then climb down a blogger.


Then review the element analysis for each blog page, if using BeautifulSoup crawl will error "Forbidden".
Discover that each article is made up of a <div></div>, as shown below, and only need to navigate to that location.


This location can be crawled, sometimes need to locate the title, summary, time, but also the same method.


Here is the detailed code that crawls the details of two people.

# Coding=utf-8 from selenium import webdriver from Selenium.webdriver.common.keys import keys import Selenium . Webdriver.support.ui as UI import Reimport timeimport Osimport codecs #打开Firefox浏览器 set wait load time Driver = Webdr Iver. Firefox () wait = UI. Webdriverwait (driver,10) #获取每个博主的博客页面低端总页码 def getpage (): Number = 0 texts = Driver.find_element_           By_xpath ("//div[@class = ' pagelist ']"). Text print Texts M = Re.findall (R ' (\w*[0-9]+) \w* ', texts) #正则表达式寻找数字 print ' pages: ' + str (m[1]) return int (m[1]) #主函数 def main (): #获取txt文件总行数 count = Len (Open (" Blog_url.txt ", ' RU '). ReadLines ()) print count n = 0 urlfile = open (" Blog_url.txt ", ' r ') content = Codecs.open (" B Log_content.txt ", ' a ', ' utf-8 ') #循环获取每个博主的文章摘信息 while n < 2: #这里爬取2个人博客信息, normal count of the blogger information URL = URLFILE.R   Eadline () url = url.strip ("\ n") print URL driver.get (URL) #获取总页码 allpage = GetPage ()     The total number of print U ' pages is: ', Allpage time.sleep (2) m = 1 #第1页 while m <= 2:ur = URL + "/artic le/list/"+ str (m) print ur driver.get (ur) article_title = Driver.find_elements_by_xpath (                "//div[@class = ' List_item Article_item ']") for the title in Article_title:con = Title.text Print con + ' \ n ' con = con.replace (' \ n ', ' \ r \ n ') #换行替换成空格 write file conveniently content.write (con + "\ r\n ") m = m + 1 else:content.write (" \ r \ n ") print U ' crawl to remove a blogger article \ n     = n + 1 else:content.close () urlfile.close () print ' Load over ' main ()
the results of the crawl are as follows:

finally hope that the article is helpful to you, if there are errors or shortcomings in the article, please Haihan ~
The new semester began, to get rid of many problems, improve efficiency, improve scientific research, serious teaching, Na Mei life.
(by:eastmount 2017-02-22 2:30 P.M. http://blog.csdn.net/eastmount/)

[Python crawler] Selenium Crawl Csdn Blog Summary and questions

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.