Recently suddenly thought of crawling Baidu academic reference, we can look at the previous blog I wrote: http://www.cnblogs.com/ybf-yyj/p/7351493.html, but if the use of this method, too painful, need to manually copy and paste, So here is the introduction of using selenium to implement this function, paste the code:
#-*-coding:utf-8-*- fromSeleniumImportWebdriverImport Time fromBs4ImportBeautifulSoup#Stitching URLTitlename='application of Biosorption for the removal of organic Pollutants:a review'Url_name=titlename.split (' ') URL='http://xueshu.baidu.com/s?wd='+'+'. Join (Url_name)#Open FirefoxDiver=Webdriver. Firefox () diver.get (URL)#prevent references too much, keep the click until the references do not exist ' load more 'Try: forIinchRange (0,50): #wait for Web site to finish loadingTime.sleep (0.2) Diver.find_elements_by_class_name ('Request_situ') [1].click ()except: Print '********************************************************'#wait until the loading completes get the webpage source codeTime.sleep (10)#get references using BeautifulSoupSoup=beautifulsoup (Diver.page_source,'lxml') Items=soup.find ('Div',{'class':'con_reference'}). Find_all ('Li') forIinchItems:PrintI.find ('a'). Get_text ()#Close Web pageDiver.close ()
Attention:
Code in red callout, I because of this mistake, got a half day
I encountered a problem, each time the first crawl, click the event does not respond, use breakpoints to see the discovery again, the back can be, this I do not know why this is the case
Chrome Click event does not execute
If you do not want to see the browser appear, you can use Diver=webdriver. PHANTOMJS () replaces Diver=webdriver. Firefox ()
The above is based on the installation of PHANTOMJS, Geckodriver.exe
Python gets dynamically loaded data above the dynamic site (Selenium+firefox)