[this article is from the Sky cloud-owned blog park]
Previous article
Using WEBDRIVER+PHANTOMJS to automate a browser-free process
The idea and realization of this article
I want to crawl the "my flash" section of the blog Park to a local file, using Webdriver and Phantomjs's no-interface browser. For XPath to get and verify the need to use the Firefox browser, install Firebug and Firepath plugin. The code is as Follows:
#-*-coding:utf-8-*-ImportOs,time fromSeleniumImportWebdriver fromSelenium.webdriver.common.byImport by fromSelenium.webdriver.supportImportExpected_conditions as ECImportSelenium.webdriver.support.ui as UIdefcrawl_memeory (username,pwd):#Start Login Cnblogs.Driver =Webdriver. Phantomjs () Driver.get ("Http://passport.cnblogs.com/user/signin?ReturnUrl=http%3A%2F%2Fwww.cnblogs.com%2F") Wait= Ui. Webdriverwait (driver, 10) Wait.until (LambdaDR:DR.FIND_ELEMENT_BY_ID ('signin'). is_displayed ()) driver.find_element_by_id ("INPUT1"). send_keys (username) driver.find_element_by_id ("Input2"). send_keys (pwd) driver.find_element_by_id ("signin"). Click ()Time.sleep (3) #Navigate to my memory.Memory_url ="https://ing.cnblogs.com#my"driver.get (memory_url) Wait.until (LambdaDR:DR.FIND_ELEMENT_BY_ID ('feed_list'). is_displayed ()) element= Driver.find_element_by_xpath (".//*[@id = ' Pager_bottom ']/a[last ()-1]") Page_num=int (element.text)#for each page, crawl the memory.Store_dir_path = Os.path.join (os.path.abspath (os.path.dirname (__file__)),"cnblogs_memory") ifos.path.exists (store_dir_path):Pass Else: Os.mkdir (store_dir_path)#Set The HTML ' s local storage path.Store_html_path = Os.path.join (store_dir_path,"Cnblogs_memory.txt") F= Open (store_html_path,"W") f.close () memory_url="https://ing.cnblogs.com#my/p"with open (store_html_path,"a") as File:file.write ("<! DOCTYPE html>") forIinchrange (page_num): Wait.until (LambdaDR:DR.FIND_ELEMENT_BY_ID ('feed_list'). is_displayed ()) memory_contents= Driver.find_elements_by_xpath (".//*[@id = ' feed_list ']/ul/li") forMemory_contentinchmemory_contents:inner_content= Memory_content.get_attribute ("InnerHTML") with open (store_html_path,"a +") as File:file.write (inner_content.encode ("Utf-8")) Pic_name="Cnblogs_memory_"+str (i+1) +". jpg"Store_pic_path=os.path.join (store_dir_path,pic_name) driver.save_screenshot (store_pic_path) Last_page_button= Driver.find_element_by_xpath (".//*[@id = ' Pager_bottom ']/a[last ()]") if(last_page_button.text.startswith ("Next"): Last_page_button.click () driver.quit () with open (store_html_path,"a") as File:file.write ("</body>")if __name__=='__main__': pwd="Password"username="User name"crawl_memeory (username,pwd)How to use
Save the above code to the local "cnblogs_memory_crawl.py" file, replacing the username and Password. Run in Python from the command line.
Run effect
Local will generate the Cnblogs_memory folder under the current run script path and generate TXT files and files under it, the file saved all my flash pages in the blog park:
Manually change the TXT file suffix to HTML to open with the following effect:
Further optimization
You can write a script to further delete the contents of the file saved locally, preserving the part you Want.
Python+webdriver Crawl Blog Park "my flash" and save to Local