Python+webdriver Crawl Blog Park "my flash" and save to Local

Source: Internet
Author: User

[this article is from the Sky cloud-owned blog park]

Previous article

Using WEBDRIVER+PHANTOMJS to automate a browser-free process

The idea and realization of this article

I want to crawl the "my flash" section of the blog Park to a local file, using Webdriver and Phantomjs's no-interface browser. For XPath to get and verify the need to use the Firefox browser, install Firebug and Firepath plugin. The code is as Follows:

#-*-coding:utf-8-*-ImportOs,time fromSeleniumImportWebdriver fromSelenium.webdriver.common.byImport by fromSelenium.webdriver.supportImportExpected_conditions as ECImportSelenium.webdriver.support.ui as UIdefcrawl_memeory (username,pwd):#Start Login Cnblogs.Driver =Webdriver. Phantomjs () Driver.get ("Http://passport.cnblogs.com/user/signin?ReturnUrl=http%3A%2F%2Fwww.cnblogs.com%2F") Wait= Ui. Webdriverwait (driver, 10) Wait.until (LambdaDR:DR.FIND_ELEMENT_BY_ID ('signin'). is_displayed ()) driver.find_element_by_id ("INPUT1"). send_keys (username) driver.find_element_by_id ("Input2"). send_keys (pwd) driver.find_element_by_id ("signin"). Click ()Time.sleep (3)    #Navigate to my memory.Memory_url ="https://ing.cnblogs.com#my"driver.get (memory_url) Wait.until (LambdaDR:DR.FIND_ELEMENT_BY_ID ('feed_list'). is_displayed ()) element= Driver.find_element_by_xpath (".//*[@id = ' Pager_bottom ']/a[last ()-1]") Page_num=int (element.text)#for each page, crawl the memory.Store_dir_path = Os.path.join (os.path.abspath (os.path.dirname (__file__)),"cnblogs_memory")    ifos.path.exists (store_dir_path):Pass    Else: Os.mkdir (store_dir_path)#Set The HTML ' s local storage path.Store_html_path = Os.path.join (store_dir_path,"Cnblogs_memory.txt") F= Open (store_html_path,"W") f.close () memory_url="https://ing.cnblogs.com#my/p"with open (store_html_path,"a") as File:file.write ("<! DOCTYPE html>")     forIinchrange (page_num): Wait.until (LambdaDR:DR.FIND_ELEMENT_BY_ID ('feed_list'). is_displayed ()) memory_contents= Driver.find_elements_by_xpath (".//*[@id = ' feed_list ']/ul/li")         forMemory_contentinchmemory_contents:inner_content= Memory_content.get_attribute ("InnerHTML") with open (store_html_path,"a +") as File:file.write (inner_content.encode ("Utf-8")) Pic_name="Cnblogs_memory_"+str (i+1) +". jpg"Store_pic_path=os.path.join (store_dir_path,pic_name) driver.save_screenshot (store_pic_path) Last_page_button= Driver.find_element_by_xpath (".//*[@id = ' Pager_bottom ']/a[last ()]")        if(last_page_button.text.startswith ("Next"): Last_page_button.click () driver.quit () with open (store_html_path,"a") as File:file.write ("</body>")if __name__=='__main__': pwd="Password"username="User name"crawl_memeory (username,pwd)
How to use

Save the above code to the local "cnblogs_memory_crawl.py" file, replacing the username and Password. Run in Python from the command line.

Run effect

Local will generate the Cnblogs_memory folder under the current run script path and generate TXT files and files under it, the file saved all my flash pages in the blog park:

Manually change the TXT file suffix to HTML to open with the following effect:

Further optimization

You can write a script to further delete the contents of the file saved locally, preserving the part you Want.

Python+webdriver Crawl Blog Park "my flash" and save to Local

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.