Python data capture with selenium, and introduction to selenium resources

Source: Internet
Author: User

When the third blog begins ~

The topic this time is data fetching. Finally to the core part of the discussion, my mood is very excited ah! If you have Baidu or Google (if you can) data capture or crawling, you will find thousands of examples. But most of the code is very verbose, and many of the code is still fetching static data, the dynamic JS written data is helpless. Or, using HTML to parse the URL, then find the JS-written data page to find the desired data.

but! I wonder if you've ever found out if you open the censorship elements of chrome or Safari or a variety of browsers. The data that can be seen on the Web page will actually be loaded inside. but when transferred to the source code, JS written data is gone! But why would we want to find the data from the source code, but not directly from the review element? With this problem, I have experienced thousands of risks, through a variety of foreign community forums and plug-in package, finally found the answer.

First introduce the protagonist of today!

    • Interpreter:Selenium
    • App:Phantomjs

Since it is interpreter,selenium can be downloaded according to my first blog's practice. PHANTOMJS, you can directly through the link I gave to download. When the two are all installed, you can start data capture formally. Of course, the example is my blog ~

First on the sample code!

#-*-coding:utf-8-*-# fromSeleniumImportWebdriverdefcrawling_webdriver ():#get local session of PHANTOMJSDriver = Webdriver. PHANTOMJS (executable_path='/users/yirugao/phantomjs/bin/phantomjs', port=65000) Driver.set_window_size (1024, 768)#OptionalDriver.get ("http://www.cnblogs.com/Jerrold-Gao/")#Load Page    #Start crawling dataDATA=DRIVER.FIND_ELEMENT_BY_ID ("Sidebar_scorerank"). Text#Print to check my result    Print(data)#quit the driverdriver.quit ()if __name__=='__main__': Crawling_webdriver ()

  

Have you ever been amazed by Python's streamlined language? This is where the Python plug-in Feature Pack is powerful.

I need to remind a few points of attention: first, the path and port of Phantomjs must be looked for. in an article in the country somehow removed the path and the port, the magic is that the program can still run. I tried it a couple of time and couldn't run like it. Second, please be sure to quit driver in the end, otherwise the memory will be more and more large.

  This code will be able to crawl the data on this page, the result is:

But what if you want to crawl a hyperlink? Through the above code, leaf out, still can not find out. Here's a simple tip:

# crawling a link    data=driver.find_element_by_id ("homepage1_homepagedays_dayslist_ctl00_daylist_titleurl_0  ")    URL=data.get_attribute ("href")

The block under the ID will be extracted from the href section and the link can be found. Here's what we found here:

================================= here is another theme of the split line ===================================

About selenium, both at home and abroad, are counted as a fire for a while. It has a much clearer comment on the home page than OPENPYXL. Here are a few of the resources I think are more useful.

Because its official website inexplicable drop line, so the first resource is Python-selenium station, there are many comments on the code, very practical.

Secondly, for the domestic students, the best site is the Chinese station, although the number is not many, but the discussion is quite rich.

Finally, is omnipotent StackOverflow, the domestic station speed is always a bit slow, in fact, it is very resentment.

================================= I'm here to encore the split line ======================================

This day spit groove: Once and friends visited the museum. Friends said, the museum is actually good treacherous, a large white wall hung a picture, you see the painting will not immediately have a comparison of objects, will feel "this is a good sense of art ah" wonderful feeling. I replied, just like Western European cuisine, a large white plate with a small stack of food, it will feel very delicate. My friend said, "Yes, it's so big, it looks good on everything." I bowed my head and smiled and said, "When will I put my code up?"

          French sweet duck breast meat at home (Magret de Canard)

Python data capture with selenium, and introduction to selenium resources

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.