When the third blog begins ~
The topic this time is data fetching. Finally to the core part of the discussion, my mood is very excited ah! If you have Baidu or Google (if you can) data capture or crawling, you will find thousands of examples. But most of the code is very verbose, and many of the code is still fetching static data, the dynamic JS written data is helpless. Or, using HTML to parse the URL, then find the JS-written data page to find the desired data.
but! I wonder if you've ever found out if you open the censorship elements of chrome or Safari or a variety of browsers. The data that can be seen on the Web page will actually be loaded inside. but when transferred to the source code, JS written data is gone! But why would we want to find the data from the source code, but not directly from the review element? With this problem, I have experienced thousands of risks, through a variety of foreign community forums and plug-in package, finally found the answer.
First introduce the protagonist of today!
- Interpreter:Selenium
- App:Phantomjs
Since it is interpreter,selenium can be downloaded according to my first blog's practice. PHANTOMJS, you can directly through the link I gave to download. When the two are all installed, you can start data capture formally. Of course, the example is my blog ~
First on the sample code!
#-*-coding:utf-8-*-# fromSeleniumImportWebdriverdefcrawling_webdriver ():#get local session of PHANTOMJSDriver = Webdriver. PHANTOMJS (executable_path='/users/yirugao/phantomjs/bin/phantomjs', port=65000) Driver.set_window_size (1024, 768)#OptionalDriver.get ("http://www.cnblogs.com/Jerrold-Gao/")#Load Page #Start crawling dataDATA=DRIVER.FIND_ELEMENT_BY_ID ("Sidebar_scorerank"). Text#Print to check my result Print(data)#quit the driverdriver.quit ()if __name__=='__main__': Crawling_webdriver ()
Have you ever been amazed by Python's streamlined language? This is where the Python plug-in Feature Pack is powerful.
I need to remind a few points of attention: first, the path and port of Phantomjs must be looked for. in an article in the country somehow removed the path and the port, the magic is that the program can still run. I tried it a couple of time and couldn't run like it. Second, please be sure to quit driver in the end, otherwise the memory will be more and more large.
This code will be able to crawl the data on this page, the result is:
But what if you want to crawl a hyperlink? Through the above code, leaf out, still can not find out. Here's a simple tip:
# crawling a link data=driver.find_element_by_id ("homepage1_homepagedays_dayslist_ctl00_daylist_titleurl_0 ") URL=data.get_attribute ("href")
The block under the ID will be extracted from the href section and the link can be found. Here's what we found here:
================================= here is another theme of the split line ===================================
About selenium, both at home and abroad, are counted as a fire for a while. It has a much clearer comment on the home page than OPENPYXL. Here are a few of the resources I think are more useful.
Because its official website inexplicable drop line, so the first resource is Python-selenium station, there are many comments on the code, very practical.
Secondly, for the domestic students, the best site is the Chinese station, although the number is not many, but the discussion is quite rich.
Finally, is omnipotent StackOverflow, the domestic station speed is always a bit slow, in fact, it is very resentment.
================================= I'm here to encore the split line ======================================
This day spit groove: Once and friends visited the museum. Friends said, the museum is actually good treacherous, a large white wall hung a picture, you see the painting will not immediately have a comparison of objects, will feel "this is a good sense of art ah" wonderful feeling. I replied, just like Western European cuisine, a large white plate with a small stack of food, it will feel very delicate. My friend said, "Yes, it's so big, it looks good on everything." I bowed my head and smiled and said, "When will I put my code up?"
French sweet duck breast meat at home (Magret de Canard)
Python data capture with selenium, and introduction to selenium resources