Selenium+python Crawling Network pictures (2)--Baidu

Source: Internet
Author: User
Tags xpath

On a blog post about how to use Selenium+python in such as Soso, Google, good search and other search engines to crawl the image of the method, but did not mention Baidu, because the situation of Baidu is more special. First of all, Baidu pictures of the data is better, because each picture has "data-desc" description can be used as a good image of the semantic tag, in addition, based on the strong technology of Baidu search for the image of high correlation, the subsequent manual screening work less; secondly, Baidu pictures of the data is not easy to crawl, If a method like the one in the previous article takes the SRC value of the img tag as the download URL, it is not downloaded to the image, resulting in the knowledge of 167B non-image data.


So, how to crawl Baidu pictures, the author tried two methods. The first method has not been fully implemented, but the idea is complete, the second method can be more simple to crawl to the Baidu image data source. Two implementation scenarios are described in turn.


Scenario 1:

Use selenium to simulate mouse actions-"Place the mouse over the image, right-click and select the Save image as option", then you can save the code as follows:

From selenium import webdriverfrom selenium.webdriver.common.action_chains import Actionchainsfrom Selenium.webdriver.common.keys import keys# initurl = ' http://image.baidu.com/i?tn=baiduimage&ipn=r&ct= 201326592&cl=2&lm=-1&st=-1&fm=index&fr=&sf=1&fmq=&pv=&ic=0&nc=1&z= &se=1&showtab=0&fb=0&width=&height=&face=0&istype=2&ie=utf-8&word=%e6%89% 8b%e6%9c%ba&oq=shouji&rsp=1 ' XPath = '//ul/li/div/a/img ' # set PROFILEFP = Webdriver. Firefoxprofile () fp.set_preference (' Browser.download.folderList ', 2) fp.set_preference (' Browser.download.manager.showWhenStarting ', False) fp.set_preference (' Browser.download.dir ', './yourfolder/') Fp.set_preference (' Browser.helperApps.neverAsk.saveToDisk ', ' Image/jpeg ') # launch driverdriver = Webdriver. Firefox (FIREFOX_PROFILE=FP) Driver.maximize_window () driver.get (URL) for element in Driver.find_elements_by_xpath ( XPath): Img_url = Element.get_attribute (' src ') Img_desc = ElemEnt.get_attribute (' data-desc ') action = Actionchains (Driver). Move_to_element (Element) Action.context_click (element ) Action.send_keys (Keys.arrow_down) Action.send_keys (' V ') action.perform () # Click Save imaged River.close ()
However, it must be found that the preservation of the picture will need to click on the dialog box confirmation save, very cumbersome. Indeed, in order to solve this problem, I Google for a long time and did not find a direct solution to the good method, the root cause is that selenium can not operate the operating system level of the dialog box, there is said the above "set profile" code snippet settings can solve the problem is not reliable. Therefore, if you use the right-click Save as a scheme, you need to use the additional plug-in or hook program to simulate automatic click. Online has recommended a autoit or can complete the task, not pro-test.


Scenario 2:

Baidu image img Tag contains src does not download to the original image, only the Data-desc property is available, but when the mouse is placed on the Baidu picture, will find as shown in the download button ,


Just find this download button to download the corresponding link can be downloaded to the original image, and the button corresponding to a link tag, the analysis of its XPath problem is resolved, the following gives the Python code:

Import urllibimport timefrom Selenium import webdriverclass crawler:def __init__ (self): Self.url = ' Http://imag e.baidu.com/i?tn=baiduimage&ipn=r&ct=201326592&cl=2&lm=-1&st=-1&fm=index&fr=& Sf=1&fmq=&pv=&ic=0&nc=1&z=&se=1&showtab=0&fb=0&width=&height=&face =0&istype=2&ie=utf-8&word=%e6%89%8b%e6%9c%ba&oq=shouji&rsp=1 ' # URL to crawl Self.img_xpath = ' Ul/li/div/a/img ' # XPath of img element self.download_xpath = '//ul/li/div/div/span/a[@class = "Downloadicon"] ' # x Path of download LINK element self.img_url_dic = {} # kernel function def launch (self): # launch Drive R Driver = Webdriver. Firefox () Driver.maximize_window () driver.get (self.url) Img_xpath = Self.img_xpath Download_xp           Ath = Self.download_xpath img_url_dic = self.img_url_dic # Simulate scrolling window to browse download more pictures pos = 0 For I in range (10):              pos + = i*500 # every time you roll down the page js = "document.documentelement.scrolltop=%d"% pos DRIVER.E Xecute_script (JS) # Get Image desc and download for img_element, link_element in Zip (driver.find_ele Ments_by_xpath (Img_xpath), Driver.find_elements_by_xpath (Download_xpath)): Img_desc = Img_element.get_attri                                Bute (' Data-desc ') # Description of Image Img_desc = Self.filter_filename_str (IMG_DESC) Img_url = Link_element.get_attribute (' href ') # URL of source image if img_url! = None and not im G_url_dic.has_key (Img_url): img_url_dic[img_url] = ' ext = img_url.split ('. ') [-1] filename = Img_desc + '. ' + ext print Img_desc, Img_url Urll Ib.urlretrieve (Img_url, './yourfolder/%s '% filename) time.sleep (1) driver.close () # Filter I Nvalid characters in FilenaMe def filter_filename_str (self, s): Invalid_set = (' \ \ ', '/', ': ', ' * ', '? ', ' ', ' < ', ' > ', ' | ', ') for I    In invalid_set:s = S.replace (i, ' _ ') return s if __name__ = = ' __main__ ': crawler = crawler () Crawler.launch ()
The resulting graph after the crawl is as follows:



The above code is only examples of the implementation of the scheme, to verify its feasibility, the internal may contain some omissions, only for the needs of friends reference, please correct me.







Selenium+python Crawling Network pictures (2)--Baidu

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.