Image processing of friends, often need to collect and organize a large number of image data sets. When doing scientific research, there are various existing standard data sets which are used directly by everyone, but the project often needs to collect the pictures themselves, and it is a more common task to crawl pictures from the Internet. In order to complete this task in Python, the following two questions need to be addressed:
1. Where does the image material originate from? The first instinct is the search engine images, such as to collect mobile phone pictures, then enter the search engine keyword can get a lot of related pictures.
2. The content of the dynamic website is often asynchronously loaded via Ajax, and the contents of the Urllib library read directly in Python are incomplete, and the required content is basically asynchronously loaded and not directly accessible. Of course, some tasks may only need to deal with some static web pages, but unfortunately, the Dynamic Web page is basically mainstream, and we crawl pictures of the site is basically Ajax loading. So how to solve the problem of crawling dynamic sites? The answer is to use the selenium library to simulate the browser open page loading completely after processing. This solution can refer to the blog post, which summarizes 4 scenarios in which the use of selenium is better suited for this task.
This article gives a picture crawler based on Soso search engine, which is the same for Google and 360 good search methods. But to crawl Baidu's slightly troublesome point, see another blog post.
First, enter the Soso search, enter the query word "mobile phone", a large number of mobile phone pictures,
The address of the URL bar is the URL to crawl the page, and then give the python combined with selenium crawl picture code as follows
From selenium import webdriverimport timeimport urllib# crawl page address url = ' http://pic.sogou.com/pics?query=%ca%d6%bb%fa& w=05009900&p=40030500&_asf=pic.sogou.com&_ast=1422627003&sc=index&sut=1376&sst0= 1422627002578 ' # target element xpathxpath = '//div[@id = ' Imgid ']/ul/li/a/img ' # start Firefox browser driver = Webdriver. Firefox () # Maximizes the window because every crawl sees only the picture in the Window Driver.maximize_window () # Record the downloaded image address and avoid repeated downloads img_url_dic = {}# browser opens the crawl page driver.get ( URL) # Simulate scrolling window to browse download more pictures pos = 0m = 0 # picture number for I in range:p OS + = i*500 # every time you roll down 500js = "document.documentelement.scrolltop= %d "% posdriver.execute_script (JS) time.sleep (1) for element in Driver.find_elements_by_xpath (XPath): Img_url = Element.get_attribute (' src ') # Save picture to specified path if img_url! = None and not Img_url_dic.has_key (Img_url): img_url_dic[img_url] = ' ' m + = 1ext = Img_url.split ('. ') [ -1]filename = str (m) + '. ' + ext# save picture data = Urllib.urlopen (Img_url). Read () F = open ('./yourfolder/' + filename, ' WB ') F.W Rite (data) F.close () Driver.close ()
The above code only downloads the first page, selenium can also simulate clicking on the page trigger loading more images. In addition, the code snippet that saves the image can be simplified into a sentence
Urllib.urlretrieve (Img_url, './yourfolder/%s '% filename)
Finally, crawled out of the picture as shown, the next article describes how to crawl Baidu pictures.
Selenium+python Crawling Network pictures (1)--Soso, Google, good search