The Lord heard that there is a website called him interesting, there is a community, which has a call him interesting girl, the main point to go to see the next, really good Ah, illustrated, otaku to see their own to know ~
The next step is to crawl the pictures of these girls, not only the pictures, the Lord found that the dialogue is also very interesting, so the dialogue also grabbed down well.
So here's the question, what's the tool for it? In the previous exercise has been used urllib2, regular expression matching is really troublesome, this time to point a little more advanced, try selenium;
What is Selenium? In fact, it is a Web automation testing tools, running up to our own operation of the browser almost, nonsense not much to say, the following start.
Tools: python2.7 + selenium2 python Edition
1 first import the required modules and then define the local storage directory
1 #Coding:utf-82 3 fromSeleniumImportWebdriver4 fromSelenium.common.exceptionsImportnosuchelementexception5 fromTimeImport*6 ImportOS7 ImportUrllib28 9 #File Save pathTenFile_path = R'F:\taqu'
2 then define 3 functions, respectively, to create a table of contents for each girl, save each girl picture, save each girl text description and dialogue, and save the picture with the URLLIB2
#------------defines 3 functions for creating a directory for each girl, saving pictures, writing text descriptions, and dialogs----------------defmkdir_for_girl (path, name):"""Create a directory with the title command:p Aram Name: Directory Name: return: Returns the directory path created"""Path=os.path.join (path, name)if notos.path.exists (path): Os.mkdir (path)returnPathdefsave_pictures (Path, url_list):"""Save the picture to a local specified folder:p Aram path: The folder where the picture is saved, returned by Mkdir_for_girl:p Aram Url_list: List of URLs for the picture to save: Return:none""" for(Index, URL)inchEnumerate (url_list):Try: PrintU'%s is saving page%d pictures'%(CTime (), index) Pic_name= STR (index) +'. jpg'file_name=os.path.join (Path, pic_name)#If the picture is present, it is not saved ifos.path.exists (file_name):PrintU'%s The picture already exists'%CTime ()Continuereq= Urllib2. Request (URL, headers={'user-agent': R'mozilla/5.0 (Windows NT 6.1; WOW64; rv:45.0) gecko/20100101 firefox/45.0'}) Data= Urllib2.urlopen (req, timeout=30). Read () F= Open (file_name,'WB') f.write (data) f.close ()exceptException, E:PrintU'%s%d picture save failed, not processed, skip continue processing next'%(CTime (), index)defWrite_text (path, info):"""Create a TXT file in the path directory, write the info information (Girl Text Description and dialog) to the TXT file:p Aram path: The directory where the TXT file is saved, returned by Mkdir_for_girl:p Aram Info: to write to TXT Text content: Return:none""" #Create/Open info.txt file and write contentfilename = os.path.join (path,'Info.txt') with open (filename,'A +') as Fp:fp.write (Info.encode ('Utf-8')) Fp.write ('\ n'. Encode ('Utf-8')) Fp.write ('\ n'. Encode ('Utf-8'))
3 define Webdirver, open the destination page, where you switch to the target page using the hyperlink text anchor element
# -----------------------Open the page, set the timeout time, maximize the window-----------------------driver = webdriver. Firefox () driver.implicitly_wait()Driver.maximize_window () driver.get (R'/http www.taqu.cn/community/')# -----------------------Switch to "He's Funny Girl" Page-----------------------driver.find_element_by_partial_link_text (u' he fun girl' ). Click ()
4 because all girl in a page, but this page is very long, really long, need to slide for a long time to slide to the bottom, why to slide to the base?
Because the browser gets the HTML source code, only the current window has shown the elements of the codes, not showing part of the code is not, of course, selenium also not locate those elements;
So, if you want to selenium can find all the girl, first to slide the page to the bottom, and so all girl loaded, to get to the HTML code containing all the girl.
So the question is, how to slide the page, through the browser scroll bar, but the scrollbar is not HTML elements, selenium can not directly control, here you have to rely on
JavaScript, the implementation of JS code in the selenium is still possible, the implementation of the code to see:
#-----------------------Swipe to the bottom of the window to refresh all girl-----------------------#because the page is long and needs to be pulled down to refresh, the controller scroll bar slides down through JavaScript#But one swipe doesn't reach the bottom, it takes several times, how many times does it take? The way it's used here is to keep it down.#Swipe, each slide once, are queried whether to reach the bottom, how to query it? This is judged by a logo image on the bottom,#if the logo is not found, it means that it has not reached the bottom, need to continue to slide, if found to jump out of the loop
#to quickly slide, set the time-out to 1 seconds
Driver.implicitly_wait (1)
#It keeps slipping and slipping. whileTrue:driver.execute_script ("Window.scrollto (0,document.body.scrollheight)") Try: #position a picture at the bottom of the pageDriver.find_element_by_xpath (".//*[@id = ' waterfall-loading ']/img[@src = '/img/no-more.png ')") #If no exception is thrown, the bottom flag is found, and the loop jumps out Break exceptnosuchelementexception as E:#Throw exception description No bottom sign found, continue swipe down Pass#Change the timeout time back to 10 secondsDriver.implicitly_wait (10)
5 after getting to the entire full page, you can find all the girl cover pictures at once, these cover pictures can be clicked, a little open the girl all the pictures and dialogues,
CSS selectors are used here, and the master prefers CSS in comparison to XPath.
# -----------------------Find all the girl cover pictures----------------------- # these cover images are clickable, and clicking will pop up all the pictures and text descriptions of the girl girls = Driver.find_elements_by_css_selector ("div# Container img"= len (girls)print u"total Girl:%d" % num
6 last step, loop through all the cover pictures, click, then crawl save each girl picture and dialog, save to Local
#-----------------------Click on each cover to extract the information for each girl----------------------- forGirlinchGirls:#Click on the cover of each girl in turnGirl.click ()#after each click of a girl, click the pop-up box to update the driver, or the cache in driver is the previous girl #must pay attention to this step ah, the Lord did not do this step, tossing for a long timeDriver.find_element_by_xpath ("Html/body/div[3]/div[2]"). Click ()#extract the title, because the characters in the caption: and | cannot be used as file names, replacing them withtitle = Driver.find_element_by_css_selector ("P.waterfall-detail-header-title"). Text title= Title.encode ('Utf-8') Title= Title.replace (":",":") Title= Title.replace ("|","丨") Title= Title.decode ('Utf-8') #under the File_path directory, create a directory with the title named for the girlPath =Mkdir_for_girl (File_path, title)#extract the URL for all pictures of the girlpics = Driver.find_elements_by_css_selector ("div.water-detail-content img") Pic_url= [X.get_attribute ('src') forXinchpics]PrintU'The total number of images for this girl is:%d'%Len (pic_url)#Save the picture to a local directory named after the girl titlesave_pictures (Path, Pic_url)#extract the basic introduction of girl and write the Info.txt fileinfo = Driver.find_element_by_xpath ("Html/body/div[3]/div[2]/div[2]/div[2]"). Text Write_text (path, info)#Extract all conversations, write Info.txt filesTalks = Driver.find_elements_by_css_selector ("div.water-detail-content P") forTinchtalks:write_text (Path, T.text)#Close the picture of the girlDriver.find_element_by_xpath ("html/body/div[3]/div[2]/div[1]/div/img"). Click ()PrintU'The girl information is extracted and continues processing the next'Sleep (1)#-----------------------All girl information extraction complete-----------------------driver.close ()PrintU'Congratulations, all girl information extraction is complete! '
7 try to run under, look at the output, is not a little excited
Total number of girl: all pictures of the girl are:09:24:5409:24:5409:24:54 from 09:33:3609:33:3709:33:37 Saving the 11th picture The girl information extraction is complete, continue processing the next congratulations, All girl information extraction complete!
8 Look at the local directory again, is not very excited?
Description: The above code is complete and all the fragments are merged together to run. As you can see here, selenium is very powerful, not just web automation tools Oh,
Or a reptile weapon, but there is a drawback here is that selenium will open the browser UI, so the operation is a bit slow, there is no better solution? ---there!
Using selenium to implement a simple crawler crawl mm picture