Earlier I told how to get Wikipedia message box through BeautifulSoup, also can get the website content through Spider, recently studied Selenium+phantomjs, ready to use them to get Baidu Encyclopedia of Tourist Attractions message box (INFOBOX), This is also the preliminary preparation for the alignment of the graduation design entity alignment and attributes. Hope the article is helpful to you ~
Source
1 #Coding=utf-82 """ 3 Created on 2015-09-04 @author: Eastmount4 """ 5 6 Import Time7 ImportRe8 ImportOS9 ImportSYSTen ImportCodecs One fromSeleniumImportWebdriver A fromSelenium.webdriver.common.keysImportKeys - ImportSelenium.webdriver.support.ui as UI - fromSelenium.webdriver.common.action_chainsImportActionchains the - #Open Phantomjs -Driver = Webdriver. PHANTOMJS (executable_path="G:\phantomjs-1.9.1-windows\phantomjs.exe") - #Driver = Webdriver. Firefox () +Wait = UI. Webdriverwait (driver,10) - GlobalInfo#Global Variables + A #Get the infobox of 5A tourist spots at defGetinfobox (name): - Try: - #create paths and txt files - GlobalInfo -Basepathdirectory ="tourist_spots_5a" - if notos.path.exists (basepathdirectory): in os.makedirs (basepathdirectory) -Baidufile = Os.path.join (Basepathdirectory,"BaiduSpider.txt") to if notos.path.exists (baidufile): +info = Codecs.open (baidufile,'W','Utf-8') - Else: theinfo = Codecs.open (baidufile,'a','Utf-8') * $ #Locate input notice:1.visit URL by Unicode 2.write filesPanax Notoginseng PrintName.rstrip ('\ n')#Delete char ' \ n ' -Driver.get ("http://baike.baidu.com/") theELEM_INP = Driver.find_element_by_xpath ("//form[@id = ' searchform ']/input") + Elem_inp.send_keys (name) A Elem_inp.send_keys (Keys.return) theInfo.write (Name.rstrip ('\ n')+'\ r \ n')#codecs does not support ' \ n ' line break + #Print Driver.current_url -Time.sleep (5) $ $ #Load Infobox -Elem_name = Driver.find_elements_by_xpath ("//div[@class = ' basic-info ']/dl/dt") -Elem_value = Driver.find_elements_by_xpath ("//div[@class = ' basic-info ']/dl/dd") the - #Create dictionary Key-valueWuyi #A dictionary is a hash table structure that is hashed by features after data entry, and does not record the original data in order to suggest tuples theElem_dic =dict (Zip (elem_name,elem_value)) - forKeyinchElem_dic: Wu PrintKey.text,elem_dic[key].text -Info.writelines (key.text+" "+elem_dic[key].text+'\ r \ n') AboutTime.sleep (5) $ - exceptException,e:#' UTF8 ' codec can ' t decode byte - Print "Error:", E - finally: A Print '\ n' +Info.write ('\ r \ n') the - #Main function $ defMain (): the GlobalInfo the #By function Get information theSource = open ("Tourist_spots_5a_bd.txt",'R') the forNameinchSource: -name = Unicode (name,"Utf-8") in ifU'Forbidden City' inchName#else add a '? ' theName = U'Beijing Forbidden City' the Getinfobox (name) About Print 'End Read files!' the source.close () the info.close () the driver.close () + -Main ()
Run results
Mainly by reading from the TXT file in the F disk the name of the country 5 A-level scenic area, and then call Phantomjs.exe browser to access the infobox value. At the same time, if there is a coding problem "' ASCII ' codec can ' t encode characters" you can set the compiler Utf-8 encoding by following the code as follows:
# set the encoding utf-8 Import sys Reload (SYS) sys.setdefaultencoding ('utf-8')# Show current default encoding print sys.getdefaultencoding ()
Corresponding source code
The corresponding Baidu Encyclopedia infobox source code, such as the basic knowledge of the code can refer to my previous blog post or my Python crawler patent, selenium not only good at doing automated testing, the same is suitable for simple crawler.
Coding issues
At this point you may still encounter the "' ASCII ' codec can ' t encode characters" encoding problem.
It is because you create TXT file by default is ASCII format, at this time your text does ' utf-8 ' format, so need to convert by the following methods.
1 ImportCodecs2 3 #use the Open method provided by codecs to specify the language encoding of the open file, which is automatically converted to internal Unicode at read time4 if notos.path.exists (baidufile):5info = Codecs.open (baidufile,'W','Utf-8') 6 Else: 7info = Codecs.open (baidufile,'a','Utf-8')8 9 #This method is not IO so line break is ' \ r \ n 'TenInfo.writelines (key.text+":"+elem_dic[key].text+'\ r \ n')
Summarize
You can learn the basic automated crawler method in the code, and learn how to display the Key-value key pair through a for loop, corresponding to the displayed attributes and attribute values, implemented by the following code:
Elem_dic = dict (Zip (elem_name,elem_value))
But the final output is not the order in infobox, why?
Finally, I hope that the article will help you, there is a basic introduction to the article, but the publication always triggered csdn sensitive system automatically locked, and do not know where the trigger. Recommended you can read ~
[Python crawler] Introduction to the method and operation of common element localization in selenium
(By:eastmount late 2015-9-6 2:30 http://blog.csdn.net/eastmount/)
[Python crawler] Selenium get Baidu Encyclopedia tourist attractions infobox message box