[Python crawler] Selenium get Baidu Encyclopedia tourist attractions infobox message box

Source: Internet
Author: User

Earlier I told how to get Wikipedia message box through BeautifulSoup, also can get the website content through Spider, recently studied Selenium+phantomjs, ready to use them to get Baidu Encyclopedia of Tourist Attractions message box (INFOBOX), This is also the preliminary preparation for the alignment of the graduation design entity alignment and attributes. Hope the article is helpful to you ~

Source

1 #Coding=utf-82 """ 3 Created on 2015-09-04 @author: Eastmount4 """  5   6 Import Time7 ImportRe8 ImportOS9 ImportSYSTen ImportCodecs One  fromSeleniumImportWebdriver A  fromSelenium.webdriver.common.keysImportKeys - ImportSelenium.webdriver.support.ui as UI -  fromSelenium.webdriver.common.action_chainsImportActionchains the    - #Open Phantomjs -Driver = Webdriver. PHANTOMJS (executable_path="G:\phantomjs-1.9.1-windows\phantomjs.exe")   - #Driver = Webdriver. Firefox () +Wait = UI. Webdriverwait (driver,10) - GlobalInfo#Global Variables +  A #Get the infobox of 5A tourist spots at defGetinfobox (name): -     Try:   -         #create paths and txt files -         GlobalInfo -Basepathdirectory ="tourist_spots_5a"   -         if  notos.path.exists (basepathdirectory): in os.makedirs (basepathdirectory) -Baidufile = Os.path.join (Basepathdirectory,"BaiduSpider.txt")   to         if  notos.path.exists (baidufile): +info = Codecs.open (baidufile,'W','Utf-8')   -         Else:   theinfo = Codecs.open (baidufile,'a','Utf-8')   *        $         #Locate input notice:1.visit URL by Unicode 2.write filesPanax Notoginseng         PrintName.rstrip ('\ n')#Delete char ' \ n ' -Driver.get ("http://baike.baidu.com/")   theELEM_INP = Driver.find_element_by_xpath ("//form[@id = ' searchform ']/input")   + Elem_inp.send_keys (name) A Elem_inp.send_keys (Keys.return) theInfo.write (Name.rstrip ('\ n')+'\ r \ n')#codecs does not support ' \ n ' line break +         #Print Driver.current_url -Time.sleep (5)   $    $         #Load Infobox -Elem_name = Driver.find_elements_by_xpath ("//div[@class = ' basic-info ']/dl/dt")   -Elem_value = Driver.find_elements_by_xpath ("//div[@class = ' basic-info ']/dl/dd")   the    -         #Create dictionary Key-valueWuyi         #A dictionary is a hash table structure that is hashed by features after data entry, and does not record the original data in order to suggest tuples theElem_dic =dict (Zip (elem_name,elem_value)) -          forKeyinchElem_dic: Wu             PrintKey.text,elem_dic[key].text -Info.writelines (key.text+" "+elem_dic[key].text+'\ r \ n')   AboutTime.sleep (5)   $            -     exceptException,e:#' UTF8 ' codec can ' t decode byte -         Print "Error:", E -     finally:   A         Print '\ n'   +Info.write ('\ r \ n')   the    - #Main function $ defMain (): the     GlobalInfo the     #By function Get information theSource = open ("Tourist_spots_5a_bd.txt",'R')   the      forNameinchSource: -name = Unicode (name,"Utf-8")   in         ifU'Forbidden City' inchName#else add a '? '  theName = U'Beijing Forbidden City'   the Getinfobox (name) About     Print 'End Read files!'   the source.close () the info.close () the driver.close () +    -Main ()

Run results
Mainly by reading from the TXT file in the F disk the name of the country 5 A-level scenic area, and then call Phantomjs.exe browser to access the infobox value. At the same time, if there is a coding problem "' ASCII ' codec can ' t encode characters" you can set the compiler Utf-8 encoding by following the code as follows:

# set the encoding utf-8 Import sys Reload (SYS)  sys.setdefaultencoding ('utf-8')#  Show current default encoding print sys.getdefaultencoding ()





Corresponding source code
The corresponding Baidu Encyclopedia infobox source code, such as the basic knowledge of the code can refer to my previous blog post or my Python crawler patent, selenium not only good at doing automated testing, the same is suitable for simple crawler.


Coding issues
At this point you may still encounter the "' ASCII ' codec can ' t encode characters" encoding problem.

It is because you create TXT file by default is ASCII format, at this time your text does ' utf-8 ' format, so need to convert by the following methods.

1 ImportCodecs2 3 #use the Open method provided by codecs to specify the language encoding of the open file, which is automatically converted to internal Unicode at read time4 if  notos.path.exists (baidufile):5info = Codecs.open (baidufile,'W','Utf-8')  6 Else:  7info = Codecs.open (baidufile,'a','Utf-8')8     9 #This method is not IO so line break is ' \ r \ n 'TenInfo.writelines (key.text+":"+elem_dic[key].text+'\ r \ n')


Summarize
You can learn the basic automated crawler method in the code, and learn how to display the Key-value key pair through a for loop, corresponding to the displayed attributes and attribute values, implemented by the following code:
Elem_dic = dict (Zip (elem_name,elem_value))
But the final output is not the order in infobox, why?
Finally, I hope that the article will help you, there is a basic introduction to the article, but the publication always triggered csdn sensitive system automatically locked, and do not know where the trigger. Recommended you can read ~
[Python crawler] Introduction to the method and operation of common element localization in selenium
(By:eastmount late 2015-9-6 2:30 http://blog.csdn.net/eastmount/)

[Python crawler] Selenium get Baidu Encyclopedia tourist attractions infobox message box

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.