[Python crawler] Selenium get Baidu Encyclopedia tourist attractions infobox message box

Last Update:2015-12-18 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Earlier I told how to get Wikipedia message box through BeautifulSoup, also can get the website content through Spider, recently studied Selenium+phantomjs, ready to use them to get Baidu Encyclopedia of Tourist Attractions message box (INFOBOX), This is also the preliminary preparation for the alignment of the graduation design entity alignment and attributes. Hope the article is helpful to you ~

Source

1 #Coding=utf-82 """ 3 Created on 2015-09-04 @author: Eastmount4 """  5   6 Import Time7 ImportRe8 ImportOS9 ImportSYSTen ImportCodecs One  fromSeleniumImportWebdriver A  fromSelenium.webdriver.common.keysImportKeys - ImportSelenium.webdriver.support.ui as UI -  fromSelenium.webdriver.common.action_chainsImportActionchains the    - #Open Phantomjs -Driver = Webdriver. PHANTOMJS (executable_path="G:\phantomjs-1.9.1-windows\phantomjs.exe")   - #Driver = Webdriver. Firefox () +Wait = UI. Webdriverwait (driver,10) - GlobalInfo#Global Variables +  A #Get the infobox of 5A tourist spots at defGetinfobox (name): -     Try:   -         #create paths and txt files -         GlobalInfo -Basepathdirectory ="tourist_spots_5a"   -         if  notos.path.exists (basepathdirectory): in os.makedirs (basepathdirectory) -Baidufile = Os.path.join (Basepathdirectory,"BaiduSpider.txt")   to         if  notos.path.exists (baidufile): +info = Codecs.open (baidufile,'W','Utf-8')   -         Else:   theinfo = Codecs.open (baidufile,'a','Utf-8')   *        $         #Locate input notice:1.visit URL by Unicode 2.write filesPanax Notoginseng         PrintName.rstrip ('\ n')#Delete char ' \ n ' -Driver.get ("http://baike.baidu.com/")   theELEM_INP = Driver.find_element_by_xpath ("//form[@id = ' searchform ']/input")   + Elem_inp.send_keys (name) A Elem_inp.send_keys (Keys.return) theInfo.write (Name.rstrip ('\ n')+'\ r \ n')#codecs does not support ' \ n ' line break +         #Print Driver.current_url -Time.sleep (5)   $    $         #Load Infobox -Elem_name = Driver.find_elements_by_xpath ("//div[@class = ' basic-info ']/dl/dt")   -Elem_value = Driver.find_elements_by_xpath ("//div[@class = ' basic-info ']/dl/dd")   the    -         #Create dictionary Key-valueWuyi         #A dictionary is a hash table structure that is hashed by features after data entry, and does not record the original data in order to suggest tuples theElem_dic =dict (Zip (elem_name,elem_value)) -          forKeyinchElem_dic: Wu             PrintKey.text,elem_dic[key].text -Info.writelines (key.text+" "+elem_dic[key].text+'\ r \ n')   AboutTime.sleep (5)   $            -     exceptException,e:#' UTF8 ' codec can ' t decode byte -         Print "Error:", E -     finally:   A         Print '\ n'   +Info.write ('\ r \ n')   the    - #Main function $ defMain (): the     GlobalInfo the     #By function Get information theSource = open ("Tourist_spots_5a_bd.txt",'R')   the      forNameinchSource: -name = Unicode (name,"Utf-8")   in         ifU'Forbidden City' inchName#else add a '? '  theName = U'Beijing Forbidden City'   the Getinfobox (name) About     Print 'End Read files!'   the source.close () the info.close () the driver.close () +    -Main ()

Run results
Mainly by reading from the TXT file in the F disk the name of the country 5 A-level scenic area, and then call Phantomjs.exe browser to access the infobox value. At the same time, if there is a coding problem "' ASCII ' codec can ' t encode characters" you can set the compiler Utf-8 encoding by following the code as follows:

# set the encoding utf-8 Import sys Reload (SYS)  sys.setdefaultencoding ('utf-8')#  Show current default encoding print sys.getdefaultencoding ()

Corresponding source code
The corresponding Baidu Encyclopedia infobox source code, such as the basic knowledge of the code can refer to my previous blog post or my Python crawler patent, selenium not only good at doing automated testing, the same is suitable for simple crawler.

Coding issues
At this point you may still encounter the "' ASCII ' codec can ' t encode characters" encoding problem.

It is because you create TXT file by default is ASCII format, at this time your text does ' utf-8 ' format, so need to convert by the following methods.

1 ImportCodecs2 3 #use the Open method provided by codecs to specify the language encoding of the open file, which is automatically converted to internal Unicode at read time4 if  notos.path.exists (baidufile):5info = Codecs.open (baidufile,'W','Utf-8')  6 Else:  7info = Codecs.open (baidufile,'a','Utf-8')8     9 #This method is not IO so line break is ' \ r \ n 'TenInfo.writelines (key.text+":"+elem_dic[key].text+'\ r \ n')

Summarize
You can learn the basic automated crawler method in the code, and learn how to display the Key-value key pair through a for loop, corresponding to the displayed attributes and attribute values, implemented by the following code:
Elem_dic = dict (Zip (elem_name,elem_value))
But the final output is not the order in infobox, why?
Finally, I hope that the article will help you, there is a basic introduction to the article, but the publication always triggered csdn sensitive system automatically locked, and do not know where the trigger. Recommended you can read ~
[Python crawler] Introduction to the method and operation of common element localization in selenium
(By:eastmount late 2015-9-6 2:30 http://blog.csdn.net/eastmount/)

[Python crawler] Selenium get Baidu Encyclopedia tourist attractions infobox message box

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

[Python crawler] Selenium get Baidu Encyclopedia tourist attractions infobox message box

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

[Python crawler] Selenium get Baidu Encyclopedia tourist attractions infobox message box

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support