Time: August 6, 2017 night 3:40, insomnia, idle to write an essay on nothing.
A few days ago, my friend asked me if I could help him crawl the text and links of a Web site, want to organize a simple to facilitate learning, website: http://www.bianceng.cn/Programming/cplus/
Requirement: Save the text content and corresponding hyperlinks in the webpage to local, total 60 pages, 1773 items
Consider that the URL of the first page in the page is: http://www.bianceng.cn/Programming/cplus/; The URL for page 2nd to 60th is: url = ' http://www.bianceng.cn/Programming/cplus/index ' + str (page_number) + '. htm ' need to crawl the content is simple, Web page URL regularity strong, Direct violence is done.
Ideas: 1. Use selenium to invoke Firefox to load the page, the first page to load directly, 第2-60 page for loop generated page URL
2. Browse the source code of the page you will find that all the articles are in the Li tag, the first step: Locate the LI tag to find all of the pages containing articles and links of the HTML code
The 3.for loop gets the items in the second step to find out the text text of each article and Herf load the pre-defined container list
4. Convert list to dataframe format direct write to local, a little lazy, how fast how to come
No more nonsense, just go to the code:
#!/usr/bin/env python #-*-coding:utf-8-*-from selenium import webdriver import time import pandas as PD ' ' 2-60 page code basic with The first page of processing, no annotated "path = '/home/ycxu/download/geckodriver ' browser = webdriver. Firefox (Executable_path=path) browser.set_page_load_timeout (+) L = [] #存储功能 "load the first page of content" ' Browser.get (' http://www. bianceng.cn/programming/cplus/') Browser.execute_script ("Window.scrollto (0, document.body.scrollHeight);") #定位到 Li Label Page_texts_one = browser.find_element_by_css_selector (' HTML body.articlelist div.w960.center.clear.mt1 div.list_ Pleft div.listbox ul.e3 '). Find_elements_by_tag_name (' li ') print ' first page content: ' For I in Page_texts_one:print i.find_elements_ By_tag_name (' a ') [0].get_attribute (' text '), I.find_elements_by_tag_name (' a ') [0].get_attribute (' href ') # Store the article content and links in the list container l.append ([I.find_elements_by_tag_name (' a ') [0].get_attribute (' text '), I.find_elements_by_tag_ Name (' a ') [0].get_attribute (' href ')]) ' Load 2 to 60 page ' for page in Xrange (2,61): url = ' Http://www.bianceng.Cn/programming/cplus/index ' + str (page) + '. htm ' browser.get (URL) browser.execute_script ("Window.scrollto (0,
Document.body.scrollHeight); ") Time.sleep (3) # Otherwise it will load incomplete page_texts_two = browser.find_element_by_css_selector (' HTML body.articlelist div.w960.cent ER.CLEAR.MT1 div.list_pleft div.listbox ul.e3 '). Find_elements_by_tag_name (' li ') print ' page%d content: '% page for I in page_t Exts_two:print i.find_elements_by_tag_name (' a ') [0].get_attribute (' text '), I.find_elements_by_tag_name (' a ') [0]. Get_attribute (' href ') l.append ([I.find_elements_by_tag_name (' a ') [0].get_attribute (' text '), I.find_elements_by_ Tag_name (' a ') [0].get_attribute (' href ')]) #将list容器转换为DataFrame格式, a sentence can be saved to the local, for lazy people is a good way to deal with H = PD. DataFrame (L) h.to_csv ('/home/ycxu/desktop/page_info.csv ', encoding = ' utf-8 ') h.to_csv ('/home/ycxu/desktop/page_info.txt ', encoding = ' utf-8 ')