Use selenium to call Firefox to crawl Web page text and links

Source: Internet
Author: User

Time: August 6, 2017 night 3:40, insomnia, idle to write an essay on nothing.

A few days ago, my friend asked me if I could help him crawl the text and links of a Web site, want to organize a simple to facilitate learning, website: http://www.bianceng.cn/Programming/cplus/

Requirement: Save the text content and corresponding hyperlinks in the webpage to local, total 60 pages, 1773 items

Consider that the URL of the first page in the page is: http://www.bianceng.cn/Programming/cplus/; The URL for page 2nd to 60th is: url = ' http://www.bianceng.cn/Programming/cplus/index ' + str (page_number) + '. htm ' need to crawl the content is simple, Web page URL regularity strong, Direct violence is done.


Ideas: 1. Use selenium to invoke Firefox to load the page, the first page to load directly, 第2-60 page for loop generated page URL

2. Browse the source code of the page you will find that all the articles are in the Li tag, the first step: Locate the LI tag to find all of the pages containing articles and links of the HTML code

The 3.for loop gets the items in the second step to find out the text text of each article and Herf load the pre-defined container list

4. Convert list to dataframe format direct write to local, a little lazy, how fast how to come

No more nonsense, just go to the code:

#!/usr/bin/env python #-*-coding:utf-8-*-from selenium import webdriver import time import pandas as PD ' ' 2-60 page code basic with The first page of processing, no annotated "path = '/home/ycxu/download/geckodriver ' browser = webdriver. Firefox (Executable_path=path) browser.set_page_load_timeout (+) L = [] #存储功能 "load the first page of content" ' Browser.get (' http://www. bianceng.cn/programming/cplus/') Browser.execute_script ("Window.scrollto (0, document.body.scrollHeight);") #定位到 Li Label Page_texts_one = browser.find_element_by_css_selector (' HTML body.articlelist div.w960.center.clear.mt1 div.list_ Pleft div.listbox ul.e3 '). Find_elements_by_tag_name (' li ') print ' first page content: ' For I in Page_texts_one:print i.find_elements_ By_tag_name (' a ') [0].get_attribute (' text '), I.find_elements_by_tag_name (' a ') [0].get_attribute (' href ') # Store the article content and links in the list container l.append ([I.find_elements_by_tag_name (' a ') [0].get_attribute (' text '), I.find_elements_by_tag_ Name (' a ') [0].get_attribute (' href ')]) ' Load 2 to 60 page ' for page in Xrange (2,61): url = ' Http://www.bianceng.Cn/programming/cplus/index ' + str (page) + '. htm ' browser.get (URL) browser.execute_script ("Window.scrollto (0,
  Document.body.scrollHeight); ") Time.sleep (3) # Otherwise it will load incomplete page_texts_two = browser.find_element_by_css_selector (' HTML body.articlelist div.w960.cent ER.CLEAR.MT1 div.list_pleft div.listbox ul.e3 '). Find_elements_by_tag_name (' li ') print ' page%d content: '% page for I in page_t Exts_two:print i.find_elements_by_tag_name (' a ') [0].get_attribute (' text '), I.find_elements_by_tag_name (' a ') [0]. Get_attribute (' href ') l.append ([I.find_elements_by_tag_name (' a ') [0].get_attribute (' text '), I.find_elements_by_ Tag_name (' a ') [0].get_attribute (' href ')]) #将list容器转换为DataFrame格式, a sentence can be saved to the local, for lazy people is a good way to deal with H = PD. DataFrame (L) h.to_csv ('/home/ycxu/desktop/page_info.csv ', encoding = ' utf-8 ') h.to_csv ('/home/ycxu/desktop/page_info.txt ', encoding = ' utf-8 ')




Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.