Use selenium to call Firefox to crawl Web page text and links

Last Update:2018-08-03 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Time: August 6, 2017 night 3:40, insomnia, idle to write an essay on nothing.

A few days ago, my friend asked me if I could help him crawl the text and links of a Web site, want to organize a simple to facilitate learning, website: http://www.bianceng.cn/Programming/cplus/

Requirement: Save the text content and corresponding hyperlinks in the webpage to local, total 60 pages, 1773 items

Consider that the URL of the first page in the page is: http://www.bianceng.cn/Programming/cplus/; The URL for page 2nd to 60th is: url = ' http://www.bianceng.cn/Programming/cplus/index ' + str (page_number) + '. htm ' need to crawl the content is simple, Web page URL regularity strong, Direct violence is done.

Ideas: 1. Use selenium to invoke Firefox to load the page, the first page to load directly, 第2-60 page for loop generated page URL

2. Browse the source code of the page you will find that all the articles are in the Li tag, the first step: Locate the LI tag to find all of the pages containing articles and links of the HTML code

The 3.for loop gets the items in the second step to find out the text text of each article and Herf load the pre-defined container list

4. Convert list to dataframe format direct write to local, a little lazy, how fast how to come

No more nonsense, just go to the code:

#!/usr/bin/env python #-*-coding:utf-8-*-from selenium import webdriver import time import pandas as PD ' ' 2-60 page code basic with The first page of processing, no annotated "path = '/home/ycxu/download/geckodriver ' browser = webdriver. Firefox (Executable_path=path) browser.set_page_load_timeout (+) L = [] #存储功能 "load the first page of content" ' Browser.get (' http://www. bianceng.cn/programming/cplus/') Browser.execute_script ("Window.scrollto (0, document.body.scrollHeight);") #定位到 Li Label Page_texts_one = browser.find_element_by_css_selector (' HTML body.articlelist div.w960.center.clear.mt1 div.list_ Pleft div.listbox ul.e3 '). Find_elements_by_tag_name (' li ') print ' first page content: ' For I in Page_texts_one:print i.find_elements_ By_tag_name (' a ') [0].get_attribute (' text '), I.find_elements_by_tag_name (' a ') [0].get_attribute (' href ') # Store the article content and links in the list container l.append ([I.find_elements_by_tag_name (' a ') [0].get_attribute (' text '), I.find_elements_by_tag_ Name (' a ') [0].get_attribute (' href ')]) ' Load 2 to 60 page ' for page in Xrange (2,61): url = ' Http://www.bianceng.Cn/programming/cplus/index ' + str (page) + '. htm ' browser.get (URL) browser.execute_script ("Window.scrollto (0,
  Document.body.scrollHeight); ") Time.sleep (3) # Otherwise it will load incomplete page_texts_two = browser.find_element_by_css_selector (' HTML body.articlelist div.w960.cent ER.CLEAR.MT1 div.list_pleft div.listbox ul.e3 '). Find_elements_by_tag_name (' li ') print ' page%d content: '% page for I in page_t Exts_two:print i.find_elements_by_tag_name (' a ') [0].get_attribute (' text '), I.find_elements_by_tag_name (' a ') [0]. Get_attribute (' href ') l.append ([I.find_elements_by_tag_name (' a ') [0].get_attribute (' text '), I.find_elements_by_ Tag_name (' a ') [0].get_attribute (' href ')]) #将list容器转换为DataFrame格式, a sentence can be saved to the local, for lazy people is a good way to deal with H = PD. DataFrame (L) h.to_csv ('/home/ycxu/desktop/page_info.csv ', encoding = ' utf-8 ') h.to_csv ('/home/ycxu/desktop/page_info.txt ', encoding = ' utf-8 ')

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Use selenium to call Firefox to crawl Web page text and links

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Use selenium to call Firefox to crawl Web page text and links

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support