Selenium+python How to crawl the Jane book website

Last Update:2018-04-16 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This article introduces the content is Selenium+python how to crawl simple book site, has a certain reference value, now share to everyone, the need for friends can refer to

Page load Logic

When you learn the basic knowledge of the crawler from the Internet, like to find a goal in practice, has a large number of articles containing a large number of concise book contains a lot of valuable information, so naturally become your choice target, if you try to find it is not as simple as the imagination, because it contains a lot of JS related data transfer. Let me start with a traditional reptile demo: >

Open the book home, there seems nothing special

Jianshu Home

Open chrome Developer mode, find the title of the article, href all in the a label, there seems to be nothing different

A.png

The next step is to look for all the labels on the page a , but wait a second. If you look closely you will find that the page will load more when the pulley rolls halfway, so that the steps will repeat three times to know 阅读更多 the button at the bottom.

Pulley

Not only so the bottom of the read more does href not tell us to load the rest of the page information, the only way is不断点击阅读更多这个按钮

Load_more.png

What, will the pulley repeat three times to slide down the center of the page and constantly click the button this operation http request can not be done, this is more like JS operation? Yes, Jane's article is not a regular HTTP request, we can not constantly redirect according to different URLs, but some of the page's actions to load page information.

Selenium Introduction

Selenium is a Web automation testing tool, support a number of languages, we can use Python selenium to do crawlers, crawl the process of simple books, it works is constantly injected JS code, so that the page load continuously, and finally extract all the a Label. First you have to download the selenium package in Python.

>>> PIP3 Install Selenium

Chromedriver

Selenium must be powered by a browser, and here I'm using Chromedriver,chrome's open-source beta, which allows you to use headless mode without having to display the previous paragraph to access the page, which is the biggest feature.

Operations in Python

Be sure to put Chromedriver in the same folder before writing the code, because we need to refer to path, so it's convenient. First of all our first task is to brush out 加载更多 the button and need to do 3 times to repeat the pulley three times to slide the center of the page , here for convenience I slipped to the bottom

From selenium import webdriverimport timebrowser = Webdriver. Chrome ("./chromedriver") browser.get ("https://www.jianshu.com/") for I in Range (3):    browser.execute_script (" Window.scrollto (0, document.body.scrollHeight); ") Execute_script is inserting JS code of    Time.sleep (2)//load takes time, 2 seconds is more reasonable

Look at the effect

Brushed out the button

The next step is to continuously click the button to load the page and continue to join the Py file.

for j in Range (10)://Here I simulate 10 times click Try:button = Browser.execute_script ("var a = Document.getelementsbyclas SName (' Load-more ');        A[0].click (); ") Time.sleep (2) Except:pass "" Above the JS code to illustrate var a = Document.getelementsbyclassname (' Load-more '); Select Load-more this element a [0].click (); Because A is a collection, index 0 then executes the click () function ' 
 '

This I do not map, after success is constantly loading the page, know that the cycle is over, the next work is much simpler, is looking for a tags, get which text and href properties, here I directly write them in the TXT file.

Titles = Browser.find_elements_by_class_name ("title") with Open ("Article_jianshu.txt", "W", encoding= "Utf-8") as F: For    T in titles:        try:            f.write (T.text + "" + T.get_attribute ("href"))            f.write ("\ n")        except TypeError:            Pass

Final result

Jane Book Articles

Headless mode

Constantly loading the page is certainly annoying, so we did not want to show the browser after the success of the test, which need to add headless mode

Options = Webdriver. Chromeoptions () options.add_argument (' headless ') browser = Webdriver. Chrome ("./chromedriver", chrome_options=options)//Add the above browser to the Chrome_options parameter

Summarize

When we can't use normal HTTP request crawling, we can use selenium to manipulate the browser to grab what we want, so there are pros and cons, such as

Advantages

Can be violent reptile
Jane does not need a cookie to view the article, no need to bother to find an agent, or we can crawl and will not be ban
Home page should be for AJAX transmission, no additional HTTP requests required

Disadvantages

Crawl speed is too full, imagine our program, click once to wait 2 seconds then click 600 times need 1200 seconds, 20 minutes ...

Additional

This is all the complete code

from Selenium import Webdriverimport timeoptions = Webdriver. Chromeoptions () options.add_argument (' headless ') browser = Webdriver.    Chrome ("./chromedriver", chrome_options=options) browser.get ("https://www.jianshu.com/") for I in Range (3):    Browser.execute_script ("Window.scrollto (0, document.body.scrollHeight);") Time.sleep (2) # print (browser) for J in range: Try:button = Browser.execute_script ("var a = Document.geteleme Ntsbyclassname (' Load-more ');        A[0].click (); ")  Time.sleep (2) except:pass#titles = Browser.find_elements_by_class_name ("title") with Open ("Article_jianshu.txt",            "W", encoding= "Utf-8") as F:for T in Titles:try:f.write (T.text + "" + T.get_attribute ("href")) F.write ("\ n") except Typeerror:pass

Related recommendations:

[Python crawler] Selenium crawling Sina Weibo content and user information

[Python crawler] uses selenium to wait for Ajax to load and simulate auto-paging, crawling East net company announcements

Python crawler: selenium+ beautifulsoup crawl js rendering dynamic content (Snow Net News)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Selenium+python How to crawl the Jane book website

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Selenium+python How to crawl the Jane book website

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support