Selenium+python How to crawl the Jane book website

Source: Internet
Author: User

This article introduces the content is Selenium+python how to crawl simple book site, has a certain reference value, now share to everyone, the need for friends can refer to


Page load Logic

When you learn the basic knowledge of the crawler from the Internet, like to find a goal in practice, has a large number of articles containing a large number of concise book contains a lot of valuable information, so naturally become your choice target, if you try to find it is not as simple as the imagination, because it contains a lot of JS related data transfer. Let me start with a traditional reptile demo: >

Open the book home, there seems nothing special

Jianshu Home

Open chrome Developer mode, find the title of the article, href all in the a label, there seems to be nothing different

A.png

The next step is to look for all the labels on the page a , but wait a second. If you look closely you will find that the page will load more when the pulley rolls halfway, so that the steps will repeat three times to know 阅读更多 the button at the bottom.

Pulley

Not only so the bottom of the read more does href not tell us to load the rest of the page information, the only way is不断点击阅读更多这个按钮

Load_more.png

What, will the pulley repeat three times to slide down the center of the page and constantly click the button this operation http request can not be done, this is more like JS operation? Yes, Jane's article is not a regular HTTP request, we can not constantly redirect according to different URLs, but some of the page's actions to load page information.

Selenium Introduction

Selenium is a Web automation testing tool, support a number of languages, we can use Python selenium to do crawlers, crawl the process of simple books, it works is constantly injected JS code, so that the page load continuously, and finally extract all the a Label. First you have to download the selenium package in Python.

>>> PIP3 Install Selenium

Chromedriver

Selenium must be powered by a browser, and here I'm using Chromedriver,chrome's open-source beta, which allows you to use headless mode without having to display the previous paragraph to access the page, which is the biggest feature.

Operations in Python

Be sure to put Chromedriver in the same folder before writing the code, because we need to refer to path, so it's convenient. First of all our first task is to brush out 加载更多 the button and need to do 3 times to repeat the pulley three times to slide the center of the page , here for convenience I slipped to the bottom

From selenium import webdriverimport timebrowser = Webdriver. Chrome ("./chromedriver") browser.get ("https://www.jianshu.com/") for I in Range (3):    browser.execute_script (" Window.scrollto (0, document.body.scrollHeight); ") Execute_script is inserting JS code of    Time.sleep (2)//load takes time, 2 seconds is more reasonable

Look at the effect

Brushed out the button

The next step is to continuously click the button to load the page and continue to join the Py file.

for j in Range (10)://Here I simulate 10 times click Try:button = Browser.execute_script ("var a = Document.getelementsbyclas SName (' Load-more ');        A[0].click (); ") Time.sleep (2) Except:pass "" Above the JS code to illustrate var a = Document.getelementsbyclassname (' Load-more '); Select Load-more this element a [0].click (); Because A is a collection, index 0 then executes the click () function ' 
'

This I do not map, after success is constantly loading the page, know that the cycle is over, the next work is much simpler, is looking for a tags, get which text and href properties, here I directly write them in the TXT file.

Titles = Browser.find_elements_by_class_name ("title") with Open ("Article_jianshu.txt", "W", encoding= "Utf-8") as F: For    T in titles:        try:            f.write (T.text + "" + T.get_attribute ("href"))            f.write ("\ n")        except TypeError:            Pass

Final result

Jane Book Articles

Headless mode

Constantly loading the page is certainly annoying, so we did not want to show the browser after the success of the test, which need to add headless mode

Options = Webdriver. Chromeoptions () options.add_argument (' headless ') browser = Webdriver. Chrome ("./chromedriver", chrome_options=options)//Add the above browser to the Chrome_options parameter

Summarize

When we can't use normal HTTP request crawling, we can use selenium to manipulate the browser to grab what we want, so there are pros and cons, such as

    • Advantages

    1. Can be violent reptile

    2. Jane does not need a cookie to view the article, no need to bother to find an agent, or we can crawl and will not be ban

    3. Home page should be for AJAX transmission, no additional HTTP requests required

Disadvantages

    1. Crawl speed is too full, imagine our program, click once to wait 2 seconds then click 600 times need 1200 seconds, 20 minutes ...

Additional

This is all the complete code

from Selenium import Webdriverimport timeoptions = Webdriver. Chromeoptions () options.add_argument (' headless ') browser = Webdriver.    Chrome ("./chromedriver", chrome_options=options) browser.get ("https://www.jianshu.com/") for I in Range (3):    Browser.execute_script ("Window.scrollto (0, document.body.scrollHeight);") Time.sleep (2) # print (browser) for J in range: Try:button = Browser.execute_script ("var a = Document.geteleme Ntsbyclassname (' Load-more ');        A[0].click (); ")  Time.sleep (2) except:pass#titles = Browser.find_elements_by_class_name ("title") with Open ("Article_jianshu.txt",            "W", encoding= "Utf-8") as F:for T in Titles:try:f.write (T.text + "" + T.get_attribute ("href")) F.write ("\ n") except Typeerror:pass 


Related recommendations:

[Python crawler] Selenium crawling Sina Weibo content and user information

[Python crawler] uses selenium to wait for Ajax to load and simulate auto-paging, crawling East net company announcements

Python crawler: selenium+ beautifulsoup crawl js rendering dynamic content (Snow Net News)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.