This article introduces the content is Selenium+python how to crawl simple book site, has a certain reference value, now share to everyone, the need for friends can refer to
Page load Logic
When you learn the basic knowledge of the crawler from the Internet, like to find a goal in practice, has a large number of articles containing a large number of concise book contains a lot of valuable information, so naturally become your choice target, if you try to find it is not as simple as the imagination, because it contains a lot of JS related data transfer. Let me start with a traditional reptile demo: >
Open the book home, there seems nothing special
Jianshu Home
Open chrome
Developer mode, find the title of the article, href
all in the a
label, there seems to be nothing different
A.png
The next step is to look for all the labels on the page a
, but wait a second. If you look closely you will find that the page will load more when the pulley rolls halfway, so that the steps will repeat three times to know 阅读更多
the button at the bottom.
Pulley
Not only so the bottom of the read more does href
not tell us to load the rest of the page information, the only way is不断点击阅读更多这个按钮
Load_more.png
What, will the pulley repeat three times to slide down the center of the page and constantly click the button this operation http
request can not be done, this is more like JS operation? Yes, Jane's article is not a regular HTTP request, we can not constantly redirect according to different URLs, but some of the page's actions to load page information.
Selenium Introduction
Selenium is a Web automation testing tool, support a number of languages, we can use Python selenium to do crawlers, crawl the process of simple books, it works is constantly injected JS code, so that the page load continuously, and finally extract all the a
Label. First you have to download the selenium package in Python.
>>> PIP3 Install Selenium
Chromedriver
Selenium must be powered by a browser, and here I'm using Chromedriver,chrome's open-source beta, which allows you to use headless mode without having to display the previous paragraph to access the page, which is the biggest feature.
Operations in Python
Be sure to put Chromedriver in the same folder before writing the code, because we need to refer to path, so it's convenient. First of all our first task is to brush out 加载更多
the button and need to do 3 times to repeat the pulley three times to slide the center of the page , here for convenience I slipped to the bottom
From selenium import webdriverimport timebrowser = Webdriver. Chrome ("./chromedriver") browser.get ("https://www.jianshu.com/") for I in Range (3): browser.execute_script (" Window.scrollto (0, document.body.scrollHeight); ") Execute_script is inserting JS code of Time.sleep (2)//load takes time, 2 seconds is more reasonable
Look at the effect
Brushed out the button
The next step is to continuously click the button to load the page and continue to join the Py file.
for j in Range (10)://Here I simulate 10 times click Try:button = Browser.execute_script ("var a = Document.getelementsbyclas SName (' Load-more '); A[0].click (); ") Time.sleep (2) Except:pass "" Above the JS code to illustrate var a = Document.getelementsbyclassname (' Load-more '); Select Load-more this element a [0].click (); Because A is a collection, index 0 then executes the click () function '
'
This I do not map, after success is constantly loading the page, know that the cycle is over, the next work is much simpler, is looking for a
tags, get
which text
and href
properties, here I directly write them in the TXT file.
Titles = Browser.find_elements_by_class_name ("title") with Open ("Article_jianshu.txt", "W", encoding= "Utf-8") as F: For T in titles: try: f.write (T.text + "" + T.get_attribute ("href")) f.write ("\ n") except TypeError: Pass
Final result
Jane Book Articles
Headless mode
Constantly loading the page is certainly annoying, so we did not want to show the browser after the success of the test, which need to add headless mode
Options = Webdriver. Chromeoptions () options.add_argument (' headless ') browser = Webdriver. Chrome ("./chromedriver", chrome_options=options)//Add the above browser to the Chrome_options parameter
Summarize
When we can't use normal HTTP request crawling, we can use selenium to manipulate the browser to grab what we want, so there are pros and cons, such as
Can be violent reptile
Jane does not need a cookie to view the article, no need to bother to find an agent, or we can crawl and will not be ban
Home page should be for AJAX transmission, no additional HTTP requests required
Disadvantages
Crawl speed is too full, imagine our program, click once to wait 2 seconds then click 600 times need 1200 seconds, 20 minutes ...
Additional
This is all the complete code
from Selenium import Webdriverimport timeoptions = Webdriver. Chromeoptions () options.add_argument (' headless ') browser = Webdriver. Chrome ("./chromedriver", chrome_options=options) browser.get ("https://www.jianshu.com/") for I in Range (3): Browser.execute_script ("Window.scrollto (0, document.body.scrollHeight);") Time.sleep (2) # print (browser) for J in range: Try:button = Browser.execute_script ("var a = Document.geteleme Ntsbyclassname (' Load-more '); A[0].click (); ") Time.sleep (2) except:pass#titles = Browser.find_elements_by_class_name ("title") with Open ("Article_jianshu.txt", "W", encoding= "Utf-8") as F:for T in Titles:try:f.write (T.text + "" + T.get_attribute ("href")) F.write ("\ n") except Typeerror:pass
Related recommendations:
[Python crawler] Selenium crawling Sina Weibo content and user information
[Python crawler] uses selenium to wait for Ajax to load and simulate auto-paging, crawling East net company announcements
Python crawler: selenium+ beautifulsoup crawl js rendering dynamic content (Snow Net News)