Python crawler tutorial -26-selenium + PHANTOMJS
- Dynamic Front-end page:
- javascript:
JavaScript a literal-translation scripting language, a dynamic type, a weak type, a prototype-based language, and a built-in support type. Its interpreter, known as the JavaScript engine, is widely used in the client's scripting language as part of the browser, and is first used in HTML (an application under the standard Universal Markup Language) to add dynamic functionality to an HTML page
- jquery:
jquery is a fast, concise JavaScript framework that is a good JavaScript code base (or JavaScript framework) following prototype. The purpose of jquery design is "write Less,do more", which advocates writing less code and doing more things. It encapsulates common JavaScript functionality code, provides a simple JavaScript design pattern, optimizes HTML document manipulation, event handling, animation design, and Ajax interaction
- Ajax:
Ajax "Asynchronous JavaScript and XML" (Asynchronous JavaScript and XML) refers to a web development technique that creates interactive Web applications.
Ajax = Asynchronous JavaScript and XML (a subset of standard generic markup languages).
Ajax is a technique for creating fast, Dynamic Web pages.
Ajax is a technique for updating parts of a Web page without reloading the entire page.
through the background with the server
- DHTML:
DHTML is the short name for Dynamic HTML, which is a dynamically HTML (an application under the standard Universal Markup Language), which is the concept of making Web pages relative to traditional static HTML. Dynamic HTML, called DHTML, is not really a new language, it's just an integration of HTML, CSS, and client-side scripting, where a page includes html+css+javascript (or other client script). Where CSS and client-side scripts are written directly on the page rather than linked on the related file. DHTML is not a technology, standard, or specification, but a combination of existing web technologies and language standards, creating a Web design concept that can still transform page element effects in real time after downloading
Python collects Dynamic Data
- Starting with JavaScript code acquisition
- Python third-party libraries run JavaScript and directly capture the pages you see in your browser
Selenium + PHANTOMJS
- Selenium:web Automated Testing tools
- Selenium Official Document: https://www.seleniumhq.org/docs/
- Features of the Selenium:
- 1. Loading pages automatically
- 2. Get Data
- 3. Screen Cutting
- PHANTOMJS: Webkit-based browser with no interface
- Operated by Selenium Phantomjs
Installation of Selenium
- If you are using Anaconda:
- Of course, it can be installed directly in the Pycharm.
- "Pycharm" > "File" > "Settings" > "Project Interpreter" > "+" > "Selenium" > "Install"
- Specific operation:
Installation of PHANTOMJS
- : http://phantomjs.org/download.html
- Download as per your operating system version, unzip is available
Use of Selenium
- The Selenium Library has a webdriver API
- Webdriver can interact with the elements on the page and use it to crawl
- Note: Use PHANTOMJS to automatically find the appropriate browser according to the environment variables, if you do not configure the environment variable to take the path as a parameter
- Case code 28dhtml.py file: https://xpwi.github.io/py/py%E7%88%AC%E8%99%AB/py28dhtml.py
# Selenium 的使用# 通过 WebDriver 操作百度进行查找from selenium import webdriverimport time# 通过 Keys 模拟键盘# 也就是放入需要输入的东西,就不用键盘输入了from selenium.webdriver.common.keys import Keys# 操作哪个浏览器就对哪个浏览器创建一个实例,这里是 PhantomJS# 自动按照环境变量查找相应浏览器,如果没有配置环境变量就将路径作为参数driver = webdriver.PhantomJS(executable_path=r"D:\app\phantomjs-2.1.1-windows\bin\phantomjs.exe")driver.get("http://www.baidu.com")# 通过函数查找 title 标签print("Title: {0}".format(driver.title))
Run results
Note: If you do not configure an environment variable, use your own path as a parameter
The red Word is not an error, the print title success is used successfully
-This note does not allow any person or organization to reprint
Python crawler tutorial -26-selenium + PHANTOMJS