?? There are a lot of open source components in Java to support a variety of ways to crawl Web pages, it is only easy to use Java to crawl Web pages. The main web crawling technology:
HttpClient
HttpClient is a sub-project under Apache Jakarta Common that can be used to provide efficient, up-to-date, feature-rich client programming toolkits that support the HTTP protocol, and it supports the latest versions and recommendations of the HTTP protocol.
The main features provided by HttpClient are listed below, and for more detailed features see the HttpClient homepage.
(1) Implementation of all HTTP methods (Get,post,put,head, etc.)
(2) Support automatic steering
(3) Support HTTPS protocol
(4) Support proxy Server
(5) Support automatic cookies management, etc.
Jsoup
Jsoup is a Java HTML parser that can parse a URL address and HTML text content directly. It provides a very labor-saving API that can be used to extract and manipulate data through dom,css and jquery-like operations.
Web page acquisition and parsing speed is very fast, recommended to use.
The main functions are as follows:
- Parsing html from a URL, file, or string;
- Use the DOM or CSS selector to find and remove data;
- Can manipulate HTML elements, attributes, text;
Htmlunit
Htmlunit is an open Source Java page Analysis tool that allows you to effectively use Htmlunit to analyze content on a page after reading the page. Projects can emulate the browser run, known as the Java Browser Open source implementation. This browser, which has no interface, runs very fast. The RHINOJS engine is used. Analog JS run.
?? Plainly is a browser, this browser is written in Java without interface browser, because it has no interface, so the speed of execution can be drip, Htmlunit provides a series of APIs, these APIs can do more functions, such as form filling, form submission, imitation click Link, Because of the built-in Rhinojs engine, JavaScript can be executed.
Web page acquisition and resolution faster, better performance, recommended for the need to parse Web page script application scenarios.
Watij
Watij (pronounced wattage) is a Web application testing tool developed using Java, and given the simplicity of Watij and the power of the Java language, Watij enables you to automate your Web application testing in a real browser. CSS rendering and JS execution are supported because the local browser is called.
Web page acquisition speed, IE version is too low (6/7) may cause a memory leak.
Here is the main introduction of Htmlunit Web crawling technology:
It is recommended to use Htmlunit to crawl Web pages mainly because:
- For web crawlers using Java implementation, we can generally use the Apache HttpClient component for HTML page information acquisition, httpclient implementation of the HTTP request returned by the response is generally plain text document page, That is, the most original HTML page.
- For a static HTML page, using HttpClient is enough to crawl out the information we need. But for more and more dynamic Web pages, more data is obtained by asynchronous JS code and rendered to, the first HTML page does not contain this part of the data.
Htmlunit is a "GUI-free browser for Java programs." It models HTML documents and provides an API that allows you to invoke pages, fill out forms, click links, and so on. Just like you did in the "normal" browser. It has pretty good javascript support (continuous improvement) and can even use a fairly complex AJAX library to emulate Chrome, Firefox, or Internet Explorer based on the configuration used. It is typically used to test or retrieve information from a Web site.
MAVEN introduces the jar package:
<!--crawler Toolkit--><dependency> <groupId>net.sourceforge.htmlunit</groupId> < Artifactid>htmlunit</artifactid> <version>2.30</version></dependency><dependency > <groupId>net.sourceforge.htmlunit</groupId> <artifactid>htmlunit-core-js</artifactid > <version>2.28</version></dependency><dependency> <groupId> Net.sourceforge.htmlunit</groupid> <artifactId>htmlunit-cssparser</artifactId> <version> 1.0.0</version></dependency>//introduced Jsoup to parse <dependency> <groupid>org.jsoup</the Web page Groupid> <artifactId>jsoup</artifactId> <version>1.11.3</version></dependency> <dependency> <groupId>net.sourceforge.htmlunit</groupId> <artifactid>neko-htmlunit</ Artifactid> <version>2.30</version></dependency>
Here's an example of creating a Web client and getting it to load the home page from a certain degree. Then we print whether the page has the correct title. GetPage () can return different types of pages based on the content type of the returned data. In this case, we expect the content type of the text/html, so we convert the result to com.gargoylesoftware.htmlun.html.htmlpage.
Very convenient:
import com.gargoylesoftware.htmlunit.WebClient;import com.gargoylesoftware.htmlunit.html.HtmlPage;public class Test { public void homPage() throws Exception { try (final WebClient webClient = new WebClient()) { final HtmlPage page = webClient.getPage("http://www.baidu.com"); System.out.println(page.getTitleText()); } } public static void main(String[] args) { try { new Test().homPage(); } catch (Exception e) { e.printStackTrace(); } }}打印结果:百度一下,你就知道Process finished with exit code 0
specific introductory cases and APIs can be found in the official documentation.
Java Web page Crawl technology htmlunit