Suddenly interested in the web crawler, so on the Internet query, found this particularly good. To share with you.
Now more and more people are keen to do web crawler (Web spider), there are more and more places need web crawler, such as search engines, information collection, public opinion monitoring, and so on. The technology (algorithm/strategy) involved in web crawler is wide and complex, such as Web Access, Web tracking, web analytics, Web search, Web rating and structure/unstructured data extraction, as well as the later finer granularity of data mining and other aspects, for beginners, not overnight can be fully mastered and skilled application, for the author , not to mention it clearly in an article. So in this article, we focus only on the web crawler's most basic technology-web crawl.
When it comes to Web Capture, there are often two points that have to be said, first is the identification of the Web page code, the other is the support of the Web script operation, in addition, whether to support the Post method to submit requests and support automatic cookie management is also a major concern of many people. In fact, the Java world, there are many open source components to support a variety of ways to crawl the web, including the above mentioned four points, so it is easy to use Java web crawler. Below, the author will focus on six of these ways.
HttpClient
HttpClient is a subproject under Apache Jakarta Common that can be used to provide efficient, up-to-date, feature-rich client-side programming toolkits that support HTTP protocols, and it supports the latest versions and recommendations of the HTTP protocol.
The following are the main features provided by HttpClient, and more detailed features can be found in the HttpClient home page.
(1) Implementation of all HTTP methods (Get,post,put,head, etc.)
(2) Support automatic steering
(3) Support HTTPS protocol
(4) Support proxy Server
(5) Support the automatic cookies management, etc.
Java Crawler Development of the most widely used Web capture technology, speed and performance first-class, in the functional support of the lower level, does not support JS script execution and CSS parsing, rendering and other quasi-browser functions, recommended for quick access to Web pages without parsing scripts and CSS scenes.
The example code is as follows:
Package cn.ysh.studio.crawler.httpclient;
Import org.apache.http.client.HttpClient;
Import Org.apache.http.client.ResponseHandler;
Import Org.apache.http.client.methods.HttpGet;
Import Org.apache.http.impl.client.BasicResponseHandler;
Import org.apache.http.impl.client.DefaultHttpClient;
Import Org.apache.http.util.EntityUtils; /** * based on htmlclient crawl page content * * @author www.yshjava.cn */public class Httpclienttest {public STA
tic void Main (string[] args) throws exception{//target page String url = "http://www.jjwxc.net/";
Create a default httpclient httpclient httpclient =new defaulthttpclient ();
try{//Request Web page http://www.yshjava.cn httpget httpget =new httpget (URL) in Get way;
Print Request Address System.out.println ("Executing request" + Httpget.geturi ());
Create Response Processor processing server response content ResponseHandler ResponseHandler = new Basicresponsehandler ();
Executes the request and obtains the result String Responsebody = Httpclient.execute (HttpGet, ResponseHandler); System.out.println ("----------------------------------------"); String htmlstr=null; if (responsebody!= null) { // Htmlstr=new String (entityutils.tostring (responsebody)); // htmlstr=new String (htmlstr.getbytes ("iso-8859-1"), "GBK" );
Read garbled resolution Responsebody=new String (responsebody.getbytes ("iso-8859-1"), "gb2312");
} System.out.println (responsebody);
System.out.println ("----------------------------------------"); }finally{//closes Connection Manager Httpclient.getconnectionmanager (). Shutdown () }}
Jsoup
Jsoup is a Java HTML parser that can directly parse a URL address, HTML text content. It provides a very labor-saving API for fetching and manipulating data through dom,css and jquery-like operations.
Web page acquisition and resolution is fast, recommended use.
The main functions are as follows:
1. Parse HTML from a URL, file, or string;
2. Use a DOM or CSS selector to find and retrieve data;
3. Can manipulate HTML elements, attributes, text;
The example code is as follows:
Package cn.ysh.studio.crawler.jsoup;
Import java.io.IOException;
Import Org.jsoup.Jsoup;
/**
* Crawl Web content based on Jsoup
* @author www.yshjava.cn
*/
public class Jsouptest {
public static void Main (string[] args) throws ioexception{
Target page
String url = "Http://www.jjwxc.net";
Use Jsoup to connect to the target page and execute the request to get the server response content
String html =jsoup.connect (URL). Execute (). body ();
Print page Content
SYSTEM.OUT.PRINTLN (HTML);
}
}
Htmlunit
Htmlunit is an open Source Java page Analysis tool, after reading the page, you can effectively use the Htmlunit analysis of the content on the page. Projects can simulate the browser to run, known as Java Browser Open source implementation. The browser, which has no interface, runs very fast. The RHINOJS engine is used. Analog JS run.
Web page acquisition and resolution faster, better performance, recommended for the need to parse the Web script application scenarios.
The example code is as follows:
Package cn.ysh.studio.crawler.htmlunit;
Import com.gargoylesoftware.htmlunit.BrowserVersion;
Import Com.gargoylesoftware.htmlunit.Page;
Import com.gargoylesoftware.htmlunit.WebClient;
/**
* Crawl Web content based on Htmlunit * *
@author www.yshjava.cn/public
class Htmlunitspider {public
STA tic void Main (string[] s) throws Exception {
//target page
String url = "http://www.jjwxc.net/";
Simulate a specific browser firefox_3
WebClient spider = new WebClient (browserversion.firefox_3);
Gets the target Web page
= spider.getpage (URL);
Print Web page content
System.out.println (Page.getwebresponse (). getcontentasstring ());
Close all Windows
// spider.closeallwindows ();
Spider.close ();
}
Watij
Watij (pronounced wattage) is a Web application testing tool that is developed using Java, and Watij enables you to complete automated testing of Web applications in real browsers, given the simplicity of Watij and the power of the Java language. Because the local browser is invoked, CSS rendering and JS execution are supported.
Web page acquisition speed is general, IE version is too low (6/7) may cause memory leaks.
The example code is as follows:
Package cn.ysh.studio.crawler.ie;
Import Watij.runtime.ie.IE;
/**
* Crawl Web content based on Watij, Windows platform only
*
* @author www.yshjava.cn
*/
public class Watijtest {
public static void Main (string[] s) {
Target page
String url = "http://www.jjwxc.net/";
Instantiating IE browser objects
ie ie = new ie ();
try {
Start browser
Ie.start ();
Go to the destination page
Ie.goto (URL);
Waiting for the page to load ready
Ie.waituntilready ();
Print page Content
System.out.println (ie.html ());
catch (Exception e) {
E.printstacktrace ();
finally {
try {//close IE browser
Ie.close ();
catch (Exception e) {
}
}
}
}
Selenium
Selenium is also a tool for Web application testing. The selenium test runs directly in the browser, just as the real user is doing. Supported browsers include IE, Mozilla Firefox, Mozilla Suite, and so on. The main features of this tool include: Test compatibility with browsers-test your application to see if it works well on different browsers and operating systems. Test system functionality-Create a recession test to verify software functionality and user requirements. Supports automatic recording actions and automatic generation. Test scripts in different languages, such as Net, Java, and Perl. Selenium is an acceptance test tool written by ThoughtWorks specifically for Web applications.
Web page acquisition speed is slow, for reptiles, is not a good choice.
The example code is as follows:
Package cn.ysh.studio.crawler.selenium;
Import Org.openqa.selenium.htmlunit.HtmlUnitDriver;
/**
* Crawl Web content based on Htmldriver *
* @author www.yshjava.cn/public
class Htmldrivertest {public
s tatic void Main (string[] s) {
//target page
String url = "http://www.jjwxc.net/";
Htmlunitdriver Driver = new Htmlunitdriver ();
try {
//Disable JS script feature
driver.setjavascriptenabled (false);
Open the target Web page
driver.get (URL);
Get the current page source
String html = Driver.getpagesource ();
Print Web page source
System.out.println (HTML);
} catch (Exception e) {
///print stack information
e.printstacktrace ();
} Finally {
try {
//close and Exit
Driver.close ();
Driver.quit ();
} catch (Exception e) {}}}}
Webspec
An Open-source Java browser with an interface that supports script execution and CSS rendering. Speed is general.
The example code is as follows:
package cn.ysh.studio.crawler.webspec; import org.watij.webspec.dsl.WebSpec; * * * webspec Crawl page content * * @author www.yshjava.cn */public class Webspectest {public static void M
Ain (string[] s) {//target page String url = "http://www.jjwxc.net/";
Instantiate the Browser object Webspec spec =new webspec (). Mozilla ();
Hide Browser form spec.hide ();
//Open target page spec.open (URL);
//Print Web page Source: SYSTEM.OUT.PRINTLN (Spec.source ());
//Close all Windows Spec.closeall (); }}