a , Htmlunit is an open-source Java page Analysis tool, after reading the page, can effectively use Htmlunit analyze the content on the page. Projects can emulate the browser run, known as the Java Browser open source implementation. This browser, which has no interface, runs very fast.
Second,: http://sourceforge.net/projects/htmlunit/?source=directory
Third, visit the specified page
Web crawler first to face the problem is how to crawl Web pages, crawling is actually very easy, not as complex as you think, an open source HtmlUnit
package, 4 lines of the main code OK!
1 Importjava.io.IOException;2 Importjava.net.MalformedURLException;3 Importcom.gargoylesoftware.htmlunit.FailingHttpStatusCodeException;4 Importcom.gargoylesoftware.htmlunit.WebClient;5 ImportCom.gargoylesoftware.htmlunit.html.HtmlPage;6 7 Public classMain {8 9 Public Static voidMain (string[] args)throwsfailinghttpstatuscodeexception, Malformedurlexception, IOException {Ten //TODO auto-generated Method Stub One FinalWebClient mwebclient =NewWebClient (); A FinalHtmlPage mhtmlpage = mwebclient.getpage ("http://www.baidu.com"); - System.out.println (Mhtmlpage.astext ()); - mwebclient.closeallwindows (); the } - -}
Operation Result:
1February 03, 2015 11:46:02Morning Com.gargoylesoftware.htmlunit.javascript.StrictErrorReporter RuntimeError2Critical: Runtimeerror:message=[an invalid or illegal selector was specified (selector: ': Checked ' error:invalid selector: *:c hecked).] Sourcename=[http://S1.bdstatic.com/r/www/cache/static/jquery/jquery-1.10.2.min_f2fb5194.js] line=[14] lineSource=[null] Lineoffset=[0]3February 03, 2015 11:46:02Morning Com.gargoylesoftware.htmlunit.javascript.StrictErrorReporter RuntimeError4Critical: Runtimeerror:message=[an invalid or illegal selector was specified (selector: ': Enabled ' Error:invalid selector: *:e nabled).] Sourcename=[http://S1.bdstatic.com/r/www/cache/static/jquery/jquery-1.10.2.min_f2fb5194.js] line=[14] lineSource=[null] Lineoffset=[0]5February 03, 2015 11:46:02Morning Com.gargoylesoftware.htmlunit.javascript.StrictErrorReporter RuntimeError6Critical: runtimeerror:message=[the data necessary to complete ThisOperation is not yet available.] Sourcename=[http://S1.bdstatic.com/r/www/cache/static/jquery/jquery-1.10.2.min_f2fb5194.js] line=[10] lineSource=[null] Lineoffset=[0]7February 03, 2015 11:46:02Morning Com.gargoylesoftware.htmlunit.DefaultCssErrorHandler Error8Warning: CSS error: ' http://www.baidu.com/' [1:81] error in expression. (Invalid token ";". Was expecting one of: <s>, <number>, "Inherit", <ident>, <string>, <plus>, .)9February 03, 2015 11:46:02Morning Com.gargoylesoftware.htmlunit.DefaultCssErrorHandler ErrorTenWarning: CSS error: ' http://www.baidu.com/' [1:143] Error in style rule. (Invalid token "*". Was expecting one of: <eof>, <s>, <ident>, "}", ";".) OneFebruary 03, 2015 11:46:02Morning Com.gargoylesoftware.htmlunit.DefaultCssErrorHandler Warning AWarning: CSS warning: ' http://www.baidu.com/' [1:143] ignoring the following declarations in Thisrule. -February 03, 2015 11:46:02Morning Com.gargoylesoftware.htmlunit.DefaultCssErrorHandler Error -Warning: CSS error: ' http://www.baidu.com/' [1:339] error in expression. (Invalid token ";". Was expecting one of: <s>, <number>, "Inherit", <ident>, <string>, <plus>, .) theFebruary 03, 2015 11:46:02Morning Com.gargoylesoftware.htmlunit.DefaultCssErrorHandler Error -Warning: CSS error: ' http://www.baidu.com/' [2:204] error in declaration. (Invalid token "normal". Was expecting one of: <s>, ":".) -February 03, 2015 11:46:02Morning Com.gargoylesoftware.htmlunit.DefaultCssErrorHandler Error -Warning: CSS error: ' http://www.baidu.com/' [2:970] Error in style rule. (Invalid token "*". Was expecting one of: <eof>, <s>, <ident>, "}", ";".) +February 03, 2015 11:46:02Morning Com.gargoylesoftware.htmlunit.DefaultCssErrorHandler Warning -Warning: CSS warning: ' http://www.baidu.com/' [2:970] ignoring the following declarations in Thisrule. +February 03, 2015 11:46:02Morning Com.gargoylesoftware.htmlunit.DefaultCssErrorHandler Error AWarning: CSS error: ' http://www.baidu.com/' [4:856] Error in style rule. (Invalid token "*". Was expecting one of: <eof>, <s>, <ident>, "}", ";".) atFebruary 03, 2015 11:46:02Morning Com.gargoylesoftware.htmlunit.DefaultCssErrorHandler Warning -Warning: CSS warning: ' http://www.baidu.com/' [4:856] ignoring the following declarations in Thisrule. -February 03, 2015 11:46:02Morning Com.gargoylesoftware.htmlunit.DefaultCssErrorHandler Error -Warning: CSS error: ' http://www.baidu.com/' [4:1016] Error in style rule. (Invalid token "*". Was expecting one of: <eof>, <s>, <ident>, "}", ";".) -February 03, 2015 11:46:02Morning Com.gargoylesoftware.htmlunit.DefaultCssErrorHandler Warning -Warning: CSS warning: ' http://www.baidu.com/' [4:1016] ignoring the following declarations in Thisrule. inFebruary 03, 2015 11:46:02Morning Com.gargoylesoftware.htmlunit.DefaultCssErrorHandler Error -Warning: CSS error: ' http://www.baidu.com/' [5:68] Error in style rule. (Invalid token "*". Was expecting one of: <eof>, <s>, <ident>, "}", ";".) toFebruary 03, 2015 11:46:02Morning Com.gargoylesoftware.htmlunit.DefaultCssErrorHandler Warning +Warning: CSS warning: ' http://www.baidu.com/' [5:68] ignoring the following declarations in Thisrule. -February 03, 2015 11:46:02Morning Com.gargoylesoftware.htmlunit.DefaultCssErrorHandler Error theWarning: CSS error: ' http://www.baidu.com/' [6:751] Error in style rule. (Invalid token "*". Was expecting one of: <eof>, <s>, <ident>, "}", ";".) *February 03, 2015 11:46:02Morning Com.gargoylesoftware.htmlunit.DefaultCssErrorHandler Warning $Warning: CSS warning: ' http://www.baidu.com/' [6:751] ignoring the following declarations in Thisrule.Panax NotoginsengFebruary 03, 2015 11:46:02Morning Com.gargoylesoftware.htmlunit.DefaultCssErrorHandler Error -Warning: CSS error: ' http://www.baidu.com/' [8:127] error in expression; ': ' found after identifier ' ProgID '. theFebruary 03, 2015 11:46:03Morning Com.gargoylesoftware.htmlunit.IncorrectnessListenerImpl Notify +Warning: Obsolete content type encountered: ' Text/javascript '. AFebruary 03, 2015 11:46:03Morning Com.gargoylesoftware.htmlunit.IncorrectnessListenerImpl Notify theWarning: Obsolete content type encountered: ' Text/javascript '. + Baidu a bit, you will know - Baidu a bit $ News hao123 Map Video Paste login settings More Products $ Baidu is set as the main page about Baidu -©2015 Baidu to use the former must read Beijing ICP Certificate No. No. 030173
Run Results
In the process of running the above program, we can get all the content of Baidu home page, the above code in the process of running a lot of warnings, the main reason for these warnings is due to the following two reasons:
1, Htmlunit support for JavaScript is not very good
2, Htmlunit support for CSS is not very good
Understand the above two points, the code rewrite, the disabled will be disabled, while disabling some unnecessary features, but also help improve the efficiency of the program, and also said that the web crawler does not need CSS support.
1 Importjava.io.IOException;2 Importjava.net.MalformedURLException;3 Importcom.gargoylesoftware.htmlunit.FailingHttpStatusCodeException;4 Importcom.gargoylesoftware.htmlunit.WebClient;5 ImportCom.gargoylesoftware.htmlunit.html.HtmlPage;6 7 Public classMain {8 9 Public Static voidMain (string[] args)throwsfailinghttpstatuscodeexception, Malformedurlexception, IOException {Ten //TODO auto-generated Method Stub One FinalWebClient mwebclient =NewWebClient (); AMwebclient.getoptions (). setcssenabled (false); -Mwebclient.getoptions (). setjavascriptenabled (false); - FinalHtmlPage mhtmlpage = mwebclient.getpage ("http://www.baidu.com"); the System.out.println (Mhtmlpage.astext ()); - mwebclient.closeallwindows (); - } - +}
1 Baidu a bit, you will know2Search Settings |Login3 News page stick to know MP3 picture video map4 Baidu a bit5 Input Method6 handwriting7 Pinyin8 Close9Space Encyclopedia Hao123 | More >>Ten set Baidu as your homepage OneJoin Baidu Promotion | Search the Leaderboards | About Baidu |About Baidu A© No. 030173 Baidu must read the Beijing ICP certificate before use
Run Results
++++++++++++++++
Talking about the use of Htmlunit