Htmlunit Web crawler Beginner's study notes (ii)

Last Update:2015-09-01 Source: Internet

Author: User

Tags xpath

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This time I climbed Sina Weibo as an example, the process is too tangled, referring to a lot of great God's post, but still left a lot of problems, we slowly look, hope that the great God help to correct, my method for the time being is still relatively lame

Crawl Sina Weibo first to landing, before crawling sister paper website, because do not have to login, so do not have this step, but crawl Sina Weibo we must first login, but to relate to a problem, that is the verification Code, verification code from I now Baidu to, and their understanding, feel temporarily still can not solve, unless manually entered, Because its own verification code is to prevent malicious landing, anti-crawler, so it is recommended that you want to try a friend with a temporary do not enter the verification code of the account to try ( about the verification code, expect the big God can give some hints )

Here is the demo code

Webclient webclient = new webclient (); Webclient.getoptions (). SetJavaScriptEnabled (True); Webclient.getoptions (). setcssenabled (false); Webclient.setajaxcontroller (new  Nicelyresynchronizingajaxcontroller ()); Webclient.getoptions (). Setthrowexceptiononscripterror (false); htmlpage htmlpage = null;try {    htmlpage =  Webclient.getpage ("Http://login.sina.com.cn/sso/login.php?client=ssologin.js (v1.4.18)");     HtmlInput username =  (htmlinput)  htmlpage.getelementsbyname ("username"). Get (0);     HtmlInput password =  (htmlinput)  htmlpage.getelementsbyname (" Password "). Get (0);     htmlinput btn = htmlpage.getfirstbyxpath (".//*[@id = ' Vform ']/div[3]/ul/li[6]/div[2]/input ');     username.setvalueattribute ("* * *");     password.setvalueattribute ("* * * *");     btn.click ();}  catch  (ioexception e)  {    e.printstacktrace ();}

The code and before the almost too much, a few parameter settings, through the method naming should be able to generally know what the meaning, I do not explain too much, but some people may ask why Webclient.getpage URL is not www.weibo.com it? Because if the direct get that www.weibo.com, get the html,debug look, all is some JS code, no login module, it is obvious that Sina's Landing module is mostly script draw out, so bad, and this site, I began to refer to the previous written by a great God/HTTP/ blog.csdn.net/bob007/article/details/29589059

This site is actually a Sina pass landing page

At first I wondered why it was the interface, so I looked at the whole request process with HttpWatch after landing through www.weibo.com.

You can see that after entering the password and account, we post the address is exactly this address, that is, Sina Weibo landing inside should be like this

1. Request www.weibo.com Get landing page, enter the user password

2. Submit the request to the Sina Pass, then click Login to complete the login operation

Our usual microblogging account, for Sina, is actually a kind of Sina pass, so it is good to understand

We go back to htmlunit, we get the Sina Pass landing page (HtmlPage object), how to get to the specific user name and password control, this can be self-debug it is clear, and then with Htmlpage.asxml () This method to interpret the various elements of this page, you can see that

The way to get the landing button is also directly refer to the previous practice of the great God, using the API is Getfirstbyxpath this method, the XPath syntax is the standard syntax, you can refer to http://www.w3school.com.cn/ xpath/study, with this syntax for any node of the page, as long as a little analysis, the basic can get

After the simple, get the user name, password input box, as well as the Submit button, our last three lines of code is the user name and password set into, and then click the Submit button, this completes the login operation, is not feel this code logic is the same as the page operation, which also embodies the description of the simulation browser

Because you've landed now, you can now browse your homepage and watch the tweets of people you care about.

But everyone here to pay attention to a very important issue, is the micro-blog display way, we can feel the actual operation, the display of Weibo is such a process

1. Show part of Weibo

2. Scroll down and brush out some new tweets

3. After scrolling a few times, the pagination control appears at the bottom of the last page

4. Click on the next page and show some of the tweets, repeat the 2-4 process.

So I began to encounter such a problem, I began to use the following code to request my home page, and print out my Weibo, unfortunately only 15, because my scroll bar does not drop, so no request to more Weibo

Page3 = Webclient.getpage ("http://weibo.com/5622296819/profile?topnav=1&wvr=6"); List
So I began to go everywhere Baidu how to use Htmlunit simulation scroll bar scrolling, find a lot, simulation of this pull this action basic see can not engage (you may see this poor brother, 14 met this problem, his last two paragraphs say my heart, HTTP/ stackoverflow.com/questions/23491638/ Is-there-a-way-to-trigger-scroll-event-with-htmlunit-or-is-it-not-possible-at-al)
However, it is suggested that the request can be simulated by a scrollbar, rather than simulating the action, such as the great God http://blog.csdn.net/zhoujianfeng3/article/details/21395223 but the code is not complete, I try to And he simulated the URL is 14, do not expect to try a bit, sure enough, but his ideas still give me a lot of inspiration, I refer to his way, through the HttpWatch looked, now the scroll pull request basically is this look (this is my concern of an account of the microblog, because the Weibo more , can be used to practice scratching)
http://weibo.com/p/aj/v6/mblog/mbloglist?ajwvr=6&domain=100505&from=myfollow_all&pre_page=1& Page=1&max_id=&end_id=3882220290581087&pagebar=0&filtered_min_id=&pl_name=pl_official_ myprofilefeed__22&id=1005051645851277&script_uri=/u/1645851277&feed_type=0&domain_op=100505 &__rnd=1441077128511
after several analyses, note that these two parameters, page=1 and pagebar=0, this page is the current page, and this pagebar is the scroll of the sequence, that is, continue to roll down the words, page=1 unchanged, Pagebar to become 1, When paging, page to 2 (to the corresponding number of pages), and Pagebar again starting from 0, according to the previous micro-blog show the process of understanding is not difficult
But note that the above URL is a scrolling URL, and click on the page is another, you can use their own httpwatch to see the know, here do not repeat
Finish the analysis, say the way
Here, scroll down the request, return the JSON data, not an HTML page, So you can no longer use webclient.getpage, we have to pay attention to, and the current from my Baidu, Htmlunit still can not very good parsing json, so here reference I said before the idea of the great God, with another crawler tool Jsoup, to parse, demo code as follows
 webrequest requestone = new webrequest (New url (URL),  httpmethod.get); Webresponse jsonone = webclient.loadwebresponse (Requestone); Jsonobject jsonobj = jsonobject.fromobject (Jsonone.getcontentasstring ()); string data =  (String)  jsonobj.get ("Data");D Ocument doc = jsoup.parse (data) ; Elements elementstwo = doc.select ("Div[class=wb_text w_f14]");for  (int i =  0; i < elementstwo.size ();  i++)  {     Element  Element = elementstwo.get (i);      system.out.println ("No."  +  ( counter++)  +  "Weibo: "                      + elementstwo.get (i). text ());} 
URL is the URL of the paging, so use Htmlunit to build a Web request, and then get to the JSON returned results, converted to Jsonobject object, and then use Jsoup to parse the JSON character to go, Here Doc.select the syntax is Cssquery syntax, and XPath difference is not too much, we can Baidu a bit, it is better to understand, and finally print all the micro-blog can be
I almost forgot, there's a need to add new dependencies .
<!--JSON begin--><dependency> <groupId>net.sf.json-lib</groupId> <artifactid>json-l Ib</artifactid> <classifier>jdk15</classifier> <version>2.2</version></ dependency><!--json end--><!--jsoup begin--><dependency> <groupid>org.jsoup</ Groupid> <artifactId>jsoup</artifactId> <version>1.6.3</version></dependency> <!--Jsoup End--
Finally I am the microblogging account in the micro-blog more than 5,000 micro-blog, directly written to a file, is completed the most basic, of course, I know this is only the most basic crawler small demo, the back of the boss did not give me the assignment of other tasks, I will continue to study Kazakhstan, hope that the great God more correct
In fact, here is just to provide you with ideas, a lot of debug analysis on the good, of course, I still belong to this method is relatively stupid, if there is a greater God has a more hanging method, welcome to tell me O (^▽^) o

Htmlunit Web crawler Beginner's study notes (ii)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More