Htmlunit Web crawler Beginner's study notes (iii)

Source: Internet
Author: User
Tags cdata

The previous article, if the analysis of each URL, including the scroll bar scrolling url and paging URL can be constructed to access the URL to crawl information, but so if you want to my attention of all the microblogging of the people all the output, it is not the URL of each person concerned to see and analysis, This is going to be a lot of work.

So today I just carefully analyzed the URL, found that in fact the change in addition to page and Pagebar The two parameters, there are other parameters need to pay attention to: the person's home page Id,domain and ID, the following explanation

Refer to the URL of a paging page

http://weibo.com/u/1645851277? pids=pl_official_myprofilefeed__22&is_search=0&visible=0&is _tag=0&profile_ftype=1&page= "+ (page++) +" &ajaxpagelet=1&ajaxpagelet_v6=1&__ref=/u/ 1645851277&_t=fm_144101367516634

Refer to the URL of a scroll bar

http://weibo.com/p/aj/v6/mblog/mbloglist?ajwvr=6&domain=100505&from=myfollow_all&pre_page=1&page=1&max_id=&end_id=3881954464220733&pagebar=0& filtered_min_id=&pl_name=pl_official_myprofilefeed__22&id=1005051645851277&script_uri=/u/1645851277&feed_type=0&domain_op=100505&__rnd=1441013708418

Above the red can be seen, 1645851277 is the home page Id,domain is 100505,id is 1005051645851277, is actually domain+ home page ID get

How do you get these parameters?

First is the homepage ID, this actually in my concern of that page can be found, relatively simple, search the ID, see, and then get the way to get the label

Obtained after the construction of a visit to this person's microblog homepage URL, and then need to find domain and ID, with Webclient.getpage in, here debug can see this configuration, hey, is not very obvious

             But the question came, how to get this $config parameter, at first I am also very puzzled, this can do. , put in the CDATA block, and then Baidu a lot of Htmlunit get CDATA method, did not find, very depressed, but this time suddenly found that this is placed in the <script> tag, that is, this is a script, I think Htmlunit an API

             Htmlpage.executejavascript (" Script ")

             This method is to allow you to execute some scripting languages directly, So decisively use the HtmlPage object of this homepage to execute this method            

Configmap = (map<string, string>) page3.executejavascript ("$CONFIG"). Getjavascriptresult ();

This is a parameter, so directly take, strong to map, and then directly get out the required properties, fix, the code after the modification can be found in accordance with the various IDs to the people I care about the microblog can be crawled out.

Finally, you are welcome to continue to criticize the great God, ah, brother in this thank you ('? ') )





Htmlunit Web crawler Beginner's study notes (iii)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.