The previous article, if the analysis of each URL, including the scroll bar scrolling url and paging URL can be constructed to access the URL to crawl information, but so if you want to my attention of all the microblogging of the people all the output, it is not the URL of each person concerned to see and analysis, This is going to be a lot of work.
So today I just carefully analyzed the URL, found that in fact the change in addition to page and Pagebar The two parameters, there are other parameters need to pay attention to: the person's home page Id,domain and ID, the following explanation
Refer to the URL of a paging page
http://weibo.com/u/1645851277? pids=pl_official_myprofilefeed__22&is_search=0&visible=0&is _tag=0&profile_ftype=1&page= "+ (page++) +" &ajaxpagelet=1&ajaxpagelet_v6=1&__ref=/u/ 1645851277&_t=fm_144101367516634
Refer to the URL of a scroll bar
http://weibo.com/p/aj/v6/mblog/mbloglist?ajwvr=6&domain=100505&from=myfollow_all&pre_page=1&page=1&max_id=&end_id=3881954464220733&pagebar=0& filtered_min_id=&pl_name=pl_official_myprofilefeed__22&id=1005051645851277&script_uri=/u/1645851277&feed_type=0&domain_op=100505&__rnd=1441013708418
Above the red can be seen, 1645851277 is the home page Id,domain is 100505,id is 1005051645851277, is actually domain+ home page ID get
How do you get these parameters?
First is the homepage ID, this actually in my concern of that page can be found, relatively simple, search the ID, see, and then get the way to get the label
Obtained after the construction of a visit to this person's microblog homepage URL, and then need to find domain and ID, with Webclient.getpage in, here debug can see this configuration, hey, is not very obvious
But the question came, how to get this $config parameter, at first I am also very puzzled, this can do. , put in the CDATA block, and then Baidu a lot of Htmlunit get CDATA method, did not find, very depressed, but this time suddenly found that this is placed in the <script> tag, that is, this is a script, I think Htmlunit an API
Htmlpage.executejavascript (" Script ")
This method is to allow you to execute some scripting languages directly, So decisively use the HtmlPage object of this homepage to execute this method
Configmap = (map<string, string>) page3.executejavascript ("$CONFIG"). Getjavascriptresult ();
This is a parameter, so directly take, strong to map, and then directly get out the required properties, fix, the code after the modification can be found in accordance with the various IDs to the people I care about the microblog can be crawled out.
Finally, you are welcome to continue to criticize the great God, ah, brother in this thank you ('? ') )
Htmlunit Web crawler Beginner's study notes (iii)