Recently suddenly learned that after the work has a lot of data collection tasks, a friend recommended webmagic this project, on the start play the next. found that the crawler project is still very useful, crawl static site almost do not have to write what code (of course, the small reptile ~~|). Okay, cut the crap, take this essay to record the crawl process of the rendered Web page first to find a JS rendering of the site, here directly to take the learning document inside a URL, http://angularjs.cn/
Open the Web page is like this
View the source code is like this
Source so little, needless to say certainly is rendered out of, casually search a record, sure enough source inside find the result
Then we start parsing the URLs, and we've found all the requests from the browser developer tool.
You start looking directly at the request for the maximum amount of data, such as the upper red line tag. From XHR see this is an AJAX request data, open the requested data is like this
From the Web page to find a source can not find the record, put in this JSON data inside search, Luck is good, search to
Needless to say, that's it!! The next step is to parse the JSON directly to get all the post-rendered links.
Click on a link from the Web page to enter, find the link is this:
Then go back to the JSON file and find this title
Find a great thing! That's the ID, which is the link behind it. Bold speculation, all links are this urine sex!! (In fact, I ordered a few links to see how to confirm the urine sex)
Next, write the code to parse the JSON data, and then piece together all the links into the crawl queue crawl.
Results found through the first page link into the subordinate link, or JS rendering ...
No way, take the link request to continue analysis
Get all the requested data:
directly see the XHR bar, which is the AJAX request data
Still look at the JSON data from the big to the small, match the content of the page until the third one finds it correctly. ++|
Then get the final data request Link: Http://angularjs.cn/api/article/A2KW
Then you can write the code:
1 Public classSpidertestImplementsPageprocessor {2 //Crawl site-related configuration, including encoding, crawl interval, retry times, etc.3 PrivateSite site = site.me (). Setretrytimes (3). Setsleeptime (100); 4 //The following matching rules can be derived from the hidden requests that are parsed from the browser first5 Private Static FinalString urlrule = "http://angularjs\\.cn/api/article/latest.*";6 Private StaticString Firsturl = "http://angularjs.cn/api/article/";7 8 @Override9 PublicSite Getsite () {Ten //TODO auto-generated Method Stub One returnsite; A } - - @Override the Public voidProcess (Page page) { - //TODO auto-generated Method Stub - /** - * Filter out all eligible URLs and manually add them to the crawl queue. + */ - if(Page.geturl (). Regex (Urlrule). Match ()) { + //get the ID contents of the JSON data through Jsonpath and then piece together the crawl links AList<string> Endurls =NewJsonpathselector ("$.data[*]._id"). SelectList (Page.getrawtext ()); at if(Collectionutils.isnotempty (endurls)) { - for(String endurl:endurls) { -Page.addtargetrequest (Firsturl +endurl); - } - } -}Else { in //extract ID and content from crawled JSON data by Jsonpath -Page.putfield ("title",NewJsonpathselector ("$.data.title"). Select (Page.getrawtext ())); toPage.putfield ("Content",NewJsonpathselector ("$.data.content"). Select (Page.getrawtext ())); + } - the } * $ @TestPanax Notoginseng Public voidTest () { -Spider.create (NewSpidertest ()). Addurl ("http://angularjs.cn/api/article/latest?s=20"). Run (); the } +}
At this point a rendered page is crawled down. Over
WebMagic Crawl Render Site