WebMagic Crawl Render Site

Source: Internet
Author: User

Recently suddenly learned that after the work has a lot of data collection tasks, a friend recommended webmagic this project, on the start play the next. found that the crawler project is still very useful, crawl static site almost do not have to write what code (of course, the small reptile ~~|). Okay, cut the crap, take this essay to record the crawl process of the rendered Web page first to find a JS rendering of the site, here directly to take the learning document inside a URL, http://angularjs.cn/

Open the Web page is like this

View the source code is like this

Source so little, needless to say certainly is rendered out of, casually search a record, sure enough source inside find the result

Then we start parsing the URLs, and we've found all the requests from the browser developer tool.

You start looking directly at the request for the maximum amount of data, such as the upper red line tag. From XHR see this is an AJAX request data, open the requested data is like this

From the Web page to find a source can not find the record, put in this JSON data inside search, Luck is good, search to

Needless to say, that's it!! The next step is to parse the JSON directly to get all the post-rendered links.

Click on a link from the Web page to enter, find the link is this:

Then go back to the JSON file and find this title

Find a great thing! That's the ID, which is the link behind it. Bold speculation, all links are this urine sex!! (In fact, I ordered a few links to see how to confirm the urine sex)

Next, write the code to parse the JSON data, and then piece together all the links into the crawl queue crawl.

Results found through the first page link into the subordinate link, or JS rendering ...

No way, take the link request to continue analysis

Get all the requested data:

directly see the XHR bar, which is the AJAX request data

Still look at the JSON data from the big to the small, match the content of the page until the third one finds it correctly. ++|

Then get the final data request Link: Http://angularjs.cn/api/article/A2KW

Then you can write the code:

1  Public classSpidertestImplementsPageprocessor {2     //Crawl site-related configuration, including encoding, crawl interval, retry times, etc.3     PrivateSite site = site.me (). Setretrytimes (3). Setsleeptime (100); 4     //The following matching rules can be derived from the hidden requests that are parsed from the browser first5     Private Static FinalString urlrule = "http://angularjs\\.cn/api/article/latest.*";6     Private StaticString Firsturl = "http://angularjs.cn/api/article/";7     8 @Override9      PublicSite Getsite () {Ten         //TODO auto-generated Method Stub One         returnsite; A     } -  - @Override the      Public voidProcess (Page page) { -         //TODO auto-generated Method Stub -         /** - * Filter out all eligible URLs and manually add them to the crawl queue.  +          */ -         if(Page.geturl (). Regex (Urlrule). Match ()) { +             //get the ID contents of the JSON data through Jsonpath and then piece together the crawl links AList<string> Endurls =NewJsonpathselector ("$.data[*]._id"). SelectList (Page.getrawtext ()); at             if(Collectionutils.isnotempty (endurls)) { -                  for(String endurl:endurls) { -Page.addtargetrequest (Firsturl +endurl); -                 } -             } -}Else { in             //extract ID and content from crawled JSON data by Jsonpath -Page.putfield ("title",NewJsonpathselector ("$.data.title"). Select (Page.getrawtext ())); toPage.putfield ("Content",NewJsonpathselector ("$.data.content"). Select (Page.getrawtext ())); +         } -          the     } *      $ @TestPanax Notoginseng      Public voidTest () { -Spider.create (NewSpidertest ()). Addurl ("http://angularjs.cn/api/article/latest?s=20"). Run (); the     } +}

At this point a rendered page is crawled down. Over

WebMagic Crawl Render Site

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.