WebMagic Crawl Render Site

Last Update:2017-09-11 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Recently suddenly learned that after the work has a lot of data collection tasks, a friend recommended webmagic this project, on the start play the next. found that the crawler project is still very useful, crawl static site almost do not have to write what code (of course, the small reptile ~~|). Okay, cut the crap, take this essay to record the crawl process of the rendered Web page first to find a JS rendering of the site, here directly to take the learning document inside a URL, http://angularjs.cn/

Open the Web page is like this

View the source code is like this

Source so little, needless to say certainly is rendered out of, casually search a record, sure enough source inside find the result

Then we start parsing the URLs, and we've found all the requests from the browser developer tool.

You start looking directly at the request for the maximum amount of data, such as the upper red line tag. From XHR see this is an AJAX request data, open the requested data is like this

From the Web page to find a source can not find the record, put in this JSON data inside search, Luck is good, search to

Needless to say, that's it!! The next step is to parse the JSON directly to get all the post-rendered links.

Click on a link from the Web page to enter, find the link is this:

Then go back to the JSON file and find this title

Find a great thing! That's the ID, which is the link behind it. Bold speculation, all links are this urine sex!! (In fact, I ordered a few links to see how to confirm the urine sex)

Next, write the code to parse the JSON data, and then piece together all the links into the crawl queue crawl.

Results found through the first page link into the subordinate link, or JS rendering ...

No way, take the link request to continue analysis

Get all the requested data:

directly see the XHR bar, which is the AJAX request data

Still look at the JSON data from the big to the small, match the content of the page until the third one finds it correctly. ++|

Then get the final data request Link: Http://angularjs.cn/api/article/A2KW

Then you can write the code:

1  Public classSpidertestImplementsPageprocessor {2     //Crawl site-related configuration, including encoding, crawl interval, retry times, etc.3     PrivateSite site = site.me (). Setretrytimes (3). Setsleeptime (100); 4     //The following matching rules can be derived from the hidden requests that are parsed from the browser first5     Private Static FinalString urlrule = "http://angularjs\\.cn/api/article/latest.*";6     Private StaticString Firsturl = "http://angularjs.cn/api/article/";7     8 @Override9      PublicSite Getsite () {Ten         //TODO auto-generated Method Stub One         returnsite; A     } -  - @Override the      Public voidProcess (Page page) { -         //TODO auto-generated Method Stub -         /** - * Filter out all eligible URLs and manually add them to the crawl queue.  +          */ -         if(Page.geturl (). Regex (Urlrule). Match ()) { +             //get the ID contents of the JSON data through Jsonpath and then piece together the crawl links AList<string> Endurls =NewJsonpathselector ("$.data[*]._id"). SelectList (Page.getrawtext ()); at             if(Collectionutils.isnotempty (endurls)) { -                  for(String endurl:endurls) { -Page.addtargetrequest (Firsturl +endurl); -                 } -             } -}Else { in             //extract ID and content from crawled JSON data by Jsonpath -Page.putfield ("title",NewJsonpathselector ("$.data.title"). Select (Page.getrawtext ())); toPage.putfield ("Content",NewJsonpathselector ("$.data.content"). Select (Page.getrawtext ())); +         } -          the     } *      $ @TestPanax Notoginseng      Public voidTest () { -Spider.create (NewSpidertest ()). Addurl ("http://angularjs.cn/api/article/latest?s=20"). Run (); the     } +}

At this point a rendered page is crawled down. Over

WebMagic Crawl Render Site

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

WebMagic Crawl Render Site

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

WebMagic Crawl Render Site

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support