Teach you how to write an e-commerce crawler-the third lesson is still makeup Web Ajax request processing and content extraction

Last Update:2016-05-16 Source: Internet

Author: User

Tags chrome developer

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Tutorial Series:

Teach you to write e-commerce crawler-first lesson find a soft persimmon pinch

Hand in hand to teach you to write e-commerce crawler-the second lesson is still makeup mesh page Product Collection crawler

After reading two, I believe everyone has been promoted from the beginning of the small rookie to intermediate rookie, well, then we continue our reptile course.

The last lesson must be because the opponent is too strong, so that we do not complete the still makeup mesh crawler.

We continue this lesson, and strive to completely fix the makeup net, leaving no regrets.

Let's take a look at the previous lesson, which left two questions, both of which are related to Ajax.

1, because it is the next page of Ajax loading, resulting in the next page URL will not be automatically discovered by the system.

2, the price of the product page is loaded through Ajax, we directly from the page to obtain the information itself.

OK, let's solve the first problem first:

The first problem is actually a reptile more common problem, that is, the discovery of URLs, by default, the discovery of URLs is the god Archer Framework Automatic processing, but if in the case of Ajax, the framework can not find the URL, this time we need to manually handle the discovery of the URL, here, The archer gives us a handy callback function that allows us to handle the discovery of the URL ourselves:

onProcessHelperUrl(url, content, site)

This callback function has two parameters, namely the currently processed Page object and the entire crawl station object, we can get the content of the Page object to analyze whether there is a new page we need URL, through the Site.addurl () method added to the URL queue can be. Here we can see that when the number of pages is exceeded, the makeup net will return to us a page like this, we know the number of pages over, do not need to add a new page URL:

This page is a good judge, just to see if there is a "no match item" keyword in the content.

Here we need some basic JS capabilities, the code is as follows:

    function(url, content, site){          if(!content.indexOf("无匹配商品")){ //如果没有到最后一页，则将页数加1 var currentPage = parseInt(url.substring(url.indexOf("&page=") + 6)); var page = currentPage + 1; var nextUrl = url.replace("&page=" + currentPage, "&page=" + page); site.addUrl(nextUrl); } }

The principle is very simple, if there is no matching item in the content of the keyword, then the current page of the next page to join the queue to crawl.
OK, the Ajax paging problem is completely solved, the following is the most difficult Ajax content loading problem, that is, how to get to the product page price information

First of all, we usually have two ideas for this kind of problem:

1, through the JS engine to render the entire page, in order to do the content extraction, this program for some complex JS page is the only solution, with the God-Archer framework to deal with is also very simple, but because the need to execute JS, resulting in crawl speed is very slow, not the last resort, we do not use this nuclear weapon

2, through the experience of just dealing with pagination, we can pre-analyze the AJAX request, and then make this step out of the request and the original page request to do an association. This scheme is suitable for relatively simple JS pages.

OK, the introduction of the idea, according to experience, we feel that the AJAX load of makeup network is not very complex, so we choose the solution to deal with this AJAX page loading.

Similarly, the homepage we crawl to this AJAX request through the Chrome Developer tool, here teaches everybody a little trick, in the developer tool, can filter the request object not XHR, this is the asynchronous request, we can easily discover our suspect URL:

http://item.showjoy.com/product/getPrice?skuId=22912

Let's look at the page. 22912 How to extract the most convenient, we soon found a label:

    type="hidden" value="22912" id="J_UItemId" />

This label is clean and the XPath obtained is simple:

input[@id = "J_uitemid"]/@value

This is a good thing to do, and we'll look at what the results of this page request are:

{"Count ":0, "Data ":{"discount": discountmoney ": " 43.00 "," originalprice ": 112," price ": " 69.00 ","  Showjoyprice ": " 69.00 "},"  Isredirect ": 0," isSuccess ": Span class= "Hljs-value" >0, "login": 0}

As can be seen, is a typical JSON object, this is good to do, the god Archer Framework provides us with a way to extract content through Jsonpath, can be easily extracted to the price object, that is, the value of prices.

So how can we finally associate this request? Here is also a scenario provided in the framework, called Attachedurl, which is specifically designed to solve the problem of requests for associated requests, that is, the value of a field can be extracted from the content of an associated request. Grammar I will not introduce, directly on the code bar:

    {        name: "skuid", selector: "//input[@id=‘J_UItemId‘]/@value", }, { name: "price", sourceType: SourceType.AttachedUrl, attachedUrl: "http://item.showjoy.com/product/getPrice?skuId={skuid}", selectorType: SelectorType.JsonPath, selector: "$.data.price", }

A brief introduction to the use of Attachedurl, first we want to set SourceType to Attachedurl, and we want to set a attachedurl, that is, the address of the association request, where a value is dynamic, So we need to extract this dynamic value before this extraction, so we have added the name of a decimation item called Skuid, which is automatically replaced with the value we extracted from the previous Skuid extract when the call method in Attachedurl is {skuid} and the real request is made. Then, as we get to the JSON return, so we extract the way should be through Jsonpath, and finally, write a extraction rule can be, jsonpath than XPath more simple, I believe we can understand.

Well, get so much, the complete code is as follows:

    var configs = {domains: ["Www.showjoy.com","List.showjoy.com","Item.showjoy.com"], Scanurls: ["Http://list.showjoy.com/search/?q=cateIds%3A1,cateName%3A%E9%9D%A2%E8%86%9C"], contenturlregexes: ["Http://item\\.showjoy\\.com/sku/\\d+\\.html"], helperurlregexes: ["Http://list\\.showjoy\\.com/search/\\?q=cateids%3a1,catename%3a%e9%9d%a2%e8%86%9c (\\&page=\\d+)?"],You can leave fields blank: [{The first extract name:"title", selector:"//h3[contains (@class, ' Choose-hd ')]",XPath required is used by default:Truecannot be empty}, {The second extract name:"Comment", selector:"//div[contains (@class, ' DTABS-HD ')]/ul/li[2]",Using regular extraction rules required:Falsecannot be empty}, {The third extract name:"Sales", selector:"//div[contains (@class, ' DTABS-HD ')]/ul/li[3]",Using regular extraction rules required:Falsecannot be empty}, {name:"Skuid", selector:"//input[@id = ' j_uitemid ']/@value",}, {name:"Price", SOURCETYPE:SOURCETYPE.ATTACHEDURL, Attachedurl:"Http://item.showjoy.com/product/getprice?skuid={skuid}", SelectorType:SelectorType.JsonPath, selector:"$.data.price",}]}; Configs.onprocesshelperurl =function (URL, content, site) { Span class= "Hljs-keyword" >if (!content.indexof ( "no matching goods")) {// If not to the last page, add the page 1 var currentpage = parseint (url.substring ( Url.indexof ( "&page=") + 6)); var page = currentpage + 1; var Nexturl = url.replace ( "&page=" + currentPage,  "&page=" + page); Site.addurl (Nexturl); } return true;} var crawler = new crawler (configs); Crawler.start ();

Finally, let's try to test the results of the crawl:

Appreciation of their hard work is not a great accomplishment, but now the results of the climb still some ointment, the number of comments and sales to get is a complete sentence, and we want to get a specific number, how to operate it? This is actually a field extraction to the subsequent processing, the framework provides us with a callback function:

afterExtractField(fieldName, data)

The function will pass in the extracted name and extracted data, we only need to use the JS string processing function to further processing the data can be directly on the complete modified code:

var configs = {domains: ["Www.showjoy.com","List.showjoy.com","Item.showjoy.com"], Scanurls: ["Http://list.showjoy.com/search/?q=cateIds%3A1,cateName%3A%E9%9D%A2%E8%86%9C"], contenturlregexes: ["Http://item\\.showjoy\\.com/sku/\\d+\\.html"], helperurlregexes: ["Http://list\\.showjoy\\.com/search/\\?q=cateids%3a1,catename%3a%e9%9d%a2%e8%86%9c (\\&page=\\d+)?"],You can leave fields blank: [{The first extract name:"title", selector:"//h3[contains (@class, ' Choose-hd ')]",XPath required is used by default:Truecannot be empty}, {The second extract name:"Comment", selector:"//div[contains (@class, ' DTABS-HD ')]/ul/li[2]",Using regular extraction rules required:Falsecannot be empty}, {The third extract name:"Sales", selector:"//div[contains (@class, ' DTABS-HD ')]/ul/li[3]",Using regular extraction rules required:Falsecannot be empty}, {name:"Skuid", selector:"//input[@id = ' j_uitemid ']/@value",}, {name:"Price", SOURCETYPE:SOURCETYPE.ATTACHEDURL, Attachedurl:"Http://item.showjoy.com/product/getprice?skuid={skuid}", SelectorType:SelectorType.JsonPath, selector:"$.data.price",}]}; Configs.onprocesshelperurl =function(URL, content, site) {if (!content.indexof ("No matching items")) {If not to the last page, add 1 pagesvar currentpage =parseint (Url.substring (Url.indexof ("&page=") +6));var page = currentpage +1; var Nexturl = url.replace ( "&page=" + currentPage,  "&page=" + page); Site.addurl (Nexturl); } return true;} Configs.afterextractfield = function (fieldName, data) {if (fieldName = = " comment "| | fieldName = " sales ") {var regex = /.* ((\d+)). */; return (Data.match (regex)) [1];} return data;} var crawler = new crawler (configs); Crawler.start ();

We have judged that if comment and sales extract the items by matching them directly to the numbers in parentheses, it is important to note that the parentheses on the Web page are the full-width brackets, so don't write them wrong.

This time finally can happily look at their own reptile data results:

Teach you how to write an e-commerce crawler-the third lesson is still makeup Web Ajax request processing and content extraction

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More