Internet financial Crawler How to write-fourth lesson snowball net Stock Crawler (single page multi-data)

Last Update:2016-07-29 Source: Internet

Author: User

Tags xpath

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Previous on Series Tutorials:

Internet financial Crawler How to write-first lesson peer-to-peer network loan Crawler (XPath primer)

Internet financial Crawler How to write-second lesson snowball net stock Crawler (Introduction to Regular expressions)

Internet financial Crawler How to write-third lesson snowball net stock Crawler (Ajax analysis)

Haha, I came again, saying that the tutorial is so wayward, we take the hot iron, the last class analysis completed but did not write the code to complete!

Tool Requirements:

The main use of the tutorial in the 1, God Archer Cloud Crawler Framework This is the basis of the crawler, 2, Chrome browser and chrome plug-in Xpathhelper this to test the XPath write is correct 3, advanced REST client to simulate the submission request

Basic knowledge:

This tutorial mainly uses some basic JS and XPath syntax, if the two languages are unfamiliar, you can learn in advance, are very simple.

Do you remember that we mentioned a few steps in the first lesson of the remote e-commerce series Crawler tutorial? We'll walk along the path again:

First step: Determine the entry URL

For the moment, use this first page of the Ajax URL link:

http://xueqiu.com/stock/cata/stocklist.json?page=1&size=30&order=desc&orderby=percent&type=11% 2c12

Step two: Differentiate between content pages and intermediate pages

This time everyone is a little puzzled, although said every stock has a separate page, but the list page information is already quite a lot of, light crawl list page information is enough, how to distinguish between content pages and the middle page? In fact, we just need to set the content page and the Middle page's regular to be the same. As follows:

http://xueqiu.com/stock/cata/stocklist\\.json\\?page=\\d+&size=30&order=desc&orderby=percent& Type=11%2c12

To remind you, here is the reason for the escape character two is because in the hands of the god Arrow, set the regular, is the string setting, you need to escape the escaped character again.

Step Three: Content page extraction Rules

Since Ajax returns JSON, and the Jsonpath is supported by the extraction method, the extraction rule is simple. However, it is important to note that since we are extracting the data from the list page, the top level of the data is the equivalent of a list, and we need to set a value for the list data on the top-level field. The specific extraction rules are as follows:

Fields: [

{

Name: "Stocks",

Selector: "$.stocks",

SelectorType:SelectorType.JsonPath,

Repeated:true,

children:[

{

Name: "Code",

Alias: "Code",

Selector: "$.code",

SelectorType:SelectorType.JsonPath,

},

{

Name: "Name",

Alias: "Name",

Selector: "$.name",

SelectorType:SelectorType.JsonPath,

},

{

Name: "Current",

Alias: "Current Price",

Selector: "$.current",

SelectorType:SelectorType.JsonPath,

},

{

Name: "High",

Alias: "Highest Price",

Selector: "$.high",

SelectorType:SelectorType.JsonPath,

},

{

Name: "Low",

Alias: "Lowest price",

Selector: "$.low",

SelectorType:SelectorType.JsonPath,

}

]

}

]

I've simply extracted some information, and the rest of the information is similar.

Well, the main code has been written, and the remaining two questions need to be solved

1. Before crawling, you need to visit the first page to get a cookie

2. Although it is possible to join the next page directly, the total number of pages is not known.

First of all, for the 1th, we only need to access the first page in the Beforecrawl callback, the god Archer will automatically process and save the cookie, the specific code is as follows:

Configs.beforecrawl =function (site) {

Site.requesturl ("http://xueqiu.com");

};

OK, except for the next page basically there is no problem, we first test to see the effect:

650) this.width=650; "Src=" Http://upload-images.jianshu.io/upload_images/2069553-f1f456c7ec2b7eb0.png?imageMogr2 /auto-orient/strip%7cimageview2/2/w/1240 "class=" Imagebubble-image "alt=" 1240 "/>

The data has come out, no problem, the first page of data have, then how to deal with the next page? We have two options:

First scenario:

We can see that there is a count field in the return value of the JSON, which should be the value of the total amount of data, so we can tell how many pages are in total without our value, plus the number of data bars per page.

A second scenario:

Let us first visit, assuming that the number of pages is large, to see what the snowball will return, we try to access the No. 500 page, we can see the return value of the stocks is 0, then we can determine whether there is data to need to add a page.

Two scenarios have pros and cons, we choose to use the first solution to deal with, the specific code is as follows:

Configs.onprocesshelperpage =function (page, content, site) {

if (Page.url.indexOf ("page=1&size=30")!==-1) {

If this is the first page

Varresult = Json.parse (Page.raw);

Varcount = Result.count.count;

Varpage_num = Math.ceil (COUNT/30);

if (Page_num > 1) {

for (Vari = 2;i<=page_num;i++) {

Site.addurl ("http://xueqiu.com/stock/cata/stocklist.json?page=" +i+ "&size=30&order=desc&orderby= Percent&type=11%2c12 ");

}

}

}

};

Well, through the three lessons of the hard struggle, finally completed the snowball Shanghai list of conquest. First look at the effect of running out.

650) this.width=650; "Src=" Http://upload-images.jianshu.io/upload_images/2069553-3db6bf2509dfdf3a.png?imageMogr2 /auto-orient/strip%7cimageview2/2/w/1240 "class=" Imagebubble-image "alt=" 1240 "/>

The complete code is as follows:

Varconfigs = {

domains: ["xueqiu.com"],

Scanurls: ["http://xueqiu.com/stock/cata/stocklist.json?page=1&size=30&order=desc&orderby=percent &type=11%2c12 "],

Contenturlregexes: ["http://xueqiu.com/stock/cata/stocklist\\.json\\?page=\\d+&size=30&order=desc& Orderby=percent&type=11%2c12 "],

Helperurlregexes: ["http://xueqiu.com/stock/cata/stocklist\\.json\\?page=\\d+&size=30&order=desc& Orderby=percent&type=11%2c12 "],

Fields: [

{

Name: "Stocks",

Selector: "$.stocks",

SelectorType:SelectorType.JsonPath,

Repeated:true,

children:[

{

Name: "Code",

Alias: "Code",

Selector: "$.code",

SelectorType:SelectorType.JsonPath,

},

{

Name: "Name",

Alias: "Name",

Selector: "$.name",

SelectorType:SelectorType.JsonPath,

},

{

Name: "Current",

Alias: "Current Price",

Selector: "$.current",

SelectorType:SelectorType.JsonPath,

},

{

Name: "High",

Alias: "Highest Price",

Selector: "$.high",

SelectorType:SelectorType.JsonPath,

},

{

Name: "Low",

Alias: "Lowest price",

Selector: "$.low",

SelectorType:SelectorType.JsonPath,

}

]

}

]

};

Configs.onprocesshelperpage =function (page, content, site) {

if (Page.url.indexOf ("page=1&size=30")!==-1) {

If this is the first page

Varresult = Json.parse (Page.raw);

Varcount = Result.count.count;

Varpage_num = Math.ceil (COUNT/30);

if (Page_num > 1) {

for (Vari = 2;i<=page_num;i++) {

Site.addurl ("http://xueqiu.com/stock/cata/stocklist.json?page=" +i+ "&size=30&order=desc&orderby= Percent&type=11%2c12 ");

}

}

}

};

Configs.beforecrawl =function (site) {

Site.requesturl ("http://xueqiu.com");

};

Varcrawler =newcrawler (configs);

Crawler.start ();

This way our snow net stock crawler is done, of course, we can also set the type of the template. But this is a high-level approach, and we'll describe it in more detail later in the course.

Finally, the crawler is interested in children's shoes welcome and QQ Group with me to discuss: 566855261.

Internet financial Crawler How to write-fourth lesson snowball net Stock Crawler (single page multi-data)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More