Collection Tutorial and Collection Paging Setup problem

Collection Tutorial and Collection Paging Setup problem _ Application Skills

Last Update:2017-01-18 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

In fact, the acquisition can be so understood, the definition of head and tail, in addition to the interception of links do not need to detect, in other places you define the head and tail, in the same HTML table can not have the same, why not have the same? It is because every step of the collection (except for the interception of links) is based on the definition of the head and tail to intercept the content of the page. So you can not only define the head and tail not have the same, but also to the extent possible to exclude unwanted content. If you understand the definition of the head and tail, basically for a simple page can be collected. Here's an example to illustrate:

The following are program code:

News list URL: http://ent.qq.com/newxw/thd_sjym.htmhttp://ent.qq.com/newxw/thd_sjym.htmhttp://ent.qq.com/newxw/thd_sjym.htm
List Start code: &LT;TD style= "PADDING-LEFT:6PX;" ><table border= "0" cellpadding= "0" cellspacing= "0" class= "Table_logo" >
List End code: &LT;TD height= "5" colspan= "2" ></td>
Link Start code: <a target= "_blank" href= "
Link Close code: ">
Title start tag:<title>
Title End tag:</title>
Body start tag: <div id= "articlecnt" >
Body end tag: <div id= "Articletopic" ></div>

The pages collected above are relatively standard. Now to analyze:

List URL: Is the page you want to collect, this step is very important, before this step I was disorderly, now found that this step is related to whether you can collect all the content. Generally you enter the collection page to determine whether there are more than one page, if there are more than one page you enter the second page and the first page is not a regular change, such as: xxxx_1.htm,xxx_2.htm, pay special attention to the number, if the first page has _id rules, then the first page as a list URL, If the first page and the second page is not related to the law, but from the second page to such a law, that will be the second page as a list of URLs, the first page to put aside, and so on all the data collection, and then to collect a separate page, anyway, after the first page, why? Because the general update is on the first page.

Start and end of list: Here is to capture the program that you want to collect the content of the general direction, such as the list of examples in the URL, there are a lot of content, I only collect the right news, then you can search the first news, look up, find paragraph in this HTML file, the only content, Here is the definition of head and tail: spaces are counted. For example, <a href has four spaces before, which is also a feature, anyway, as long as the full text is the only line. The same method, just this search last news, positioning, save the full text to find, waste time, push down to find a full text only code,

Links start and end, here you have to look at the collection page, and then look at the HTML file, generally <a href= start,> end. The middle content lets the collection program help you to put
Here has been close to success, at this time to insure a little, random point five pages, find the common ground of five pages, the beginning and end of the title, the beginning and end of the text to fill out,
The final Test. It should be okay.

Here to talk about pagination, there are two kinds of pagination, one is to capture the page pagination, the other is the content of the article page.

Pagination of Acquisition page:

Cases:

The following are program code:

News list URL: http://www.pconline.com.cn/mobile/news/hgxz/index_1.htmlhttp://www.pconline.com.cn/mobile/news/hgxz/index_1.html
List Start code: 1px solid; "> Articles List </TD>
List End code: <div align= "CENTER" >
List Index Paging: Batch generation: http://www.pconline.com.cn/mobile/news/hgxz/index_{$ID}.htmlhttp://www.pconline.com.cn/mobile/news /hgxz/index_{$ID}.html
Build Scope: 4to1
Link Start code: <a href= "
Link Close code: target= "_blank"
Title start tag:<title>
Title end tag:-Pacific Internet pconline-[Mobile phone new Express]</title>
Body start tag: advertising:ad_top</iframe>
Body end tag: <br clear=all>

Notice the difference between the list URL and the list index paging link: The index_1.html is changed to index_{$ID}.html,
Build scope: How many pages are there? To?, there are two options, from the back to the front, before and after, whatever you like.

page pagination for content pages:

Cases:

The following are program code:

News list URL: http://www.enet.com.cn/emobile/inforcenter/articlelist.jsp?page=1&atype=A&acid=4146http:// www.enet.com.cn/emobile/inforcenter/articlelist.jsp?page=1&atype=A&acid=4146
List Start code: &LT;TD class= "Filter4" width= "><font color=" "#FFFFFF" > Mobile Information
List End code: &LT;TD height= "2" ></td>
Batch generation: http://www.enet.com.cn/emobile/inforcenter/articlelist.jsp?page={$ID}&atype=a&acid=4146http:// www.enet.com.cn/emobile/inforcenter/articlelist.jsp?page={$ID}&atype=a&acid=4146
Build scope: 10to1 (this benefit is the latest news in front, otherwise just like the collection page, the last page for the latest news)
Link Start code: <td><a href= "
Link Close code: target= "_blank"
Title start tag: <strong class= "P24" >
Title end tag: <td align= "center" > (here to copy the preceding blanks, otherwise there will be errors)
Body start tag: <table width= "100%" border= "0" cellspacing= "0" cellpadding= "2" align= "Center" >
Body end tag: <p id= "Adv_under_cont" ></p>
Next page start tag: <a href= "./(Find the next page, put the <a href="./space copied)
Next end tag: "> Next page </a>

Notice the beginning and end of the next page: Find the paging code and find the code for the next page: Define the next page start and end OK. Try a few more here, because the code has a small range of options.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Collection Tutorial and Collection Paging Setup problem _ Application Skills

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Collection Tutorial and Collection Paging Setup problem _ Application Skills

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support