Collection Tutorial and Collection Paging Setup problem _ Application Skills

Source: Internet
Author: User
In fact, the acquisition can be so understood, the definition of head and tail, in addition to the interception of links do not need to detect, in other places you define the head and tail, in the same HTML table can not have the same, why not have the same? It is because every step of the collection (except for the interception of links) is based on the definition of the head and tail to intercept the content of the page. So you can not only define the head and tail not have the same, but also to the extent possible to exclude unwanted content. If you understand the definition of the head and tail, basically for a simple page can be collected. Here's an example to illustrate:


The following are program code:

News list URL: http://ent.qq.com/newxw/thd_sjym.htmhttp://ent.qq.com/newxw/thd_sjym.htmhttp://ent.qq.com/newxw/thd_sjym.htm
List Start code: &LT;TD style= "PADDING-LEFT:6PX;" ><table border= "0" cellpadding= "0" cellspacing= "0" class= "Table_logo" >
List End code: &LT;TD height= "5" colspan= "2" ></td>
Link Start code: <a target= "_blank" href= "
Link Close code: ">
Title start tag:<title>
Title End tag:</title>
Body start tag: <div id= "articlecnt" >
Body end tag: <div id= "Articletopic" ></div>


The pages collected above are relatively standard. Now to analyze:

List URL: Is the page you want to collect, this step is very important, before this step I was disorderly, now found that this step is related to whether you can collect all the content. Generally you enter the collection page to determine whether there are more than one page, if there are more than one page you enter the second page and the first page is not a regular change, such as: xxxx_1.htm,xxx_2.htm, pay special attention to the number, if the first page has _id rules, then the first page as a list URL, If the first page and the second page is not related to the law, but from the second page to such a law, that will be the second page as a list of URLs, the first page to put aside, and so on all the data collection, and then to collect a separate page, anyway, after the first page, why? Because the general update is on the first page.

Start and end of list: Here is to capture the program that you want to collect the content of the general direction, such as the list of examples in the URL, there are a lot of content, I only collect the right news, then you can search the first news, look up, find paragraph in this HTML file, the only content, Here is the definition of head and tail: spaces are counted. For example, <a href has four spaces before, which is also a feature, anyway, as long as the full text is the only line. The same method, just this search last news, positioning, save the full text to find, waste time, push down to find a full text only code,

Links start and end, here you have to look at the collection page, and then look at the HTML file, generally <a href= start,> end. The middle content lets the collection program help you to put
Here has been close to success, at this time to insure a little, random point five pages, find the common ground of five pages, the beginning and end of the title, the beginning and end of the text to fill out,
The final Test. It should be okay.

Here to talk about pagination, there are two kinds of pagination, one is to capture the page pagination, the other is the content of the article page.

Pagination of Acquisition page:

Cases:

The following are program code:

News list URL: http://www.pconline.com.cn/mobile/news/hgxz/index_1.htmlhttp://www.pconline.com.cn/mobile/news/hgxz/index_1.html
List Start code: 1px solid; "> Articles List </TD>
List End code: <div align= "CENTER" >
List Index Paging: Batch generation: http://www.pconline.com.cn/mobile/news/hgxz/index_{$ID}.htmlhttp://www.pconline.com.cn/mobile/news /hgxz/index_{$ID}.html
Build Scope: 4to1
Link Start code: <a href= "
Link Close code: target= "_blank"
Title start tag:<title>
Title end tag:-Pacific Internet pconline-[Mobile phone new Express]</title>
Body start tag: advertising:ad_top</iframe>
Body end tag: <br clear=all>


Notice the difference between the list URL and the list index paging link: The index_1.html is changed to index_{$ID}.html,
Build scope: How many pages are there? To?, there are two options, from the back to the front, before and after, whatever you like.

page pagination for content pages:

Cases:

The following are program code:

News list URL: http://www.enet.com.cn/emobile/inforcenter/articlelist.jsp?page=1&atype=A&acid=4146http:// www.enet.com.cn/emobile/inforcenter/articlelist.jsp?page=1&atype=A&acid=4146
List Start code: &LT;TD class= "Filter4" width= "><font color=" "#FFFFFF" > Mobile Information
List End code: &LT;TD height= "2" ></td>
Batch generation: http://www.enet.com.cn/emobile/inforcenter/articlelist.jsp?page={$ID}&atype=a&acid=4146http:// www.enet.com.cn/emobile/inforcenter/articlelist.jsp?page={$ID}&atype=a&acid=4146
Build scope: 10to1 (this benefit is the latest news in front, otherwise just like the collection page, the last page for the latest news)
Link Start code: <td><a href= "
Link Close code: target= "_blank"
Title start tag: <strong class= "P24" >
Title end tag: <td align= "center" > (here to copy the preceding blanks, otherwise there will be errors)
Body start tag: <table width= "100%" border= "0" cellspacing= "0" cellpadding= "2" align= "Center" >
Body end tag: <p id= "Adv_under_cont" ></p>
Next page start tag: <a href= "./(Find the next page, put the <a href="./space copied)
Next end tag: "> Next page </a>

Notice the beginning and end of the next page: Find the paging code and find the code for the next page: Define the next page start and end OK. Try a few more here, because the code has a small range of options.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.