Step by step teach you how to collect website Rules

Source: Internet
Author: User

Step 1: Determine the website to be collected (we will use Dede's official website as the collection site for demonstration)

  1. Http://www.dedecms.com/plus/list.php? Tid = 10

Copy code


Step 2: Determine the encoding of the acquired site. Open the collected web page and view the source code (ie: View
-> Source code)


 


 



Find the charset between

Step 3: write the rules for getting the collection list

The source URL is clearly written as pageno, which indicates the page number. Therefore, to collect a list of multiple pages, replace the page number with "[Var: Paging]", as shown below:
[Url = http://www.dedecms.com/plus/list.php? Tid = 10 & pageno = [Var] http://www.dedecms.com/plus/list.php? Tid = 10 & pageno = [Var [/url]: Paging]


 


 

The article URL must contain the URL and cannot contain these two URLs. They are usually used for filtering and filtering only when there are many unnecessary connections in the collection list range.

The above Web site does not carry as to why the http://www.dedecms.com to add in front, this don't I said it.


If there is only one list page, you can simply enter the URL in the source URL.


 


 

Note that the key here is.

The following describes how to collect rules for retrieving the document list ",

Is the source code file opened on the collected page above. There is no other code similar to this page before finding the article list.

Before and after the document list on the dedecms Official Site List page, the latest and not the same are "class =" newslist ">" and "class =" pages "> ", write "Start html" and "End html" respectively.





Step 4: Collect the title, content, author, source, and other rules of the article, and collect by page.
For how to write "Start html" and "End html", refer to "get the article list rule writing" in step 3"



 


 

The following describes how to collect paging content and view the circled content.



Whether to select "list of all pages" on the page of the document"


For how to write "Start html" and "End html", refer to "get the article list rule writing" in step 3"


 





There is another one here. Due to the Forum configuration, it is displayed at the top.

Click "Paging content field" in the article content. If you do not select this field, you cannot collect it.


"Multimedia Resources in the download field" is to download multimedia resources (videos, software, images, etc.) to your local device during collection, that is, your website.


The following are filter rules:


Filtering rules need to be written using regular expressions, but for beginners, this is more difficult than logging on to the sky, and I do not understand it. :)
The above operations are completed. Save
Click "test"
 
Similar to the above figure, it indicates that the operation is successful.
Click "collect" later"
After the collection is complete, export the data to your topic.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.