Detailed description of some functions using collection examples
Today we are going to give you an example of a website with 163 entertainment channels. This should be a common and practical rule. Let's start.
If you are a veteran of the train collector, you can refer to it because what I want to explain violates the traditional thinking. If you are a newbie, you 'd better take a closer look, this will speed up your entry and save you a lot of time in the future. The following are some basic steps for data collection, which can be used flexibly:
1. Create a site
1. Open the train collector and create a new site:
To facilitate management, you can obtain any name that you think is easy to remember for your site. However, it is recommended that you use the name of the target source as the site name for future management, as shown in
Most websites usually have only one set of templates or several similar templates. The so-called "mark in the template" is very similar. What is "template mark? Template tag refers to the start and end mark of a part of content. For example, many regular websites (usually some websites with relatively large content, such as sina and 163) will use the same Or
To indicate the beginning of the content. There are two reasons for this. One is due to a large amount of content, which marks the cooperation between various departments for the convenience of project handover, and the other is the need for content control, with the popularity of xhtml, more and more layer-based controls make it easier to search for collection tags (which you will understand later ). The reason for this is that we are going to explain the content rules of the entire site.
2. Title label description. Corresponding page at this: http://ent.163.com/06/1029/11/2UJNHOS3000322EL.html
First, switch from "basic site information" to "whole site content rules", and then copy the URL of the content page to "typical page", and then click "test" to read the source code. Starting from the title tag, we found that the title collected by default tags has more "_ Netease entertainment". Double-click the title tag or select a title tag and click Modify, add "_ Netease entertainment" to the excluded content box, and the title label is complete.
3. content label description. The most important thing to create a tag for a collection rule (task) is to find the sign that begins and ends. Currently, most collectors require that the start and end signs must be the only sign of the entire source code, that is, only one start or end sign can be found in all html source code. However, the train collector does not need to do this. You only need to find the first icon from top to bottom. I mean, the html code allows n identical start (end, the same below) signs, however, as long as the sign of the content we want to collect is the first html from top to bottom. Open any content page, here take http://ent.163.com/06/1029/11/2UJNHOS3000322EL.html as an example, we found that his content from "go to Forum", so double-click the code test box, find the required code,
We can use this as the marker for starting the content, but this is not perfect. Please open several content pages and right-click the page -- view source code ", then compare the code and extract the same part.
As a sign of the beginning of the content.
Next, let's look at the content ending mark, as shown in the following two figures:
The following content is collected based on my settings.
In general, the content we collect from the start sign to the end sign contains content, advertisements, or links that must be excluded. What we need to exclude here is"
Topics> the sixth golden e TV and Art Festival
". The exclusion method is to find the corresponding code and copy the complete code into the content exclusion window. The change part is replaced. Because this is a full-site rule, you must find several more categories, for example, the current entertainment 163 includes "Stars | images | movies | TVs | music | forums | topics | celebrity visits, here, I will only extract "stars, images, and movies" to explain them to you. Looking for other categories is just to make the rules general and perfect. If you only need one of the categories, such as "Images", then you can directly make this rule.
Http://ent.163.com/06/1018/15/2TNNT7EU00031H2L.html this page is just paging, so by the way the settings of the top and bottom pages. The "Previous Page" and "next page" on his side use images as links, so you only need to choose not the image name (right-click the corresponding image to view attributes and copy the image name) copy it to the corresponding code box. For details, see the image:
If you want to exclude any content, you only need to find the complete copy of the corresponding code into the code exclusion window and replace the variable part. Because there is no advertisement on his side, even if all the whole site rules are completed, click Save to go to the single task creation page. All right, the whole site rule will talk about these two labels. The other labels should be added according to the above steps as needed. Remember to keep them in mind. For other questions, please go to the train collector Forum: http://bbs.locoy.com discussion.
II. The following describes how to create a single task rule:
1. Many people may not yet understand how the train collector works. This is definitely a unique feature of the train (at least so far, I don't know if anyone has the same function in the future !)
The train collector can directly access the content collection without making a website rule, so that you can decide whether to collect the selected target source based on the difficulty of the site, you don't have to wait until the website is collected to find that you can't pick up the website or it's not worth your time (the previous time is no use !).
One of the biggest functions of train v3.0 is to inherit the site rules. As long as the rules you created earlier are common, you do not need to create any content collection rules for all subsequent tasks. Because the content collection rules we created earlier are universal, we do not need to explain the rules here, and directly inherit from the site,
2. Website collection rule creation
Step: "New" -- "new task". Other operations include:
Rules must be good at discovering regular things. It is no problem to do this collection. We want to collect the address of the sample in this http://ent.163.com/special/00031HI0/entnews.html
This board only collects 1-3 pages as an example. We found that the beginning of each foliar website included "past entertainment hotspots" and ended with "1st 2 ...... Page, so please copy the corresponding code in the html source code to the collection range in a specific area. In addition, the website must contain "/06/", so that the website can be collected (simple, try it yourself), such:
3. Release Method. There are five release methods. Here we take the most common "online release" as an example.
Select the web to publish online to the website, click "define global publishing method", and then follow the steps prompted by the system: select the publishing module -- enter the website/cms root address -- use the built-in train browser to log on -- close the built-in browser -- refresh the list -- test module, test succeeded -- save configuration -- save task -- the highlighted part is the step you want to perform, from left to right from top to bottom:
The following are two screenshots that I just collected from the local forum for testing: