How to compile CMS collection rules for locomotive collectors
Practical video of the locomotive collector tutorial-CMS collection rules compilation
**************************************** ****************************
Practice of the locomotive collector tutorial-compiling CMS collection rules
Finishied by Victor qq: 99767290
**************************************** ****************************
First, let's take a look at the basic functions of LocoySpider V3,
The basic functions of the locomotive we use today are as follows:
1. Create a site
2. Create a task
3. "Save to software database" in data publishing Mode"
Of course, this tutorial is centered on the topic "CMS collection rules writing", so it is impossible to cover all aspects of the functions of the head of the engine collector, please forgive me!
Now we will explain it to you in combination with actual practices.
**************************************** *****************
1. Create a site
1. function: Aggregation of collection tasks with the same collection content rules on the same site
2. Benefits:
A,Clear classification to facilitate query and call;
B. Default collection tasks created under the siteInheritanceSite collection content rules,Avoid duplicationThe trouble of Compiling Collection rules;
3. Practice:
We useDaily economic newsAs an example to explain, first we open the site http://www.nba.com.cn, browse the articles in different columns to find the siteArticle mode (Template)Almost identical
(Of course, there is a small difference, that is, some paragraphs are marked by paragraphs. <>
</P> some sections are divided by <DIV> </DIV>. If your website layout is set to <
Table> </table> is no big deal, but if your website uses the <DIV> </DIV> layout,
The residual DIV tag may damage your original layout. We will continue to discuss the solution to this situation later, so I will not go into details here ).
Well, now we have reason to believe that we can set up a website's "content rules" to cover all the topics of this website.
Click "CREATE" and select "daily economic news"
We will first write the "title" rule
Compilation of Title Tag rules
Note: To confirm the start string of a tag, pay attention to two points: 1. uniqueness; 2. close-fitting principle, that is, try to be as close to the target collection area;
Start string:<Span class = "txt181">
End string:</Span> <span class = "hui">
Note: confirm the uniqueness of the string: copy the string and press the shortcut key Ctrl + F to search for it. If the string is unique, a prompt "XXX cannot be found" is displayed.
To ensure the universality of labels, we can select different articles for testing. Here we will not demonstrate /.
Html Tag exclusion:Select "select all ".
Note: We can retain "spaces (placeholders)" because the separation of "long titles" of some sites does not rely on punctuation or pure white spaces, instead, they are separated by "placeholders". At this time, we need to retain the "space (placeholder)" option. (Check it out after class)
In this case, we can directly perform a collection test on a "typical page" by using a dot to test the collection effect. If we are satisfied, we will then write the rules for the article content.
Writing of the content tag rules
Start string:<Span id = "zoom" class = "content">
End string:<Br> <iframe
Html Tag exclusion:At this point, we need to retain the commonly used strings "<br/>", "P", and "<DIV" used to divide paragraphs, and keep the commonly used images "
Note: we have chosen to exclude "<table", but some articles often contain "Data Tables". At this time, we can only take full care of the overall situation and check for missing information in the future. Unless you can confirm that no additional layout table exists in your target collection area, we still exclude the table as a good example.
Writing of the author's tag rules
The key points are the same as the title label rules.
Start string:<Div align = "center" style = font-size: 9pt>
End string:[200
Html Tag exclusion:Select "select all ". (Test)
Time tag writing rules
The main points are the same as above.
Start string:<Span id = "zoom" class = "content">
End string:<Br> <iframe
Html Tag exclusion:Select "select all ". (Test)
Source tag specification Compilation
This value is generally set to the target website we collect by default.Fixed format data", But if you want to better reflect the copyright awareness of your website, you can make adjustments when collecting and setting the reprinted articles on the target website, we will not repeat it here.
Now, we have set the "content rules" for the entire site. The following describes how to set the collection task.
**************************************** **********************************
2. Create a collection task
Right-click the collection site you just created and select "Create task from this site". In the displayed dialog box, check "content rules ", the result is as mentioned above: "collection tasks created under the site inherit the collection content rules of the site by default". Now, we can directly compile the rules for "collection URLs.
"Collection URL depth"Tag Compilation
For the sake of flexibility and convenience, we usuallyArticle list pageSo we can useThe default value is "1"For more in-depth collection, We will elaborate in subsequent tutorials. We will not go into detail here.
Start website collectionWriting rules
Click "add wizard". In the displayed dialog box, there are three options: "single-page Website", "Batch/Multi-page", and "text import". Generally, we will not use the "text import" method. Here we will only elaborate on the first two collection methods.
We will firstSingle-page website"Settings, here we select the" Real Estate "column for learning.
The list page url is
Http://www.nbd.com.cn/ClassNews.asp? D_SClassID = 74,
Copy to the text field, click "add", and "complete adding ".
Go back to "new task"-"collection URL" and click"Select a region to collect web site"Set
From:Align = 'left'> homepage-To:Class = right_font> total
Test, result 40 page article page... All acquisition Tests passed. Satisfied. (We will not perform collection here) continue to learn.
Let's learn"Batch/multiple pages"
Click"Wizard add"In the pop-up dialog box, select"Batch/multiple pages"
To determineVariableTo perform the following operations:
1. on the webpage,Click""Next Page, Found the address bar URL: http://www.nbd.com.cn/ClassNews.asp? D_SClassID = 74 & page =2
2. Move the mousePoint to "next page"Found the browser at the bottom left of the status bar display address is http://www.nbd.com.cn/ClassNews.asp? D_SClassID = 74 & page =3;
3. Move the mousePoint to "" Last Page"Found the browser at the bottom left of the status bar display address is http://www.nbd.com.cn/ClassNews.asp? D_SClassID = 74 & page =58;
4. Move the mousePoint to "" Homepage""Found the browser at the bottom left of the status bar display address is http://www.nbd.com.cn/ClassNews.asp? D_SClassID = 74 & page =1;
Http://www.nbd.com.cn/ClassNews.asp? D_SClassID = 74 & page = 2
Http://www.nbd.com.cn/ClassNews.asp? D_SClassID = 74 & page = 3
Http://www.nbd.com.cn/ClassNews.asp? D_SClassID = 74 & page = 58
Http://www.nbd.com.cn/ClassNews.asp? D_SClassID = 74 & page = 1
In this way, we can determine"& Page = (*); "Is the variable of its list URL. Can I set it as follows:
Multi-page similar Address URL form: http://www.nbd.com.cn/ClassNews.asp? D_SClassID = 74 & page = (*);
The number ranges from1 to 58, with an interval multiple of 1;
Click "add" and complete adding.
Here, the "collection URLs for selected regions" settings are the same as those for "single-page URLs" and "collection URLs for selected regions", which are not described here.
Click "Start test URL" (this process is very long and I have paused video recording)
Of course, inActualDuring the operation, if the data volume is largeYou can skip the test., Directly collect data, even becauseIncomplete applicability of rulesAs a resultData Loss, I think so tooCan be ignored.
Here, I only select 2 pages for collection
The test results are displayed on a total of 80 pages.
Next step: Set "data publishing method"
**************************************** **********************************
Method 1:"Save to software database",
At the same time, select method 3"Publish the Web to the website online"Custom publishing method","Custom category ID"Select3, To the taskNameFor "Real Estate", and "Save, update" collection tasks, since we have just started the tutorial, we will not study in depth.
Return to the main interface of the locomotive, right-click the "Real Estate" task, and select "start" to complete the collection.
Data is automatically collectedRelease to method 3TheSpecified topic (ID = 3), At the same timeSave:
Locomotive installation directory/DATA/No.-Task Name/SpiderResult. mdb
.
Oh, yesterday net gave me a prompt for my error ,,,
I had to write a text, video, and collect information to my website in three hours. I fainted N times, and I wrote a rough warehouse. It was completely written by my feelings ,, sorry, please forgive me !!!, Correct the following:
Method 1 and method 3 areParallel Relationship, Can be selected at the same time,You can choose either of them., If you are not releasing a module, you can directly collect it to the local software database. "Local software database"Is MicrosoftAccessYou can open the database to view and check the data.
Method 3Publish the Web to the website online", I will explain it in the subsequent tutorial. I hope you will be patient.
All right, this tutorial ends! Next lesson, goodbye !!!
Video