User operation manual for site collectors

Source: Internet
Author: User
Tags bbcode xsl

Source: Visual mining website collector

1. Product Introduction

DM visual mining website collector is a visual data mining software, it can be used for website collection, Forum collection, article collection, blog collection, dedecms collection, mobile collection, new cloud collection, forum posting, Forum top posts, etc.

2 rules

A rule is a unit (data field) used to define data collection ). Such as the title field and link field. It also describes how these fields are processed. Different Unit rules are defined by different modules and executed by these modules.

The modules of these definitions and processing units are called the designer. The system includes the following designers that can be dynamically added or deleted.

1) webpage designer

Collect data on the webpage.

2) unit Filter

Filter or replace characters in a unit.

3) unit wrapper

Merge unit data and additional characters.

4) CSV Generator

Submit the collected data to the user's website in the format of a CSV file. The user's website reads the CSV data and publishes it to the user's website.

5) content publisher

Publish the collected data to the user's website by simulating the user.

6) Forum publisher

Publish the collected data to the user's forum by simulating the user.

 

For more information, see the relevant chapter.

2.1 list

The Rule List displays the rule files that the user saves in the system. You can select a rule to delete, modify, and run it.

 

Click rules> List

Go to the rule list page

E: Modify the rule. You can modify the rule name and category.

D: Delete the rule.

R: run the rule immediately.

2.2 New

When you collect new data, you need to define a rule to tell the system how to extract the data you need. Rule definitions are defined by different designers. Generally, the web designer defines a starting web page and a collection unit. Then select the designer to define other processes.

2.2.1 webpage designer

The webpage designer collects webpage data. It provides a visual interface for rule definition.

2.2.1.1 unit definition

The unit definition refers to the data to be collected.

 

Click rule> Create

Go to the create Rule Page

Enter the URL to be collected in the URL dialog box.

Such as: http://java.csdn.net/c_channelrecomm/tag/1

Click OK

Page Layout definition:

1) The lower part of the system menu is the tool bar of the designer.

2) The webpage structure of the target website is displayed on the left of the page, which is called the webpage Structure View.

3) The upper right part of the page is the page display area, which is called the data view.

4) the lower right part of the page is a list of Rule designers, called the designer view.

 

Determine the data to be collected.

If the page needs to be collected with a title and a URL. Follow these steps to define the title and URL:

1) Title

Click the text in the page title "using buffering technology to improve JSP program performance and stability" and the red area appears.

In the webpage Structure View, Area A in red indicates the HTML tag of the data. The red area in the data view indicates the data currently contained.

 

Click the cell button on the toolbar of the designer to enter the cell definition view.

The blue text in the figure is the preview of the current data. Move the scroll bar downward

Enter the unit name in the content unit name input box, such as the title.

If you need to convert HTML tags or filter HTML tags, click the template setting button to go to the template setting View:

Right-click and select the data conversion script. To convert the HTML tag to bbcode, click the bbcode menu item.

 

The template is defined by XSL. You can customize your own HTML data conversion script in the input box.

The template has an ordered priority, and the bottom part will overwrite the preceding template. The template definition requires users to have certain XSL technologies.

Click OK to close the window.

So far, the rules of the title unit have been defined. To view the template output results, click the "Data View" Save button to view the template output results in the original preview view.

 

2) Website

Select the row with the property name href and enter the URL in the unit name input box. Move the scroll bar downward.

Click Save. The title and URL entries appear in the designer view.

Click "web designer-Java technology Product Channel-csdn ...." Tab to return to the data view.

If you need to collect data from multiple regions, repeat the preceding operations. Until all are defined.

 

Click "preview" on the toolbar of the designer. Check the matching status of the current rule. (You must preview the data after defining the differentiation unit. Otherwise, the data cannot be accurately matched .)

The blue area in the figure indicates the data that the current rule matches. If the matching matches the user's expectation, the current rule is successfully defined. Otherwise, you can go back to the "unit definition View" to adjust the rule feature definition. For more information, see "Adjust Rules.

 

If you do not need to set other data, refer to the CSV generator and subsequent chapters.

2.2.1.2 data publishing

You can use the CSV generator, content publisher, and Forum publisher to publish data. For more information, see the relevant chapter.

 

2.2.1.3 wildcard expression

If you need to match strings with wildcards, the system provides the following wildcard keywords.

$ S: match any string.

$ N: matching number

$ C: matching English letters (A-z) and numbers (0-9)

$ (... $): Group expression.

$: $ Character

 

You can add a $ sign before a letter to convert the letter into a letter. For example, you need to write your own regular expressions to achieve more advanced matching.

For example, $ ^ $ ($/d $) $ + $ [A $-Z $] $ + $.

It generates a regular expression ^ (/d +) [A-Z] + $.

 

Such as string http://java.csdn.net/page/4518975c-9fdc-429a-a7d7-34fa2be9d08

Its feature expression can be written as http://java.csdn.net/page/?n=s

You can also write it as http://java.csdn.net/page/~s-~s-~s-~s-$s.

 

2.2.1.4 adjust rules 2.2.1.4.1 how to expand the Matching Scope?

In the unit definition view --- attribute list, delete or modify the value of the expression. Click "save. Return to the data view to preview the matching result. Until the matching is correct

2.2.1.4.2 how to narrow the Matching Scope?

1) modify the matching expression

Add or modify the value of the expression in "unit definition View"> "attribute list. Click "save. Return to the data view to preview the matching result. Until the matching is correct

 

2) limit the parent tag

In "Data View"> "webpage Structure View", find the HTML Tag corresponding to the current unit. The tag is marked in red.

The figure shows the red "". Click the parent node "Li" of ""

The blue-yellow area in the figure indicates the defined unit.

The red area appears. Click "unit" on the "designer toolbar" to go to the "unit definition View"

Move the scroll bar downward

Enter or modify the expression value to keep the unit name empty. Click "save. In the designer view, you will see entries of the tag name. Here is "<li> ".

Return to the data view to preview the matching result. Until the matching is correct.

 

2.2.1.5 paging Definition

To collect data from multiple pages, you need to define paging rules to tell the system how to obtain other pages.

 

Move down the scroll bar of the above page.

Right-click the paging address

Click the attribute menu to obtain the page url. The URL of this page is http://java.csdn.net/c_channelrecomm/tag/2.

Click the page button of the designer toolbar to open the page Design dialog box.

Select the page address of the page number or automatically extract the page address

 

2.2.1.5.1 page address

Copy the paging address to the "address expression" input box.

In this example, http://java.csdn.net/c_channelrecomm/tag/2.

 

Modify the page Keyword:

Replace the character representing the page number with $ {count }.

For example, http://java.csdn.net/c_channelrecomm/tag/##count }.

 

2.2.1.5.2 automatically extract the paging address

Automatic paging address Extraction uses an address expression to match the source code of the current page. Define the source code of the web page. Then, the webpage is extracted by defining a wildcard expression.

Feature expression syntax parameter "test the wildcard table to formula" section.

If the value matched by the feature expression has other parts, you need to use arc brackets in the feature expression to enclose the paging address.

For example: http://www.caijiqi.net/?n-?n-=n-page.jspor <a href = "(http://www.caijiqi.net/$s)"> next page </a>

 

Click Preview to view the page address extraction information.

2.2.1.6 define a lower-level (sub-) webpage

If data is distributed on different pages and the lower pages are defined by the upper-layer unit, you can define the page to which the Unit points.

 

For example, you can define the URL of an article on the list page, obtain the page of the article through the URL in the list page, and then define the title and body of the article on the article page.

 

Right-click "url" in "designer View.

Click "HTML designer" to go to the unit design steps on the subpage.

Repeat the steps defined in the first page. Until all units are defined.

2.2.1.7 content template Definition

The content unit contains HTML tags and data. By default, the system filters all tag names. You can use the XSL template to retain or convert tags. The template is prioritized, and the bottom part will overwrite the previous definition.

 

The system has preset various templates that can be directly called or expanded.

2.2.1.7.1 retain the Tag Name

The "retain Tag Name" template can retain the content already contained by the specified tag name.

<! -- 1. Keep tag name [A | IMG | BR | p] -->

<XSL: template match = "A | IMG | BR | P">

<XSL: Call-Template Name = "includetagname"/>

</XSL: Template>

The red part of the code is an XPATH expression. For example, a, Div, span, and H1 are retained. The red part is changed to a | Div | span | H1. For the XPath technology, see the relevant documentation.

2.2.1.7.2 bbcode-conversion tag

This template can convert tags such as A and IMG into bbcode code. You can add or modify this template.

<! -- 2. bbcode-conversion tag -->

<XSL: template match = "A">

[Url = <XSL: value-of select = "@ href"/>] <XSL: Apply-templates/> [/url]

</XSL: Template>

 

<XSL: template match = "IMG">

[Img] <XSL: value-of select = "@ SRC"/> [/img]

</XSL: Template>

2.2.1.7.3 filter tags

The "filter tag" template filters the tag name or content contained in the tag.

<! -- 3. Filter tags -->

<XSL: template match = "Div | A">

<XSL: Call-Template Name = "removetag"/>

</XSL: Template>

The red part in the Code is an XPATH expression. To replace the XPath expression corresponding to the tag.

2.2.2 unit Filter

The Unit filter can filter the units defined by the upper-level designer. You can use a wildcard expression to filter or replace the content.

 

Suppose "web designer ..." The units in the filter must be used. Right-click the item

Click "unit filter" in the menu to go to the "unit filter designer View"

Scroll down the "unit filter View"

Click Add

Right-click the target unit item and the "unit list" menu is displayed.

Click the filter unit in the menu, such as the title unit. The system automatically writes the unit name. If it is different from the actual one, you can manually modify it.

Enter a wildcard expression in the replacement condition. For more information about wildcard expressions, see "wildcard expressions.

Enter the replacement value. If you want to filter, it is null.

 

If the replace value needs to contain part of the text to be replaced, the wildcard expression will enclose the matching expression of some text in brackets. Use $1 and $2 in the replacement value to reference the group elements in the wildcard expression.

 

Click a blank area outside the red area and click "New Code" or "New View" to preview the result.

 

Click the Add button to filter other units.

 

Click Save to save the settings. If you do not need to set another designer, click System Menu> rules> Save, save the current rule, and end the rule definition.

2.2.3 unit wrapper

The unit packaging filter can combine multiple upper-level units into a single unit, or add text to the Unit.

Suppose "unit filter ..." Additional text is required for the unit. Right-click the item

Click "cell package" in the pop-up menu ". Go to the cell wrapper view.

Scroll down to "unit over-packaging View"

Click Add

Right-click the value item and the "unit list" menu is displayed.

Click the unit to be packed in the menu, such as the title unit. The system automatically writes the unit name. If it is different from the actual one, you can manually modify it.

You can enter other texts in the value or select other units.

After the question is written, click a blank area outside the red area and preview the result in "code" or "View.

 

Click the Add button to set other units.

 

Click Save to save the settings. If you do not need to set another designer, click System Menu> rules> Save, save the current rule, and end the rule definition.

2.2.4 CSV generator (publisher)

The CSV generator allows you to publish unit data to your website in CSV format. The user's website accepts CSV files, reads CSV data, and publishes the data to the user's website system. User websites can be forums or CMS.

 

You need to download the release plug-in corresponding to your website from the official website and install it on your website according to the plug-in instructions. If the official website does not provide the plug-ins required by the user, the user can download the plug-in Development Kit for self-development. You can also seek help from the official team.

 

Assume that the "unit package ..." The unit and its parent unit must be published to the user's website in CSV format. Right-click the item

Click "CSV generator" in the pop-up menu ". Go to the CSV generator view.

Enter the CSV plug-in URL in the upload URL. For example, http://www.caijiqi.net/csv-dm-taker.php. Select the encoding of the user's website. If the data size is too large and the transmission times out, you need to set the size of each split file in the maximum split input box.

 

Click the import from template button on the designer toolbar.

Select the released template from the pop-up menu. If no template is required, you need to manually set the release parameters.

 

Scroll down the "CSV generator View"

If no template is selected, click Add to enter the parameters corresponding to the plug-in the unit name column. For more information, see the plug-in instructions.

Right-click the value item and the "unit list" menu is displayed.

Click the unit to be released in the menu, such as the title unit.

 

After setting all the settings, click the Save button to save the settings. You can click "Save as template" on the "designer toolbar" to save the current parameters for future extraction. If you do not need to set another designer, click System Menu> rules> Save, save the current rule, and end the rule definition.

2.2.4.1 rule template Definition

You can save frequently-used Rule definitions as templates so that the previous settings can be extracted during the next rule definition. You cannot create a template separately. It is part of the designer. When the designer supports the template function, the template function button is displayed on the designer toolbar. A template can be added or loaded only when rules are defined.

You can delete unnecessary templates from the System Menu> Tools> dictionary.

2.2.5 content publisher

The content publisher simulates the way users submit web pages to publish data. It cannot be applied to webpages with special limits. Such as verification code and release time restrictions. If you are a website administrator and are not subject to special restrictions, we recommend that you use this module for data publishing. You can install no plug-ins on your website. If you need to perform special processing on the number of releases, such as downloading attachments, you only need to install a public plug-in on the website. This plug-in can be downloaded from the official website, you can also develop it on your own. This method ensures the integrity and security of data publishing.

 

If the user can provide a special channel for a specific user ID by modifying the system code. It is not subject to special restrictions when publishing data with this user ID. In this case, you can also use this publisher to publish data.

 

Suppose "web designer ..." The unit and its parent unit must be published to the user's website as a content publisher. Right-click the item

Click "content publisher" in the pop-up menu ". Go to the content publisher view.

The content publisher is divided into logon and release forms. The logon form is used for user logon. The publish form is used to submit data.

Click the import from template button on the designer toolbar. Select the released template from the pop-up menu. If no template is required, you need to manually set the release parameters.

 

1) logon form:

Action: logon form URL. The website can be determined based on the logon form on the logon page of the user's website.

Encoding: webpage encoding. The code is consistent with that of the user's website.

Click Add

Enter the logon parameters and values in the red area. The specific parameters can be determined based on the logon form on the user's website logon page.

 

2) Release form:

Action: publish the form URL. The website can be determined based on the form published on the Publishing Page of the user's website.

Encoding: webpage encoding. The code is consistent with that of the user's website.

Click Add

Enter the release parameters in the red area. The specific parameters can be determined based on the logon form on the user's website login page.

Right-click the item corresponding to the value

Select a unit in the pop-up menu.

 

Data Plugin: a plug-in installed on a website to process published data. This plug-in can filter data and download attachments. Do not question.

 

Click "Save settings" on the "designer toolbar" to save the settings. You can click "Save as template" on the "designer toolbar" to save the current parameters for future extraction. If you do not need to set another designer, click System Menu> rules> Save, save the current rule, and end the rule definition.

 

2.2.6 Forum publisher

The Forum publisher simulates the way users submit web pages to publish data. It cannot be applied to webpages with special limits. Such as verification code and release time restrictions. If you are a website administrator and are not subject to special restrictions, we recommend that you use this module for data publishing. You must install the BBS plug-in on your website. This plug-in is used to return the post ID so that the system can automatically follow up. If you need to perform special processing on the number of releases, such as downloading attachments, you only need to install a public plug-in on the website. This plug-in can be downloaded from the official website, you can also develop it on your own. This method ensures the integrity and security of data publishing.

 

You need to download the BBS plug-in corresponding to your website from the official website and install it on your website according to the plug-in instructions. If the official website does not provide the plug-ins required by the user, the user can download the plug-in Development Kit for self-development. You can also seek help from the official team.

 

If the user can provide a special channel for a specific user ID by modifying the system code. It is not subject to special restrictions when publishing data with this user ID. In this case, you can also use this publisher to publish data.

 

Suppose "web designer ..." The unit and its parent unit must be published to the Forum website. Right-click the item

Click "Forum publisher" in the pop-up menu ". Go to the Forum publisher view.

The Forum publisher is divided into logon and release forms. The logon form is used for user logon. The publish form is used to submit data.

Click the import from template button on the designer toolbar. Select the released template from the pop-up menu. If no template is required, you need to manually set the release parameters.

1) Login User:

The login user refers to the user ID of the legal post. The format is "User Name @ password ". Multiple users are separated by line breaks. The system randomly selects users to publish data.

For example

JWD @ 123d

Haoren @ hao13

Didi @ didi

2) user unit:

The user unit refers to the user ID of the post that is collected. This item is mainly used by the system to allocate posts of the same user to the same user ID of the website.

Right-click the input box and select the corresponding unit.

1) logon form:

Action: logon form URL. The website can be determined based on the logon form on the logon page of the user's website.

Encoding: webpage encoding. The code is consistent with that of the user's website.

Click Add

Enter the logon parameters in the red area. The specific parameters can be determined based on the logon form on the user's website.

Right-click the item corresponding to the value

Select a menu item in the pop-up menu.

 

2) Release form:

Action: publish the form URL. The website can be determined based on the form published on the Publishing Page of the user's website.

Encoding: webpage encoding. The code is consistent with that of the user's website.

Click Add

Enter the release parameters in the red area. The specific parameters can be determined based on the logon form on the user's website login page.

Right-click the item corresponding to the value

Select a unit in the pop-up menu. The parameter corresponding to the topic ID must be specified. Otherwise, the system will not be able to follow up.

 

Forum plug-in: the user must have been installed on the user's website. This plug-in is used to return the post ID. So that the system can follow up.

 

Data Plugin: a plug-in installed on a website to process published data. This plug-in can filter data and download attachments. Do not question.

 

Click "Save settings" on the "designer toolbar" to save the settings. You can click "Save as template" on the "designer toolbar" to save the current parameters for future extraction. If you do not need to set another designer, click System Menu> rules> Save, save the current rule, and end the rule definition.

 

3. Task

You can use the task function to specify the collection rules to be automatically started by the system at a fixed time point for data collection. Tasks allow rules to automatically collect data. You can use the task function to synchronize data between your website and the target website, or monitor data on the target website.

3.1 list

The task list displays the task settings that the user saves in the system. You can select a task to delete and modify it.

Click task> List

Go to the task list page

E: Modify the task. You can modify the rule and running time.

D: Delete the task.

 

3.2 new

When you need to automatically collect data, you need to define a task to specify the collection rules and running time.

Click task> Create

Go to the Create task page

Click a rule in the rule list box to set the execution cycle, execution period, and remarks. Click Save to save the rule.

 

Rule List: A collection rule that is scheduled to start.

Execution cycle: One day is the execution interval of the cycle.

Execution period: a time point in a day. 24. The value of PM is PM.

Note: describes tasks.

4 categories

You can use the category function to classify collection rules to facilitate rule management.

4.1 List

The category List displays the category settings that the user saves in the system. You can select a category to delete or modify it.

Click Category> List

Go to the category list page

E: Modify the category. You can modify the name and description.

D: Delete the category. It deletes all the rules and associated tasks under this category.

 

4.2 new

Click Category> New

Go to the create category page

Enter the name and description, and click Save. Save category.

5. supported information
Resources Description
Www.caijiqi.net The project official website publishes project documents and provides system downloads.
QQ: 107175884  
Mail: hotheartboy@gmail.com  

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.