Introduction to the relevant terminology of train collectors

Source: Internet
Author: User
Tags character set http request tag name

Introduction to the relevant terminology of train collectors

1. Acquisition Tasks

The acquisition task is the complete configuration of the data acquisition and publishing task in train collector, including acquisition rule and release module.

2. Acquisition rules

That is, we give some settings on how to collect and collect things so that the collector can follow the rules set.

This setting can be exported from the train collector and saved as a. ljobx file, or you can import the train collector again.

3. Release module

In the train collector, the release module is the setting for "where to publish the data that has been collected."

Includes the Web publishing module and the database publishing module, which can be exported to save as. wpm files and. dbm files, respectively.

and can again import train collector, many times use.

4. Publishing interface

The publishing interface is a small page program that is usually used in conjunction with the Web publishing module to meet the specific needs of the user.

That is, the collector sends the collected data to the release interface file, the interface file obtains the data, and handles the data flexibly according to the user's specific needs.

5. Label

A field name used to extract information about a particular content, specified by the user when editing the rule.

For example, the title, mobile phone number, mail, author, content tags collected information in the release module can be obtained through the tag name corresponding to,

The format is [Label: tag name] such as [Label: title].

There are two types of tags in the train collector: List page labels and content page labels,

The name of the list page tag is to get the content information when the list page is fetched (that is, when the URL is adopted).

The content page label gets the content information when it gets the content page or multiple page content (content).

Note: There is usually an HTML tag, where the tag refers to a property identifier in some HTML code, such as:

6. (*)

This symbol is often encountered when using a train collector, which is a universal symbol of a variable,

If we just need to know the variation of the variable and don't care what the variable is, then we can use this symbol instead.

7.[parameters]

A tag tag used to match a prepared extraction information, such as to extract a form from a combination of code.

To "MCLK" from the Code "(This, ' 108484 ', ' 134217 ', ' 168475 ', ' 1 ');" To extract the new address format as an example.

"Mclk (this, ' [parameters] ', ' [parameters] ', ' [parameters] ', ' 1 ');", in order, the 108484 argument is parameter 1, and so on.

The address format that is actually required is the following: bbs/read.php?id=[parameter 1]& sort=[parameter 3]&action=[parameter 2],

The 3 parameters in the above code and the Id,soft and action parameters in the address below correspond to the corresponding values.

The order should not be reversed. This is combined into a new address format.

8. Starting Web site

An entry URL that is used to obtain a subordinate link address, which can be one or more articles,

You can add the same format multiple URLs or import text URLs by adding the Start URL Wizard.

If there is no way to define multilevel URLs, these addresses are collected as content page URLs.

9. Multi-level Web site

Sequentially collect and analyze the address according to the multi-level URL in the list, and then get the content page address by collecting and analyzing in sequence.

Multi-level Web site acquisition can use the page automatic analysis and manual acquisition method to collect subordinate URLs,

In the process of collection, you can collect list page and fetch additional parameters of list pages at the same time.

10.Cookie

The string that is used to interact with the server is a section of your user's information, the logon information, that is recorded in the HTTP request access.

When used in a browser, it is often recorded in the form of text in your IE cache directory.

So that the next time within the validity period without entering user information can continue to access the authentication permissions of the Web page.

11.user-agent

The browser identity, which is used to inform the server of the type of client you are using,

Some web pages that need to be logged on may need to authenticate both cookies and user-agent.

So you need to set it to the same format as the native browser.

12. Paging

The list or content page is longer and is divided into multiple pages, which require the combination of all the child pages.

Such subpages are pagination (list paging or content paging).

13. Multi-page

In some cases, you need to collect a page corresponding to the URL, pictures and other content,

You need to open a new page to capture this information, and these other open pages are called multiple pages.

14. Web page encoding

is a library that specifies its specific character encoding format in a Web page, such as the following sentence in a Web page:

Such words indicate that the character set encoding for this page is GB2312.

The train collector can automatically recognize the common Web pages,

Also lists most of the Web page encoding format, you can directly in the collector manually select the appropriate encoding format.

15. The Agent

Refers to the network proxy server, can proxy network users to obtain the required network information.

The function of the agent has access restrictions that can break through its own IP access to foreign sites,

Access to some units or groups of internal resources, break through the IP blockade of telecommunications and hide the real IP and so on.

16. Plugin

In a train collector, a plug-in is an external program that can perform a specific processing of the collected data,

After compiling the plug-in, the collector can pass the data to the plug-in, then process the data and pass the data to the collector.

(Can be self-developed, can also contact customer service customization.) )

17.Cron expression

In the setup of the Train Collector Program Task Manager, you can set up a full cron expression to represent the scheduled execution of the task.

It is a string consisting of 6 or 7 subexpression expressions. Each expression represents a field,

Each field describes a separate schedule detail and is separated by a space between each domain, which consists of two formats.

Seconds Minutes Hours dayofmonth Month dayofweek Year

Seconds Minutes Hours dayofmonth Month DayOfWeek

A cron expression has at least 6 or 7 space-delimited time elements, each of which uses a number, but can also appear with the following special characters, meaning that:

1.Seconds seconds (allowed with a value of 0-59, special symbols allowed,-*/)

2.Minutes min (allowed with a value of 0-59, special symbols allowed,-*/)

3.Hours hours (allowed with a value of 0-23, special symbols allowed,-*/)

4.day-of-month Day of the month (allowed value of 1-31, special symbols allowed,-*/? L W C)

5.Month months (allow value of 1-12 or JAN-DEC, special symbols allowed,-*/)

6.day-of-week Days of the week (Allow values of 1-7 or SUN-SAT, special symbols allowed,-*/? L C #)

7.Year (optional field) year (optional field, allow values to be left blank or 1970-2099, special symbols allowed,-*/)

Special character Meaning:

(1) * represents any value of the field. If used in the minutes domain, it means that events are triggered every minute.

(2)? Can only be used in DayOfMonth and DayOfWeek two domains. It doesn't actually match any of the values of the field, because DayOfMonth and DayOfWeek affect each other. If you want to trigger the schedule on the month of 20th, no matter 20th is the week, you can only use the following: 13 13 15 20 *?, the last one can only use, but not, if the use of * to indicate the monthly 20th 15:13 13 seconds regardless of the week will trigger, in fact, is not.

(3) – represents a range, such as using 5-20 in a minutes field, which means that 5-20 minutes is triggered once per minute

(4)/indicates that start time is triggered and then triggered once at regular intervals. If you use 5/20 in the minutes domain, it is triggered every 20 minutes from the minute number 5, and the result is 25,45,05, respectively.

(5) That lists the enumeration values. If you use 5,20 in the minutes domain, it means 5 minutes and 20 minutes per minute.

(6) L means last, only in DayOfMonth and DayOfWeek fields.

(7) W indicates a valid weekday (Monday to Friday) and can only appear in the DayOfMonth domain, and the system will trigger events on a valid weekday that is closest to the specified date. In addition, the recent search for W will not span the month.

(8) # used to determine the number of weeks of the month, can only appear in the DayOfMonth domain. such as 4#2 the second Thursday of a month.

Complete corn expression such as 0 15 08? * MON-FRI indicates that data is scheduled to be updated every Monday to Friday 8:15 A.M..

18. Mission Web Site Library

The collector is under the folder Datalocoyspiderpageurl,

Each task under this site will generate a separate or common Web site to compare Web site duplication.

19.HTTP Request

When a browser opens a Web page, it is actually sending one or another HTTP requests,

Train collectors, too, the process of fetching content from a specified address is to send an HTTP request and then process the content based on the request.

When the browser sends a request to the Web server, it passes a block of data to the server, which is the request information.

The HTTP request information consists of 3 parts: The request method URI Protocol/version, the request header (requesting header), and the request body.

The following figure:

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.