Collection principle of website Data collector

Last Update:2014-12-12 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Intermediary transaction SEO diagnosis Taobao guest Cloud host technology Hall

Website Collector: It is a kind of program that can collect and publish online information quickly, which is divided into two functions: information collection and processing, information releasing function.

As a quick to increase the content of the program, the collector has been the majority of personal webmaster attention. On the one hand, we try to prevent others to collect their own website, on the other hand also want to use collectors to collect the site to enrich their site content. When the collector is produced we have no way to know, the current domestic major article management system has integrated the collection and collection function, even if some of the domestic large Web sites have more or less use of information collection, it is clear that people's enthusiasm for collection, after all, the use of collecting time-saving. Now the collection of products are very many, the function is also strengths. But for a long time, no matter what kind of collector, no matter how simple the program is, the collector program is difficult to use for most ordinary users. So, here's how the collector works and hopefully it will help in the process of using the collector.

In fact, the basic principle of the collector and the flow is very simple, simple division is:

Get the data.

Depending on the type of collector and the development of the language, the way to obtain is somewhat different. But they are all by accessing the collected site to extract the corresponding information of the collected site. Acquisition program by reading the information in the collection rules to determine what kind of way to access the collected sites, what addresses are collected in the site is valid, what content is the collection, how to extract useful information, etc., these are specified by the collection rules.

We take the old BFC collector as an example (the free version has more features and is not advertised in the content), acquisition rules first need to specify the address of the collection of content list, BFC called "List url", this list page contains the content you want to collect links, such as we collect BFC official forum of " BFC Collector Application Exchange "the content of this plate, the link address is: http://bbs.bfcstudio.com/thread.php?fid=9."

We can set the list URL to this address, now the list address has, but this page we just want to intercept the contents of one of the areas to collect, how to do? This requires setting the list range, where the list start string and list end string are used. The name of the list starting string is what you need to start from where the page code is, and the end of the list string is where you want the content to end.

Here is the most difficult to understand all the acquisition program is also the difficulty of setting rules, in fact, as long as you are willing to carefully look at the list page code, it is very easy to do. As long as you remember the following basic principles, you must not be stumped by the start and end strings when making rules:

Start string Standard: In page HTML code, the desired content appears before and only once (such as multiple occurrences, whichever is the first occurrence).

End string Standard: In page HTML code, there is only one occurrence after the starting string (such as multiple occurrences, whichever is the first occurrence). Remember that this is the starting string.

The start and end strings appear in pairs, and the collector intercepts the contents between them as valid content. They are not necessarily the only ones in the code, but each pair must be the content you need (the collection forum replies are useful). Use CTRL and you'll find the right standard.

Another explanation for the start and end strings:

Start string:

A string that precedes a valid text message in the collected code, which must satisfy the following conditions: It is unique in the content before the valid information. (if not unique, whichever is the first occurrence) there must be one or more start strings in the content before the valid information (the program will be the first occurrence of the string), otherwise the content will fail to fetch.

End string:

In a string that follows a valid text message in the collected code, the string must satisfy the following conditions: The string cannot be included from the start string to the end of the valid information. One or more end strings must exist in the content after the valid information (the program will be the first occurrence of the string starting from the start string), otherwise the content will fail to fetch. Some netizens think of a better way to set up, you can use the DW and other visual page design tools to extract the keyword, the specific operation please see the following address: http://bbs.bfcstudio.com/read.php?tid=692

To use a collector, you have to figure out how to set up the start and end strings, which is the foundation of all acquisition programs, knowing that the ability of an existing computer is impossible to know what you need, not just software problems.

Okay, let's not say anything else, now set the start and end string information, the list of valid range has been delineated, the acquisition program will automatically extract the link exists in the area.

If you do not need links in this area, you can also use more detailed link filtering function, in the BFC collector is based on the content of the URL to filter, you can set the URL must contain content or must not contain content. That is, url inclusion and URL exclusion in the BFC Rule Manager.

Other collectors also basically provide similar functions, flexible use can achieve the same purpose.

About List paging: Most collectors offer a more complete list paging setup. For this feature, the most widely used is a regular paging type, similar to the following paging:

Thread.php?fid=2&search=&page=1

thread.php?fid=2&search=&page=2

Thread.php?fid=2&search=&page=3

Thread.php?fid=2&search=&page=4

Thread.php?fid=2&search=&page=5

If you encounter such a paging, set up simple, for the BFC collector can be a batch-specified method, and set the URL string to thread.php?fid=2&search=&page= {page}.

The {page} scope is set to range from 1 to 5 (several pages are filled in).

{page}: is a paging variable of the BFC collector that can be automatically incremented or decremented within a specified range.

Another way to set pagination is more stupid but simple, is to add the function manually, select here you just fill out the list you need to collect the address on it, each line, if you have time, you fill in how much.

There is also a paging settings, that is, set the next page of the link code to start and end the code, the program will automatically find the link in the current page to locate the next page of the link, this setting is more troublesome but the effect is quite good indeed.

The above is three kinds of way to set information paging, as to how the acquisition program to operate and distinguish we do not have to care too much, the three methods of setting the same applies to content paging settings.

Now that we have the list of addresses that need to be collected, the following is the setup of the collection content.

Content Extraction Settings:

In the other site, we need is generally the title of the article and article content, acquisition process, the collector will collect the address list in the content of the HTML code download to the local and according to the rules set in the corresponding information to extract the relevant content of the article.

First, the extraction of the title, the data processing module of the collector will intercept the information in the current article code as the title according to the "header start string" and "title End string". Here the "title start string" and "title end string" Setting principle are the same as the list scope interception principles mentioned earlier.

For a friend who wants to use the link name directly as a title, the BFC Collector provides a simple way to set the title rule, directly select the option to automatically extract the content header, and then, when selected, does not need to fill in the title Start string and title end string. As shown in the following illustration:

(There is no need to set a title rule in the BFC collector)

Of course, if the link name in the list is empty or the picture is linked, you still need to set the title start string and end string.

Again, the text extraction:

and the title and list range are extracted the same, set your body to start the string and end the string on it.

Here the important thing is to deal with the content of the text, we know just collected the content is a section of HTML code, which contains the content we do not know, perhaps with malicious code, or affect the visual effects of the label, such as table, TR, TD, Tbody and so on. So if you want to publish to the forum, it is best to use the UBB code to publish to ensure the security and compatibility of the forum (it is possible that the user you use can not send HTML paste, causing the post failure). So basically all of the collectors provide the ability to convert code formats.

What if you need to publish content to a CMS or other system that doesn't support UBB code? It's easy to publish with HTML, but it's better to set up a filter in the rules that might cause layout clutter. This is very handy in the BFC collector:

Check the labels you need to filter.

Besides the text pagination, this also has nothing to say, and the list page is the same set method, set up a good paging rule can be.

Now let's take a look at how to handle content that we don't need or need to replace in the body or header content, which is done in the form of various elements, often using filter elements and substitution elements: BFC

Filter elements: To delete the content you do not need, the scope can be a title or body content.

Replacement element: The content that replaces the original content with what you set yourself. The scope of the action can be either a title or a body content.

Using these two elements is a good way to deal with what you have collected.

For more detailed filtering elements and how to use the replacement elements you can view here:

http://bbs.bfcstudio.com/read.php?tid=1159

http://bbs.bfcstudio.com/read.php?tid=1160

In addition to the above two elements, BFC also provides insert elements and reference elements.

Inserts an element that inserts the contents of the specified (dynamic or static content) at the specified position in the title or body.

The function of a reference element is to specify the contents of a reference element (which can be dynamically intercepted from the acquisition content through a start/end string). You can also specify static content by itself) to assign to the reference target field of the referencing element as part of the sending packet field, which is the value of a single form segment. Because of the flexibility of use, we no longer specifically introduce it.

More in-depth data processing:

If these processing functions still do not meet your requirements, you need to do more complex conversion, how to do?

Then use the extension function, the extension function is free from the BFC collector, it can be customized, of course, if you are very familiar with JavaScript or VBScript, you can develop these two scripting language function code to suit their needs, For example, the BFC collector's own Mars and simplified traditional conversion scripts, as well as a UBB code conversion script for you to replace the program's built-in UBB script conversion, according to the documentation and those function scripts you can make your own extension functions.

Now we've got the content, so where do we publish it? BFC's release target is specified by the rules, each rule can only be published for a certain plate (of course, you can also dynamically specify before the acquisition), which is different from other collectors, in the first page of the rule information set the target forum and the target plate can be, At the same time, it can also set whether each collection will eject the target setting window (reassign the target forum and plate) and only collect not publish (only collect to local but not publish to the website, apply to the friend who likes to browse locally) function.

Now we have explained the contents of the collection.

Publish Data

The release of data is much simpler than data collection (unless you want to do your own release plug-ins), as long as you set up your website information on it, you need to pay attention to the following points:

[List=1]

The website address, the website address must according to the procedure request fills in. Different procedures have different requirements, according to the actual situation can be completed.

Login address, this is very important, otherwise the acquisition program will not be able to log in to the user, can not submit content.

Submit the address, this does not mention, must be set up (general Plug-ins are all with default information, the default is generally not a problem).

User information, now the acquisition program has provided a multi-user publishing function, so your list of users must be maintained, pay attention to whether they have the right to post or various types of posts.

Plate information

Another point to note is that your user login information is expired, most of the collector will be automatically logged in when the user, and some are required to provide login after the cookies information, if the login information expired, will also cause the release failed, so it is best to regularly maintain login information, As for how long maintenance time is to see you log in when the selected login expiration time.

Do the above some of your collection content can be normal release.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More