Train collector acquisition principle and process Introduction

Source: Internet
Author: User
First, what is data collection? As we can understand, we opened a website and saw a very good article, so we copied the title and content of the article and transferred it to our website. our process is called collection, which transfers information useful to others to our website.

The same is true for collectors, but the whole process is completed by software. we can understand that we can copy the title and content of an article. We can know where the content is and where the title is, but the software doesn't know, so we need to tell the software, this is the process of writing rules .. we have copied and opened our website, such as where the Forum posts, and pasted the post for publishing. for software, it is to imitate the process of posting the post, to post the article, how to publish it, this is about the release module ..

The train collector is a software used to collect data. It is currently the most powerful collector on the network. It can collect almost any webpage content you see.

  Data Capturing principle of train collectors:

How the train collector crawls data depends on your rules. To obtain all the content on the webpage of a topic, you must first collect the website of this webpage. This is the website of this topic. The program crawls the list page according to your rules, analyzes the web site from it, and then crawls the content in the web page of the web site. Analyze the downloaded webpage based on your collection rules, separate the title content and other information, and save the information. If you select to download images and other network resources, the program analyzes the collected data, finds out the article, and downloads it to the local device.

  Train collector data publishing principles:

After we collect the data, the data is stored locally by default. We can process the data in the following ways.

1. No processing is performed. Because the data itself is stored in the database (access or db3), if you just want to see it, you can directly view it with the relevant software.

2. Publish the Web to the website. The program will imitate the browser to send data to your website, you can achieve the effect of manual Publishing.

3. Access the database directly. You only need to write a few SQL statements. The program will import data to the database according to your SQL statement.

4. Save as a local file. The program reads the data in the database and saves it as a local SQL or text file in a certain format.

Train collector workflow:

Locomotive collection can be divided into two steps: data collection and data publishing. The two processes can be separated.

1. Collect data, which includes the collection URL and content. This process is the process of obtaining data. We made rules and processed the content during the collection process.

2. Publishing content is to publish data to your own forum. The CMS process is also an existing process for realizing data. You can use the web for online publishing, database storage or storage as local files.

The specific use is actually flexible and can be determined based on the actual situation. For example, when collecting data, I can collect the data before publishing, or collect the data for publishing at the same time, or publish the configuration first, or add the configuration after collecting the data. In short, the specific process depends on you. One of the powerful functions of the train collector is embodied in flexibility.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.