Webpage Data Capture System Solution

Source: Internet
Author: User

1. Introduction
Project Background
In the Internet era, information is as boundless as the sea. Even our method of getting information has changed: from traditional dictionary searching through books, then to searching through search engines. From the age of information shortage, we have suddenly moved to today's age of extremely rich information.
Today, the problem that bothers us is not that there is too little information, but that there is too much information, which makes it impossible for you to distinguish and choose. Therefore, it is very important to provide a tool that can automatically capture data on the Internet and automatically sort and analyze data.
 
The information we obtain through traditional search engines is usually displayed in the form of web pages. Such information is naturally friendly for manual reading, but it is difficult for computers to process and reuse it. In addition, the amount of information retrieved is too large. It is difficult for us to extract the information we need most from a large number of search results.

The data aggregation system involved in this solution is born from this. Based on certain rules, the system crawls information on a specified website, analyzes and organizes the captured results, and stores them in a structured database to prepare for data reuse.

Chinam.com is a famous large-scale recruitment website. In order to have a comprehensive and detailed understanding of the overall capacity of the Recruitment Market, chinacili.com can fully understand the situation of other competitors and provide potential customer information for market personnel, we provide this solution.

Task and purpose
Jesoft cooperates with chinaili.com to develop an automatic data aggregation system, which obtains open information resources from the Internet and analyzes, processes, and reprocesses the information, provide accurate market information resources for the marketing department of chinacili.com.

2. Design Principles
The following two principles are fully taken into account when designing the system scheme, which will always run through the design and development process:

System accuracy
The system needs to obtain information from the vast ocean of information on the Internet. How to ensure the accuracy and effectiveness of the information it captures is a key factor to evaluate the value of the entire system. Therefore, in addition to sorting and analyzing captured information, we can intelligently perceive the content and format of the target website when the content and format of the target website change, timely reporting and adjustment are also important measures to ensure system accuracy.

System flexibility
Although the system is an internal system that provides services to a few users and monitors fixed sites, it still needs to be flexible and scalable.
Because the structure, layers, and format of the target site are constantly changing, and the target site to be captured by the system is constantly adjusted, the system must be able to adapt to this change, when the Captured target changes, the system can continue to complete the data aggregation task through simple settings or adjustments.

3. solution:
1. Function Structure

2. Define the format and compile the script
First, we need to compile the captured script (Format) based on the characteristics of the target website to be crawled ). Including:
URL path of the target website;

How can I obtain data? You can use the simulated query function (manually check the parameters submitted on the query page and simulate the submission). You can also use the serial number to traverse from start to end (you need to find the current maximum serial number value );
Compile (standards and scripts) for the characteristics of each website );

3. Capture Data
The program sub-programs provided by the system perform data capture tasks according to the pre-defined XML format. To prevent the detection program discovery of the target website, we recommend that you save the captured page directly, and then proceed. Instead of processing the information immediately after it is obtained, this is of great value for improving the capture efficiency and retaining first-hand information.
Simulate logon using a defined script;
For the query items in the drop-down list, traverse each value in the list cyclically. Simulate page turning on the page with the query results to obtain all the query results;
If the job library or enterprise name library uses an auto-incrementing Integer as its unique ID, we can find a way to obtain its maximum value and then capture all of it through traversal;
Perform regular capture operations and incrementally Save the captured data;

4. Simple Analysis
On an internet server, perform simple analysis and processing on the collected data, including:

Structured Data: structured data is obtained to facilitate data transmission in the future, and to facilitate the next re-sorting and troubleshooting tasks.

Exclude duplicates. When the simulated query method is used for traversal, the data captured by the system will be repeated. Because repeated data may cause repeated analysis and processing processes, it not only occupies system resources, but also reduces the processing efficiency of the system and brings a large amount of junk data to the system. To avoid repeated and redundant data, we must first remove duplicates.

The content, structure, and format of the target site may cause the system to fail to capture or capture a large number of error messages, by judging the data error rate, we can obtain information about whether the target site has changed and send an alert to the system in a timely manner.

5. Data is transmitted back to the internal device.
The system passes the processed data through the Web
Service is sent back to the Enterprise. The only thing that needs to be considered is how to implement incremental updates. Otherwise, if a large amount of data is updated to the local database every day, network congestion will occur.

6. Data Analysis
The data analysis here is different from the analysis operations described above on the remote server. The latter is used to filter data in a simple and effective manner, data redundancy is prevented to solve problems such as slow processing speed or network congestion. The former aims to facilitate future manual confirmation and effectively help market personnel to perform fast manual sorting. The details are as follows:
L differentiated by region;
L divided by accuracy; helps users prioritize information with high effectiveness;
L divided by the number of positions released;
L record the change process of positions published by each enterprise;

7. manual confirmation
This part focuses on two aspects:
1. provide friendly man-machine interfaces for manual confirmation of such information;
2. Compare with the job library of yingcai network to extract the differences for manual confirmation:
Through communication and communication with market personnel, they can understand the information they are interested in, provide data as expected, and complete manual confirmation.

8. Statistical Summary
The summary statistics function is also an important part of the data aggregation system. The system provides the following types of statistical summary functions:

Statistics on newly-added enterprises, positions, and other information on each website day on a website basis;
Track large enterprises and collect statistics on their posts posted on various websites;
Collect and summarize various information by day, week, And month;
Collect statistics by region, enterprise, and position;
Others;

Simulation statistics summary page

[Reference http://www.cbinews.com/solution/news/2528.html]

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.