Web page data capturing (Demo available)

Source: Internet
Author: User

Background

I once worked on a price comparison system in the company, that is, to capture the prices of the products on other websites and correspond to the products of my own company, and then demonstrate them to provide a reference for the price of pm. Later, when a friend of a colleague was looking for a job, the headhunter asked him to set up a program to capture the lowest price ticket on the net. Then, I helped him with the whole process. The purpose of this article is to provide the source code of this program, and then discuss with you the web page information capture points. The Demo uses c # And runs in the vs2012 environment.

Project Structure Overview

The following is the Demo project structure:

Running result

The following figure shows the running result of the Demo:

Train of Thought and Problem Analysis

  • I personally think that the acquisition of webpage information is divided into two phases: 1. Know the target webpage and relevant parameters, and obtain the source code of the webpage. 2. Extract the obtained source code from the information we need, and convert to c # object
  • HttpHelper. the class in the cs file is responsible for setting the target webpage address and related parameters, which are found online. It is said that cookies, certificates, and other verification can be ignored. It is very good. We recommend that you use the class, therefore, the first goal is easier to accomplish.
  • The difficulty lies in the second goal. How can we capture valid information in the html source code (json data) and convert it into the c # object we need? The Demo obtains json data, captures a part of the data using regular expressions, and converts the data to a list of entity classes. AsyncRegexHelper in the Demo is a helper class for asynchronous Regular Expression matching. During the use of regular expression matching, infinite Backtracking is often encountered, this help class can be used to asynchronously execute the matching and has a timeout time. The problem is that regular expression matching is unreliable, difficult, and difficult to expand. We are planning to use Html Agility Pack for data matching. I hope you can give me some advice. Thank you.

Summary

I am a poor writer. Thank you for your support. The source code is provided. You can share with us that we hope to build a general point system. You only need to enter the website address and some simple rules to obtain the information we need.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.