C#+htmlagilitypack+xpath takes you to collect data (take weather data as an example)

Source: Internet
Author: User
Tags xpath

Reading Catalog 1.HtmlAgilityPack Introduction 2.XPath Technology Introduction and Usage 3. Weather Collection Case 4. Resources

The first contact Htmlagilitypack was 5 years ago, some accidents, let me from the technical department temporarily transferred to the Sales department, responsible for the establishment of some processes and the search for potential customers, and finally found a lot of customer information in Alibaba, very comprehensive, just start is manually copied to Excel, is really a tired, Although at that time C # is also very vegetable, also want to be able to use the program to obtain in bulk (so the idea should be more at ordinary times only good). After many twists and turns, finally found the Htmlagilitypack artifact, these years also used htmlagilitypack collected a lot of types of data, especially football tournament database data collection and weather data collection, are using Htmlagilitypack, So the use of their own process to sum up, to share with you, so that more people contact and learn to use, to bring their own work traversal.

Today's main content is Htmlagilitypack's basic introduction, the use, the actual code. Finally, we take the weather data as an example to introduce the actual collection analysis process and simple code. We will open the weather database and C # operation code in the next article. Acquisition of the core is just introduced here, in fact, the core code have, their own processing can be, but also free of need for people to open. For specific details, please pay attention to the next article.

. NET Open Source directory: "directory" this blog other. NET open source project articles Directory

The original address of this article: C#+htmlagilitypack+xpath take you to collect data (take weather data as example) back to catalog 1.HtmlAgilityPack Introduction

Htmlagilitypack is an open source parsing HTML element class library, the biggest feature is the XPath to parse the HMTL, if you previously used C # to manipulate XML, then use Htmlagilitypack will also be handy. The latest version is 1.4.6, the download address is as follows: http://htmlagilitypack.codeplex.com/the current stable version is 1.4.6, the last update or 2012, so very stable, basic functions comprehensive, there is no need to update.

When it comes to htmlagilitypack, you have to introduce an auxiliary tool that doesn't know how to analyze the page structure when others are using it. Anyway, I'm using an official offer called the Hapexplorer tool. Very useful.    Here we will describe how to use it when using. Back to Catalog 2.XPath technology Introduction and use 2.1 XPath Introduction

XPath is the XML Path language, which is a language used to determine the location of parts of a document in the XML (a subset of standard generic markup languages). XPath is based on an XML tree structure that provides the ability to find nodes in a data structure tree. At first the original idea of XPath was to use it as a generic syntax model between XPointer and XSL. But XPath was quickly used by developers as a small query language.

XPath is a standard for the consortium. Its main purpose is to locate the node in the XML1.0 or XML1.1 document node tree. There are currently two versions of XPath1.0 and XPath2.0. Of these, Xpath1.0 was the 1999 standard of the Consortium, and the XPATH2.0 standard was established in 2007 years. For more information about XPath in English, see: http://www.w3.org/TR/xpath20/. path Expression of 2.2 XPath

XPath is the query language of XML and is similar to the role of SQL. Use the following XML as an example to introduce the syntax of XPath. Some of the following information is a few years ago to learn this time, from the network and blog to obtain some information, temporarily can not find the source, examples and text is the basic reference, again thanked. If everyone finds a similar article, tell me the link and I add the reference. The following XPath-related expressions are also very basic, and are basically enough to use.

<?xml version= "1.0" encoding= "iso-8859-1"?>
<catalog>
<cd country= "USA" >
<title >empire burlesque</title>
<artist>bob dylan</artist>
<price>10.90</price >
</cd>
</catalog>

Locating nodes: XML is a tree structure, similar to the structure of folders in a file system, and XPath is similar to the path naming of a file system. However, XPath is a pattern that selects all nodes in an XML file where the path conforms to a pattern. For example, to select all the price elements in the CD under catalog, you can use:

/catalog/cd/price

If the beginning of the XPath is a slash (/) represents this is an absolute path. If the beginning is a two slash (//), all of the elements in the file that conform to the schema are selected, even if they are at different levels in the tree. The following syntax selects all elements in the file called CDs (which are selected at any level in the tree)://CD

Select unknown element: Use an asterisk (*) to select an unknown element. The following syntax selects all the child elements of/CATALOG/CD:

/catalog/cd/*

The following syntax selects elements in all catalog that contain the price as a child element.

/catalog/*/price

The following syntax selects all elements of a two-layer parent node, called Price.

/*/*/price

It is important to note that in order to access elements that are not hierarchical, the XPath syntax must start with a two slash (//), the asterisk (*) is used to access the unknown element, and the asterisk can only represent elements of unknown names and cannot represent elements of unknown levels.

Select a branch: Use brackets to select a branch. The following syntax takes the first element, called a CD, from the child elements of catalog. There is no No. 0 element in the definition of XPath.

/CATALOG/CD[1]

The following syntax selects the last CD element in catalog: (XPATHJ does not define a function of first (), so you can use the example [1] to remove the element.

/catalog/cd[last ()]

The following syntax selects all/CATALOG/CD elements with a value equal to 10.90 for the price element

/CATALOG/CD[PRICE=10.90]

Select Properties: In XPath, you can select attributes in addition to selecting elements. Properties are all started with the @. For example, select all attributes in a file called Country:

@country

The following syntax selects the CD element with the Country property value UK

cd[@country = ' UK ']
Back to Catalog 3. Weather site Collection Cases 3.1 Requirements Analysis

We want to collect the weather information of the city all over the country, the website is: http://www.tianqihoubao.com/, the website data divides into 2 kinds, one is the historical data, the coverage is 2011 to present, one is the weather forecast data, the historical data is the weather report, Which is the actual weather data. The range of acquisition must cover the major cities of the country, preferably all cities. By analyzing the site's pages, it does meet the requirements. Weather information, including actual weather conditions, wind conditions, and temperature conditions, including the lowest and highest intervals.

Combined with basic requirements, we enter the site, analyze some general features, and the structure of the main page. 3.2 Web page structure Analysis

To collect a large amount of information, the site must be detailed analysis and summary of the page. Because machine acquisition is not artificial, you need to dynamically construct URLs, requests, or page HTML, and then parse them. So the analysis of the Site page structure is the first step, but also a very important step. We first go to the Total history page: http://www.tianqihoubao.com/lishi/, as shown below:

Obviously, this total page is divided by the provinces, you can see each province, prefecture-level name of the link, are fixed format, but the phonetic abbreviation is different. And the first city in each province is the capital city. It should be noted that the procedure is to distinguish between provincial and other urban cities. Of course, the provincial capital city can also be omitted, after all, only more than 30, manual tag is also very quick things. This page we will mainly collect the abbreviation information of the province, then we select a province, click in, look at each province specific city information, such as our choice Liaoning province: http://www.tianqihoubao.com/lishi/ln.htm the following figure:

Similarly, the area below each province also has a separate link, formatted and similar to the above, according to the city pinyin. We see below each province, there are large prefecture-level districts, each of which is subdivided into small counties and cities. We randomly click on the Dalian link to go in and look at the specific weather history information:

This page contains historical data from January 2011 to 2015, separated by month. The link features are also fixed, including the city name Pinyin and year month information. So it's easy to construct this link. Here's how each month looks:

I shielded some of the ads, manual to wipe off it. Each city's monthly weather information is simpler, with direct tables filled with data, dates, weather conditions, temperature and wind. These steps are guided by the link step-by-step of the page, so the process is clear, to gather the information is clear, with a general idea:

First collect the entire province pinyin code, then obtains each province each prefecture level, as well as the corresponding county level county name and the phonetic code, finally circulates each county level county, obtains all historical data according to the month. The following will focus on several pages of the node situation, that is, how to use Htmlagilitypack and XPath to get the data you want, as to save to the database, eight Immortals crossing recount, I used the Xcode component. 3.3 Analysis province-county Structure page

Or to Liaoning province as an example: http://www.tianqihoubao.com/lishi/ln.htm, open the page, right-click the page source code, paste into the hapexplorer, you can also open the link directly in Hapexplorer, As shown in the following animated demo:

We can see that the XPath address on the right, after the div is finished, below are the DL tags, which is the line we want to collect. Here we use the code to get the above structure. Let's take a look at the code that gets the page's source code:

1 2 3 4 5 6 7 8

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.