There are many types of HTML Parser, the most commonly used is htmlagilitypack and sgmlreader (http://sourceforge.net/projects/dekiwiki/files/SgmlReader ).
Here we useHtmlagilitypack:
: Http://htmlagilitypack.codeplex.com
At the same time, the official website provides a tool to automatically generate the XPath path, namely, the URL of the tool.
For more information about XPath expressions and relevant tutorials, see:Selection of XPath expressions [updating...]
There are many ways to obtain HTML:
1. Simulate logon and obtain page information through the httpwebrequest class
2. Simulate logon with a third-party control. For details, refer to: migrating resumes.
Usage:
First, reference the DLL file of htmlagilitypack.UsingHtmlagilitypack;
Functions that extract content based on XPath:
/// <Summary>
/// Obtain filtered strings Based on xpath
/// </Summary>
/// <Param name = "content"> HTML content to be extracted </Param>
/// <Param name = "XPath"> XPath expressions </Param>
/// <Param name = "separ"> Delimiter </Param>
/// <Returns> Extracted content </Returns>
Public Static String Getstrbyxpath ( String Content, String XPath, String Separ)
{
Htmldocument doc1 = New Htmldocument ();
Doc1.loadhtml (content );
Htmlnodecollection repeatnodes = Doc1.documentnode. selectnodes (XPath );
String Text = "" ;
// Cyclic nodes
Foreach (Htmlnode Node In Repeatnodes)
{
Text + = Node. innertext + Separ;
}
Return Text;
}