Htmlparser.net Reference

Last Update:2015-08-28 Source: Internet

Author: User

Tags lexer

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Example 1:

Using System;
Using System.IO;
Using Winista.Text.HtmlParser;
Using Winista.Text.HtmlParser.Lex;
Using Winista.Text.HtmlParser.Util;
Using Winista.Text.HtmlParser.Tags;

private void Button1_Click (object sender, EventArgs e)
{
We can use the stream to load a HTML file from the local disk
Or use the URI to load a Web page from the Internet
byte[] htmlbytes = Encoding.UTF8.GetBytes (This.textBox1.Text);
MemoryStream memsteam = new MemoryStream (htmlbytes);
Inputstreamsource input = new Inputstreamsource (Memsteam, "utf-8");
Page page = new page (input);
Lexer Lex = new Lexer (page);

if (this.textBox1.Text.Length <= 0)
Return
Here I-read the HTML from the textbox
Lexer Lexer = new Lexer (this.textBox1.Text);
Parser Parser = new Parser (lexer);
NodeList htmlnodes = parser. Parse (NULL);
This.treeView1.Nodes.Clear ();
THIS.TREEVIEW1.NODES.ADD ("root");
TreeNode treeroot = this.treeview1.nodes[0];
for (int i = 0; i < Htmlnodes.count; i++)
{
This. Recursionhtmlnode (Treeroot, htmlnodes[i], false);
}
}

private void Recursionhtmlnode (TreeNode TreeNode, INode Htmlnode, bool siblingrequired)
{
if (Htmlnode = = NULL | | treeNode = = NULL) return;

TreeNode current = TreeNode;
Current node
if (Htmlnode is ITag)
{
ITag tag= (Htmlnode as ITag);
if (!tag. Isendtag ())
{
String nodestring = tag. TagName;
if (tag. Attributes! = NULL && tag. Attributes.count > 0)
{
if (tag. attributes["ID"]! = NULL)
nodestring = nodestring + "{id=\" + tag. attributes["ID"]. ToString () + "\"} ";
if (tag. attributes["CLASS"]! = NULL)
nodestring = nodestring + "{class=\" + tag. attributes["CLASS"]. ToString () + "\"} ";
if (tag. attributes["STYLE"]! = NULL)
nodestring = nodestring + "{style=\" + tag. attributes["STYLE"]. ToString () + "\"} ";
if (tag. attributes["HREF"]! = NULL)
nodestring = nodestring + "{href=\" + tag. attributes["HREF"]. ToString () + "\"} ";
}
Current = new TreeNode (nodestring);
TREENODE.NODES.ADD (current);
}
}

The Children Nodes
if (htmlnode.children!=null && htmlNode.Children.Count > 0)
{
This. Recursionhtmlnode (Current, Htmlnode.firstchild, true);
}

The sibling nodes
if (siblingrequired)
{
INode sibling = htmlnode.nextsibling;
while (sibling! = null)
{
This. Recursionhtmlnode (TreeNode, sibling, false);
Sibling = sibling. NextSibling;
}
}
}

Htmlparser is a pure Java-written HTML parsing library that does not rely on other Java library files, primarily for the transformation or extraction of HTML. It can parse HTML very fast and without errors.
It is no exaggeration to say that Htmlparser is currently the best tool for parsing and parsing HTML.

Whether you want to crawl Web data or transform HTML content, you'll never be tempted to praise it with Htmlparser.

C # version Htmlparser download java version htmlparser download

A simple tutorial:

(1), Data Organization analysis:

Htmlparser mainly relies on node, abstractnode, and tag to express HTML, because remark and text are relatively simple and are ignored here.

node is the basis for the formation of a tree structure to represent HTML, all data representations are the implementation of the interface node, node defines the page tree structure with the Page object, defines the method to get the parent, child, and sibling nodes, defines the node to the corresponding HTML text method, The corresponding starting and ending position of the node is defined, the filtering method is defined, and the visitor access mechanism is defined.
Abstractnode is a specific class implementation of node that acts as a tree structure, in addition to the ACCETP method associated with a specific node, tostring,tohtml,toplaintextstring method, Abstractnode implements most of the basic methods, making its subclasses, regardless of the specific tree operation.
Tag is the main content of the specific analysis. Tag is divided into composite tag and can not contain other tags of the simple tag two classes, wherein the former base class is Compositetag, its subclass contains Bodytag,div,framesettag,optiontag, etc. 27 sub-classes , and simple tag has basehreftag, Doctypetag,frametag,imagetag,inputtag,jsptag,metatag,processinginstructiontag these eight categories.
Node is divided into three categories:

Remarknode: Represents a comment in HTML
Tagnode: Tag node, is the most kinds of node type, the tag of the specific node class are Tagnode implementation.
Textnode: Text node
(2), visitor way to access HTML:
1, the overall parsing process
Make a parser with a URL or page string
Make a visitor with this parser.
Use Parser.visitallnodewith (Visitor) to traverse nodes
Gets the data obtained after the visitor traversal
2,visit process
What to do before parsing: visitor.beginparsing ();
Each time a node is taken, let it accept the visitor
What to do after parsing: visitor.finishedparsing ();
3, get the node process: Step through the HTML, parse out node. This section is more complex and does not require a lot of understanding for our application and skips over.
4, node access
Node access is based on visitor mode, and node's accept method and specific visitor visit method are key.
The first three types of node to accept are different:
For all Tagnode, use an accept method, which is the Tagnode accept method. First, whether it is the end of the label, if it is visitor.visitendtag (this); otherwise visitor.visittag (this);
If it is Textnode, then Visitor.visitstringnode (this);
If it is Remarknode, then Visitor.visitremarknode (this);

In fact, nodevisitor inside these four kinds of visit methods are empty, because in different visitor of the three kinds of nodes in the processing is different, for the node that needs to be processed, as long as the corresponding visit method is overloaded, if not processing it will be ignored; If users use their own visitor, they can also handle different types of nodes flexibly.

The system has implemented the following 8 kinds of visitor that I would like to introduce, actually can be seen as the system shows us how to do all kinds of visitor to access the HTML, because actually we want to really use htmlparser words, also need specific visitor, The visitor combination provided by these simple systems is difficult to accomplish.
(3), System visitor function Introduction:
Objectfindingvisitor: Used to find all nodes of the specified type, using GetTags () to obtain the results.
Stringbean: Used to get the HTML code to remove the code between <SCRIPT></SCRIPT> and <PRE></PRE> from a specified URL, or to use it as a visitor, Used to remove the code inside the two tags, using stringbean.getstrings () to get the results.
HtmlPage: Extracts nodes in the Title,body and Tabletag nodes in the page.
Linkfindingvisitor: Find out the total number of links contained in a node.
Stringfindingvisitor: Find out the number of specified strings in the traversed Textnode.
Tagfindingvisitor: Finds all nodes of the specified tag and can specify multiple types.
Textextractingvisitor: Remove all the tags from the page to extract the text, the visitor of the extracted text is sometimes very useful, just note that when extracting text, the label attributes are also removed, that is, only the text between the labels, such as <a> The links in the list are also removed.
Urlmodifyingvisitor: Used to modify links in a Web page.
(4), Filter
　
If visitor is a traversal of the extracted information, of course, this information can include some nodes or more effective information from the node analysis, which depends on what our visitor is made of, then filter is the goal is clear, is used to extract nodes. So if you want to use Htmlparser, you should first familiarize yourself with the data organization mentioned above.
　
The system defines 17 specific filter, including the filter based on the node's parent-child relationship, the filter of the filter combination, the filter according to the content of the page, and so on. We can also implement filter to make our own filter to extract nodes.
The call to filter is independent of the visitor because there is no need to filter out some nodelist before using visitor to access it. The method to invoke filter is:
NodeList NodeList = Myparser.parse (Somefilter);
After parsing, we can use:
node[] nodes = Nodelist.tonodearray ();
To get an array of nodes, you can also directly access:
Node node = nodelist.elementat (i) to get node.
In addition, after the filter has been nodelist, we can still use NodeList's Extractallnodesthatmatch (Somefilter) to further filter, You can also use NodeList's Isitallnodeswith (somevisitor) to make further visits.
In this way, we can see that Htmlparser provides us with a very convenient way of parsing HTML, for different applications can use visitor to traverse the HTML node to extract data, you can also filter the nodes by filtering, extract the nodes we are concerned about, and then process the nodes. With such a combination, we are sure to be able to find the information we need.

code example:

Htmlparser gets the HTML code for all the links in the C # version, similar to the Java version:
String htmlcode = "<HTML><HEAD><TITLE>AAA</TITLE></HEAD><BODY>" + ... + "</ Body>Parser Parser = Parser.createparser (Htmlcode, "GBK");
HtmlPage page = new HtmlPage (parser);
Try
{parser. Visitallnodeswith (page);}
catch (Parserexception E1)
{e1 = null;}
NodeList NodeList = page. Body;
Nodefilter filter = new Tagnamefilter ("A");
NodeList = nodelist. Extractallnodesthatmatch (filter, true);
for (int i = 0; i < nodelist. Size (); i++)
{
Linktag link= (Linktag) nodelist. ElementAt (i);
System.Console.Write (link. GetAttribute ("href") + "\ n");
}

Htmlparser.net Reference

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More