Atitit.html parser Selection & #160; jsoup & #160; nsoup & #160;, java & #160; c # & #160;. net & #160; version,

Source: Internet
Author: User

Selection of Atitit.html parser jsoup nsoup, java c #. net version,

Selection of Atitit.html parser jsoup nsoup, java c #. net version

 

 

1. Requirements for frame selection 1

1.1. 1 more documents

1.2. cross-platform 1

2. html Parser features: 1

2.1. jQuery-style CSS selector 1

2.2. Operate HTML document. 1

3. How the browser parses html and prevents Garbled text 2

4. Place the meta tag at the top of the head area. 4

5. HTML Parser 4

6. Refer to 8

6.1.1. atitit. java parse jsoup into html table read and parsing Summary-  attilax column... 8

 

1. Framework selection requirements 1.1. Many documents 1.2. Cross-platform 2. Html Parser features: 2.1. JQuery-style CSS Selector

· Cleans up the HTML of untrusted sources

2.2. Operate HTML documents.

 

· JQuery-style CSS Selector

 

NSoup. Nodes. Document doc = NSoup. NSoupClient. Parse (HtmlString );

Listen

NSoup. Nodes. Document doc = NSoup. NSoupClient. Connect ("http://www.oschina.net/"). Get ();

Listen

EbClient webClient = listen to new WebClient ();

String HtmlString = Encoding. GetEncoding ("UTF-8"). GetString (webClient. DownloadData ("http://www.oschina.net /"));

NSoup. Nodes. Document doc = NSoup. NSoupClient. Parse (HtmlString );

Listen

WebRequest webRequest = WebRequest. Create ("http://www.oschina.net /");

NSoup. Nodes. Document doc = NSoup. NSoupClient. Parse (webRequest. GetResponse (). GetResponseStream (), "UTF-8 ");

 

Author: old wow's paw Attilax iron, EMAIL: 1466519819@qq.com

Reprinted please indicate Source: http://www.cnblogs.com/attilax/

 

3. How the browser parses html and prevents garbled characters

Details

HTML documents are transmitted over the Internet as byte stream sequences with character encoding information. The character encoding information can be specified in the HTTP Response Header sent with the document, or in the HTML tag of the document. The browser converts byte streams into characters displayed on the browser based on the character encoding information. If you do not know how to construct a page character, the browser naturally cannot render the page correctly. Most browsers buffer a certain number of byte streams before executing any JavaScript code or drawing pages, while caching, they also need to find the relevant character encoding settings (an exception worth noting is IE6/7/8 ).

The number of byte streams to be buffered varies by browser. If no encoding settings are found, the default encoding varies by browser. However, in any browser, if the specified encoding settings are different from the default values after sufficient byte streams are buffered and the page is rendered, the document is re-parsed and the page is re-painted. If the encoding changes affect external resources (such as css, js, and media), the browser may even request resources again.

To avoid these latencies, specify character encoding as early as possible for any HTML document that exceeds 1 kb (precisely 1024 bytes, which is the maximum buffer limit for all browsers we have tested.

Suggestions

Specify encoding through HTTP header information or meta tag

There are several ways to specify encoding for HTML documents:

Server: Specify the encoding parameters through the web server configuration, and specify the Content-Type header with correct encoding information for all text/html documents. For example, Content-Type: text/html; charset = UTF-8

Client: contains the meta tag of http-equiv = "content-type" in HTML code and specifies the character encoding. For example.

If possible, configure the HTTP header information of the specified character encoding for your web server. Some browsers (such as Firefox) will use a shorter latency buffer (than other browsers) before executing JavaScript to check whether character encoding is specified in the header information. This means they can skip the HTML tag check to shorten the number of buffered bytes and delay time.

 

4. Place the meta tag at the beginning of the head area.

If you cannot modify the web server configuration, you must use the meta tag to specify the encoding. Make sure that the meta tag you use to specify the encoding is the first sub-element of the head tag in the document. The browser will search for character encoding parameters in the first 1024 bytes of the document. To avoid performance loss, the sooner the encoding parameters appear in the document header, the better, in certain cases, if the meta tag is not the first child element of the head

5. HTML Parser

The HTML Parser parses HTML tags into the parsing tree.

5.0.1.1. HTML syntax definition

HTML terms and syntaxes are defined in the w3c organization-created specification. The current version is HTML4, and HTML5 is in progress.

5.0.1.2. Not context-independent syntax

In the introduction to the parser, we can see that the syntax can be defined in a format similar to BNF. Unfortunately, all general parser discussions are not applicable to HTML (I mentioned them for entertainment, they can be used to parse CSS and JavaScript ). HTML cannot be defined using the context-independent syntax required by the parser. In the past, the HTML format specification was defined by Document Type Definition, but it is not a context-independent syntax.

HTML is quite similar to XML. XML has many available Resolvers. Another XML variant in HTML is XHTML. What are the main differences between them? The difference is that HTML applications are more "tolerant" and allow you to miss some start or end tags. It is a "soft" syntax, not as rigid as XML. In general, this seemingly subtle difference creates two different worlds. On the one hand, HTML is very popular, because it embraces your mistakes and makes the life of webpage authors easy. On the other hand, it makes it difficult to write the syntax format. Therefore, HTML Parsing is not simple, and the context parser is not feasible.

5.0.1.3. Resolution Algorithm

As we can see earlier, HTML cannot be parsed using top-down or bottom-up parser.

The reasons are as follows:

1. Language tolerance

2. The browser must provide error tolerance for invalid HTML.

3. The parsing process is repeated. The source code remains unchanged during parsing. However, in HTML, content can be added when the script tag contains "document. write", that is, the parsing process will actually change the source code.

The browser creates its own parser to parse HTML documents.

The parsing algorithm is described in the HTML5 specification. The parsing consists of two parts: Word Segmentation and building tree.

Word Segmentation is part of lexical analysis. It parses the input into a symbolic sequence. In HTML, symbols are start tags, end tags, attribute names, and birth values.

The word divider identifies these symbols and sends them to the tree builder. Then, the analytics continues to process the next symbol until the input ends.

5.0.1.4. Word Segmentation Algorithm

The output of the algorithm is an HTML symbol. Algorithms can be described using state machines. Each status consumes one or more characters from the input stream and updates the next status based on them. The decision is affected by the current symbol status and the build status of the tree. This means that the same character may produce different results, depending on the current status. The algorithm is too complex. Let's use an example to look at its principles.

Basic example: analyze the following labels:

<Html> <body> Hello world </body>

The initial state is "Data state". When "<" is encountered, the state is changed to "Tag open state ". After a symbol consisting of "a-z" is eaten, the "Start tag token" is generated, and the status is changed to "Tag name state ". We keep this status until we encounter "> ". Each character is appended to a new symbol name. In our example, the final symbol is "html ".

When ">" is encountered, the current symbol is complete and the status changes back to "Data state "." <Body> "the tag will be processed in the same way. Now the "html" and "body" labels are complete, and we return to the "Data state" status. When "H" ("Hello world" first letter) is eaten, a character symbol is generated until the "</body>" symbol is met, we have completed a character "Hello world ".

Now we are back to the "Tag open state" state. An "end tag token" is generated when the next input "/" is eaten and changed to the "Tag name state" state. Similarly, this status remains until we encounter ">. When the new tag symbol is completed, we return to "Data state ". Similarly, " 5.0.1.5. Tree Construction Algorithm

When the parser is created, the Document Object is also created. During tree construction, the root node of the DOM tree will be modified, and the elements will be added to it. Nodes completed by each word divider are processed by the tree builder. The Specification defines the DOM object associated with each symbol. In addition to adding an element to the DOM tree, it is also added to an open element stack. This stack is used to correct nested errors and labels that are not closed. This algorithm is also described by the state machine. Its state is called "insertion modes ".

Let's take a look at the following tree construction process:

<Html> <body> Hello world </body>

During tree construction, the input is the symbol sequence obtained during word segmentation. The first mode is called "initial mode ". After receiving the html symbol, it will change to the "before html" mode and re-process the symbol in this mode. This creates an HTMLHtmlElement and appends it to the root document node.

Then the status changes to "before head ". When we receive the "body", an HTMLHeadElement is created implicitly. Even if we do not have this label, it is also created and added to the tree.

Now we enter "in head" mode, and then "after head", the Body will be reprocessed, The HTMLBodyElement element will be created and inserted, and then the "in body" mode will be entered.

After receiving the character "Hello world", a "Text" node is created, and all characters are appended to the node one by one.

After receiving the body end tag, enter the "after body" mode. after receiving the html end tag, enter the "after body" mode. Resolution will be terminated after all symbols are processed

 

6. Reference

Read and parse Summary of atitit. jsoup html table

6.0.1. Atitit. java parse jsoup parse html table read parsing Summary-atattilax column...

Useful collection tool _html htmlhtmlagilitypack_phoenixne _xinlang.html

Htmlhtml_it _xisai .html

How does a browser work: rendering engine, HTML parsing (serialization 2)-Ctrip design conference .html

HTML parsing tool HtmlAgilityPack-Zhou Gong (Zhou Jinqiao) column-51ctotechnical blog .html

 

 

 

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.