How to parse HTML with Javascript?

Source: Internet
Author: User

 

The problem is that I need to parse some content from another webpage and integrate it into the webpage currently being processed. The first idea is to use dojo. xhrGet to get the webpage content, and then use some Javascript Library for parsing. I am familiar with dojo and do not have such a library. I have queried it and jQuery does not. Later I found that the author of jQuery, John Resig, has a library that supports the construction of the SAX syntax and DOM tree:

Pure JavaScript HTML Parser

I tried the web page I want to test. Unfortunately, it's crash (this web page has dozens of K sizes and is not necessarily very standard ). Not only do we miss Java, but there will be a bunch of such things in our arsenal library in Java, many of which even support cssrendeing and Javascript Execution.

What should I do? My requirements are not complex. Maybe regular expressions can be used, but I think of complicated css queries and the future changes to be compatible with webpages. Therefore, this method is not worth trying.

Suddenly, browser is the perfect HTML Parser. If you want browser to open a new window to load the webpage, and then use dom api to analyze the webpage, in this way, the content I need will be more accurate and flexible. After the web page changes in the future, it will be easier to modify the program. Of course, the new window is not very elegant (and may even lead to user confusions). We use frame to implement it.

There are two types of frames: frameset, Frame tag, and iframe tag. If an HTML file contains frameset, the file should only contain

Many textbooks talk about the content of the frame created by <frameset> and the relationship between windows and documents between frames, such as "Javascript advanced programming" (Nicolas Zakas ). The frame created through iframe has the following relationships:

The window object corresponding to the original document, iWindow for short, which has the following objects:

IWindow. frames: array of sub-frames. You can use frames. length = 0 to determine whether a window has a frame.

IWindow = top

IWindow = iWindow. parent

The Document Object of the created frame is referenced by frameNode. contentDocument according to DOM specifications. The window object should be frameNode. contentWindow, but note that the implementations of various browsers are different. This is exactly where dojo can help. Let's take a look at the implementation of dojo:

Create iframe www.2cto.com

 

Dojo. io. iframe. create (/* String */fname,/* String */onloadStr,/* String? */Uri)

The first parameter is the name you will specify for the frame. In fact, dojo also regards this name as the id attribute of the iframe. Therefore, you cannot use the same name multiple times to create an iframe. The second parameter specifies the event that should be triggered when the iframe content is loaded. It can be a small javascript program. The third parameter is optional. If this parameter is not specified, it loads a blank html file at a specified position in the field where dojo. js is located. If you load dojo across domains (cross-origin loading becomes more common when CDN is used more frequently), this will cause a security error. You can specify this parameter as "about: blank". Both firefox and ie support this syntax.

Manipulate iframe document objects

You should always use the methods provided by dojo to obtain references to iframe document objects, because the object hierarchies of different browsers are different.

 

Dojo.io.iframe.doc (/* DOMNode */myFrame );

Here, myFrame is a previously created DOMNode through the create FUNCTION. Remember that inline frame is a node in the current document object. If you are interested in taking a look at the source code of dojo, it implements the selection of this document for different browsers in this way:

 

Doc: function (/* DOMNode */iframeNode ){

 

// Summary: Returns the document object associated with the iframe DOM Node argument.

 

Var doc = iframeNode. contentDocument | // W3

 

(

 

(

 

(IframeNode. name) & (iframeNode.doc ument )&&

 

(Dojo.doc. getElementsByTagName ("iframe") [iframeNode. name]. contentWindow )&&

 

(Dojo.doc. getElementsByTagName ("iframe" )[iframenode.name+.content?#doc ument)

 

)

 

) | // IE

 

(

 

(Iframenode.name=&&(dojo.doc. frames [iframeNode. name]) &

 

(Dojo.doc.frames[iframenode.name).doc ument)

 

) | Null;

 

Return doc;

 

},

How can we use DOM APIs and css selection syntax for the doc object we get?

The core Syntax of jQuery is $. When a node is selected through $, its default root node is the current document object, but you can use $ (,) to select its root node, for example:

 

$('H1 ', commandid parent.frames000002.16.doc ument). remove ()

It should be said that jQuery's syntax is quite concise at this point. In dojo, we need to do this:

 

Dojo. withDoc (/* the doc object of iframe */iframeDoc, method, scope, args );

Scope is the environment where dojo is used to find the second parameter method and all variables referenced in method. If the method is 'byid', the scope should be dojo. Args is the parameter to be passed in when method is called.

Iframe destruction

The dojo document does not mention how to destroy an iframe, and does not even implement this function in its own implementation. Considering that iframe is only a node of the current document, you can use DOM APIs to destroy an iframe. This is not an experiment.

Replace the target Link

Compared with destroy operations, iframe may be used to load different html documents. You can achieve this through dojo. io. iframe. setSrc:

 

Dojo. io. iframe. setSrc (/* DOMNode */iframe,/* String */src,/* Boolean */replace );

 

The third parameter is worth noting. According to the experiment, if it is set to false, iframe may not update the document in firefox.

Some behaviors of this function are worth further research. We specified the onload event processing code when creating the iframe, because there is no good way to ensure that the handler must be executed after the iframe document is loaded. So what should I do when setSrc replaces the new location with iframe? After experiment in firefox, it is surprising that the onload event handler specified during iframe creation is still valid-And dojo does not do anything about it. This indicates that when dojo creates an iframe, the specified onload event is a window event, rather than a document event, or even a lower-layer event.

Conclusion

In the world of Javascript, sometimes programmers are dancing in handcuffs. Art is more important than science at this time. Iframe effectively solves document parsing problems-programmers can still operate on these documents using APIs and methods they are familiar, this is much better than searching for an immature library, or even implementing some functions on your own.

The iframe method still has its own shortcomings-it may be fatal in some cases. Browsers increasingly restrict cross-origin operations. Therefore, if an HTML document from another domain is loaded in iframe, you can only access the attributes such as title and location, but cannot access the content. Because this is a browser restriction, neither dojo nor jQuery can survive. The only method may be to load cross-origin content through the link or <script> tag-This makes it necessary for the Javascript-based HTML Parser to still exist.

From: midsummer lotus-cutting-edge web Technology

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.