PHP + Tidy-perfect XHTML correction + filtering

Source: Internet
Author: User
Tags html cleanup tidy
PHP + Tidy-perfect XHTML correction + filter input and output
Input and output are the basic functions of many websites. User input data, website output data for others to browse.

Take the popular Blog as an example. the input and output here is the Blog article page generated by the author after editing the article for others to read.
There is a problem here, that is, user input is usually uncontrolled, and it may contain incorrect formats or code with security risks; the final output content of the website must be correct HTML code. This requires correction and filtering of user input content.

Never trust user input
You may say that the WYSIWYG editor, FCKeditor, TinyMCE... is everywhere. Yes, they can all automatically generate standard XHTML code, but as web developers, you must have heard of "never trust the data submitted by users ".

Therefore, it is necessary to correct and filter user input data.

Better error correction and filtering
So far, I have not seen any implementations that make me satisfied. what I can see is usually low efficiency and unsatisfactory results, with such obvious defects. For example, WordPress is a widely used blog system with simple operations and rich plug-ins, however, the TinyMCE integrated with the backend is quite a headache for some clever error correction and filtering code. the mandatory replacement of halfwidth characters is too conservative and so on ..... as a result, it is difficult to paste a piece of code to make it correctly displayed.

Here, by the way, I complained that this blog was built on WordPress. in order to make these articles correctly display the code, I searched a lot on the Internet and tried some plug-ins, in the end, I tried to comment out some filtering rules in the code so that I could barely show a decent look -. -B

Of course, I don't want to blame it too much (wordpress), but I just want to explain that it can do better.

What is Tidy and how does it work?
The description from Tidy ManPage is as follows:

Tidy reads HTML, XHTML and XML files and writes cleaned up markup. for HTML variants, it detects and corrects merge common coding errors and strives to produce into Ally equivalent markup that is both W3C compliant and works on most browsers. A common use of Tidy is to convert plain HTML to XHTML. for generic XML files, Tidy is limited to correcting basic well-formedness errors and pretty printing.

In short, Tidy cleans up HTML code, generates HTML code that complies with W3C standards, and supports HTML, XHTML, and XML. Tidy provides a library TidyLib to facilitate the use of the powerful functions of Tidy in other applications. Fortunately, PHP has corresponding tidy modules available.

Dude, why is it PHP again?
Well, this is a problem... sorry, because I only need PHP.-v
But fortunately, I am not talking about pure code here, so there are some analysis processes. sharing these things is much more useful than posting code.

Use Tidy in PHP
To use Tidy in PHP, you need to install the Tidy module, that is, to load the PHP extension of tidy. so. The specific process is omitted, which is purely physical. Finally, you can see "Tidy support enabled" in phpinfo.

With the support of this module, almost all the functions provided by Tidy can be used in PHP. Common HTML cleanup is an extremely easy task. you can even generate a Document parsing tree, such as operating HTML nodes like DOM on the client. The following describes the specific code. you can also refer to the official PHP Manual.

PHP + Tidy implementation for error correction and filtering
The above mentioned many background materials seem to be too difficult, and the specific code to solve the problem is the most direct.

1. simple error correction

Function HtmlFix ($ html)
{

If (! Function_exists ('tidy _ repair_string '))
Return $ html;
// Use tidy to repair html code

// Repair
$ Str = tidy_repair_string ($ html,
Array ('output-xhtml '=> true ),
'Utf8 ');
// Parse
$ Str = tidy_parse_string ($ str,
Array ('output-xhtml '=> true ),
'Utf8 ');
$ S = '';

$ Nodes = @ tidy_get_body ($ str)-> child;

If (! Is_array ($ nodes )){
$ ReturnVal = 0;
Return $ s;
}

Foreach ($ nodes as $ n ){
$ S. = $ n-> value;
}
Return $ s;
}
The above code is to clear and correct the XHTML code that may not be standardized, and output the standard XHTML code (the input and output are all UTF-8 code ). The implementation code is not the most concise, because in order to cooperate with the following filter function, I wrote as detailed as possible.

2. advanced implementation: Error Correction + filtering

Function:

XHTML error correction: the standard XHTML code is output.
Filters insecure code but does not affect content display. it only clears insecure code in style/javascript.
Insert a super-long string Mark to enable browser-compatible automatic line feed. For more information, see the broken line of ultra-long text on the webpage.
Function HtmlFixSafe ($ html)
{

If (! Function_exists ('tidy _ repair_string '))
Return $ html;
// Use tidy to repair html code

// Tidy parameter settings
$ Conf = array (
'Output-xhtml '=> true
, 'Drop-empty-Paras' => FALSE
, 'Join-classes '=> TRUE
, 'Show-body-only' => TRUE
);

// Repair
$ Str = tidy_repair_string ($ html, $ conf, 'utf8 ');
// Generate resolution tree
$ Str = tidy_parse_string ($ str, $ conf, 'utf8 ');

$ S = '';

// Obtain the body node
$ Body = @ tidy_get_body ($ str );

// Function _ dumpnode, check each node, filter and output
Function _ dumpnode ($ node, & $ s ){

// View the node name. for example

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.