php+tidy-Perfect XHTML error correction + Filtration _php Tips

Source: Internet
Author: User
Tags function definition html cleanup prev tidy
Input and output
Input and output should be said to be the basic functions of many websites. User input data, Web site output data for other people to browse.

Take the current popular blog as an example, where the input output is the author to edit the article generated blog post page for others to read.
The problem here is that user input is usually uncontrolled, that it may contain incorrect formatting or contain code that has security implications, and that the content of the final site output must be the correct HTML code. This requires that the user input content to be corrected and filtered.

Never trust a user's input
You may say: Now everywhere is the WYSIWYG editor (WYSIWYG), FCKeditor, TinyMCE ... You might cite a whole bunch. Yes, they can all automatically generate standard XHTML code, but as Web developers, you must have heard "never trust data submitted by users."

Therefore, it is necessary to make error correction and filtering for user input data.

Need for better error correction and filtering
So far, I've never seen a related implementation that I'm satisfied with, and it's usually inefficient, less effective, and obviously flawed. For a more well-known example: WordPress is a very broad use of the blog system, simple and powerful operation and rich plug-in support, but it integrates the TINYMCE and backstage a bunch of smart-coded error-correcting filter code is quite a headache, forced replacement of half-width characters, Overly conservative substitution rules, etc... It's hard to make it look like a piece of code to put it right.

Here by the way complain, this blog is used with WordPress frame, in order to let this several articles can correctly display code, online search a lot also tried some plug-ins, and finally turned over its code to some filter rules commented out only can show a decent little-.-b

Of course, I don't want to blame it too much (WordPress), just to show that it can do better.

What is tidy and how does it work?
Excerpt from the Tidy manpage description:

Tidy reads HTML, XHTML and XML files and writes cleaned up markup. For HTML variants, it detects and corrects many common coding errors and strives to produce visually equivalent markup tha The T is both the compliant and works on most browsers. A common use of Tidy be to convert plain HTML to XHTML. For generic XML files, Tidy are limited to correcting basic well-formedness errors and pretty.

Simply put tidy is to clean up the HTML code, generate clean and consistent with the standard of the HTML code, support Html,xhtml,xml. Tidy provides a library tidylib to facilitate the use of tidy's powerful features in other applications. Luckily, PHP has the appropriate tidy module to use.

Dude, why PHP again?
Uh, the question ... Ashamed, because I only have so little php-.-V
But fortunately, I am speaking here is not pure code, at least some analysis of the process, to share these things more useful than the paste code.

Using Tidy in PHP
To use tidy in PHP need to install tidy module, that is, load tidy.so this PHP extension, the specific process is slightly, pure physical activity. Finally, in Phpinfo () to see "Tidy support Enabled" on the OK.

With the support of this module, PHP can use almost all of the features provided by tidy. Common HTML cleanup is an exceptionally easy thing to do, and can even generate a parse tree for a document, such as manipulating the HTML node of a client operation DOM. Here will be a specific code description, you can also look at the official PHP manual.

Php+tidy implementation of error correction and filtering
The above has said so many background material, seems to be too wordy, the specific problem-solving code is the most direct.

1. Simple error-correcting implementation

function Htmlfix ($html)
{

if (!function_exists (' tidy_repair_string '))
return $html;
Use tidy to repair HTML code

Repair
$str = tidy_repair_string ($html,
Array (' output-xhtml ' =>true),
' UTF8 ');
Parse
$str = tidy_parse_string ($str,
Array (' output-xhtml ' =>true),
' UTF8 ');
$s = ';

$nodes = @tidy_get_body ($str)->child;

if (!is_array ($nodes)) {
$returnVal = 0;
return $s;
}

foreach ($nodes as $n) {
$s. = $n->value;
}
return $s;
}
The code above is a cleanup of potentially irregular XHTML code that outputs standard XHTML code (input and output are UTF-8 encoded). The implementation code is not the most streamlined, because I write as carefully as possible to fit the filter below.

2. Advanced implementation: Error correction + filtering

Function:

XHTML error correction, output standard XHTML code.
Filters unsafe code but does not affect content presentation, but clears unsafe code in Style/javascript.
Insert <wbr> tag for super long string to achieve browser-compatible automatic wrapping function, related articles can refer to the page in the long text of the line break problem.
function Htmlfixsafe ($html)
{

if (!function_exists (' tidy_repair_string '))
return $html;
Use tidy to repair HTML code

Parameter setting of tidy
$conf = Array (
' Output-xhtml ' =>true
, ' Drop-empty-paras ' =>false
, ' join-classes ' =>true
, ' Show-body-only ' =>true
);

Repair
$str = tidy_repair_string ($html, $conf, ' UTF8 ');
Generate Parse Tree
$str = tidy_parse_string ($str, $conf, ' UTF8 ');

$s = ';

Get the Body node
$body = @tidy_get_body ($STR);

function _dumpnode, check each node, filter after output
function _dumpnode ($node,& $s) {

View the node name, and if it is <script> and <style> clear it directly
Switch ($node->name) {
Case ' script ':
Case ' style ':
Return
Break
Default
}

if ($node->type = = Tidy_nodetype_text) {
/*
If the node is text, do additional processing:
The problem of automatic line-wrapping for too long text;
Automatic recognition for hyperlinks (not implemented)
*/
Insert <wbr>
$s. = Htmlinsertwbrs ($node->value,30, ', ' &?/\ ');

Auto Links??? TODO * * *
Return
}

is not a text node, then processing the label and its properties
$s. = ' < '. $node->name;

Check each property
if ($node->attribute) {
foreach ($node->attribute as $name => $value) {

/*
Cleans up some DOM events, usually on the start of,
such as the onclick onmouseover ....
Or the attribute value has javascript: the typeface,
For example, href= "javascript:" is also cleared.
*/
if (Strpos ($name, ' on ') = = 0

Stripos (Trim ($value), ' javascript: ') ===0
){
Continue
}

Preserve Secure properties
$s. = '. $name. ="'. Htmlescape ($value). ' ";

}
}

Recursively check for child nodes under this node
if ($node->child) {

$s. = ' > ';

foreach ($node->child as $child) {
_dumpnode ($child, $s);
}

Child node processing complete, closing label
$s. = ' </'. $node->name. ' > ';
}else{

/*
There are no child nodes, close the tag
(In fact, you can also consider removing empty nodes directly.)
*/
if ($node->type = = Tidy_nodetype_start)
$s. = ' ></'. $node->name. ' > ';
Else
/*
Pairs of non-paired tags, such as Directly closed by/>.
*/
$s. = '/> ';
}
}
function definition End

Start filtering the body node with the above function.
if ($body->child) {

foreach ($body->child as $child)
_dumpnode ($child, $s);
}else
Return ";

return $s;
}
The above code in the comments should be more detailed, the principle of working with the code to see it.
More stringent filtering is also easy to expand, such as the implementation of the link in the text automatically recognized.


A little supplement

If you've seen the long text break in the pages I've written before, you may find that the function that handles word wrapping in the above code is different:

The previous introduction is Htmlescapeinsertwbrs (), which uses Htmlinsertwbrs ().

Here's to explain:
Htmlescapeinsertwbrs () requires that the input string not be a special character escape, that is, no htmlspecialchars () for <>& treatment of <>&. Because there is special processing inside the function.
While processing the text node after the tidy processing, because of the tidy relationship, has automatically <>& and other characters for the corresponding <>& escape, so need to use a special function to avoid duplication of escape, The function is Htmlinsertwbrs (), which is known from the name only to insert <wbr> tag, not to do extra work.

Then you may have a problem:
If <wbr> is inserted into the middle of an HTML tag, such as inserting a <wbr&gt in the middle of <div> or >, turning <d<wbr>iv> and &<wbr>gt; That will affect the presentation of the original information.

Yes, it's a new problem, but there are some tricks you can use to solve it effectively:

Because we're dealing with the text node that tidy gets, it means it's impossible to hit the HTML tag, so it doesn't hit the <wbr> in the middle of the tag.
For the second case, the escaped character is a form of &xxxxx;, so just insert the <wbr> tag in front of all 1 & symbols (note the fourth argument at the time of the call) because the next <wbr> tag will be inserted in 30 ( This is already 2 far greater than the length of xxxxx after the second character, which is actually called in the code above. This is ensured that the above 1, 22 points are not inserted in the middle of the escape character.
The following is a PHP implementation of HTMLINSERTWBRS ():

function Htmlinsertwbrs ($str, $n = 10,
$chars _to_break_after= ', $chars _to_break_before= ')
{
$out = ';
$strpos = 0;
$SPC = 0;
$len = Mb_strlen ($str, ' UTF-8 ');
for ($i = 1; $i < $len; + + $i) {
$prev _char = Mb_substr ($str, $i -1,1, ' UTF-8 ');
$next _char = Mb_substr ($str, $i, 1, ' UTF-8 ');
if (_u_isspace ($next _char)) {
$SPC = $i;
} else {
if ($i-$SPC = = $n

Mb_strpos ($chars _to_break_after,
$prev _char,0, ' UTF-8 ')
!== FALSE

Mb_strpos ($chars _to_break_before,
$next _char,0, ' UTF-8 ')
!== FALSE
) {
$out. = Mb_substr ($str, $strpos,
$i-$strpos, ' UTF-8 ')
. ' <wbr> ';
$strpos = $i;
$SPC = $i;
}
}
}
$out. = Mb_substr ($str, $strpos, $len-$strpos, ' UTF-8 ');
return $out;
}
...
Ok, write so much first, the relevant information in the text have links.
Next time I think of it.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.