PHP + tidy-perfect XHTML correction + Filtering

Source: Internet
Author: User
Tags html cleanup processing text tidy

Input and Output
Input and output are the basic functions of many websites. User input data, website output data for others to browse.

Take the popular blog as an example. The input and output here are edited by the author.ArticleAnd then generate a blog article page for others to read.
There is a problem here, that is, user input is usually not controlled, it may contain incorrect formats or contain security risksCodeThe final website output content must be correct HTML code. This requires correction and filtering of user input content.

Never trust user input
You may say that the WYSIWYG editor, FCKeditor, tinymce... is everywhere. Yes, they can all automatically generate standard XHTML code, but as Web developers, you must have heard of "Never trust the data submitted by users ".

Therefore, it is necessary to correct and filter user input data.

Better error correction and filtering
So far, I have not seen any implementations that make me satisfied. What I can see is usually low efficiency and unsatisfactory results, with such obvious defects. For example, WordPress is a widely used Blog system with simple operations and rich plug-ins, however, the tinymce integrated with the backend is quite a headache for some clever error correction and filtering code. The mandatory replacement of halfwidth characters is too conservative and so on ..... as a result, it is difficult to paste a piece of code to make it correctly displayed.

Here, by the way, I complained that this blog was built on WordPress. In order to make these articles correctly display the code, I searched a lot on the Internet and tried some plug-ins, in the end, I tried to comment out some filtering rules in the code so that I could barely show a decent look -. -B

Of course, I don't want to blame it too much (WordPress), but I just want to explain that it can do better.

What is tidy and how does it work?
The description from tidy manpage is as follows:

Tidy reads HTML, XHTML and XML files and writes cleaned up markup. for HTML variants, it detects and corrects merge common coding errors and strives to produce into Ally equivalent markup that is both W3C compliant and works on most browsers. A common use of tidy is to convert plain HTML to XHTML. for generic XML files, tidy is limited to correcting basic well-formedness errors and pretty printing.

In short, tidy cleans up HTML code, generates HTML code that complies with W3C standards, and supports HTML, XHTML, and XML. Tidy provides a library tidylib to facilitate the use of the powerful functions of tidy in other applications. Fortunately, PHP has corresponding tidy modules available.

Dude, why is it PHP again?
Well, this is a problem... sorry, because I only need PHP.-V
But fortunately, I am not talking about pure code here, so there are some analysis processes. Sharing these things is much more useful than posting code.

Use tidy in PHP
To use tidy in PHP, you need to install the tidy module, that is, to load the PHP extension of tidy. So. The specific process is omitted, which is purely physical. Finally, you can see "Tidy support enabled" in phpinfo.

With the support of this module, almost all the functions provided by tidy can be used in PHP. Common HTML cleanup is an extremely easy task. You can even generate a document parsing tree, such as operating HTML nodes like Dom on the client. The following describes the specific code. You can also refer to the official PHP manual.

PHP + tidy implementation for error correction and filtering
The above mentioned many background materials seem to be too difficult, and the specific code to solve the problem is the most direct.

1. Simple Error Correction

Function htmlfix ($ HTML)
{

If (! Function_exists ('tidy _ repair_string '))
Return $ HTML;
// Use tidy to repair HTML code

// Repair
$ STR = tidy_repair_string ($ HTML,
Array ('output-XHTML '=> true ),
'Utf8 ');
// Parse
$ STR = tidy_parse_string ($ STR,
Array ('output-XHTML '=> true ),
'Utf8 ');
$ S = '';

$ Nodes = @ tidy_get_body ($ Str)-> child;

If (! Is_array ($ nodes )){
$ Returnval = 0;
Return $ S;
}

Foreach ($ nodes as $ n ){
$ S. = $ n-> value;
}
Return $ S;
}
The above code is to clear and correct the XHTML code that may not be standardized, and output the standard XHTML Code (the input and output are all UTF-8 code ). The implementation code is not the most concise, because in order to cooperate with the following filter function, I wrote as detailed as possible.

2. advanced implementation: Error Correction + Filtering

Function:

XHTML Error Correction: The standard XHTML code is output.
Filters Insecure code but does not affect content display. It only clears Insecure code in style/JavaScript.
Insert the <WBR> flag to the ultra-long string to enable browser-compatible automatic line feed. For more information, see the broken line of ultra-long text on the webpage.
Function htmlfixsafe ($ HTML)
{

If (! Function_exists ('tidy _ repair_string '))
Return $ HTML;
// Use tidy to repair HTML code

// Tidy parameter settings
$ Conf = array (
'Output-XHTML '=> true
, 'Drop-empty-paras' => false
, 'Join-classes '=> true
, 'Show-body-only' => true
);

// Repair
$ STR = tidy_repair_string ($ HTML, $ Conf, 'utf8 ');
// Generate resolution tree
$ STR = tidy_parse_string ($ STR, $ Conf, 'utf8 ');

$ S = '';

// Obtain the body Node
$ Body = @ tidy_get_body ($ Str );

// Function _ dumpnode, check each node, filter and Output
Function _ dumpnode ($ node, & $ s ){

// View the node name. If it is <SCRIPT> or <style>, clear it directly.
Switch ($ node-> name ){
Case 'script ':
Case 'style ':
Return;
Break;
Default:
}

If ($ node-> type = tidy_nodetype_text ){
/*
If the node contains text, perform additional processing:
Automatic line feed of too long text;
Automatic Identification of hyperlinks (not implemented)
*/
// Insert <WBR>
$ S. = htmlinsertwbrs ($ node-> value, 30 ,'','&? /\');

// Auto links ??? * ** Todo ***
Return;
}

// If it is not a text node, process the label and its attributes.
$ S. = '<'. $ node-> name;

// Check each attribute
If ($ node-> attribute ){
Foreach ($ node-> attribute as $ name => $ value ){

/*
Clear some DOM events, usually starting with on,
For example, onclick onmouseover ....
Or the property value contains javascript: text,
For example, href = "javascript:" is also cleared.
*/
If (strpos ($ name, 'on') = 0

Stripos (TRIM ($ value), 'javascript: ') = 0
){
Continue;
}

// Keep secure attributes
$ S. = ''. $ name. '="'. htmlescape ($ value ).'"';

}
}

// Recursively check subnodes under this node
If ($ node-> child ){

$ S. = '> ';

Foreach ($ node-> child as $ child ){
_ Dumpnode ($ child, $ S );
}

// The Sub-nodes are processed and tags are closed.
$ S. = '</'. $ node-> name. '> ';
} Else {

/*
No child nodes are available. Close the label.
(In fact, you can also directly Delete empty nodes)
*/
If ($ node-> type = tidy_nodetype_start)
$ S. = '> </'. $ node-> name. '> ';
Else
/*
For non-paired tags, such as <HR/> <br/>
Close directly with/>
*/
$ S. = '/> ';
}
}
// Function Definition end

// Use the above function to filter the body node.
If ($ body-> child ){

Foreach ($ body-> child as $ child)
_ Dumpnode ($ child, $ S );
} Else
Return '';

Return $ S;
}
The comments in the above Code should be more detailed. Let's take a look at the working principle with the code.
More Strict filtering is also easy to expand, for example, automatic identification of links in the text.

Some Supplements

If you have read the broken lines of long text in my previous webpage, you may find that the functions used in the above Code to process automatic line breaks are different:

Htmlescapeinsertwbrs () was introduced earlier, while htmlinsertwbrs () was used above ().

Here is an explanation:
Htmlescapeinsertwbrs () requires that the input string be not escaped with special characters, that is, it is not processed by htmlspecialchars () on <>&> <>. Because the function has a dedicated internal processing.
When processing text nodes processed by tidy, the <> & and other characters are automatically used as corresponding <> & escape due to the tidy relationship, therefore, we need to use a special function to avoid repeated escape. This function is htmlinsertwbrs (). From the name, we know that it only inserts the <WBR> flag without additional work.

Then you may have a problem:
If <WBR> is inserted in the middle of an HTML Tag, for example, <WBR> is inserted in the middle of <div> or>, <D <WBR> IV> and <WBR> gt.

Yes, it is indeed a new problem, but some tips can effectively solve it:

Because we are dealing with the text node obtained by tidy, it means it is impossible to touch the HTML Tag, so we will not encounter the situation of inserting <WBR> in the middle of the tag.
In the second case, the escape characters are in the form of & XXXXX, you only need to insert the <WBR> mark before all & symbols in 1 (note the fourth parameter in the call ), because the next <WBR> flag will be inserted after 30 characters (the second parameter actually called in the above code is used as an example), the length of 2 is far greater than the length of XXXXX. In this way, the first and second points can ensure that they are not inserted in the middle of the escape character.
The following describes the PHP implementation of htmlinsertwbrs:

Function htmlinsertwbrs ($ STR, $ n = 10,
$ Chars_to_break_after = '', $ chars_to_break_before = '')
{
$ Out = '';
$ Strpos = 0;
$ SPC = 0;
$ Len = mb_strlen ($ STR, 'utf-8 ');
For ($ I = 1; $ I <$ Len; ++ $ I ){
$ Prev_char = mb_substr ($ STR, $ I-1, 1, 'utf-8 ');
$ Next_char = mb_substr ($ STR, $ I, 1, 'utf-8 ');
If (_ u_isspace ($ next_char )){
$ SPC = $ I;
} Else {
If ($ I-$ SPC = $ n

Mb_strpos ($ chars_to_break_after,
$ Prev_char, 0, 'utf-8 ')
! = False

mb_strpos ($ chars_to_break_before,
$ next_char, 0, 'utf-8')
! = False
) {
$ out. = mb_substr ($ STR, $ strpos,
$ I-$ strpos, 'utf-8')
. '';
$ strpos = $ I;
$ SPC = $ I;
}< BR >}< br> $ out. = mb_substr ($ STR, $ strpos, $ len-$ strpos, 'utf-8');
return $ out;
}< br>...
OK. There are links to the relevant materials first.
I will try again next time.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.