PHP intercepts HTML code string problem

Source: Internet
Author: User
Tags html tags tidy
Requirements: To intercept a piece of text for a certain physical length display, note that the number of bytes to intercept is not a string, the UFT-8 encoding is 3 bytes or 4 bytes, while the display of Chinese will occupy the length of two characters, English characters only one, the whole corner and different.

And the data given is an HTML code string, like this:

<div class= "AAA" ><a href=http://www.webjx.com//php/2009-07-21//aaa.php?id=1″> John </a> Reviews <a href= "/aaa.php?id=444″> Dick </a> share <a href=" bbb.html "> An article a long list of things </a></div>

Interception is to intercept the div tag inside of things, and to keep the HTML tags, just to do the processing of the text. For example, I may only intercept the "Dick" word "li", but if so put to the front, "Dick" front of the A tag is not closed, so after the interception to ensure that the HTML syntax correct.

This problem is really not very good, let me depressed for two days. Note that this is just a string, except that the content is HTML code and there is no DOM. If it is in front of the processing is good to do, direct DOM acquisition, and then the inside of the node to deal with the end of the innerHTML and things like the output of the finished. Now it's going to be a different idea. The idea of a colleague is this:

Traverses each character of the string. Set a tag, hit the tag at the beginning of the tag < to 1, the next characters are not counted, and then hit > then start counting. In the label inside the string processing, but also to determine the current character encoding is not likely to be Chinese, generally speaking PHP UTF-8 encoded in the length of the characters are 3, so if you encounter is a Chinese character encoding, you must skip two of the count ... Speaking of which, I have already begun to have a big head. Personally think this method is very uncomfortable, first of all this exquisite logic is not easy to control, and UFT-8 code Chinese can produce a length of 3 or 4 , so the tightness of the code is questionable.

My personal idea is to do it with Tidy (see the PHP Manual for specific usage). Yesterday studied the Tidy, found that this thing is very useful. First, convert this string into a Tidy object, so that:

$tidy = tidy_parse_string ($str, Array (), ' utf8′); The last one is to set the code, note, here is UTF8 , not utf-8, there is no middle of that connection.

Then get the body in the $tidy (because $tidy will automatically add the

$body = Tidy_get_body ($tidy);

This time you can use Var_dump to see some $body structure, you will find that it has changed each label into a corresponding object, which has the corresponding attributes. For example, for example, <a href=http://www.webjx.com//php/2009-07-21/"#" >sdf</a>, such a statement corresponds to some of the attributes:

Name=> "a"
Value =http://www.webjx.com//php/2009-07-21/> "<a href=" # ">sdf</a>"
child=> array{[0]=> A text node object, value is SDF}
attribute=array{"href" => "#"}
... other properties

As you can see, we can actually handle the value of the text node below the node in the a tag, so that it doesn't break any HTML integrity. Originally I thought that after changing the value of the text node in the A label, the value of a tag will also change, so I directly return a tag corresponding to the value of the node is OK, did not think it is not that way, ah, so after processing the text or to their own to spell out the new HTML.

Know the structure of the tidy object, everything will be fine, as long as the traversal of all nodes, for this requirement, is to find the div tag, and then start processing the inside node. The code is as follows:

if (Mb_strwidth ($subchild->value, ' utf-8′ ') >= $len)
{
$subchild->value = http://www.webjx.com//php/2009-07-21/mb_strimwidth ($subchild->value, 0, $len, ' ... ', ' utf-8′ ') ;
$trimed _str. = $subchild->value;
Break
}
Else
{
$trimed _str. = $subchild->value;
$len = $len-mb_strwidth ($subchild->value, ' utf-8′);
}

The $subchild inside is a child node. Note that mb_strwidth is used here to get the length of the string. Seriously recommend this Mb_strwidth, very easy to use, it will be Chinese as a two character length processing, just in line with the needs here! And the time to intercept the string used mb_strimwidth, this function will be Chinese as a two character length processing, mb_ the beginning of the function is really useful ah.

I will not write the specific code, because it is written for a need, not made in general form. One day I have time to make a general release again.

In addition, it is a pity that Firefox does not support the Text-overflow attribute, or the background so hard to truncate.



Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.