Normal display of Web pages in any character set

Source: Internet
Author: User
Tags character set ord

Typically, our web page will specify a coded character set, such as GB2312, UTF-8, iso-8859-1, and so on, so that we can display the text we specify on the page. But we are likely to encounter the situation where we may want to display Chinese characters on iso-8859-1-encoded web pages, or to display Korean on GB2312-encoded web pages. Of course, a solution is that we do not use iso-8859-1 or GB2312 coding, and all using UTF-8 code, so that we can only be in this code, the mixed display of national characters, which is now a number of Web sites using the method.

And what I'm saying here is not the above method, because the above method must specify the character set for UTF-8 only, once the user manually assigned to other character sets, or perhaps for some reason, that character set settings did not work, and the browser does not correctly automatically identify the words, we see the page or garbled, Especially in some frames of the Web page, a frame of the page if the character set settings do not work, in Firefox display garbled and can not be changed (I mean in the case of not installing Rightencode plug-ins).

And the method I introduced here even if the Web page is designated as the Iso-8859-1 character set, can also correctly display Chinese characters, Japanese and so on. The simple principle is that all other encodings except the first 128 characters in the ISO-8859-1 code are represented by NCR (Numeric character reference). For example, the words "Chinese characters", if we write "Chinese" in this form, then it can be displayed correctly in any character set. Based on this principle, I wrote the following program, which transforms the existing Web page into a Web page that can be displayed in any character set. You only need to specify the source page of the character set and source page, click the Submit button, you can get the target page. You can also only convert some text, just fill in the text box, and specify the original character set of the text, point submit button, it will appear on the page encoded text. In addition, I also wrote WordPress plug-ins, now my Blog can be in any character set can be displayed correctly.

Implementation method:

The first step is to convert the string of the source character set to the UTF-16 character set, which is done because each character in the UTF-16 character set is two bytes long, which is easy to deal with, and complex to do directly on the source character set. The source character set can be obtained from the META tags in the original Web page. Can also be specified separately, my program is to allow users to specify the source character set in the form, because I can not guarantee that the user submitted files must be HTML files (other files are also possible, such as this WordPress Chinese package source file is a PO file , the contents of it can also be handled in this way), and even the HTML file does not necessarily have a meta tag for the specified character set, so it is safe to specify the character set by the form alone. You may find it complex to convert one character set to another, sure, it's really troublesome to do it yourself, but it's easy to do it in PHP, because it already contains such a function, and you can easily convert between the various character sets through the Iconv function. If you have not installed Iconv extensions on your machine, you can also use the mb_convert_encoding function, and if multibyte string extension is not installed, then there is no way, because you have to implement so many of the encoding of the conversion is basically impossible, Unless you're a top bull! The use of Iconv is recommended because it is efficient and supports more character sets.

After the above step, the string is then processed in every two bytes. These two bytes directly into the number is & #xxxxx, xxxxx, if this number is less than 128 to use this character directly (note that this is a single-byte), otherwise use the form of & #xxxxx; One thing to note here is that when this number is 65279 (16 binary 0xFEFF), ignore it, because this is the transmission control character in the Unicode encoding, and our current string is only the first 128 characters in the ISO-8859-1 encoding, So we don't need it anymore.

Well, the basic idea is this, here is the implementation of the program:

    1. <?php
    2. function Nochaoscode ($encode, $str) {
    3. $str = Iconv ($encode, "Utf-16be", $str);
    4. for ($i = 0; $i < strlen ($STR); $i + +, $i + +) {
    5. $code = Ord ($str {$i}) * 256 + ord ($str {$i + 1});
    6. if ($code < 128) {
    7. $output. = Chr ($code);
    8. else if ($code!= 65279) {
    9. $output. = "&#". $code. ";";
    10. }
    11. }
    12. return $output;
    13. }
    14. ?>

In the parameter of the function, $encode is the source character set, $str is the string that needs to be transformed. The return result is the conversion of the string later.



Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.