How to Use UTF-8_with_BOM, XML and Java together

Source: Internet
Author: User

Utf_bom FAQ
WWW escapes
Wikipedia UTF-8.
Kuinka äköset toimimaan servletiss ä( in Finnish)

Use utf8 for your HTML files
You shoshould use utf8 for all your HTML files, it just make life easier. there are two things to keep in mind, see example HTML below. if you follow these simple rules your site readers shocould not have problems displaying text.

  • Save your. html as UTF-8 encoded text files
  • Add "meta http-equiv" Upload AG to head part of HTML files
<ptml><pead>  <title>Page Title</title>  <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />  <meta name="keywords" content="some,fine,keywords" /></pead><body>your html content goes here....</body></ptml>

 

XML
You shoshould put BOM marker at the start of text files if possible. Then to make all even more safe add xml header row and specify encoding you use within a document.
<? XML version = "1.0" encoding = "UTF-8"?>

Windows notepad (Win2k, XP) can save files with BOM marker. Change your favorite text editor if it cannot found with standard BOM markers.

  • UTF-8
  • UTF-16BE)
  • UTF-16LE (little endian)

Windows WordPad (Win2k, XP) can't save files using UTF-8 charset.

Here is a small example XML document.

<?xml version="1.0" encoding="UTF-8"?><note>   <body1>Jättiläinen meni keittiöön   ja kaatoi kaikki kattilat.    Hiiri            meni puutarhaan   ja söi kaikki puut.   </body1>   <body2>char entities: < > & " '</body2>   <body3>safe xml chars: /O/</body3>   <body4>Decimal Numeric Character Reference: Ä €</body4>   <body5>Hex Numeric Character Reference: Ä €</body5></note>

You shoshould see this after unescaping document.

   Jättiläinen meni keittiöön   ja kaatoi kaikki kattilat.    Hiiri            meni puutarhaan   ja söi kaikki puut.   char entities: < > & " '   safe xml chars: /O/   Decimal Numeric Character Reference: Ä €   Hex Numeric Character Reference: Ä €

 

 

 

Java BOM Recognition
Unicodereader class
Jdkbug 4508058

Java default Io reader does not recognize all BOM markers. it it known to be fixed in jdk6, but I havent tested it yet. you can use unicodereader class to overcome problems and auto-recognize BOM markers. it will give a transparent behaviour to underlying inputstreams.

Example code using unicodereader class
Here is an example method to read text file. It will recognize BOM marker and skip it while reading.

   public static char[] loadFile(String file) throws IOException {      // read text file, auto recognize bom marker or use       // system default if markers not found.      BufferedReader reader = null;      CharArrayWriter writer = null;      UnicodeReader r = new UnicodeReader(new FileInputStream(file), null);      char[] buffer = new char[16 * 1024];   // 16k buffer      int read;      try {         reader = new BufferedReader(r);         writer = new CharArrayWriter();         while( (read = reader.read(buffer)) != -1) {            writer.write(buffer, 0, read);         }         writer.flush();         return writer.toCharArray();      } catch (IOException ex) {         throw ex;      } finally {         try {            writer.close(); reader.close(); r.close();         } catch (Exception ex) { }      }   }

Example code to write UTF-8 with BOM marker
Write BOM marker bytes to start of empty file and all proper text editors have no problems using a correct charset while reading files. Java's outputstreamwriter does not write utf8 BOM marker bytes.

   public static void saveFile(String file, String data, boolean append) throws IOException {      BufferedWriter bw = null;      OutputStreamWriter osw = null;      File f = new File(file);      FileOutputStream fos = new FileOutputStream(f, append);      try {         // write UTF8 BOM mark if file is empty         if (f.length() < 1) {            final byte[] bom = new byte[] { (byte)0xEF, (byte)0xBB, (byte)0xBF };            fos.write(bom);         }         osw = new OutputStreamWriter(fos, "UTF-8");         bw = new BufferedWriter(osw);         if (data != null) bw.write(data);      } catch (IOException ex) {         throw ex;      } finally {         try { bw.close(); fos.close(); } catch (Exception ex) { }      }   }

 

XML test application, config test application
Example application using unicodereader class with full sources. It reads various Unicode XML text files and output values to UTF-8_with_BOM text file. application uses unicodereader class to autorecognize Unicode BOM markers.
Testxml = read and write XML file
Testconfig = read and write properties File

Javaxmltest.zip
Reference Image of XML file output
HTML test page

Run test application, openData.txt. rtfFile to WordPad or any text editor able to use Unicode TrueType/OpenType fonts. I have foundArial Unicode MS.Font to be a very good. file is just a text file even so it has. RTF suffix. you may open it to notepad but it might not show all characters properly as default. you can still use Notepad but to save file just do not editUnknown blackbox characterLetters.

 

 

Note:

1, this article reproduced from: http://koti.mbnet.fi/akini/java/java_utf8_xml/

2, unicodereader and unicodeinputstream: http://koti.mbnet.fi/akini/java/unicodereader/

 

 

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.