Biztalk character encoding conversion for outbound/inbound message
Most messages from a Linux/Unix environment use UTF-8, while windows are mostly UTF-16 (UNICODE) encoding. Therefore, it is often necessary to convert the encoding method of packets.
Method 1
Set the targetcharset value to big-Endian-UTF 16 through the XML transmit pipeline targetcharset value of BizTalk Server 2006, and you want to use UTF-16 (UNICODE)
Note: method 1 may have bugs and won't be converted as expected.
Method 2
Use a custom pipeline, for example, setting the target charset attribute in the Custom pipeline of the XML assembler component, as shown below: • to use the UTF-8 encoding format, set the target charset property to UTF-8 (65001 ).
• To use the UTF-16 encoding format, set the target charset attribute to big-Endian-UTF 16 (1201) or little-Endian-UTF 16 (1200 ).
Method 3
Use business flow assignment. For example, add the messageassignment shape to a new business flow. Next, double-click the messageassignment shape. Then, type the followingCodeBiztalk expression editor. <message_name> (xmlnorm. targetcharset) = "Unicode ";
What are the differences and relationships between Unicode UTF-8 UTF-16?
UNICODE:
The encoding mechanism developed by unicode.org should include common texts all over the world.
In 1.0, It is a 16-bit code, from u + 0000 to U + FFFF. each 2byte Code corresponds to one character. At the beginning of 2.0, the 16-bit limit was abandoned. The original 16-bit is used as the basic bit plane, and the 16-bit plane is added, which is equivalent to 20-bit encoding, the encoding range is 0 to 0x10ffff.
UTF: Unicode/UCOS Transformation Format
UTF-8, 8bit encoding, ASCII do not change, other characters do Variable Length Encoding, each character 1-3 byte. Usually used as an external code. has the following advantages:
* It is irrelevant to the CPU byte sequence and can communicate with each other on different platforms.
* High Fault Tolerance. If any one byte is damaged, only one encoding bit will be lost at most, and no chainlock error will occur (for example, if one byte is incorrect, the entire line will be garbled)
UTF-16, 16-bit encoding, is a variable length code, roughly equivalent to 20-bit encoding, the value between 0 and 0x10ffff, basically is the implementation of Unicode encoding. it is a variable length code, which is related to the CPU order, but because it saves the most space, it is often used as an external code for network transmission.
The UTF-16 is Unicode preferred encoding.
UTF-32, Uses only 32-bit encoding in the Unicode range (0 to 0x10ffff), equivalent to a subset of the UCS-4.
UTF and UNICODE:
Unicode is a character set and can be viewed as an internal code.
UTF is a encoding method because Unicode is not suitable for direct transmission and processing in some scenarios. UTF-16 is Unicode encoding directly, no transformation, but it contains 0x00 in the encoding, the first byte of the first 256 bytecode is 0x00, in the operating system (C language) it has special significance and may cause problems. using UTF-8 encoding to convert Unicode directly can avoid this problem and bring some advantages.
The software has three ways to determine the character set and encoding of the text:
The most standard way is to check the first several bytes of the text, as shown in the following table:
Character Set/Encoding
Ef bb bf UTF-8
Fe FF UTF-16/UCS-2, little endian
FF Fe UTF-16/UCS-2, big endian
FF Fe 00 00 UTF-32/UCS-4, little endian.
00 00 Fe FF UTF-32/UCS-4, big-Endian.