First, allow me to repeat the wheel
The gb2312 encoding for Silverlight has appeared on the network for a long time, but many problems have been found after use:
- Stream operations not supported
- No rollback Policy
- Only decoding is implemented, but encoding is not implemented.
- Too many running error results
- Too few characters are supported
For the many problems above, a gb2312encoding dedicated for the Silverlight application is released. By the way, let's talk about how to compile an encoding.
First, you need to know about the character encoding in. NET Framework, GB 2312, EUC (Extended UNIX code)
Gb2312 is an ascii-compatible dubyte character encoding. A char can occupy one or two bytes. If two bytes are occupied, the first byte must be lead_byte. In this way, you can distinguish whether a char occupies one byte or two bytes. In gb2312, lead_byte is 0x81 ~ 0xfe.
In practice, encoding can be executed in two situations: direct call and indirect call.
Direct call: Directly calls the encoding object for decoding and encoding. You can use the following code:
static string Decode(byte[] bytes) { Encoding encoding = Encoding.Default; return encoding.GetString(bytes); } static byte[] Encode(string str) { Encoding encoding = Encoding.Default; return encoding.GetBytes(str); }
The final call process within encoding is to first call the getcharcount and getbytecount methods to learn the length of the decoded and encoded array, create an array of the corresponding size and type and pass it to the getchars and getbytes methods. The final result is obtained.
In this way, the reload to be completed is:
public override int GetCharCount(byte[] bytes, int index, int count) public override int GetChars(byte[] bytes, int byteIndex, int byteCount, char[] chars, int charIndex) public override int GetByteCount(char[] chars, int index, int count) public override int GetBytes(char[] chars, int charIndex, int charCount, byte[] bytes, int byteIndex)
Indirect call: Encoding is indirectly called through streamreader and streamwriter. This process is a little more complex than the direct call process. Take streamreader to read stream as an example (the streamwriter process is similar ).
- Get decoder of Encoding
- Create a bytebuffer with a minimum length of buffersize = 1024 bytes
- Get _ maxcharsperbuffer = encoding. getmaxcharcount (buffersize)
- Create a charbuffer with the length of _ maxcharsperbuffer
- Read a bytebuffer segment
- Pass bytebuffer to decoder. getchars and obtain the decoding result of the clip.
- If stream is not read, jump to 5. Otherwise, proceed to the next step.
- Combine the obtained decoded fragments into the final result.
The above conclusions can be found in the following evidence:
private void System.IO.StreamReader.Init(Stream, Encoding, bool, int); private int System.IO.StreamReader.ReadBuffer(char[], int, int, out bool);public override string System.IO.StreamReader.ReadToEnd();
We can see two problems from this process:
- Decoder instead of encoding is called for decoding data fragments.
- If lead_byte is the last byte of the clip during the decoding process, how to save the data to the next call
Problem 1 appears to be able to solve problem 2
Because a new decoder instance is obtained when streamreader is constructed. Note that this new instance is very important. After completing step 1, we can find that the last digit is lead_byte. In this case, we can save lead_byte in decoder. Because.. Net multiple instances can share the same encoding, so the encoding instance cannot store any status information related to encoding and decoding, and each streamreader instance corresponds to a unique decoder instance, therefore, the lead_byte status can be ensured.
In this way, the reload to be completed is:
public override int GetMaxByteCount(int charCount) public override int GetMaxCharCount(int byteCount) public override Decoder GetDecoder()
Rollback Policy: Not all characters in the encoding and decoding processes correspond one to one in both character sets. A character may exist in this character set but does not exist in another character set. Therefore, a proper rollback mechanism is required to handle such problems. In FX 4.0, three policies are supported: "Best rollback", "replacement rollback", and "exception rollback. However, this article aims to solve the problem of gb2312 support in Silverlight, and there is no public rollback policy in Silverlight. Therefore, this article only selects the simplified "replace rollback" policy: replace all unprocessed data with "?" (0x3f ).
Maximum supported characters
Here, we need to clarify that although it is gb2312 encoding, the character set used is far beyond the range of gb2312. To maximize support, use gb2312 encoding of FX 4.0 to enumerate all the results and save them in the data file.
The usage of this project can be found on the wiki page of codeplex.
Conclusion:
- It is very complicated to create an encoding from scratch. This article cannot cover all aspects. Please forgive me.
- This project did a lot of tests before release. The test examples used cover all aspects that can be considered at present, and compared with gb2312 encoding of FX 4.0You can safely use this project
- Although this is the encoding of gb2312, the implementation method is implemented according to the dual-byte character set (DBCS) encoding method. So,You only need to replace the corresponding data file to support other DBCS encoding.
- Not mentioned how to generate data files
- This project has been released to codeplex: http://gb2312.codeplex.com/
- Latest source code: http://gb2312.codeplex.com/releases/view/75550#DownloadId=358387)
- This project uses Microsoft reciprocal license (MS-rl) Authorization Protocol