This article is to share with you the use of Javascript to calculate the string in localStorage in the number of characters, respectively on the UTF-8 and UTF-16 two encoding are described in detail, there is a need for small partners can refer. Recently, the project has a requirement to use js to calculate the memory occupied by writing a string to localStorage. As we all know, js uses Unicode encoding. There are N Unicode implementations, among which the most used is the UTF-8 and UTF-16. Therefore, this article only discusses these two types of codes.
The following definition is taken from Wikipedia (http://zh.wikipedia.org/zh-cn/UTF-8) and is partially removed.
UTF-8 (8-bit Unicode Transformation Format) is a variable-length character encoding for Unicode that represents any character in the Unicode standard, and the first byte in the encoding is still ASCII compatible, each character is encoded in one to four bytes.
The encoding rules are as follows:
The character code is between 000000-00007F and is encoded in one byte;
The characters between 000080-0007FF use two bytes;
3 bytes between 000800-00D7FF and 00E000-00FFFF. Note: Unicode does not exist in the range D800-DFFF;
4 bytes between 010000-10FFFF.
While the UTF-16 is a fixed length character encoding, most of the characters use two bytes encoding, character code exceeds 65535 use four bytes, as shown below:
000000-00 FFFF two bytes;
010000-10 FFFF four bytes.
At the beginning, since the page uses UTF-8 encoding, then the string stored in localStorage should also be encoded in UTF-8. However, the test later found that the size calculated is less than 5 MB, but an exception was thrown when it was stored in localStorage. You can change the page encoding. If localStorage saves strings according to the page encoding, isn't it a mess? The browser should all use UTF-16 encoding. 5 MB string calculated by UTF-16 code, it was written. If the value is exceeded, it fails.
Now, the code implementation is attached. The calculation rule is written above. In order to calculate the speed, the two for loops are written separately.
/*** Calculate the number of memory bytes occupied by the string, calculated by default using the UTF-8 encoding method, can also be formulated as a UTF-16 * UTF-8 is a variable length Unicode encoding format, use one to four bytes to encode each character ** 000000-00007F (128 code) 0 zzzzzzz (00-7F) one byte * 000080-0007FF (1920 code) 110 yyyyy (C0-DF) 10 zzzzzz (80-BF) two bytes * 000800-00D7FF 00E000-00 FFFF (61440 code) 1110 xxxx (E0-EF) 10 yyyyyy 10 zzzzzz three bytes * 010000-10 FFFF (1048576 code) 11110www (F0-F7) 10 xxxxxx 10 yyyyyy 10 zzzzzzzz four bytes ** note: unicode does not exist in the range D800-DFFF * {@ link http://zh.wikipedia.org/wiki/UTF-8 } ** Most of the UTF-16 uses two byte encoding, encoding over 65535 uses four bytes * 000000-00 FFFF two bytes * 010000-10 FFFF four bytes ** {@ link http://zh.wikipedia.org/wiki/UTF-16 } * @ Param {String} str * @ param {String} charset UTF-8, UTF-16 * @ return {Number} */var sizeof = function (str, charset) {var total = 0, charCode, I, len; charset = charset? Charset. toLowerCase (): ''; if (charset = 'utf-16' | charset = 'utf16') {for (I = 0, len = str. length; I <len; I ++) {charCode = str. charCodeAt (I); if (charCode <= 0 xffff) {total + = 2;} else {total + = 4 ;}} else {for (I = 0, len = str. length; I <len; I ++) {charCode = str. charCodeAt (I); if (charCode <= 0x007f) {total + = 1;} else if (charCode <= 0x07ff) {total + = 2 ;} else if (charCode <= 0 xffff) {total + = 3 ;}else {total + = 4 ;}} return total ;}