A recent project has a need to use JS to calculate a string of strings written to the memory of the localstorage, as we all know, JS is encoded using Unicode. and the implementation of Unicode has n kinds, which use the most is UTF-8 and UTF-16. Therefore, only the two encodings are discussed in this article.
The following definition is excerpted from Wikipedia (Http://zh.wikipedia.org/zh-cn/UTF-8) and has been partially abridged.
UTF-8 (8-bit Unicode Transformation Format) is a variable-length character encoding for Unicode that can represent any character in the Unicode Standard, and the first byte in its encoding is still compatible with ASCII. Encode each character with one to four bytes
The coding rules are as follows:
Character code between 000000–00007f, encoded in one byte;
000080–0007ff between the characters in two bytes;
Three bytes between 000800–00d7ff and 00e000–00ffff, note: Unicode does not have any characters in the range d800-dfff;
4 bytes between the 010000–10FFFF.
While UTF-16 is a fixed-length character encoding, most characters are encoded with two bytes, and the character code exceeds 65535 of the use of four bytes, as follows:
000000–00ffff two bytes;
010000–10ffff four bytes.
At first, since the page is UTF-8 encoded, the Localstorage string should also be encoded in UTF-8. But later, the test found that the size of the calculated is less than 5MB, deposited localstorage but thrown abnormal. Think about it, the page encoding can be changed. If localstorage to the page's encoding to save the string, not chaos? Browsers should all be encoded using UTF-16. Using the UTF-16 code to calculate the 5MB string, it was successfully written. More than a failure.
OK, attach code implementation. The rule of calculation is written above, for the purpose of calculating speed, the two for loop is written separately.
/** * Calculates the number of bytes of memory used by the string, which is computed by default using UTF-8 encoding, or UTF-16 * UTF-8 is a variable-length Unicode encoding format, encoded with one to four bytes per character * * 000000-0 0007F (128 code) 0zzzzzzz (00-7f) One byte * 000080-0007FF (1920 code) 110yyyyy (C0-DF) 10zzzzzz (80-BF) Two Bytes * 000800-00d7ff 00e000-00ffff (61,440 code) 1110xxxx (E0-EF) 10yyyyyy 10zzzzzz Three bytes * 010000-10FFFF (1 048576 code) 11110www (F0-F7) 10xxxxxx 10yyyyyy 10zzzzzz Four bytes * Note: Unicode does not have any characters in the range d800-dfff * {@link http://z H.WIKIPEDIA.ORG/WIKI/UTF-8} * * UTF-16 most uses two byte encoding, encoding exceeds 65535 using four bytes * 000000-00ffff two bytes * 010000-10FFFF Four bytes * * {@link http://zh.wikipedia.org/wiki/UTF-16} * @param {string} str * @param {string} charset Utf-8, u
Tf-16 * @return {number}/var sizeof = function (str, charset) {var total = 0, charcode, I,
Len CharSet = CharSet?
Charset.tolowercase (): '; if (charset = = ' utf-16 ' | | | charset = = ' Utf16 ') {for (i = 0, len = str.length;i < Len;
i++) {charcode = Str.charcodeat (i);
if (charcode <= 0xffff) {total = 2;
}else{total = 4;
}}else{for (i = 0, len = str.length i < len; i++) {charcode = Str.charcodeat (i);
if (charcode <= 0x007f) {total = 1;
}else if (charcode <= 0x07ff) {total = 2;
}else if (charcode <= 0xffff) {total = 3;
}else{total = 4;
}} return total;
}