Recent projects have a need to use JS to calculate a string of strings written into the memory of the localstorage, it is well known that JS is encoded using Unicode. There are n implementations of Unicode, most of which are UTF-8 and UTF-16. Therefore, this article only discusses these two types of coding.
The following definition is excerpted from Wikipedia (Http://zh.wikipedia.org/zh-cn/UTF-8) and has been partially abridged.
Originally from: http://www.alloyteam.com/2013/12/js-calculate-the-number-of-bytes-occupied-by-a-string/
UTF-8 (8-bit Unicode Transformation Format) is a variable-length character encoding for Unicode that can represent any character in the Unicode Standard, and the first byte in its encoding is still compatible with ASCII. Encode each character using one to four bytes
The encoding rules are as follows:
Character codes between 000000–00007f, encoded with one byte;
The characters between 000080–0007ff with two bytes;
Three bytes between 000800–00d7ff and 00e000–00ffff, note: Unicode has no characters in the range d800-dfff;
The 010000–10ffff between the two is 4 bytes.
While UTF-16 is a fixed-length character encoding, most characters use two-byte encoding, and the character code exceeds 65535 using four bytes, as follows:
000000–00ffff of two bytes;
010000–10ffff of four bytes.
At first, since the page is UTF-8 encoded, the Localstorage string should also be encoded with UTF-8. But later, the test found that the size of the figure is less than 5MB, deposited localstorage but thrown abnormal. Think about it, the code of the page can be changed. If localstorage the string according to the encoding of the page, isn't it a mess? Browsers should all be encoded using UTF-16. The 5MB string was calculated using the UTF-16 code, and it was written in a smooth. Over then failed.
Well, attach the code implementation. The rule of calculation is written above, in order to calculate the speed, the two for loop is written separately.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061 |
/**
* 计算字符串所占的内存字节数,默认使用UTF-8的编码方式计算,也可制定为UTF-16
* UTF-8 是一种可变长度的 Unicode 编码格式,使用一至四个字节为每个字符编码
*
* 000000 - 00007F(128个代码) 0zzzzzzz(00-7F) 一个字节
* 000080 - 0007FF(1920个代码) 110yyyyy(C0-DF) 10zzzzzz(80-BF) 两个字节
* 000800 - 00D7FF
00E000 - 00FFFF(61440个代码) 1110xxxx(E0-EF) 10yyyyyy 10zzzzzz 三个字节
* 010000 - 10FFFF(1048576个代码) 11110www(F0-F7) 10xxxxxx 10yyyyyy 10zzzzzz 四个字节
*
* 注: Unicode在范围 D800-DFFF 中不存在任何字符
*
{@link <a
onclick="javascript:pageTracker._trackPageview(‘/outgoing/zh.wikipedia.org/wiki/UTF-8‘);" href="http://zh.wikipedia.org/wiki/UTF-8">http://zh.wikipedia.org/wiki/UTF-8</a>}
*
* UTF-16 大部分使用两个字节编码,编码超出 65535 的使用四个字节
* 000000 - 00FFFF 两个字节
* 010000 - 10FFFF 四个字节
*
*
{@link <a
onclick="javascript:pageTracker._trackPageview(‘/outgoing/zh.wikipedia.org/wiki/UTF-16‘);"
href="http://zh.wikipedia.org/wiki/UTF-16">http://zh.wikipedia.org/wiki/UTF-16</a>}
* @param {String} str
* @param {String} charset utf-8, utf-16
* @return {Number}
*/
var sizeof =
function
(str, charset){
var total = 0,
charCode,
i,
len;
charset = charset ? charset.toLowerCase() :
‘‘
;
if
(charset ===
‘utf-16‘ || charset ===
‘utf16‘
){
for
(i = 0, len = str.length; i < len; i++){
charCode = str.charCodeAt(i);
if
(charCode <= 0xffff){
total += 2;
}
else
{
total += 4;
}
}
}
else
{
for
(i = 0, len = str.length; i < len; i++){
charCode = str.charCodeAt(i);
if
(charCode <= 0x007f) {
total += 1;
}
else if
(charCode <= 0x07ff){
total += 2;
}
else if
(charCode <= 0xffff){
total += 3;
}
else
{
total += 4;
}
}
}
return total;
}
|
JS computes the number of bytes in a string