Resolve nsdata that contain illegal UTF-8 encoding

Source: Internet
Author: User
Tags 0xc0

We often meet in the development of the NSData conversion to nsstring, or through nsjsonserialization parsing JSON scene, once NSData contains illegal UTF-8 encoding, then the result will be returned nil, but the result is not in line with our expectations, Because this is probably just a coding error, we would prefer to discard or replace the error code with an error character.
On Google to find a lap, some people have achieved such a method, but the individual feel that the writing is not rigorous, fault tolerance is not very good, simply write a bar, in strict accordance with RFC3629 standards.

UTF-8 is a variable length encoding, for different lengths of bytes have a fixed format, in the RFC3629 specification can only be up to four bytes, and the range has requirements, more relevant introduction please jump Wikipedia UTF-8 entry (jump address):

1
2
3
4
0xxxxxxx
10xxxxxx
10xxxxxx
10xxxxxx

According to this rule write a nsdata extension method, see the code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21st
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
@implementationNSData (UTF8)

- (NSData *) Utf8data
{
Save Results
Nsmutabledata *resdata = [[Nsmutabledata alloc] Initwithcapacity:Self. length];

Invalid code substitution symbol (common?-?)
NSData *replacement = [@ "?" datausingencoding:nsutf8stringencoding];

uint64_t index =0;
Const uint8_t *bytes =Self. bytes;

while (Index <Self. length)
{
uint8_t len =0;
uint8_t header = Bytes[index];

Single byte
if ((header&0x80) = =0)
{
Len =1;
}
2 bytes (and cannot be c0,c1)
Elseif ((header&0xE0) = =0XC0)
{
if (header! =0xC0 && Header! =0XC1)
{
Len =2;
}
}
3 bytes
Elseif ((header&0xF0) = =0XE0)
{
Len =3;
}
4 bytes (and cannot be f5,f6,f7)
Elseif ((header&0xF8) = =0XF0)
{
if (header! =0xf5 && Header! =0xf6 && Header! =0XF7)
{
Len =4;
}
}

Not recognized
if (len = =0)
{
[Resdata appenddata:replacement];
index++;
Continue
}

Detects valid data length (how many bytes are in the back of 10xxxxxx)
uint8_t Validlen =1;
while (Validlen < Len && Index+validlen <Self. length)
{
if ((Bytes[index+validlen) &0XC0)! =0x80
Break
validlen++;

//valid bytes equals the number of bytes required by the encoding to represent legal , otherwise illegal
if (validlen = len)
{
[Resdata Appendbytes:bytes+index Length:len];
else
[Resdata Appenddata:replacement];
//move subscript
index + = Validlen;
return resdata;
@end

link on GitHub address: https://github.com/tanhaogg/THCategory

Resolve nsdata that contain illegal UTF-8 encoding

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.