We often meet in the development of the NSData conversion to nsstring, or through nsjsonserialization parsing JSON scene, once NSData contains illegal UTF-8 encoding, then the result will be returned nil, but the result is not in line with our expectations, Because this is probably just a coding error, we would prefer to discard or replace the error code with an error character.
On Google to find a lap, some people have achieved such a method, but the individual feel that the writing is not rigorous, fault tolerance is not very good, simply write a bar, in strict accordance with RFC3629 standards.
UTF-8 is a variable length encoding, for different lengths of bytes have a fixed format, in the RFC3629 specification can only be up to four bytes, and the range has requirements, more relevant introduction please jump Wikipedia UTF-8 entry (jump address):
1 2 3 4
|
0xxxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
|
According to this rule write a nsdata extension method, see the code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21st 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 |
@implementationNSData (UTF8)
- (NSData *) Utf8data { Save Results Nsmutabledata *resdata = [[Nsmutabledata alloc] Initwithcapacity:Self. length]; Invalid code substitution symbol (common?-?) NSData *replacement = [@ "?" datausingencoding:nsutf8stringencoding]; uint64_t index =0; Const uint8_t *bytes =Self. bytes; while (Index <Self. length) { uint8_t len =0; uint8_t header = Bytes[index]; Single byte if ((header&0x80) = =0) { Len =1; } 2 bytes (and cannot be c0,c1) Elseif ((header&0xE0) = =0XC0) { if (header! =0xC0 && Header! =0XC1) { Len =2; } } 3 bytes Elseif ((header&0xF0) = =0XE0) { Len =3; } 4 bytes (and cannot be f5,f6,f7) Elseif ((header&0xF8) = =0XF0) { if (header! =0xf5 && Header! =0xf6 && Header! =0XF7) { Len =4; } } Not recognized if (len = =0) { [Resdata appenddata:replacement]; index++; Continue } Detects valid data length (how many bytes are in the back of 10xxxxxx) uint8_t Validlen =1; while (Validlen < Len && Index+validlen <Self. length) { if ((Bytes[index+validlen) &0XC0)! =0x80 Break validlen++; //valid bytes equals the number of bytes required by the encoding to represent legal , otherwise illegal if (validlen = len) { [Resdata Appendbytes:bytes+index Length:len]; else [Resdata Appenddata:replacement]; //move subscript index + = Validlen; return resdata; @end |
link on GitHub address: https://github.com/tanhaogg/THCategory
Resolve nsdata that contain illegal UTF-8 encoding