UTR-8 encoding File Upload BOM header problem (the actual problem PHP Upload CSV file first string length problem)

Source: Internet
Author: User
Tags first string

When uploading CSV files in PHP in the past two days, the first value in the first column is always not verified by regular expressions. For example, the first value in the first column is "test_test1", and the second value in the first column is "test_test2". The two values without essential differences provide two results for the same regular expression. In the tangle, var_dump is used to print two values. The displayed result "test_test1" has a length of 13 and "test_test2" has a length of 10. Why is this difference? I found a piece of material online. Only then can you understand that it is a BOM header Problem

 

BOM

Bom -- byte order mark, Translated as" Byte sequence mark ". Here we can find a description about Bom: There is a file named" Zero Width no-break Space ", Translated as" Zero-width uninterrupted Interval It is encoded as feff. Fffe does not exist in the UCs, so it should not appear in actual transmission. We recommend that you transmit the character "Zero Width no-break space" before transmitting the byte stream in the UCS specification ". In this way, if the receiver receives feff, it indicates that the byte stream is big-Endian; if it receives fffe, it indicates that the byte stream is little-Endian. Therefore, the character "Zero Width no-break space" ("Zero Width uninterrupted interval") is also called Bom. The UTF-8 does not need BOM to indicate the byte order, but BOM can be used to indicate the encoding method. The UTF-8 code for the character "Zero Width no-break space" is ef bb bf. So if the receiver receives a byte stream starting with ef bb bf, it will know that this is UTF-8 encoding. Windows uses BOM to mark the encoding of text files. If the character U + feff appears at the beginning of the byte stream, it is used to identify the byte sequence of the byte stream, whether it is a high front or a low front. If it appears in the middle of the byte stream, it is expressed Zero-width non-wrap Space The user looks like a space. Starting from unicode3.2, U + feff can only appear at the beginning of the byte stream and can only be used to identify the byte sequence, as indicated by its name-byte sequence mark; other usage has been discarded. Instead, U + 2060 is used to express zero-width and zero-gap. Similar to Windows notepad and other software, when saving a file encoded in UTF-8, it inserts three invisible characters (0xef 0xbb 0xbf, BOM) at the beginning of the file ). It is a string of hidden characters, used for the notepad editor to identify whether the file is encoded in UTF-8. For general files, this will not cause any trouble. However, Bom is a big headache for PHP. PHP does not ignore the BOM. Therefore, when reading, including, or referencing these files, the BOM is used as part of the Beginning body of the file. According to the characteristics of the embedded language, this string of characters will be directly executed (displayed. As a result, even if the top padding of the page is set to 0, the whole web page cannot be placed close to the top of the browser, because there are three characters at the beginning of HTML!

Representation of byte sequence tags of different encodings

Encoding (Hexadecimal) (Decimal)
UTF-8 EF BB BF 239 187 191
UTF-16 (large order) Fe FF 254 255
UTF-16 (small order) FF fe 255 254
UTF-32 (large order) 00 00 Fe FF 0 0 254 255
UTF-32 (small order) FF Fe 00 00 255 254 0 0
UTF-7 2b 2f 76 and belowOneByte: [38 | 39 | 2B | 2f] 43 47OneByte: [56 | 57 | 43 | 47]
En: UTF-1 F7 64 4C 247 100 76
En: UTF-EBCDIC Dd 73 66 73 221 115 102 115
En: Standard compression scheme for Unicode 0e Fe FF 14 254 255
En: BOCU-1 Fb ee 28And may followFF 251 238 40And may follow255
GB-18030 84 31 95 33 132 49 149 51

 

That is to say, at the beginning of the CSV file, three invisible characters (0xef 0xbb 0xbf, that is, BOM) are hidden. php will not hide this Bom. Therefore, two strings are not equal.

Therefore, you must remove the BOM header when uploading files.

Solution: $ S = trim ($ S, "\ XeF \ xbb \ xbf \ xFF \ xfe ");

"\ XeF \ xbb \ xbf \ xFF \ xfe"
"\ XeF \ xbb \ xbf" UTF-8
"\ XFF \ xfe" utf-16le (low front)
"\ Xfe \ xFF" utf-16be (High Front)
Corresponding BOM Headers

 

I don't know what I said. I don't understand it. However, if you encounter this problem, you can refer to this solution to solve the problem first, and then find relevant information on the Internet.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.