Linux and Windows unzipping garbled UTF-8 BOM reading problems

Source: Internet
Author: User

Linux and Windows unzipping garbled UTF-8 BOM reading problems
Linux and Windows File garbled characters require CNN to run data in linux over the past few days, but many garbled characters have occurred when uploading data and data list to linux. Summarize the encoding problems encountered in the past two days. 1. decompress the garbled file. If the compressed file package in windows is directly transmitted through ftp, xftp is used to transfer it to the directory of the linux server. Here, zip is used for compression, in Linux, unzip is used to decompress the package directly, and garbled characters are found. Here is the problem of unzip and garbled characters. The main reason is that when we compress the files and folder names, most of them are based on the current environment, which is the default GBK encoding for windows Chinese. Therefore, if you decompress the package directly in linux, the system generally uses utf8 encoding by default. Therefore, decode the package according to the encoding method of utf8, so garbled characters may occur. The solution is to add the CP936 option unzip-O CP936 xxx.zip here CP936, some people may not understand, in fact, the earliest GBK encoding is the MBCS Character Set customized by IBM, man encoding is exactly on the 936 page of the entire character set. Therefore, CP936 is used in many places to represent GBK 2. the problem encountered after the File Reading BOM is garbled is that my txt file is an image list with a Chinese path in the list and a large number of folders involved. It is impossible to change it to English, therefore, garbled characters may occur when GBK-encoded txt files are stored in linux. Here we try to unify the utf8 format as much as possible, in order to facilitate processing. Therefore, I used notepad in windows to save the ANSI-encoded txt file as UTF-8. But !!! I still encountered the most dizzy problem. Here we introduce the concept of BOM. A file is marked with a specific mark after encoding. When the file is opened, the mark is read to decode the file accordingly, this will solve many problems. However, many editors are not uniform, which brings us a lot of trouble. Common examples include UTF8, UTF16, mark BOM_UTF8 '\ xef \ xbb \ xbf 'bom_utf16_le' \ xff \ xfe 'bom_utf16_be '\ xfe \ xff'. Here, I need to upload the txt file to linux. the default format is UTF8, therefore, if you use windows notepad directly, an error will be reported when you read it later. We can see that \ xef \ xbb \ xbf is the first in the subsequent unicode code stream. It may be because linux does not read the BOM header by default. Therefore, if files need to be transmitted, it is best not to include the BOM header in Linux. Here, we use Notepad ++ and select UTF8-no BOM format for encoding. Reading Chinese Characters in linux will not be garbled.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.