file character encoding format detection and conversion under Linux

Source: Internet
Author: User

Most of the time, the non-English character files that we encounter are encoded using UTF-8, and we generally look at the contents of these files without problems. Sometimes, however, we may encounter non-UTF-8 encoded files, such as the GBK encoding of Chinese, or the CP1251 encoding of Russian. Text files generally do not have the information of their own encoding format, which gives us a lot of trouble dealing with. This article describes several Linux commands to detect and convert the encoding format of a text file.

Detect file Encoding format

The eNCA command name is the abbreviation for extremely Naive Charset Analyser, which, judging from its name, should be used to detect the encoding format of the file.

Installation eNCA

Under Ubuntu, you can use the following command to install

apt-get install enca
How to use

The simplest way to use it is as follows:

# enca test.txtSimplified Chinese National Standard; GB2312

The above test.txt is a text file containing Chinese GB2312 encoding. According to eNCA 's documentation, when we are lucky, we can detect the encoding format of the file without adding any additional parameters above. And in my experience, The language setting for Linux is a factor that affects luck. The above operation in the default language is Chinese Linux, the behavior will be as shown above the results of the correct detection of the file's Chinese encoding format. And when Linux defaults to English, luck won't be so good.

# enca test.txtenca: Cannot determine (or understand) your language preferences.Please use `-L language‘, or `-L none‘ if your language is not supported(only a few multibyte encodings can be recognized then).Run `enca --list languages‘ to get a list of supported languages.

Based on eNCA 's error, we need to provide it with -L parameters to qualify the file language to be detected.

First, we need to look at the languages supported by eNCA under the current system and the corresponding encoding types:

# enca --list languagesbelarusian: CP1251 IBM866 ISO-8859-5 KOI8-UNI maccyr IBM855 KOI8-U bulgarian: CP1251 ISO-8859-5 IBM855 maccyr ECMA-113     czech: ISO-8859-2 CP1250 IBM852 KEYBCS2 macce KOI-8_CS_2 CORK  estonian: ISO-8859-4 CP1257 IBM775 ISO-8859-13 macce baltic  croatian: CP1250 ISO-8859-2 IBM852 macce CORK hungarian: ISO-8859-2 CP1250 IBM852 macce CORKlithuanian: CP1257 ISO-8859-4 IBM775 ISO-8859-13 macce baltic   latvian: CP1257 ISO-8859-4 IBM775 ISO-8859-13 macce baltic    polish: ISO-8859-2 CP1250 IBM852 macce ISO-8859-13 ISO-8859-16 baltic CORK   russian: KOI8-R CP1251 ISO-8859-5 IBM866 maccyr    slovak: CP1250 ISO-8859-2 IBM852 KEYBCS2 macce KOI-8_CS_2 CORK   slovene: ISO-8859-2 CP1250 IBM852 macce CORK ukrainian: CP1251 IBM855 ISO-8859-5 CP1125 KOI8-U maccyr   chinese: GBK BIG5 HZ      none:

In my test environment, eNCA supports the above languages and encodings. Can see the Chinese is chinese: GBK BIG5 HZ , so we can try the following combination of parameters:

# enca -L chinese test.txtSimplified Chinese National Standard; GB2312

At this time, eNCA gave a definite answer.

By default, eNCA gives a human-readable encoding format name, such as the one above Simplified Chinese National Standard; GB2312 . Sometimes, we want to give the results to other commands or programs, for example, we want to combine the iconv command to convert the file encoding, you can add to let eNCA give the corresponding program available encoding name:

# enca -i -L chinese test.txtGBK
Convert file Encoding format

After knowing the correct encoding format of the file, we tend to want to convert the file to a common or system-supported encoding format like UTF8 for further processing.

Use eNCATo convert

When we add parameters to the enca command -x , the enca command converts the file to -x the encoding format specified by the parameter:

# enca -L chinese test.txtSimplified Chinese National Standard; GB2312# enca -x UTF8 -L chinese test.txt# enca -L chinese test.txtUniversal transformation format 8 bits; UTF-8

As you can see, enca -x UTF8 -L chinese test.txt after the command is executed, the encoding of the file test.txt is converted from GB2312 to UTF8. note that the enca command overwrites the source file, so when you use this command, be aware of the backup source file.

Use IconvTo convert

Iconv is the standard command and API for converting character encodings in the *nix system. If we want to convert a GBK encoded file to UTF8 encoding, you can use the iconv command in the following ways:

# iconv -f GBK -t UTF8 test.txt

Where the test.txt file is to be converted, the -f GBK parameter indicates that the source file is encoded as GBK, which -t UTF8 represents the target encoding to be converted. After executing the above command, the iconv command prints the converted file contents to standard output.

If you want to save the converted content to a file, you can add -o parameters:

    

This command will automatically save the converted content to the test_converted.txt file.

You iconv -l can view all character set names by passing it. The previous article also mentions enca -i that you can use to output the file encoding names that are available for iconv .

Resources
    • Enconv (1)-Linux man page
    • Wiki-iconv
    • Libiconv

file character encoding format detection and conversion under Linux

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.