Most of the time, the non-English character files that we encounter are encoded using UTF-8, and we generally look at the contents of these files without problems. Sometimes, however, we may encounter non-UTF-8 encoded files, such as the GBK encoding of Chinese, or the CP1251 encoding of Russian. Text files generally do not have the information of their own encoding format, which gives us a lot of trouble dealing with. This article describes several Linux commands to detect and convert the encoding format of a text file.
Detect file Encoding format
The eNCA command name is the abbreviation for extremely Naive Charset Analyser, which, judging from its name, should be used to detect the encoding format of the file.
Installation
eNCA
Under Ubuntu, you can use the following command to install
apt-get install enca
How to use
The simplest way to use it is as follows:
# enca test.txtSimplified Chinese National Standard; GB2312
The above test.txt
is a text file containing Chinese GB2312 encoding. According to eNCA 's documentation, when we are lucky, we can detect the encoding format of the file without adding any additional parameters above. And in my experience, The language setting for Linux is a factor that affects luck. The above operation in the default language is Chinese Linux, the behavior will be as shown above the results of the correct detection of the file's Chinese encoding format. And when Linux defaults to English, luck won't be so good.
# enca test.txtenca: Cannot determine (or understand) your language preferences.Please use `-L language‘, or `-L none‘ if your language is not supported(only a few multibyte encodings can be recognized then).Run `enca --list languages‘ to get a list of supported languages.
Based on eNCA 's error, we need to provide it with -L
parameters to qualify the file language to be detected.
First, we need to look at the languages supported by eNCA under the current system and the corresponding encoding types:
# enca --list languagesbelarusian: CP1251 IBM866 ISO-8859-5 KOI8-UNI maccyr IBM855 KOI8-U bulgarian: CP1251 ISO-8859-5 IBM855 maccyr ECMA-113 czech: ISO-8859-2 CP1250 IBM852 KEYBCS2 macce KOI-8_CS_2 CORK estonian: ISO-8859-4 CP1257 IBM775 ISO-8859-13 macce baltic croatian: CP1250 ISO-8859-2 IBM852 macce CORK hungarian: ISO-8859-2 CP1250 IBM852 macce CORKlithuanian: CP1257 ISO-8859-4 IBM775 ISO-8859-13 macce baltic latvian: CP1257 ISO-8859-4 IBM775 ISO-8859-13 macce baltic polish: ISO-8859-2 CP1250 IBM852 macce ISO-8859-13 ISO-8859-16 baltic CORK russian: KOI8-R CP1251 ISO-8859-5 IBM866 maccyr slovak: CP1250 ISO-8859-2 IBM852 KEYBCS2 macce KOI-8_CS_2 CORK slovene: ISO-8859-2 CP1250 IBM852 macce CORK ukrainian: CP1251 IBM855 ISO-8859-5 CP1125 KOI8-U maccyr chinese: GBK BIG5 HZ none:
In my test environment, eNCA supports the above languages and encodings. Can see the Chinese is chinese: GBK BIG5 HZ
, so we can try the following combination of parameters:
# enca -L chinese test.txtSimplified Chinese National Standard; GB2312
At this time, eNCA gave a definite answer.
By default, eNCA gives a human-readable encoding format name, such as the one above Simplified Chinese National Standard; GB2312
. Sometimes, we want to give the results to other commands or programs, for example, we want to combine the iconv command to convert the file encoding, you can add to let eNCA give the corresponding program available encoding name:
# enca -i -L chinese test.txtGBK
Convert file Encoding format
After knowing the correct encoding format of the file, we tend to want to convert the file to a common or system-supported encoding format like UTF8 for further processing.
Use
eNCATo convert
When we add parameters to the enca command -x
, the enca command converts the file to -x
the encoding format specified by the parameter:
# enca -L chinese test.txtSimplified Chinese National Standard; GB2312# enca -x UTF8 -L chinese test.txt# enca -L chinese test.txtUniversal transformation format 8 bits; UTF-8
As you can see, enca -x UTF8 -L chinese test.txt
after the command is executed, the encoding of the file test.txt is converted from GB2312 to UTF8. note that the enca command overwrites the source file, so when you use this command, be aware of the backup source file.
Use
IconvTo convert
Iconv is the standard command and API for converting character encodings in the *nix system. If we want to convert a GBK encoded file to UTF8 encoding, you can use the iconv command in the following ways:
# iconv -f GBK -t UTF8 test.txt
Where the test.txt
file is to be converted, the -f GBK
parameter indicates that the source file is encoded as GBK, which -t UTF8
represents the target encoding to be converted. After executing the above command, the iconv command prints the converted file contents to standard output.
If you want to save the converted content to a file, you can add -o
parameters:
This command will automatically save the converted content to the test_converted.txt file.
You iconv -l
can view all character set names by passing it. The previous article also mentions enca -i
that you can use to output the file encoding names that are available for iconv .
Resources
- Enconv (1)-Linux man page
- Wiki-iconv
- Libiconv
file character encoding format detection and conversion under Linux