On many UNIX-like platforms, there is an iconv tool that can be used to convert character encoding. For common text files, the file command can be used to detect the character encoding type of a file, by combining the two, you can easily encode an unknown encoding type text file with a specified encoding type. For example, some files in the Linux kernel source code are not encoded in ASCII format (it seems to be related to some hacker "strange" names). For example:
$ CD/path/to/linux-2.6.17
$ File kernel/sys. c
Kernel/sys. C: ISO-8859 C program text
It is visible that the character encoding type of this file is ISO-8859.
Let's take a look at what is not ASCII code. Use iconv to convert it from ASCII:
$ Iconv-F ascii-T utf8 kernel/sys. c>/tmp/sys. c
Iconv: Illegal input sequence at position 29203
The conversion error is found. The character encoding at the 29203-byte location is not ASCII. Use the hexdump and CAT commands to see what the location is:
$ Hexdump-C-N 10-s 29203 kernel/sys. c
00007213 E5 20 73 76 65 6e 73 6B 61 2E |. svenska. |
2017721
$ Cat kernel/sys. c | grep Svenska
* Samma P? Svenska ..
It is estimated that this should be the name of an author.
Below according to the file command tells us the encoding type ISO-8859 for conversion, first you have to through iconv-L to check whether iconv supports ISO-8859?
$ Iconv-L | grep ISO-8859
ISO-8859-1 //
ISO-8859-2 //
ISO-8859-3 //
ISO-8859-4 //
ISO-8859-5 //
ISO-8859-6 //
ISO-8859-7 //
ISO-8859-8 //
ISO-8859-9 //
ISO-8859-9E //
ISO-8859-10 //
ISO-8859-11 //
ISO-8859-13 //
ISO-8859-14 //
ISO-8859-15 //
ISO-8859-16 //
Obviously supported, but not directly supported ISO-8859, so you have to choose one of them during conversion.
$ Iconv-F ISO-8859-1-T utf8 kernel/sys. c>/tmp/sys. c
Let's look at the size of the converted file and the content near 29203 Bytes:
$ LS-l kernel/sys. c/tmp/sys. c
-Rwxr-XR-x 1 Falcon 50359 kernel/sys. c
-RW-r -- 1 Falcon 50360 2008-06-29/tmp/sys. c
$ CAT/tmp/sys. c | grep Sven
* Samma P into Svenska .. To sum up, what should I do if I want to recode an unknown character encoding text file with the specified encoding type?
1. Run the file command to view the character encoding of the file.
2. Use iconv-L to check whether iconv supports this encoding type. If yes, find the nearest one.
3. If yes, enable iconv for conversion. Otherwise, an error is prompted.
In this way, you can write a script to automatically perform the conversion process (incomplete, you can add some content by yourself), for example:
Code:
#! /Bin/bash
# Encode. Sh -- encode a file with an indicated Encoding
# Make sure user give two arguments
["$ #"! = 2] & Echo "Usage: 'basename $ 0' [to_encoding]" & Exit-1
# Make sure the second argument is a regular file
[! -F $2] & Echo "the second argument shocould be a regular file" & Exit 1
File = $2
# Make sure the first argument is a encoding supported by iconv
Iconv-L | grep-Q $1
[$? -Ne 0] & Echo "iconv not support such encoding: $1" & Exit-1
To_encoding = $1
# Is there a text file?
File_type = 'file $ file | grep "text $ "'
[$? -Ne 0] & Echo "$ file is not a text file" & Exit-1
# Get the old Encoding
From_encoding = 'echo $ file_type | cut-d ""-F 2'
From_encoding = 'iconv-L | grep $ from_encoding'
[$? -Ne 0] & Echo "iconv not support the old encoding: $ from_encoding"
From_encoding = 'echo $ from_encoding | cut-d "/"-F 1'
# Convert the file from from_encoding to to_encoding
Iconv-F $ from_encoding-T $ to_encoding $ File
[Ctrl + A select all]
Save the downloaded file as encode. Sh, add the executable permission, and convert the file.
$ Chmod + x encode. Sh
$./Encode. Sh utf8 kernel/sys. cCharset-detectorBefore the development schedule, there will usually be a mobile phone, and I just wrote this small program, I tried to access the BBS platform on the opposite bank through [pcmanx]. Unfortunately, I encountered a very troublesome problem, that is, I had to specify the zhuyun by myself, when I was riding a taxi, I had to hold my hand too vigorously, causing micro-damage, so I kept typing a word... anyway, I decided to replace [pcmanx] In addition, the function of auto-dynamic messaging BBS was added. In Mozilla, the algorithm of the automatic guessing file neural program has already been used, however, the Mozilla official website also provides the forum [a composite approach to language/encoding detection] For the test. The peer network provides a translation of the Chinese language. [A merge method of statement/statement syntax]. The related practice can be used to test Mozilla CVS tree [extensions/universalchardet ]. Blog [Mozilla re-licensing end] also mentioned that the Mozilla Foundation recently announced that Mozilla codebase is based on the original MPL (Mozilla Public License) the authorization mode is MPL, GPL, and lgpl. [Pcmanx] is compatible with permissions, so the urgent issue is how to integrate. I initially removed the burdens of NSPR (Mozilla runtime) and used the-fno-rtti and-fno-tions of G ++, and-nostdinc ++ compilation flags. If you convert-lstdc ++ into-lsupc ++, you can also get the C-only library step by step, the goal is to create an add-on, so that [pcmanx] You can use the dlopen to manipulate the internal task and initially complete the automatic Upload File Upload and upload program, called [charset-detector] (Bzip2 tarball ). The following uses the program Program (under the test project) as an example to see how it works. initcall.txt is a big5 program file:
charset-detector/test$ file initcall.txtinitcall.txt: ISO-8859 English text, with CRLF line terminatorscharset-detector/test$ ./test-chardetect ./initcall.txtFile ./initcall.txt ...Charset = Big5 The UNIX tool file was rejected. Fortunately, our charset-detector was correctly aware of other functions, while charset-detect Library had only six APIs, which were easy to operate. The next step is hack [pcmanx]. After the BBS connection is established, the buffer overflow will be sent to charset-detect APIs. For the judgment of the legal disclaimer, and then for the Attention plane re-engineering. Published by jserv at May 22,200 6 pm |