Convert the character encoding type of a file by using the file and iconv commands.

Source: Internet
Author: User

Http://hi.baidu.com/netwrom/blog/item/8885f31ef0d09ae7e1fe0b1c.html
On many UNIX-like platforms, there is an iconv tool that can be used to convert character encoding. For common text files, the file command can be used to detect the character encoding type of a file, by combining the two, you can easily encode an unknown encoding type text file with a specified encoding type. For example, some files in the Linux kernel source code are not encoded in ASCII format (it seems to be related to some hacker "strange" names). For example:

$ CD/path/to/linux-2.6.17

$ File kernel/sys. c

Kernel/sys. C: ISO-8859 C program text

It is visible that the character encoding type of this file is ISO-8859.

Let's take a look at what is not ASCII code. Use iconv to convert it from ASCII:

$ Iconv-F ascii-T utf8 kernel/sys. c>/tmp/sys. c

Iconv: Illegal input sequence at position 29203




The conversion error is found. The character encoding at the 29203-byte location is not ASCII. Use the hexdump and CAT commands to see what the location is:

$ Hexdump-C-N 10-s 29203 kernel/sys. c

00007213 E5 20 73 76 65 6e 73 6B 61 2E |. svenska. |

2017721

$ Cat kernel/sys. c | grep Svenska

* Samma P? Svenska ..




It is estimated that this should be the name of an author.

Below according to the file command tells us the encoding type ISO-8859 for conversion, first you have to through iconv-L to check whether iconv supports ISO-8859?

$ Iconv-L | grep ISO-8859

ISO-8859-1 //

ISO-8859-2 //

ISO-8859-3 //

ISO-8859-4 //

ISO-8859-5 //

ISO-8859-6 //

ISO-8859-7 //

ISO-8859-8 //

ISO-8859-9 //

ISO-8859-9E //

ISO-8859-10 //

ISO-8859-11 //

ISO-8859-13 //

ISO-8859-14 //

ISO-8859-15 //

ISO-8859-16 //




Obviously supported, but not directly supported ISO-8859, so you have to choose one of them during conversion.

$ Iconv-F ISO-8859-1-T utf8 kernel/sys. c>/tmp/sys. c




Let's look at the size of the converted file and the content near 29203 Bytes:

$ LS-l kernel/sys. c/tmp/sys. c

-Rwxr-XR-x 1 Falcon 50359 kernel/sys. c

-RW-r -- 1 Falcon 50360 2008-06-29/tmp/sys. c

$ CAT/tmp/sys. c | grep Sven

* Samma P into Svenska ..
To sum up, what should I do if I want to recode an unknown character encoding text file with the specified encoding type?


1. Run the file command to view the character encoding of the file.

2. Use iconv-L to check whether iconv supports this encoding type. If yes, find the nearest one.

3. If yes, enable iconv for conversion. Otherwise, an error is prompted.


In this way, you can write a script to automatically perform the conversion process (incomplete, you can add some content by yourself), for example:


Code:

#! /Bin/bash

# Encode. Sh -- encode a file with an indicated Encoding


# Make sure user give two arguments


["$ #"! = 2] & Echo "Usage: 'basename $ 0' [to_encoding]" & Exit-1


# Make sure the second argument is a regular file


[! -F $2] & Echo "the second argument shocould be a regular file" & Exit 1

File = $2


# Make sure the first argument is a encoding supported by iconv


Iconv-L | grep-Q $1

[$? -Ne 0] & Echo "iconv not support such encoding: $1" & Exit-1

To_encoding = $1


# Is there a text file?

File_type = 'file $ file | grep "text $ "'

[$? -Ne 0] & Echo "$ file is not a text file" & Exit-1


# Get the old Encoding

From_encoding = 'echo $ file_type | cut-d ""-F 2'

From_encoding = 'iconv-L | grep $ from_encoding'

[$? -Ne 0] & Echo "iconv not support the old encoding: $ from_encoding"

From_encoding = 'echo $ from_encoding | cut-d "/"-F 1'


# Convert the file from from_encoding to to_encoding

Iconv-F $ from_encoding-T $ to_encoding $ File


[Ctrl + A select all]



Save the downloaded file as encode. Sh, add the executable permission, and convert the file.

$ Chmod + x encode. Sh

$./Encode. Sh utf8 kernel/sys. c

Charset-detector

Before the development schedule, there will usually be a mobile phone, and I just wrote this small program, I tried to access the BBS platform on the opposite bank through [pcmanx]. Unfortunately, I encountered a very troublesome problem, that is, I had to specify the zhuyun by myself, when I was riding a taxi, I had to hold my hand too vigorously, causing micro-damage, so I kept typing a word... anyway, I decided to replace [pcmanx]
In addition, the function of auto-dynamic messaging BBS was added.

In Mozilla, the algorithm of the automatic guessing file neural program has already been used, however, the Mozilla official website also provides the forum [a composite approach to language/encoding detection] For the test. The peer network provides a translation of the Chinese language.
[A merge method of statement/statement syntax]. The related practice can be used to test Mozilla CVS tree [extensions/universalchardet ].
Blog [Mozilla re-licensing end] also mentioned that the Mozilla Foundation recently announced that Mozilla codebase is based on the original MPL (Mozilla Public License) the authorization mode is MPL, GPL, and lgpl.
[Pcmanx] is compatible with permissions, so the urgent issue is how to integrate.

I initially removed the burdens of NSPR (Mozilla runtime) and used the-fno-rtti and-fno-tions of G ++, and-nostdinc ++ compilation flags. If you convert-lstdc ++ into-lsupc ++, you can also get the C-only library step by step, the goal is to create an add-on, so that [pcmanx]
You can use the dlopen to manipulate the internal task and initially complete the automatic Upload File Upload and upload program, called [charset-detector] (Bzip2 tarball ).

The following uses the program Program (under the test project) as an example to see how it works. initcall.txt is a big5 program file:

    charset-detector/test$ file initcall.txtinitcall.txt: ISO-8859 English text, with CRLF line terminatorscharset-detector/test$ ./test-chardetect ./initcall.txtFile ./initcall.txt ...Charset = Big5

    The UNIX tool file was rejected. Fortunately, our charset-detector was correctly aware of other functions, while charset-detect Library had only six APIs, which were easy to operate. The next step is hack [pcmanx]. After the BBS connection is established, the buffer overflow will be sent to charset-detect APIs.
    For the judgment of the legal disclaimer, and then for the Attention plane re-engineering.
    Published by jserv at May 22,200 6 pm

    Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.