Cnbook/textpro6 Application 1: Analysis of garbled emails of the "character entity" Type

Source: Internet
Author: User
Tags printable characters
Cnbook/textpro6 Application 1: Analysis of garbled emails of the "character entity" Type

Assume that you receive an email with the following content:

& Agrave; & acute; & ETH; & aring; & ecirc; & otilde; & micro; & frac12; & pound; & not; & ETH; & raquo; & ETH; & raquo;

Can you tell me that this email is actually saying "Thank you? This article discusses the origin of this Garbled text and introduces a simple decoding method.

1. garbled Origin

In many cases, garbled characters are generated and the processing methods are different. The Garbled text in this article contains a large number of strings like "& agrave; & acute. Anyone familiar with HTML knows that these strings are character entities in HTML.

Entity with 1.1 characters

HTML defines a special representation for some characters, that is, the character entity. Character entities can be named entities or numeric entities. The format of the named object is:

&Character name;

For example, "& amp;" indicates "&", and "& agrave;" indicates "à ". The format of the numeric entity is:

&#Character Base-10 unicode encoding;

For example, "& #038;" indicates "&", and "& #192;" indicates "à ".

1.2 garbled

We can use codeview to view the encoding of "received letter, thank you". Its GBK encoding is:

C0 B4 D0 C5 ca D5 B5 BD A3 AC D0 BB D0 bb

Some software systems do not support Chinese characters. These non-ASCII characters (1 in bytes) are represented by named entities. For example, "& agrave;" is used to represent C0 and "& acute;" is used to represent B4. In this way, the garbled Code mentioned above is generated:

& Agrave; & acute; & ETH; & aring; & ecirc; & otilde; & micro; & frac12; & pound; & not; & ETH; & raquo; & ETH; & raquo;

2 parsing garbled characters

The method for parsing such garbled characters is obvious:

  1. Maps named or numeric entities to the corresponding encoding.
  2. Analyze text encoding with common encoding methods to obtain meaningful text.

Based on this idea, we can write a small program to implement the decoding function. However, I have no time to write this program. This document describes how to decode cnbook and codeview using two gadgets I have previously written. Codeview can view the text encoding. Cnbook is my notebook (unfinished ).

Suppose we have obtained the encoding of the garbled email, for example, "C0 B4 D0 C5 ca D5 B5 BD A3 AC D0 BB D0 BB", we can convert the encoding to text in codeview. We can perform decoding based on common encoding methods such as GBK and big5 until meaningful text is obtained. So how can we get codeview's expected text encoding from garbled emails? We can use the cnbook "Custom replace table ".

2.1 parsing process

Cnbook allows you to customize a replacement table, and then replace the specified or all text according to all replacement pairs in the replacement table. I have defined and loaded a replacement table for converted character entities. The decoding process is described below:

  1. The garbled characters to be analyzed are copied to the cnbook.
  2. You can right-click "Custom replace-> character entity to encode" to map Garbled text to encoding.

  3. Copy the encoding text to the codeview and convert the encoding to the text.

Section 3.3 describes the replications of this replace table.

3. Replacement of cnbook/textpro6

This section briefly introduces the cnbook functions related to this article. The introduction here also applies to textpro6.

3.1 custom replacement

To use the custom replacement function of cnbook, you must first prepare a replacement table. The replacement table is a text file. The format of each line is:

Source string = target string

The comment line starts with "=. If the string contains "=", it must be written as "/= ". If the string contains "/", it must be written as "//". After preparing the replace table, go to "Settings> Custom replace table" in the cnbook to enter the settings of the custom replace table.

Cnbook allows you to set 30 custom replacement tables. Select the sequence number of the replace table to be set, and click "set" to select the replace table source file. Click "options" to set the replacement options. The options for replacing the "character entity to encoding" table are:

After setting, you can edit the menu or right-click the menu to execute custom replacement. Cnbook saves the source file and replacement option of the replace table to the "tables/n. tab" file in the directory where cnbook is located, where "N" indicates the sequence number of the replace table. As long as the creator of the replace table publishes a tab file, other users can share his/her replace table. A good practice is to release the replace table source file while releasing the tab file. Replace the option with a comment line at the beginning of the source file.

3.2 replacement options

The "Use escape character" option sets whether to interpret and replace the Escape Character in the string. Cnbook currently supports the following escape characters:

// Character '/'
/N Represents a continuous carriage return (0d) and line feed (0a)
/T Tab (09)
/X"1-6 bits in hexadecimal notation" It represents the character corresponding to the hexadecimal unicode encoding.
For example, "/xa0" is a Unicode Character encoded as 0xa0.
In the replace table source file, "/" must be written as "//".

Another option to be explained is "quick replacement ". This option controls the replacement process:

  1. If "quick Replace" is not selected, the execution process of the replace table is to execute the replace of each group in the replace table in sequence. All text is scanned for each group of replicas. Subsequent replicas are performed on the previous replicas. At this time, the order of replacement may affect the result of replacement.
  2. "Quick Replace" only scans text once. The program first calculates the maximum length of the replacement table source string. During scanning, the matching source string is searched in the replacement table starting from the maximum length. Therefore, "quick replacement" takes precedence over the longest string at each scan point. "Quick replacement" does not support regular expressions, because the length of the string matching the source string using the regular expression may be changed.
3.3 What is the replacement of "character entity to encoding?

In the "replace table" dialog box, click the Browse button to view the content of the replace table. You can also read the source files in the src directory. This replacement table contains three replicas:

  1. Replace the first named entity within 255 (including 255, the same below) with the corresponding encoded text.
  2. Replace the 128 numeric entities encoded in 128-255 with the corresponding encoding text.
  3. Replace the 255 characters encoded within 194 with the corresponding encoding text. The 194 characters include all 191 printable characters in the range and 3 commonly used controllers: 09, 0d, and 0a.

From the content of the replace table, we can see that "character entity to encoding" can replace all the source text encoded within 255 with the corresponding encoding.

Conclusion

This article describes how to use custom replacement of cnbook. The cnbook (download) is designed to be a fast, compact, and flexible text processing tool. Unfortunately, I don't have time to complete it. I think that at least "column mode", "listing rows containing strings", and "Input Method sensing" should be added before I can use it as the default text editor. However, cnbook's support for four-byte Chinese characters is still good, and its function is slightly stronger than "Notepad ".

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.