Exploring e-book formats

Source: Internet
Author: User

Exploring e-book formats
1. we are very familiar with epub format, which is produced by compression software of pk company, so we can use: Right-click --> open method --> compression software (many kernels should be similar), open for use, however, the core is top-level confidential. We can't analyze it either.
2. The mobi format (only for English and unencrypted documents) is relatively compressed and relatively simple. Sometimes we can guess at a glance what it is. The mobi text compression is roughly as follows:
A. english: At the beginning, it must be a complete word (because there is nothing in front of it). If there is a previous content> = 3 and <= 10, the first four digits must be 80--BF (X000-), and the last 12 digits represent the value and length of the forward index. The last three digits indicate the length, and the number obtained from the 9-digit right shift of three digits is the value of the forward index. For example, This is a test. it can be compressed to (This 8018a test) and interpreted as (X000 0000 000X0000) the first four digits are 8, indicating that the first occurrence exists, the last three digits are 0, indicating that the length is 0 + 3, the maximum value is (XXX) 7 + 3, and the right shift is 3 digits to get (000000XX ), that is to say, the length of the forward number of 3 characters (space is also a character) at the position of 8080 is 3 to get is + space.
B. the English space is frequently used. Therefore, if there is no repetition of 3-10 characters, the space will be combined with the following character + 80 (hexadecimal), for example: space + He (48 + 80) is encoded as: C865, space + he is encoded as: E865. Decompress F765 to: we.
C. chinese characters and some special symbols: It seems that the compression of mobi is not suitable for Chinese characters. It may be suggested to index the dictionary dynamically to make analysis difficult. Generally, Chinese characters and special symbols are represented by a hexadecimal number: the first four digits must be 0 (0000). The last four digits indicate that these characters are not compressed. For example, 06 indicates that the last six characters are not compressed (photocopied ).

The above is the compression and decompression of abc in mobi format. There is no way to analyze the compression of other images and Chinese indexes in the actual file.

The following code uses the decoding program written in delphi:

Begin // start to read the compressed mobi word by word. Assume that the variables and functions have been defined;
Counter: = 0; // counter;
J: = 0; // reader;
Tm. Position: = 0; // tm = TMemoryStream; dynamically allocated memory that has been read into the mobi file;
While (j <size-2) do // size indicates the file length. You can adjust the start position based on the file index;
Begin
Tm. Read (rd1, 1); // Read a character;
Inc (j );
Fm. Seek (counter, 0); // The positioning pointer must go forward and backward from time to time. Therefore, when reading each character, locate it again;
Case rd1 of // simply use a case to handle different situations;

$80 .. $ bf: // dollar sign indicates the hexadecimal number. Only the possible demarcation points are preliminarily determined. (Compression designers will try to make full use of all available spaces );
Begin
Tm. Position: = tm. Position-1;
Tm. Read (bw, 2 );
Bw: = swap (bw) and $ 7fff;
Sm1: = CountTrace (bw) + 3;
WriteMem (sm1 );
Inc (j );
End;


$1 .. $ 0f:


For I: = 0 to rd1-1 do
Begin
Tm. Read (rd2, 1 );
Inc (j );
Fm. Write (rd2, 1 );
Inc (counter );
End;

$ C0.. $ ff:
Begin
C: = $20;
Fm. Write (c, 1 );
Inc (counter );
C: = rd1 and $ 7f;
Fm. Write (c, 1 );
Inc (counter );
End;
$0: inc (counter );
Else
Fm. Write (rd1, 1 );
Inc (counter );
End;
End;
Fm. Position: = 0;
Outstream. CopyFrom (fm, counter-1 );
Fm. Free;
Tm. Free;
End;

The above program is compiled in delphi7.0.




Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.