Analysis of common e-book formats and their counter-compiling ideas 1th 3 Page _ Other

Source: Internet
Author: User
Tags rar relative password protection
Disclaimer: This article may be mixed with a number of technical terms, and if you are unhappy with this, please do not continue reading.
This article only discusses some ideas, does not provide the relevant source code, at most also only provides the source code website link. If you are dissatisfied with this, please do not continue to read.
This article only discusses the issue of E-book decompile from the technical point of view, do not use it for infringing copyright and other illegal purposes, or harm other people's interests. If you are disappointed with this, please do not continue to read.
The copyright of this article is owned by the author, please ask the author's written consent before reprint.
1. Foreword
2. Common ebook format and its counter-compiling ideas
2.1 PDF format
2.2 e-book based on IE kernel
2.2.1 CHM format
2.2.2 EXE format
2.2.2.1 Web Compiler 1.67
2.2.2.2 caislabs EBook Pack Express 1.6
2.2.2.3 General anti-compiling idea
2.3 HLP Format
2.4 Novel web/Novel World (Ebx/xreader)
3. Conclusion
Appendix based on IE kernel e-book implementation approach

1. Foreword
This article describes the ebook, refers to the original, editable HTML, TXT, RTF, image files, such as packaging into a stand-alone EXE, or other only dedicated browser to read files, packaged files are usually not used to edit the general tools, Full-text search.

The reverse compiling of e-books described in this article refers to the retrieval of content from e-books, to the reduction or conversion of standard, editable HTML, TXT, RTF, and image files.

Like everything else in the world, the advent of E-book compilers and counter compilers is not accidental, it is inevitable.

On the ebook compiler side, probably from the day of the electronic document, someone is thinking about packaging the electronic document. I personally think that this is mainly from the following aspects to consider:

Easy to read and manage. Read the text file in DOS, in particular, the Chinese file is more troublesome, so there has been a Chinese font, with basic browsing (paging, scrolling) The function of the DOS ebook; Because of the need to achieve the same reading effect on different OS platforms, a cross-platform PDF ebook was produced; With the development of the Internet , a lot of information in HTML format, but in the face of a lot of HTML files, not everyone knows to double-click index.htm or Default.htm, and too many files, management problems, so the CHM format and various based on IE Kernel EXE format ebook.
Facilitate the protection of intellectual property, trade secrets. The importance of this question is now understandable, do not say that the core of business secrets, even if only a novel, will be some despicable people to the original HTML, TXT file to add logo, packaging, and then claimed to be their "hard to sweep the results of the school", and then openly collected the so-called " VIP fee ". As a result, PDF has always been one of the selling points of the document security, the domestic various forms of electronic books also to prevent the decompile, anti-content replication as the primary goal.
Those who oppose the packaging of the common format into a very personal format have their own reasons:

Facilitate Full-text search. As mentioned earlier, E-books are generally not available for full-text retrieval using the usual retrieval tools, which set up barriers to the effective use of data. I personally think that the collection of books in dozens of, hundreds of times, manually set up a summary, index may be acceptable; After that, all I want is a quick full-text search tool, like the reliance on Google in the Internet environment.
Easy to modify. As the saying goes: "Gold without pure, no one is perfect", e-books are also people do, sometimes inevitable will be something wrong, or because the development of information, need to amend the original content, add, this time if the face is a non-editable exe, you will have any feelings?
Save time and patience. Windows in the display file list, you need to read the file information, EXE files to read icon, etc., if equipped with anti-virus software, access to folders, anti-virus software will generally automatically check the EXE files in the folder automatically, and the size of e-books are generally in MB level, So when you open an ebook containing an EXE format, it feels a bit slow and annoying.
Save space. General EXE format ebook's standard architecture is: Executable body + content +toc. Executable refers to the code part of the ebook, including program code, plug-in code, interface resources and so on. Content refers to the electronic book contains the text, image content, generally using some kind of compression, encryption algorithm for processing. The TOC (Table of content) is the equivalent of a catalog index, which accelerates access to content. Therefore, relative to the direct use of WinZip, winrar to the original content of the compression, each of the E-exe format of the ebook will waste a portion of disk space to store the Executive Body part. The more flashy the software interface is, the more wasteful it is, and the more I've seen the most exaggerated ebook than the original content of more than 1 MB of things.
Avoid rubbish. For some e-books based on IE kernel, due to the implementation of technology limitations, may be in the registry and the system directory to leave garbage.
Safety. If the current network society is a malicious, no integrity of the environment, may be a bit exaggerated, but it is true that some people do not know "to be honest" why things. To be honest, every time I get an ebook with an unknown EXE format, I suspect that there are no trojans or viruses in it.
facilitates platform conversion, including switching to handheld devices. EXE-format e-books may look cool, but only under Windows, and if you want to look under other systems, especially on handheld devices, the only way to do this is to decompile it.
Of course, after the decompile, you must also find the right alternatives to continue to meet the original needs:

Packaging tools. Recommended choice of WinZip or WinRAR, not only easy to use, but also small files after packaging, access to the directory is also fast.
Reading tools. Now can not understand the package directly read Zip/rar file content of the software, a lot of a search, I have done a myreader, not only can read directly from the Zip/rar, but also automatic positioning index.htm, bookmarks, on-site protection, Resource Browser Right button menu expansion, Zip/rar password automatic memory and other functions.
Full-Text Search tool. Can directly in Zip/rar full text search software also have a lot of, I also did a findstr, support encryption Zip/rar, this tool can also integrate with myreader, search results can directly with the myreader directly open, do not need to unpack. In addition, it also supports bulk text replacement, so it is often used by me to organize downloaded or decompile the novel, including the removal of advertising links, absolute URL change to relative URLs.
The protection of the fruits of labor. This can be done directly with Zip/rar password protection.
2. Common ebook format and its counter-compiling ideas
2.1 PDF format
PDF is a cross-platform electronic document format launched by Adobe and Adobe offers a dedicated document browser that allows users to have the same reading effect on different platforms.

As a matter of fact, Adobe's PDF Editing Tool,--adobe Acrobat itself, has supported the saving of PDF files as Rich Text format, so I don't have much research on PDF decompile. But this feature seems to be "document security" restrictions, fortunately, I Google, cracked pdf security software seems to have a lot of. If you're really interested in bulk conversions, there's also an article on CodeProject that provides source code for converting PDFs to plain text.

From the situation I used to see, Adobe Acrobat's output of rich text format, the English document should not be too much problem, the most is the format a little change, but in the output of Chinese documents, occasionally because of the character set code error, resulting in the output of the file in Word, WordPad, when the Can only see a pile of garbled. In this case, manually replacing the character set encoding can be resolved.

There is also a possible garbled PDF file using a custom font, resulting in the conversion of the file can not be normal display, this is more trouble. PDF file with a font has two ways: a complete font, called font embedding, only one of the characters to be used in a font, called the font subsetting. In the E-Class publication Forum, "Book production, reading Tool area" has been discussed, need to be able to see for themselves.

But once I tried to use a software called pdf2html, the idea is to convert each page of the PDF file into a JPG file, and then encapsulate the JPG file into the HTML file, plus the directory, page button, etc., so that when browsing the web, even the client's acrobat Reader and client font support can be saved. The HTML file template for this software how to do not to say it first, most of my strange is that the converted image format can only be JPG, not PNG. In fact, for a page with a large white background, the PNG format is not only a file length smaller than a JPG, and will not be the same as JPG format, in the text, the edge of the image to produce many small fragments (high time clutter).

2.2 e-book based on IE kernel
With the development of the Internet, now that more and more Web document content is provided in HTML format, and Microsoft itself in the form of a control to provide the Internet Explorer's core, can easily be called by almost all Windows programming tools, so the current based on IE kernel E-book seems to occupy the mainstream position.

2.2.1 CHM format
The CHM (pronounced "chum") is intended to be compiled HTML Help file, which is presented by Microsoft as an alternative format for the HLP format (the standard assistance file format under 16-bit Windows), So Microsoft itself is not only offering free browsers with more than 4.01 versions of IE, but also free production tools Microsoft HTML help Workshop.

CHM file internal Use its format, this is a very good compression format, feel compression ratio than zip, rar large.

Due to the openness of its format, some people have made independent compile and decompile tools in CHM format, and have disclosed all the source code, which can be seen here:

http://bonedaddy.net/pabs3/hhm/

In addition to providing CHM compilation, decompile tools and their source code, this site also provides a CHM format for detailed instructions, of course, in English. I do unebook at the beginning, it used the source code of Chmdeco, to achieve the batch decompile CHM function. If this site unfortunately not login, Google Chmdeco, there are a lot of backup sites. Chmdeco internal use is the source code of Chmlib, this source code is very famous, in addition to Chmdeco, chmtools with it.

But after using it for a while, I found that the code would have an array error when decompile some CHM files. The probability of this error is small, but after the emergence of the more upset, so finally gave up the code.

Now the CHM decompile code used by Unebook is changed from here:

Http://www.codeproject.com/winhelp/htmlhelp.asp

This code uses Microsoft's unpublished its file access interface to manipulate files directly. Because of the use of Microsoft's things, not only the target code is relatively small, compatibility is much better, there is no counter-compiled CHM file (only one exception, is that the CHM file itself will not open), memory leaks and anything else found. It seems that Microsoft's things still need to be dealt with by Microsoft, side is kingly.

Others make CHM e-books, for the sake of convenience, did not make index.htm, but rely solely on the left of the directory tree for navigation. For such an ebook, after the decompile, generally also need according to the generated HCC files, automatically generate an index page, so as not to see when inconvenient. The HCC file structure is as follows:

Multi-level directory through <UL> control, see <UL> when the next go down the directory,</ul> back to the level.
The catalog entry begins with the <object type= "Text/sitemap" >, ending with </OBJECT>. With <param name= "name" value= "XXX" > Store item name, <param name= "local" value= "xxx.html" > Store link.
Some directory entries may have only names and no links.
In Unebook, not only can automatically generate index pages according to HCC files, but also automatically generate frames pages, index pages and display pages embedded in the frame to maximize the simulation of the directory effect in the CHM. If you want to completely imitate the dynamic expansion of the tree directory effect, you need to add pictures, JS, CSS and other documents, it is not worth the candle.

2.2.2 EXE format

In addition to the CHM format, a large number of e-books based on IE kernel are provided in EXE format. E-book tools that make EXE format now seem to be an industry that feeds a large number of programmers. Although many people think that the format of the ebook is cool: A file can be executed, the interface can be done beautifully, but also with password protection. But I personally to this format of the ebook is the most hated: In addition to the security, speed, space, search and other issues mentioned above, I am most upset that the current EXE ebook is not easy to use the bookmark function, especially not able to navigate to any location on the page bookmark function, It's troublesome to see a long document interrupted at half time, so since Myreader has implemented the bookmark function, I have made up my mind to solve the problem of decompile.
current 1/3 page   1 23 Next read the full text

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.