Disclaimer: This article may contain a lot of technical terms. If you are not satisfied with this, please do not continue to read it.
This article only discusses some ideas and does not provide relevant information. Source code At most, only the source Code The link of the website. If you are not satisfied with this, please do not continue reading.
This article only discusses the problem of decompilation of e-books from a technical perspective. Do not use this for illegal purposes such as copyright infringement or the interests of others. If you are disappointed with this, please do not continue reading.
This article is copyrighted by the author. Please obtain the written consent of the author before reprinting.
1. Preface
2. Common e-book formats and decompilation ideas
2.1 PDF Format
2.2 e-books based on IE Kernel
2.2.1 chm format
2.2.2 EXE format
2.2.2.1 web compiler 1.67
2.2.2.2 caislabs ebook pack express 1.6
2.2.2.3 General decompilation ideas
2.3 HLP format
2.4 novel/novel world (EBX/xreader)
3. Conclusion
How to implement e-books based on IE kernel in Appendix
1. Preface
The ebook described in this article refers to packaging the original and editable HTML, txt, RTF, and image files into an independent exe, or other files that can only be read by a dedicated browser. The packaged files cannot be edited or searched in full text using conventional tools.
The Decompilation Of The ebook described in this article refers to extracting, restoring, or converting the content from the ebook into standard, editable HTML, txt, RTF, and image files.
Just like other things in the world, the emergence of e-book compilers and anti-compilers is not accidental and has its inevitability.
In the case of the e-book compiler, some people have been thinking about packaging electronic documents since the day when there was an electronic document. I personally think this mainly involves the following aspects:
Easy to read and manage. Because there are too many files and management problems, the chm format and a variety of e-books based on the IE kernel.
This facilitates the protection of intellectual property rights and trade secrets. I believe everyone can understand the importance of this issue. Don't talk about things that contain core commercial secrets, even if it is a novel in a district, some mean people put the original HTML and TXT files with the logo, package them, and then claim to be "the result of hard school scanning", and then charge the so-called "VIP fee" in an effective manner ". Therefore, PDF has always regarded document security as one of the selling points, and various unique-format e-books in China are also designed to prevent decompilation and prevent content duplication.
Those who oppose packaging the general format into a dedicated format also have their own reasons:
Easy full-text retrieval. As mentioned above, e-books generally do not support full-text retrieval using common retrieval tools, which sets an obstacle for the effective use of materials. I personally think that when there are dozens or hundreds of shards, manual summarization and indexing may be acceptable, all I want is a quick full-text retrieval tool, just like Google's dependency in the Internet environment.
Easy to modify. As the saying goes: "There is no gold, no one is perfect." e-books are also made by people. Sometimes there are inevitable errors, or because of the development of information, the original content needs to be corrected and supplemented, what do you think if you are facing an uneditable EXE?
Save time and patience. When Windows displays the file list, it needs to read the file information, and the EXE file also needs to read icons. If the file contains anti-virus software, when entering the folder, anti-virus software usually automatically checks the EXE files in folders, while the size of e-books is generally in MB. Therefore, when you open an e-book that contains the EXE format, it feels very slow, this is quite objectionable.
Save space. The standard architecture of e-books in EXE format is: executable body + content + TOC. The executable body refers to the Execution Code of the ebook, including Program Code, plug-in code, interface resources, etc. Content refers to the text and image content actually contained in an e-book. It is usually compressed and encrypted. Algorithm . TOC (table of content) is equivalent to a Directory Index to accelerate content access. Therefore, compared to compressing the original content with WinZip and WinRAR, each ebook in the EXE format will waste part of disk space to store the execution part. The more fancy the e-book software interface is, the larger the waste is. I have seen the most exaggerated e-book that is more than 1 MB of the original content.
Avoid spam. Some e-books based on the IE kernel may leave garbage in the Registry and system directory due to technical restrictions.
Security. If today's Internet Society is a malicious and dishonest environment, it may be a bit exaggerated, but some people do not know what it means to be honest. To be honest, every time I get an e-book of an unidentified EXE format, I wonder if there are any Trojans or viruses in it.
Easy platform conversion, including conversion to handheld devices. E-books in EXE format may look great, but after all, they can only be viewed in windows. If you want to view them in other systems, especially on handheld devices, the only way out is to decompile it.
Of course, after decompilation, you must also find a suitable alternative to continue to meet the original needs:
Packaging tool. We recommend that you select WinZip or WinRAR, which is not only easy to use, but also has a small file size after packaging, so it is faster to enter the directory.
Read tools. Now I can read a lot of software for zip/rarfile content without any worries. I have done a myreader by myself, you can not only read the content directly from the zip/rarfile, but also have functions such as automatic positioning of index.htm, bookmarks, on-site protection, resource browser shortcut menu extension, and automatic memory of zip/RAR passwords.
Full-text retrieval tool. There are also a lot of software that can be directly searched in full text in zip/RAR. I have also developed a findstr, which supports encrypted zip/RAR. This tool can also be integrated with myreader, you can directly open the search results using myreader without unpacking. In addition, it also supports batch text replacement, so it is often used to sort out downloaded or decompiled novels, including removing ad links and changing absolute URLs to relative URLs.
Protection of Labor achievements. This is simply protected by a zip/RAR password.
2. Common e-book formats and decompilation ideas
2.1 PDF Format
PDF is a cross-platform electronic document format released by Adobe. Adobe provides a dedicated document browser, allowing users to get the same reading effect on different platforms.
In fact, Adobe Acrobat, a PDF editing tool provided by Adobe, already supports saving PDF files in the RTF format. Therefore, I do not have much research on PDF decompilation. However, this feature seems to be restricted by "document security". Fortunately, I Google it, and there seems to be a lot of software to crack PDF security protection. If you are really interested in batch conversion, there is also an article on codeprojectArticleProvides source code for converting PDF files into plain text.
From the perspective of my usage, the RTF format output by Adobe Acrobat itself should not be too problematic for English documents. At most, the format is somewhat changed, but when outputting Chinese documents, occasionally, due to Character Set code errors, only a bunch of garbled characters can be seen when the output file is opened in word or wordpad. In this case, manually Replace the character set encoding.
Another possibility of garbled characters is that the user-defined font is used in the PDF file, which makes it troublesome to display the converted file normally. There are two ways to bring the built-in font for PDF files: a complete font, called font embedding; only one font contains the characters to be used, called font subsetting. I have discussed this in the "book production and reading Tools" area of the e-category publications Forum. If you need it, you can go and see it on your own.
However, once I tried to use a software named pdf2html, the idea of this software is to convert every page of the PDF file into a JPG file, and then encapsulate the JPG file into an HTML file, add directories and page flip buttons to save the client's Acrobat Reader and client font support during network browsing. How does the HTML file template of this software not talk about it first? The most strange thing is that the converted image format can only be JPG, not PNG. In fact, for a page with a large white background, the PNG format is not only smaller than the JPG format, but also not the same as the JPG format, many small fragments (high-level clutter) are generated at the edge of text and image ).
2.2 e-books based on IE Kernel
With the development of the Internet, more and more network documents are provided in HTML format, while Microsoft itself provides the kernel of IE browser in the form of controls, it can be easily called by almost all programming tools in windows. Therefore, e-books based on the IE kernel seem to be the mainstream.
2.2.1 chm format
CHM (pronounced "chum") was originally intended to be Compiled HTML Help file, which was proposed by Microsoft as an alternative to HLP format (Standard help file format in 16-bit windows, therefore, Microsoft not only provides free browsers with IE of version 4.01 or later, but also provides Microsoft HTML Help Workshop, a free production tool.
The CHM File uses the its format internally. This is a very good compression format, and the compression ratio is larger than zip and RAR.
Due to the openness of its format, some foreign countries have already made independent compilation and decompilation tools in chm format, and published all the source code. You can refer to the following:
Http://bonedaddy.net/pabs3/hhm/
In addition to CHM compilation and decompilation tools and their source code, this website also provides detailed descriptions of the chm format, of course, in English. At the beginning, I used the chmdeco source code to implement batch decompilation of CHM. If this website is unfortunately unable to log on, Google chmdeco and there are many backup sites. Chmdeco uses the chmlib source code internally. This source code is very famous. Besides chmdeco, chmtools also uses it.
However, after using this code for a while, I found that this Code may cause an array out-of-bounds error when decompiling some CHM files. Although this error is unlikely to occur, it is still quite upset after it appears, so I finally gave up the code.
Now the CHM decompilation code used by unebook is changed from here:
Http://www.codeproject.com/winhelp/htmlhelp.asp
This Code uses the its file access interface that is not publicly available by Microsoft to operate the file directly. Because Microsoft is used, not only is the target code relatively small, but the compatibility is much better. Currently, there is no CHM file that cannot be decompiled (the only exception, the CHM file itself cannot be opened), and the memory vulnerability is not found. It seems that Microsoft still has to deal with the things of Microsoft.
When others made chmbooks, index.htm was not produced, but they simply relied on the left-side directory tree for navigation. After decompilation, an index page is automatically generated based on the generated HCC file to avoid inconvenience. The file structure of HSCC is roughly as follows:
The multilevel directory is controlled by <ul>. When you see <ul>, go down to level 1 directory and </ul> go back to level 1.
The directory item starts with <object type = "text/sitemap"> and ends with </Object>. Store the necklace with <Param name = "name" value = "XXX"> name and <Param name = "local" value = "xxx.html">.
Some directory items may only have names and no links.
In unebook, you can not only automatically generate an index page based on the HSM file, but also automatically generate a framework page. You can embed the index page and display page into the framework to simulate the directory effect in CHM to the maximum extent possible. If you want to completely imitate the tree directory that can be dynamically scaled, you need to add images, JS, CSS and other files, which is not worth the candle.
2.2.2 EXE format
In addition to the chm format, a large number of e-books based on the IE kernel are provided in the EXE format. The e-book tools used to create EXE files now seem to have become an industry that has supported a large number of programmers. Although many people think that this format of e-book is cool: A file can be executed, the interface can also be very beautiful, and password protection can be carried. However, I personally hate this type of e-books most: apart from the security, speed, space, and retrieval problems mentioned above, one of my most disturbing points is that the current EXE ebook does not have a good bookmarkdonefile function, especially the bookmarkdonefile function that can be located anywhere on the page, it is very troublesome to see the interrupted part of the long document. Therefore, since myreader implemented the bookmarks function, I made up my mind to solve the problem of decompilation.