In the original article, I published my personal homepage http://purplesword.info/pdf-to-epub. welcome to everyone's workshop and thank you for your support ~ Based on my experience in converting PDF files into Epub e-books, I have summarized this article. Due to my limited level, it is inevitable that there are errors and deficiencies. I hope you can correct them in time. When writing this article, it is assumed that the reader has used the basic operations of the software listed in the article, such as how to use Notepad ++ to open HTML files and how to use PDF Password
Remover ). In addition, you need to know more about epubbuilder operations. This article mainly discusses PDF documents with illustrated images and non-scanned versions. For scanned PDF files, if they are cartoon files, you can directly use Adobe
Acrobat converts an image into an Epub. If it is a text-scanned e-book, you can use OCR software to recognize it to convert it into text (abbyy
Finereader recognition results are relatively good), but OCR recognition accuracy is not high enough, some Chinese characters that cannot be identified need to be manually input. For PDF files in plain text format, save them as TXT and convert them. If the text cannot be copied, you can use PDF
Password remover decryption. For details, refer to the PDF decryption section in this article. PDF is short for Portable Document Format (Portable File Format). It is an electronic file format and has nothing to do with the operating system platform.
Developed by the company. PDF
The file is based on the PostScript Language Image Model. No matter which printer is used, it can ensure accurate color and accurate print effect. That is, the PDF will faithfully reproduce every character, color, and image of the original. PDF is mainly composed of three technologies: · derived from postscript, which can be called a scaled-out version of postscript; · Font Embedding System, which allows the font to be transmitted along with the file; · data compression and transmission system. The PDF file structure can be divided into four parts: 1. first 2. file body 3. cross-reference table 4. on the one hand, PDF is used as the industrial standard for pre-press publishing. Due to its complicated layout, the file content is also complex. For example, PDF can be embedded with special fonts, the absolute location of each image and Text object can be freely saved. On the other hand, Epub adopts XML standards similar to the webpage format, even if CSS style sheets are supported, the typographical effect is still far from PDF, so it is relatively complicated to convert an excellent PDF e-book into an Epub format. Some genuine paid readers will launch some Epub with high typographical quality. I have tried it and the typographical style is really good, but it can only be obtained after careful preparation, it is unrealistic for us to Convert PDF files, and the charges for these e-books are relatively high, and there should be restrictions on authorization, it is impossible to share the file so that everyone can use it (these files can only be opened after logging on to a specific account in a specific application, and cannot be opened if they are copied to others ). All software that may be used: PDF password remover 3.0 Adobe hybrid batchrome browser (other browsers should do the same) notepad ++ Microsoft WORDWPS digital photo compression master epubbuilder The general idea is to first remove the password restrictions, export it to HTML format, remove irrelevant information, correct garbled characters, and then import it with epubbuilder to improve the book information, divide the chapter, and use the reader to check whether there are any serious errors, and then publish it. Why should we convert it to HTML? This format is completely open-source, well-handled, and has a low error rate. It is also consistent with the internal storage format of Epub. The specific steps are as follows: First, specify one point. The final goal of the following steps is to display all images normally, however, both are left-aligned by default (some readers can set the image to be forcibly centered ). The text section is separated from the image. The text does not surround the image, that is, the left half is the image, and the right half is the text. Although the effect is better, it is too difficult to implement. 1. If encrypted, use the PDF password remover to remove the PDF file encryption restriction. Encryption will be discussed later. 2. Open it with Acrobat. The menu is displayed.
File, save it as (or export), and select generate html3.2 format (no CSS ). The generation process may be slow, so you must wait patiently. Do not scatter the mouse and it is easy to crash. Based on experience, html4.0 (css1.0) is not selected here. Despite the support of CSS in the latter, typographical statements are more reasonable, but in fact the error rate in epubbuilder import is greatly increased and the effect is not good. 3. check whether there are obvious errors in the browser. For example, if it cannot be opened at all, it is all garbled characters, no Chinese characters, no images, etc. All garbled characters may be HTML Encoding Problems. No Chinese characters may be PDF fonts and encoding problems, and no images may be html links. This is troublesome to solve, and it is not necessarily true. I can't do anything if such a serious problem occurs. Fortunately, this problem does not occur as long as the PDF file is normal. Here is a brief description. html is usually composed of source files and data files, for example, photography .html and corresponding folder "photography _ FILES". Folders folders may also be other names, such as images. The source code and data folders should usually be placed in the same parent folder, the folder contains images and other multimedia files, and may include CSS style sheets and JavaScript scripts. in HTML exported from PDF, only images are displayed. The HTML source file is actually a text file, which can be opened in Notepad. Later, we will use Notepad ++ to directly operate the HTML source file. 4. Starting from this step, we need to correct various HTML problems, which may be difficult to understand. People who know HTML and regular expressions should be able to quickly understand. If you don't understand it, follow the instructions. If you feel that the layout is good when you open HTML in the previous step, and there are no other things, you can skip these steps of HTML correction and import them directly to epubbuilder to see the effect. 5. Use notepad ++ to open the HTML file. We can see its source code. 6. Replace and delete the align code in the HTML source code to remove the specified alignment of the image text so that it is left aligned by default. The specific method is to press Ctrl + H or search in the menu-
Replace: "search mode" with "normal", "search target" with align = "center", and "Replace with". If this option is not specified, check "loop search ", click "replace all ". If you process multiple files at a time, open all and click "replace all open files. Similarly, replace "Search targets" with "align =" Left "," align = "right", and "align =" Justify. When you open HTML again, you will find that the original position is somewhat messy and the image looks much better. If some images are centered, they can be replaced with not all images, or they will be modified using word. This layout is a bit messy, because some images depend on the right alignment, some depend on the left, and some text sets the two ends of alignment. 7. Replace and delete headers and other texts (using regular expressions). Generally, books have headers and footers. In this example, red boxes and page numbers are marked. This information makes no sense after EPUB is generated, because Epub has different page numbers in different situations. People familiar with word know that when editing a book, the header can be edited and modified in batches. However, after a PDF file is generated, the header and footer become independent objects and cannot be deleted at the same time. If the header is text, it will be processed in the next step. In the source code, it is possible that the text uses escape characters, which cannot be modified without understanding. If the header contains an image, the number 02 must be replaced by the HTML source code. The method is as follows. Use Chrome and notepad ++ to open the HTML file at the same time, right-click the header image in chrome, and review the elements. A window showing the source code is displayed below, switch to notepad ++ and use the replacement function. First, change "search mode" to "Regular Expression ", deselect "case-sensitive" and select ".
Matches newline ", other options remain unchanged, find the target as <[^ <>] * IMG [^ <>] * width = "39" [^ <>] * Height = "71" [^ <>] *> note that there is no space in the middle, the numbers after width and height are the ones you just saw in chrome, replace them all, save the file, but do not close notepad ++. At this time, the width and height of the image will not meet the requirements. Refresh in chrome to see if there is any problem after the modification. If there is a problem, undo the change in Notepad ++, and then analyze the actual situation (omitted here ). In Chrome, we can see that some headers are not replaced because their width and height may be different from the previous ones. In this case, we only need to repeat the previous actions. It is hard to understand the Chinese characters expressed by escape characters (such as & #20154;): 8. html will be almost perfect after further editing with word, so it is also critical. Open HTML with word (other software is not recommended, because the HTML file containing the entire book is usually very large, and many software can easily crash, such as WPS, Dreamweaver, word2010 is well optimized in this aspect, but it is unclear in version 2003). After opening it, you can select and modify all the fonts, and then replace it to remove a small amount of unprintable characters, it is displayed as a question mark (do not replace the question mark in the original text as much as possible), replace repeated website information, advertisements, and headers and footers in the text form (emphasize again, do not replace the same content in the original text with the header. You can replace the content in the word with the specified font, which is more convenient). If the page number is not very regular, for example, page X, this information is removed by epubbuilder. Then, use word to appropriately modify the layout of text and text, and remove unnecessary directories without any serious problems. Note that there is a problem here. If the PDF file is complete and there is a directory, delete the corresponding page number in the directory. Like this: Preface ........................................................................ 1. This first page is meaningless when converted to HTML, and there is no need to keep it. In addition, pay attention to a common problem that some text is saved in the form of images, which will be detailed in the following FAQ. The powerful replacement function of Word allows you to specify the text format before and after replacement. 9. Optimize the HTML file if necessary to correctly import epubbuilder. In fact, this step should be caused by incomplete epubbuilder. Many special information will be added to the header of the HTML file after being edited and saved using word. <meta…>, There is also a green <! -.......... -> Some (in standard HTML, the text in this form is a comment, and there is no impact after deletion), and there are also image links. These problems sometimes affect epubbuilder import and cause some errors. If an error occurs, use your browser to open another storage and use Notepad ++ to delete the green one. <! -.......... -> Section. If it still does not work, use WPS to create a document (word is not good, WPS will re-link the image when it is generated, and word will not), open HTML in a browser, select all and copy the webpage content, paste it to WPS and save it as HTML. At this time, the HTML file will be completely regenerated, but the image may be converted to PNG by WPS. The occupied space is generally increased, which is not recommended. 10. If you use WPS to save the file again, check the size of the HTML file's image folder. If it is too large, compress the file as follows: Use a digital photo compression master to add a folder, save the JPG file to another folder and use Notepad ++ to open the HTML source code.
...> For this image label, use an ordinary pattern to replace. PNG with. jpg ". Then, delete the PNG Image in the image folder and move the Compressed JPG to the folder. Finally, open it in a browser and confirm. 11. Use epubbuilder to import, edit book information, divide chapters, and intelligently typeset to see if there are any errors. If there are any mistakes, modify them. A problem may have been left over before. If the footer is removed from page X, you can use the feature row deletion function. Here I still propose a defect of epubbuilder. You may find that the original HTML layout is good, there are also font information, but there is no such information after the import, some images will also have some minor problems, such as my HTML, but after the import, the font format is gone, the text and image are in the center, and the red box on the left of "Master photography" is totally messy. I have no good countermeasures yet, we are looking forward to improving the epubbuilder function. 12. Export Epub, and use the handheld bookstore or other viewer to check whether it is normal. Then release it. Wait for the review to pass the bill ^_^. Do you think this coin is extremely happy? I analyzed the following issues: first, the encrypted copyright protection of PDF. Many PDF files use certain encryption methods due to copyright issues. The strictest one is to enter a password when opening a file ,. In this case, we can only use advanced
PDF password recovery a type of software brute-force cracking is not described in detail here, the success rate is not high, it takes a lot of time. A more common encryption method is that a file can be opened, but the text in the file cannot be copied or garbled. In this case, we use the PDF Password
In general, remover can quickly remove the encryption restriction and copy the text in it. The second is the font embedded in PDF. Many PDF files are embedded with some fonts, which may lead to garbled characters. I don't know how to export the entire file by mistake, but if there are only a few mistakes, you can manually modify them or make some mistakes, the replacement method can be used to solve the problem. Of course, the correct word may be replaced incorrectly when all text is automatically replaced. For example, if all the "Words" in the text are changed to "tokens", we can replace them with confidence because this word is rare. However, if all the words of "are changed to" white ", the blind replacement will replace" Understanding "with" clear ", and the problem will arise. There is indeed no good solution to this situation. Pay special attention to this. There is another situation where the font system does not (usually because the fonts used by other texts do not support these words, these words will be replaced with other fonts ), then it is converted into an image ,. This requires patience and manual correction in word. Third, the layout problem. For example, the following is a beautifully-formatted photo e-Book with images and text in the middle of the image. What will happen after the text is converted to HTML? At last, as you expected, it was a bit confusing. This is not easy to solve. If you want to make high-quality books, manually modify them in word. (Unfortunately, I really don't have the patience .) There is also a strange situation. In PDF, text is also an object. Generally, text with the same font is an object. Each image is an object. However, the PDF editor has a very interesting place. For example, if two paragraphs of text are originally an object and you insert a blank line in it, it may be divided into two objects. In turn, when two objects of the same nature (both text and both text) are close together, they are automatically merged into an object. A miracle happened. Imagine, suppose that the middle graph is very close to the following graph, and the width is the same, then miraculously combine it into an object, and then output the HTML, they become "connected babies" and become an image. Then, you don't know how to arrange the descriptive text next to them (unless you manually split the image ). So I hope the reader will take a look. Ah, the helplessness of converting PDF to EPUB is also evident here.
All rights reserved. For more information, see Source: purplesword? PDF-to-Epub-format e-book experience
PDF-to-Epub-format e-book experience