Ma Jian
Email:[email protected]
Published: 2009.09.22
Update:
2012.06.11
The relevant content has been updated for the new progress of Pdftoy.
1 Introduction
2 theory
3 implementation
Conversion of 3.1 MRC model
3.1.1 Single layer DjVu
3.1.2 3 Floor DjVu
3.1.3 2-Layer DjVu (color text)
3.2 Conversion of images
3.2.1 JB2 Turn JBig2
3.2.2 IW44 turn JPEG 2000
3.2.3 JPEG vs. CCITT G4 Conversion
3.3 Conversion of hidden text
3.4 Conversion of the Directory
3.5 Conversion of other parts
4 Conclusion
5 extension
5.1 Making PDFs with DjVu technology
5.2 Reverse Conversion
5.3 PDF Browser Restrictions
1 Introduction
In the field of scanning electronic documents, PDF and DjVu each have a number of strong supporters, so you can often see the internet for help to achieve the two formats to convert each other posts-all want to be able to turn into their own or others like the format. There are also a variety of solutions available on the web, from the simplest virtual printers (both PDF and DjVu) to the use of specialized tools (single-step) or tool-set (multi-Step) transformations.
I have also been doing some technical explorations in this area recently, but the focus is not on the outcome itself (I have never advocated a poor toss between different formats), but in the process of conversion: I want to compare PDF and DjVu models and internal data compression algorithms from a technical point of view, so as to achieve lossless conversions, At the same time, keep the file length changes small.
This article is a record of the above process.
2 theory
According to my own understanding, DjVu's high compression ratio mainly comes from the following aspects:
- Based on the hierarchical structure of the MRC (Mixed Raster Content, see ISO/IEC 16485) Model: The scanned image is decomposed into foreground, background and mask layer, then the most suitable image compression algorithm is adopted for different layers. This method is undoubtedly better than traditional non-layered, cram static image compression formats (such as JPEG, JPEG 2000, PNG, TIFF, GIF, etc.) in the expression of text and image mixed bitmap images. In addition, in accordance with ISO/IEC 16485, if the image is divided into sub-regions (strip) and then layered, or using n-layer structure, it is possible to achieve higher compression performance. However, DjVu probably think that the pursuit of such performance improvement is not worth it, so always adhere to the MRC basic three-layer model.
- On the basis of stratification, DjVu from reading psychology, that readers ' attention to the text part of scanned page is higher than the attention of illustrations and shading. Therefore, for the text part, the pixel size is not reduced to maintain the highest possible clarity and resolution, and for the illustrations, shading parts, generally first thumbnail, and then lossy compression-usually long, wide shrink to the original 1/3~1/12, display and then zoom back. Simple calculation can be known, even if the length, width only shrunk to 1/3, the image area is only the original 1/9 large, that is not yet how to easily reach 1:9 compression ratio, nature can greatly reduce the final file length, pay the price is: many DjVu file illustrations look blurry, The reason for this is that in addition to lossy compression, image scaling is more important. DjVu file Each layer of the pixel size, can be seen from the Djvutoy exported DjVu file information, interested may wish to take a look. Those who often ask "Why do I see the djvu so vague?" "or" Why is the foreign djvu clearer than the domestic djvu? "The people should take a good look at it."
- In the coding aspect, DjVu's text layer adopts JB2 compression algorithm. The core idea of this algorithm is: to divide the entire page of text into a symbol (shape), the same symbol is no longer repeated encoding, so that the whole page of text can be a non-repeating symbol set (called "Dictionary", Dictionary), a page description set to represent. A single page description can be expressed in triples (Idx,x,y), the IDX represents the symbol in the dictionary ordinal, (x, y) is where the symbol is displayed, and plainly the meaning of each page description is: Display the symbol named IDX at (x, y). In this way, not only the blank parts of the page no longer need to encode, and for the printing of fonts (especially alphabetic text), each page of the degree of repetition of the symbol is very considerable, these repeated symbol encoding can also be omitted, so the compression ratio is larger than the regular static image. But the problem with this algorithm is: How to tell if two symbols are the same? After all, the image is scanned, binary character edge is full of burrs, to say two words a pixel is not very likely, there must be a tolerance, the difference is more than this tolerance is considered two symbols different, otherwise considered the same. In conventional DjVu production software, the user is generally provided with three choices: lossless (lossless), clean, lossy (lossy), tolerance from small to large, and final file length from large to small. DjVu has always advertised a high compression ratio, so the default choice of conventional production software is lossy, so it is possible to mistake a similar word for the same word:
http://djvu.org/forum/phpbb/viewtopic.php?t=659
http://readfree.net/bbs/read.php?tid=277235
This problem is JB2 lossy compression of the original sin, theoretically difficult to completely avoid, in fact, each DjVu generation engine will be in the internal to make some discrimination to make up, but the effect of who can not tell. So if there is enough evidence (and I doubt that anyone can come up with such evidence) to prove that a DjVu engine will not misjudge Chinese similar words, it is absolutely impossible for me to compress the files I need to keep as lossy JB2. Of course, to others to see is another, wrong good of me p?
In addition, this discrimination is closely related to the scanning dpi of the image, and the likelihood of a miscarriage of error less than 300 dpi is greater than the likelihood of a miscalculation above the DPI and above. In accordance with the requirements of the foreign scanning industry (see the famous "The Scan and Share tutorial version 1.07"), the scan should be scanned at a level of dpi, and then use the software to zoom to dpi, then processed into DjVu. This is why some people always feel that the foreign djvu than the domestic clearer reason: everyone's dpi is not the same!
- DjVu illustrations, shading is usually used IW44 compression algorithm, the algorithm based on wavelet (Wavelet) analysis, the principle of basic and JPEG 2000 almost, generally using a higher compression ratio, the price is the image quality can be seen by the naked eye is lossy.
In contrast to ISO 32000-1, the above features are also available in PDF:
- The transparent imaging model of PDF supports multilayer structures, including transparent, translucent, and more complex than the DJVU model structure.
- Starting with PDF 1.4 (corresponding to ACROBAT5), JBIG2 compression is supported, which is identical to the JB2 compression of DjVu in the core ideology (see ISO/IEC 32000-1:2008 7th 4.7, ISO/IEC 14,492:2001). However, JBIG2 considered a broader range, in addition to text, line chart, but also take into account the halftone (halftone) image, and so the definition is far more complex than JB2. In other words, the JB2 data stream can be completely converted into a JBIG2 data stream, but the reverse is not necessarily the--jbig2 of something that does not correspond in JB2.
- JPEG 2000 compression is supported from PDF 1.5 (corresponding Acrobat6), and the theoretical basis of this compression algorithm is the same as that of DjVu IW44 compression, which is based on wavelet analysis. From the actual image test results, for the same continuous-tone image, the two algorithms at the same compression ratio, the final visual effect is not obvious difference. In other words, for the same image, the two algorithms compress the file length can be similar, the visual effect is similar.
Therefore, in theory, most DjVu can be converted to PDF, in the case of a small change in file length (change or there is, after all, file structure differences), the data Lossless (JB2->JBIG2) or Visual lossless (Iw44->jpeg 2000).
Note that I'm talking about "most DjVu" because the exception always exists.
3 implementation
Theory says a lot, if not a practical realization, always feel a little bit empty. So I am based on the freepic2pdf PDF generation engine, joined to the DjVu support, finally implemented in Djvutoy DjVu to PDF function: One can convert a book, in addition to the image also includes multi-level bookmarks, hidden text, but does not include comments, thumbnails and so on.
The following are some of the key technologies of the implementation of the principles and methods, and the final results of the verification.
Conversion of 3.1 MRC model
As previously stated, DjVu's basic image model is the ISO/IEC 16485 MRC three layer model, but not all DjVu are made up of three layers, some of which are single or 2 layers.
- Single layer DjVu: Also known as photo DjVu (color, grayscale) or bi-level DjVu (black and white), a page with only one layer of images, color, grayscale images can be IW44 or JPEG compression, black and white images using JB2 or CCITT G4 compression.
- Layer 2 DjVu: Also known as color text DjVu, that is, only the foreground layer (JB2 compression or CCITT G4 compression) and the background layer (IW44 or JPEG compression), but the foreground layer allows color, at this time the JB2 is also known as colorized JB2. This is an extension of the original JBIG2 standard, which is unique to DjVu.
- 3 Layer DjVu, that is, contains the mask layer, the foreground layer, the background layer 3 layers of DjVu. The mask layer is a black and white image, can be compressed with JB2 or CCITT G4, the foreground layer, the background layer is grayscale or color image, you can use IW44 or JPEG compression.
The conversion of the MRC to PDF image model of DjVu is discussed below for these three scenarios.
3.1.1 Single layer DjVu
Single-layer DjVu is actually a single image, and the image in the PDF can directly establish correspondence, so the single-layer DjVu to PDF does not involve too many model-level things, when the conversion of the entire image into the PDF page.
3.1.2 3 Floor DjVu
The 3 layer DjVu is also not complex, the PDF image model also allows the use of masks, and even allow to specify the opacity (weight) of the mask, so 3 DjVu to PDF, there is not much problem on the model, just how to choose the right mask representation.
I finally chose to implement it with Smask, for the simple reason that a PDF produced in this way can specify a background color when viewed in Acrobat, which is often said to be a "Transparent background pdf."
This example is a three-layer structure of the DjVu file and the Djvutoy converted PDF file, interested can compare the display effect. The results of the internal data comparison are as follows:
- DjVu: Mask layer pixel size 2774x3543, data flow length 26896 bytes, foreground layer 232x296 (length, width is only 1/12 of the mask layer), Data flow length 5138 bytes, background layer 925x1181 (length, width is only 1/3 of the mask layer), The data stream length is 34334 bytes.
- PDF: The pixel size of each layer is the same as DjVu, the data flow length is: 27424 bytes, 5083 bytes, 34386 bytes respectively, the difference is not big.
If you are interested, you might as well save this example DjVu as a single static image, you can see the file length of a sharp expansion, the comparison will help to understand what I said earlier DjVu high compression ratio of reasons.
DjVu Convert PDF official conversion software Caminova documentexpress Enterprise 7.5 (DEENT75) When converting multi-layer DjVu, there is a gimmick: The converted PDF has layer control, You can specify that the foreground or background layer be displayed when you browse with Acrobat. I personally think that the layer control will increase the length of the PDF file, and support the layer control of the PDF browser and the use of very few people, so do not control it.
Djvutoy's gimmick is: The converted PDF is transparent background, whether it is a single layer or multilayer, users can specify the background color when browsing.
3.1.3 2-Layer DjVu (color text)
"Colored text" is a unique feat of DjVu. If the page contains colored text, there are two ways to implement it in DjVu (see LizardTech, LizardTech DjVu Reference DjVu V3, published in 2005) 7.1.3.1 Encoding "):
- General three-layer method: The text contour is compressed with JB2, as a mask layer (SJBZ), and the color portion is compressed with IW44 as the foreground layer (FG44). This technique is used in the example above. In order to pursue a high compression ratio, the foreground layer is usually scaled to a large scale (as the above example is scaled to 1/12), so that when the display is restored, the text color may look a bit weird, because the foreground layer after zooming is always somewhat different from the original.
- Color Text method: The text contour compresses with JB2, becomes the Mask layer (SJBZ), then encodes each symbol's color, becomes the foreground color layer (FGBZ).
Comparing the two methods, the latter has a higher coding efficiency, the text color is also more pure, the disadvantage is that each symbol color must be a single solid color, can not change (such as gradient text). And the former of the scope of adaptation is undoubtedly more extensive, compression ratio problem is usually solved by the reduction, such as the length of the width to 1/12, then the area is only the original 1/144, has not begun coding easily over 1:100 compression ratio.
To my understanding of the PDF, using color text DjVu if you want to convert to PDF, the most destructive approach is probably: the SJBZ data segment into a "dictionary" and "page description" two parts, the dictionary of symbols encapsulated into a lattice font embedded in the PDF, the content of the page description into a PDF character output instructions, The color description in the FGBZ is converted to the foreground color setting directive of the PDF. When displayed, displays the characters in the specified color, and the character lattice comes from the inline font.
This method is good, but the complexity of it I just think about losing the courage to try. So I ended up stealing a lazy one: converting the 2-storey structure into a regular 3-storey structure. Official conversion software DEENT75 is also used for this method, but Djvutoy has one more option than Deent75: You can choose the scale of the thumbnail of the foreground layer when converting.
When the 2-layer model is turned into a 3-layer model, the color foreground layer needs to be restored, then the foreground layer is reduced, and the original mask layer and the background layer are unchanged, so that the 2 layers become 3 layers. If the foreground layer is not scaled, the converted PDF is visually identical to the original DjVu, but the file length increases-the foreground layer is grayscale or color, whether JPEG or JPEG 2000 compression, if the screen size does not fall down, the file length will not fall down.
In deent75, the foreground layer is scaled to the original pixel length and width of 1/12, while the default value of Djvutoy is the same as deent75, but if the quality is very concerned about the length of the file is not intended, you can also manually set the scale of the thumbnail.
In addition, the foreground layer image generation is also very fastidious, deent75 generation method I imitated for a long time did not imitate out, now this is through a lot of experiments, in the file length, image quality is not worse than deent75.
3.2 Image Conversion 3.2.1 JB2 turn JBig2
This part of the beginning seems to be no suspense: the JB2 in the dictionary, page description decoded, according to the requirements of JBIG2 re-encoding, encapsulation can, in the middle does not need to decode the whole map into a bitmap and then re-segmentation, clustering.
But after the actual work will know, the middle is still fastidious: if the dictionary is not processed, direct coding, encapsulation, the final result will probably be about 20% longer than the original JB2 data stream. One of the reasons I also see Adam Langley Jbig2enc understand: If some of the symbols in the dictionary appear in the page description, you can separate these symbols into a dictionary, those only appear once the symbol into another dictionary, which can reduce the number of index bits in the page description, Ultimately, the entire data stream length is reduced. This technology does not see who is specifically named, called "Dictionary Two-time coding" technology. This technique has an impact on multi-page dictionaries, and also on single-page exclusive dictionaries.
In addition to the above-mentioned dictionary two coding techniques, JBIG2 's arithmetic coding efficiency also has an impact on the final data stream length, but this part is too complex, not the average person can handle.
Validation of the final coding results is simple:
- With Djvutoy can export DjVu file structure, with Pdftoy or free open source Pdfview can export PDF file structure, compare the JB2, JBIG2 data stream length, you can know the difference in coding efficiency. From the actual test results, there are some differences, but there is absolutely no online common djvu propaganda material declared so big.
- Using Pdftoy or Unicornviewer 0.17 or more versions of the JBIG2 data in the PDF can be converted to JB2 and encapsulated into DjVu files, Djvutoy can be exported before and after the conversion of the DjVu file dictionary, page description, Using Finddupfile to verify that the dictionaries of the two files are identical, the page descriptions can be verified exactly the same as Excel, so it can be considered that JB2 JBig2 and reverse JBig2 to JB2 process are completely lossless.
This kind of verification actually shows one thing: for a single-layer DjVu with JB2 compression, it can be converted to PDF by Djvutoy Lossless, and the file length is similar.
In addition, the similarity between JB2 and JBig2 is not accidental, and at T's Patrick Haffner, Leon Bottou, Yann LeCun and LizardTech company Luc Vincent co-authored the paper "A General Segmentation Scheme for DjVu Document Compression The 2nd Chapter, the origin of the JB2 algorithm is introduced:
The mask image is encoded with a new bi-level image compression algorithm called JBZ or djvubitonal. It's a variation on T ' s proposal to the emerging JBIG2 standard. The basic idea of JB2 was locate individual shapes on the page (such as characters), and use a shape clustering algorithm t o find similarities between shapes. Shapes that is representative of each cluster (or in a cluster by themselves) is coded as individual bitmaps with a meth Od similar to JBIG1.
It seems that not only the name is similar, JB2 and JBig2 Chase to the root of the blood relationship, but it seems that JBig2 later developed a number of tricks, and JB2 this decadent--the human Ah!
3.2.2 IW44 turn jpeg
My own mathematical foundation is not very good, the wavelet analysis is more daunting, so no research is possible like JB2 to JBig2, in the case of not decoding into a bitmap to achieve direct conversion, but the use of a lazy stupid method: First decode the IW44 into a bitmap, according to the length of the data stream before and after decoding can calculate the compression ratio , and then compress the bitmap into JPEG 2000 by this compression ratio. The key is this: JPEG 2000 compression allows you to specify the compression ratio, which guarantees that the compressed data stream length is within the specified range.
Validation of the final coding results is also simple:
- Export the structure of DjVu file with Djvutoy, export the structure of PDF file with Pdftoy or Pdfview, compare the length of BG44, FG44 and Jpxdecode data stream, you can know the difference of coding efficiency. From the actual test results, the difference can be ignored.
- Using Pdftoy or Unicornviewer 0.17 or more can be in the PDF JPEG 2000 Image Lossless export, with the image comparison software can be quantitative comparison of the difference between the two, but also directly with the naked eye, in my opinion is similar, basically can be considered as " Visually lossless "unless the compression rate exceeds a certain limit.
If anyone is proficient in wavelets, I would like to do a bit of IW44 and JPEG 2000 in-depth study, I always feel that the two can be directly converted-The study has been fruitful don't forget to let me know.
The above JB2, IW44 Verify that: for 3 layer DjVu, after converting to PDF with Djvutoy, the template layer is certainly lossless, the foreground layer, the background layer Visual lossless, the file length difference is not small.
For the 2-layer DjVu, due to the need to supplement the foreground layer, the file length increase after conversion is obvious, the impact of the foreground layer is also visually visible in some cases.
3.2.3 JPEG vs. CCITT G4 Conversion
According to "LizardTech DjVu Reference DjVu V3", the mask layer in DjVu in addition to JB2 compression, can also be CCITT G4 compression, its chunk ID is smmr, foreground layer, background layer in addition to IW44, but also allow JPEG compression, Its chunk IDs are FGJP, BGJP, respectively.
Because of the compression efficiency of the two compression algorithms and JB2, IW44 difference too much, so the use of the two compression algorithm DjVu file in reality, I have to test the file is also made with software deliberately.
The PDF itself supports CCITT G4, JPEG compression, so a lossless conversion of these two compressed images to Pdf--ccitt G4 may also need to be re-encoded, and the JPEG image will be embedded in it.
3.3 Conversion of hidden text
The DjVu is designed for scanning images, but also provides hidden text functionality to facilitate retrieval, copying, and so on for document content.
The hidden text in the DjVu is obtained by OCR, and DjVu, with hidden text, is customarily referred to as "double-layered DjVu", which is actually inherited from a "double-layered PDF"-a PDF produced with scanned images, or hidden text generated by OCR.
In the process of DjVu to PDF, if DjVu already have hidden text, naturally want to be able to go directly to the past, no longer OCR. But it involves an essential difference between the DjVu and the PDF.
DjVu design purpose has never changed, is to scan the image, the text is only auxiliary, so the text in the DjVu is the real "hidden" text, only the text encoding (UTF-8), the position of the text, but does not contain any font information, so in theory is not displayed text, unless the additional font is specified.
The text in the PDF is tied to the image, showing it to be normal, and hiding it is just a special case. Therefore, in the PDF, the text in addition to the encoding, display location, display scale, but also have font information. So when you convert the hidden text in DjVu to PDF, trouble is on the font.
The fonts in the PDF can be either inline or plug-in fonts. Whichever is better, each person's opinion is different. I myself is more inclined to plug-in fonts.
There are special provisions in the PDF for external fonts, there are 14 standard fonts supported by all PDF browsers, 9 are for the Western European Latin (Latin 1), CJK (Chinese, Japanese, and Korean) are provided with additional standard fonts, whether supported by the browser's discretion. Acrobat can support Adobe's CJK standard fonts If they have an Asian language pack installed. Unicornviewer was developed by the Chinese, not to mention the support for CJK.
In other words, if the use of external fonts, in fact, only Latin 1 (Western Europe 11 countries) and CJK (Simplified, complex, Japanese, Korean) to ensure platform versatility, other languages, such as Russian, in theory, you can specify the Windows TrueType fonts as plug-in fonts, but its platform versatility is not guaranteed.
When converting DjVu into PDF with deent75, the hidden text is only for Latin 1 and CJK plug-in font conversion. Djvutoy in the hidden text conversion aspect completely learned from deent75, its position and deent75 difference after the decimal point after the 4th--djvutoy I think to the decimal place after the 4th is enough, deent75 feel that should also retain more digits.
On the basis of imitating deent75, Djvutoy also made some improvements:
- Enhanced support for Chinese, Japanese, and Korean vertical lines of text. It is surprising that deent75 has no vertical concept at all, since the company's headquarters are in Asia.
- Allows word to be merged into line. The location of a single word may change after merging, but the data flow length is greatly reduced and proofreading is much simpler.
- Deent75 converted the double-decker PDF is "figure compression word", that is, hidden text at the bottom, the image in the upper layer. Such treatment has some drawbacks, so djvutoy to acrobat, using the "Word Pressure map" method, that is, the image at the bottom, hidden text in the upper layer.
In short, some things are used out.
3.4 Conversion of the Directory
The catalogue is called the outline in the PDF, in the DjVu is called the bookmark, the Contents, actually is in the browsing time, the left side displays the outline outlining.
The catalog in DjVu is actually much simpler than PDF and does not allow for fine-grained control of the jump position: in a PDF, you can jump to a page by clicking on a catalog item, or you can jump to a location in the page, and DjVu can only jump to the page, which is similar to the PDG directory.
Djvutoy Conversion DjVu Directory is the time to go straight, will be djvu in the Utf-8 converted to PDF Unicode, the page number also shone. But I also stole a bit lazy: DjVu in the directory allows to jump to a file or a url,djvutoy for these situations is ignored.
3.5 Conversion of other parts
In the DjVu, there are comments, thumbnails and other content, these in the PDF has a corresponding, in theory, the conversion to PDF should also be able to turn around, but I see the official deent75 also do not control these, so I also ignore, anyway, these things for me also can not touch, not worth spending time.
4 Conclusion
To sum up, most djvu in the conversion to PDF, can be in the case of small changes in file length, the data lossless (JB2 to JBIG2) or visual lossless (IW44 to JPEG 2000), and can be hidden text, directories, such as the transformation of the past, if the conversion method and tools properly.
From this point of view, the "DjVu format compression ratio is higher than the PDF format" is actually not true-the "format" of the PDF can also achieve a high compression ratio of djvu, so the difference is not in the "format", but in the conversion of static images to the final "format" tools and methods.
5 Extension 5.1 making PDF with DjVu technology
Today's common PDF authoring tools, including Acrobat, tend to "embed" the entire static image stream or even file into a PDF file without further processing (such as layered by the MRC model) when converting a static image to a PDF. The advantage of this method is that the technology is simple, easy to implement, the image can be completely lossless, the disadvantage is that it is often complained that the PDF file is much larger than the DjVu.
From the previous description, the high compression ratio of DjVu is directly related to its "layered structure, on-demand encoding", which can be copied to the PDF. So I think that if you want to improve the compression rate of scanned PDF, you can improve it on the PDF authoring software: Introduce the kernel or engine of the commercial DjVu production software, layer the scanned images that need to be converted to PDF, then select the most effective image compression algorithm according to the layered results. That is, the "image->djvu->pdf" process is simplified to "image->pdf", the middle of this step in the PDF production software quietly completed.
Of course, if not too much trouble, or have the technical accumulation of OCR, you can do the layered development, but the end result is the same. In fact, when I first saw the high compression ratio PDF produced with Luratech's products, I doubted they were doing it. This is one of the reasons that prompted me to write this article. The current deent75 also allows the user to specify whether the resulting file is DjVu or PDF, and if the PDF is selected, the image-to-layered PDF is implemented directly.
5.2 Reverse Conversion
After the discussion of DjVu to PDF, a very natural question is: Can you convert the PDF back to DjVu?
My answer to this question is: see how you want to turn. The simplest way of course is to print directly to the DjVu virtual printer, or find a ready-made Pdf2djvu software, like tossing or turning the PDF to the picture, and then the picture to DjVu.
But since the front said a half-day data format conversion, then our thinking is not too divergent, or follow the same idea: can you extract the image data stream from the PDF file data stream, and the level description, and then try to convert back to DjVu? My answer is: not necessarily. The reasons are as follows:
- For the JBIG2 data stream in the PDF, if no halftone image is added to the inside, it has a corresponding relationship with the DjVu JB2 data stream, which can be reversed without damage to the JB2 data stream. However, when I realized this process in Pdftoy and unicornviewer, I encountered the same problem as the original JB2 JBig2: The length of the file turned back was larger than the original DjVu file length. From the analysis of Djvulibre source code, this is also because of the "dictionary two Code" in JB2, but I really did not have the patience to study, so I took a lazy approach: in the "Export" interface to add a "two-time encoding" option, if the option is not selected, Then with my own lazy method, that is, the data in the JBig2 out, directly into the JB2 encoding, the middle does not need to decode the whole map into a bitmap, the process can be verified to be lossless; otherwise the full map is decoded into a bitmap, and then using Minidjvu or Djvulibre cjb2, Re-partitioning, clustering, and coding into JB2 by non-destructive parameters can result in changes in the dictionary and page descriptions, but the full image is still lossless and the data stream length can be smaller.
- For JPEG 2000 data in the PDF, I can not directly convert to IW44, and since the Djvulibre in the IW44 compression interface does not support the specified compression rate, so even if the decoding into a bitmap after re-compression, it is difficult to ensure that the length of the file is not changed.
- Color text, if not re-processing, I can not guess what method to turn back.
Therefore, I have only realized the PDF of the JBig2 exported to DjVu, but dare not to try Pdf->djvu, but also recommend that you do not idle to play, or which day suddenly regretted that can not buy medicine to go.
Although the study of reverse conversion is not exhaustive, it also produces other byproducts: in the course of the study, I felt that the future use of JPEG 2000 compressed PDF will increase, so in unicornviewer specifically to strengthen the support in this area, and my name all the PDG-related software, are beginning to support "file named PDG is JPEG 2000": If the picture in the PDF does not go back to DjVu, simply export it to a picture to see it.
5.3 PDF Browser Restrictions
The PDF, which I've described earlier, uses JBIG2, JPEG 2000 compression, which requires Acrobat 5 or later, which requires a browser with more than 6 versions of Acrobat to display properly. Fortunately, the mainstream version of Acrobat is now at least 7. Other common PDF browsers, Pdf-xchange support these two formats no problem, Foxit need specialized plug-ins, Cajviewer is not supported. My own unicornviewer is not a problem, in the JPEG 2000 has been specifically enhanced, better than ACROBAT8 compatibility.
DjVu Turn PDF