Some tests and ideas on converting html to pdf, and examples of converting html to pdf
Due to work requirements, I recently spent some time studying the function of converting html to pdf. The Key Technology for converting html to pdf is how to deal with complex css styles on webpages. by collecting data on the Internet, we found that there are three solutions for converting html to pdf:
- Client mode: The frontend and backend calls the client program and converts PDF files using the function of the client program. The testing tools include wkhtmltopdf and PhantomJS.
- Java jar package parsing class mode: parse css styles using java code and translate html files into PDF files. The test classes include IText, Flying Sauser, and PD4ML.
- Js front-end parsing mode: The js front-end parses html files as PDF files. This test case includes html2canvas.
The solution introduced on the internet is tested one by one based on the actual project requirements. The performance and functions are analyzed as follows.
1. Test Page Introduction
By viewing the online Introduction of various conversion cases, simple html styles and general table styles, the above solutions are supported when converting PDF files. However, considering the actual business needs, this test uses the css style of bootstrap (v 3.3.6) and applies the new feature of css3 to the page. Compile a static html page based on this new feature. The html page display effect in the browser is as follows:
2. wkhtmltopdf Test
Wkhtmltopdf is a tool developed using the webkit web rendering engine to convert html into pdf. It can be integrated with Multiple scripting languages to convert documents. Official Website address http://wkhtmltopdf.org/
Technical Features: Wkhtmltopdf can directly convert the web page browsed in the browser into a pdf, which is a software for converting the html page into a pdf page (which needs to be installed on the server ). You can use java code to call the cmd command to convert a webpage to a pdf file.
Function test:Enter the test command in cmd to view the processing progress.
First parameter: path of wkhtmltow..exe
Second parameter: html page to be converted to pdf
Third parameter: PDF file path and file name
The page export result is as follows:
Test Description:
Test results show that wkhtmltopdf supports the overall CSS style of bootstap. It does not support the new features of css3, such as circular image styles. Some page styles are invalid. For Chart display, the eachart chart Export program reports an error, not supported. However, echart has an interface for converting charts to images. You can obtain the image address and export it to pdf.
3. Test PhantomJS
PhantomJS is a webkit-based headless browser, that is, there is no UI interface, that is, it is a browser, but the click, page flip, and other human-related operations in it require programming. It provides the javaScript API interface, that is, by writing JS programs, you can directly interact with the webkit kernel. On this interface, you can use java to call js and other related operations through java, this solves the limitations of developing high-quality collectors Based on webkit that can be better developed by c/c ++ in the past. It also provides installation packages for different operating systems, such as windows, linux, and mac, that is, it can be used to develop collection projects or perform automatic project testing on different platforms. Official Website address http://phantomjs.org/
PhantomJS can be used for web page analysis and has many functions. This time, only the web page function is called. The test in cmd is as follows:
The export results on the test page are as follows:
Test Description:
Tests show that PhantomJS supports the bootstap style better. It does not support the new features of css3, such as circular image styles. Some page styles are invalid. You can also directly export the echart chart. The effect is as follows:
3. IText and Flying Sauser
IText implements html2pdf with fast speed and poor error correction capability. It supports Chinese characters (unicode encoding is required for HTML), but supports a Chinese font and is open-source. Flying Sauser implements html2pdf with poor error correction capability. It supports multiple Chinese fonts (some styles cannot be recognized) and is open-source.
Technical Features:Html css styles are parsed Based on java programming. Currently, only simple pages and styles are supported. The compatibility between css 3 styles and associated complex css styles is very poor. When the page content is long, the processing time is slow. Reference: https://code.google.com/archive/p/flying-saucer/
Test results:The test page of this experiment cannot be displayed. The general test page has the following effect:
Test Description:
Tests show that the two open-source projects IText and Flying Sauser have almost no compatibility with css3. By checking the materials, we find that this technology is outdated and this open-source project is not updated and maintained now. For the export of simple tables and statistical data, the updated technologies include bootstrap table and easyui datagrid table export. This solution is not recommended.
4. PD4ML Testing
PD4ML is a Java-only class library. It uses HTML and CSS as powerful tools for generating PDF documents for page layout and content definition formats, which simplifies the process of generating PDF files for end users. Reference: http://www.pd4ml.com
The software has the following advantages:
- The HTML tags and CSS attributes supported are comprehensive, and the conversion distortion is relatively small. You can use HTML + CSS to implement precise layout control.
- It provides better error tolerance for webpage file tags and CSS syntax errors.
- The conversion and output of images are supported without additional control.
The disadvantage of the software is:
- It is not open-source and the latest demo version. After downloading and testing, it is found that Chinese conversion is not supported. You must purchase a commercial version. (This is a pitfall, but the problem of testing Garbled text cannot be passed, and it is not supported later ).
- Some old versions after cracking can solve the garbled problem, but the supported css styles are not all of the new versions.
Test results:
Test Description:
The new version contains garbled Chinese characters, but some css styles are supported. After the deciphering of the old version, the style compatibility of the interface is poor, and the support for bootsrtap is low, it is basically possible to generate a data, and there is no problem in displaying images. It is not recommended that you use template export or other tools to export common pages, because it is a paid software and the performance is not perfect.
5.html 2canvas Test
Html2canvas is a very good JavaScript class library. It uses some new features of html5 and css3 to realize the functions of webpage on the client. Html2canvas gets the DOM and element style information of the page and renders it into a canvas image to implement the page function. It does not need any rendering from the server. The whole image is created in the client browser. If the browser does not support Canvas, Flashcanvas or assumercanvas will be used instead. The following browsers support this script well: Firefox 3.5 +, Google Chrome, Opera new version, IE9 or later. Because each browser renders different pages in different ways, the images generated are not the same. Although it is still in the development stage, it is still worth looking forward. This plug-in depends on the jQuery plug-in. We recommend that you use the latest version.
- Cross-origin images are not supported
- Cannot be used in browser plug-ins
- Some browsers do not support SVG Images
- Flash not supported
- Ifream is not supported (the original js code can be modified, and ifream is supported)
During this test using html2canvas, we found that many project pages are normal, including echart charts. Only a small number of new css3 features are not supported. It works better. However, a fatal problem was found during the application test. When the page module calls html2canvas, it found that some css of the original page suddenly failed. After tracking and analysis, it is found that HTML 2canvas js functions process css styles that cannot be recognized by it. Especially for hidden and displayed modules, the support is unfriendly.
The page effect is as follows:
However, the css of the original page is invalid, and the page is abnormal, some hidden styles, and the displayed styles are messy.
Test Description:
The test shows that html2canvas supports the bootstap style better. It does not support the new features of css3, such as circular image styles. Its main advantage is the light front-end. To change the style of the original page, you can first export the image and refresh the page again.
6. Summary
Through tests in the above cases, the common html conversion to pdf methods described on the Internet are mostly simple html conversion methods available. However, there are still many problems in practical application and it is difficult to apply them. By analyzing the implementation principles of these methods, we can draw the following conclusions:
- Html Web pages are completely converted to pdf, and all solutions are insufficient. If only some form pages are used, do not use the css3 attribute for html styles. You can use the client mode and html2canvas for processing.
- The front-end style of html is developing fast, and the new feature of css3 is very effective. css defines new rules and syntaxes. Java conversion classes such as IText and Flying Sauser cannot be compatible with these changes because they cannot write conversion functions in time and these open-source projects are older technologies, later open-source teams have stopped maintenance and updates.
- PD4ML is essentially a style conversion process for java to process css. It is a commercial software with team support in compatibility with css3. It is more powerful than IText and Flying Sauser in terms of performance and functionality. However, it does not support a small number of css styles. Chinese garbled characters cannot be solved.
- For the kernel mode of the client browser, PhantomJS is more powerful than wkhtmltopdf, but it is only a small function. It can also be used for Web analysis. We recommend that you use PhantomJS.
- Html2canvas is a lightweight Front-End Tool with flexible modes. At present, some functions are incomplete, but the overall effect is good. To solve some problems that affect the original page, you can save them successfully and refresh the page to export them to pdf.
7. Reference Links
Http://blog.csdn.net/ouyhong123/article/details/26401967
Http://blog.csdn.net/tengdazhang770960436/article/details/41320079
Http://www.cnblogs.com/jasondan/p/4108263.html
Http://blog.csdn.net/accountwcx/article/details/46785437
Http://blog.csdn.net/zdtwyjp/article/details/5769353