How to convert Docx/odt to pdf/html with Java?__java

Source: Internet
Author: User
Tags postgresql redis throwable xsl log4j

How do I convert Docx/odt to pdf/html with Java?




How do I convert Docx/odt to pdf/html with Java? This question comes to all of the the time of the forum like StackOverflow. So I decided to write a article about this topic to enumerate the Java (open source) frameworks which the manages.



Here some paid product which manages Docx/odt to pdf/html Converters:Aspose.Words for Java which manages only docx Rter. Docmosis which manages docx and ODT converters. Muhimbi PDF Converter Services.



To is honest with your, I have not tried those solution because it ' s. I won't speak about them in this article.



Here Some open source Product which manages Docx/odt to pdf/html CONVERTERS:JODCONVERTER : JODC Onverter automates conversions between Office document formats using OpenOffice.org or LibreOffice. Supported formats include OpenDocument, PDF, RTF, HTML, Word, Excel, PowerPoint, and Flash. It can be used as a Java library, a command line tool, or a Web application. DOCX4J: DOCX4J is a Java library for creating and manipulating Microsoft Open XML (Word docx, Powerpoint pptx, and Ex Cel xlsx) files. It is similar to Microsoft's OpenXML SDK, but for Java. DOCX4J uses JAXB to create the In-memory object representation. Xdocreport which provides:docx Converters which works With apache POI xwpf and itext  2.3.7 for PDF. ODT Converters which works with odfdom and itext 2.3.7 for PDF.



Here is the criteria that I am which are important for converters:best renderer:the converter must not loose some Formatt ing information. Fast:the Converter must is the more fast. Less memory intensive to avoid outofmemory problem. Streaming:use Inputstream/outputstream instead of File. Using streaming instead of File, avoids some problems (hard disk isn't used, no need to have write right on the hard disk Easy to Install:no need to install Openoffice/libreoffice, MS Word on the server to manage converter.



In this article I'll introduce those 3 Java frameworks converters and I'll compare it to give pros/cons for each frame Work and try to is more frankly because I ' m one of Xdocreportdeveloper.



If you are want to compare the result of conversion, performance, etc of docx4j and Xdocreport quickly, your can play with our live Demo which provides a jax-rs REST converter service.



 Goal of this article are to introduce those 3 frameworks converters and share I skills about ODT and docx to PD F. 


Download

You can download the samples of Docx/odt converters explained in this article:org.samples.docxconverters.jodconverter.zip: The samples of conversion docx to pdf/html with Jodconverter. Org.samples.docxconverters.docx4j.zip samples of conversion docx to pdf/html with docx4j. Org.samples.docxconverters.xdocreport.zip samples of conversion docx to pdf/html with Xdocreport (Apache POI xwpf). 


How to manage PDFs with Java?

Here's the 3 best famous Java PDF libraries:  Apache fop: apache FOP (formatting Objects Processor) is a print fo Rmatter driven by XSL formatting objects (XSL-FO) and a output independent formatter. It is a Java application which reads a formatting object (FO) tree and renders the resulting pages to a specified output. Output formats currently supported include PDF, PS, PCL, AFP, XML (area tree representation), Print, AWT and PNG, and to a Lesser extent, RTF and TXT. The primary output target is PDF. Apache pdfbox: the Apache PDFBox Library is a open source Java tool for working with PDF documents. This is project allows creation of new PDF documents, manipulation of existing documents and ability to extract content F ROM documents. Apache PDFBox also includes several command line utilities. Apache PDFBox is published under the Apache License v2.0 Itext: itext are a library that's allows to create and Mani Pulate PDF documents. IT enables developers looking to EnhanCe web-and applications with dynamic PDF document generation and/or manipulation.


With IText, there are 2 versions:2.3.x which is MPL License. 5.x which is AGPL License.


 How do I convert Docx/odt to pdf/html with Java?

Just for information, docx and ODT files are a zip which are composed with:several XML entries like Word/document.xml (doc x), Content.xml (ODT) which describes with XML the content of the document, Styles.xml, which describes used styles, etc. b Inary data for image.



To compare performance between Jodconverter, docx4j, xdocreport framework converters, tests must follow 2 rules:logs mus T is disabled to ignore time of generated log (ex:docx4j generates a lot of logs which the degrade). Convert twice the Docx/odt to Html/pdf, to ignore time of the initialization of the framework converter (Ex:ignore time O F connection to LibreOffice and Jodconverter, ignore time of the load of JAXB classes of docx4j, etc). To compare our converters frameworks, we'll convert twice the docx and'll retain the last elapsed time.



To compare the result quality of the conversion, I have tried to use on each of the samples converters project, several docx whic H are designed with Table (border, rows/cols span), header/footer, images etc. In this article we'll just study simple docx helloworld.docx:






But can launch the other docx for each Java Eclipse Project to, the result of HTML and PDF conversion.


  Jodconverter with docx

To test and use jodconverter, your need to install OpenOffice or LibreOffice. In the I case I have installed LibreOffice 3.5 on Windows.



Org.samples.docxconverters.jodconverter Eclipse project that you can download this is sample the docx converter with Jodcon Verter. This is Project Contains:docx folder which contains several docx to convert. Those docx comes from the Xdocreport Git, we use to test our converter. PDF and HTML folders where docx would be converted. Lib folder whith jodconverter and Dependencies JARs.



Download JARs


To download Jodconverter JARs, download the zip Jodconverter-core-3.0-beta-4-dist.zip, unzip it and copy paste the lib fol Der of the zip to your Eclipse Java project. Add those JARs in your classpath.



My test is done with LibreOffice 3.5 and the official distribution doesn ' t work with LibreOffice 3.5 (103).
To fix this problem, I have replaced the official JARs Jodconverter-core-3.0-beta-4.jar with jodconverter-core-3.0-beta-4- Jahia2.jar.


  HTML Converter

Here's the Jodconverter Java code which converts twice the«docx/helloworld.docx»to«html/helloworld.html»:


 
Package org.samples.docxconverters.jodconverter.html;

Import Java.io.File;
Import Org.artofsolving.jodconverter.OfficeDocumentConverter;
Import org.artofsolving.jodconverter.office.DefaultOfficeManagerConfiguration;

Import Org.artofsolving.jodconverter.office.OfficeManager;
		public class Helloworldtohtml {public static void main (string[] args) {//1) Start LibreOffice in headless mode.
		Officemanager Officemanager = null; try {Officemanager = new defaultofficemanagerconfiguration (). Setofficehome (New File ("C:/Program files/libreoffic
			E 3.5 ")). Buildofficemanager ();

			Officemanager.start (); 2) Create Jodconverter Converter Officedocumentconverter converter = new Officedocumentconverter (Officemanager

			);
			3 Create HTML createhtml (Converter);

		Createhtml (Converter);
			finally {//4) Stop LibreOffice in headless mode.
			if (Officemanager!= null) {officemanager.stop (); }} private static void Createhtml (OFFICEDOCUMentconverter Converter) {try {long start = System.currenttimemillis ();
			Converter.convert (New file ("Docx/helloworld.docx"), New file ("html/helloworld.html"));
		System.err.println ("Generate html/helloworld.html with" + (System.currenttimemillis ()-start) + "MS");
		catch (Throwable e) {e.printstacktrace ();
 }
	}
}


You can notice this code uses Java.io.File for docx input and HTML output because Jodconverter cannot with work.



After running this class, you'll be on the console few Jodconverter logs and the elapsed:




Generate html/helloworld.html with 12109ms
Generate html/helloworld.html with 391ms


Jodconverter converts a simple helloworld.docx to HTML with 391ms. The quality of the conversion is perfect.



This is the connection to LibreOffice takes a long time (5219ms) and disconnection too.


  PDF Converter

Here's the Jodconverter Java code which converts twice the«docx/helloworld.docx»to«pdf/helloworld.pdf»:


Package org.samples.docxconverters.jodconverter.pdf;

Import Java.io.File;
Import Org.artofsolving.jodconverter.OfficeDocumentConverter;
Import org.artofsolving.jodconverter.office.DefaultOfficeManagerConfiguration;

Import Org.artofsolving.jodconverter.office.OfficeManager;
		public class Helloworldtopdf {public static void main (string[] args) {//1) Start LibreOffice in headless mode.
		Officemanager Officemanager = null; try {Officemanager = new defaultofficemanagerconfiguration (). Setofficehome (New File ("C:/Program files/libreoffic
			E 3.5 ")). Buildofficemanager ();

			Officemanager.start (); 2) Create Jodconverter Converter Officedocumentconverter converter = new Officedocumentconverter (Officemanager

			);
			3 Create PDF createpdf (Converter);

		CreatePDF (Converter);
			finally {//4) Stop LibreOffice in headless mode.
			if (Officemanager!= null) {officemanager.stop (); }} private static void CreatePDF (OfficedocumentconvErter Converter) {try {long start = System.currenttimemillis ();
			Converter.convert (New file ("Docx/helloworld.docx"), New file ("Pdf/helloworld.pdf"));
		System.err.println ("Generate pdf/helloworld.pdf with" + (System.currenttimemillis ()-start) + "MS");
		catch (Throwable e) {e.printstacktrace ();
 }
	}
}


After running this class, you'll be on the console few Jodconverter logs and the elapsed:




Generate pdf/helloworld.pdf with 3172ms
Generate pdf/helloworld.pdf with 468ms


Jodconverter converts a simple helloworld.docx to PDF with 468ms. The quality of the conversion is perfect. 


docx4j

DOX4J provides several docx converters:docx to HTML converter. Docx to PDF Converter based on XSL-FO and FOP.



ORG.SAMPLES.DOCXCONVERTERS.DOCX4J Eclipse project that you can download this is sample of docx converter with docx4j. This is Project Contains:docx folder which contains several docx to convert. Those docx comes from the Xdocreport Git, we use to test our converter. PDF and HTML folders where docx would be converted. Lib folder whit docx4j and dependencies JARs.






For docx4j, logs must is disabled because it generates a lot of logs which the degrade. To did That:create src/docx4j.properties like this:




Docx4j. Log4j.configurator.disabled=true
Create src/log4j.properties like this:


Log4j.rootlogger=error

Donload With maven



To download docx4j and their dependencies JARS, the best mean was to-use MAVEN with this pom:




<project xmlns= "http://maven.apache.org/POM/4.0.0" xmlns:xsi= "Http://www.w3.org/2001/XMLSchema-instance"
	xsi:schemalocation= "http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd" >
	< Modelversion>4.0.0</modelversion>
	<groupid>org.samples.docxconverters.docx4j&l

Alibaba Cloud Hot Products

Elastic Compute Service (ECS) Dedicated Host (DDH) ApsaraDB RDS for MySQL (RDS) ApsaraDB for PolarDB(PolarDB) AnalyticDB for PostgreSQL (ADB for PG)
AnalyticDB for MySQL(ADB for MySQL) Data Transmission Service (DTS) Server Load Balancer (SLB) Global Accelerator (GA) Cloud Enterprise Network (CEN)
Object Storage Service (OSS) Content Delivery Network (CDN) Short Message Service (SMS) Container Service for Kubernetes (ACK) Data Lake Analytics (DLA)

ApsaraDB for Redis (Redis)

ApsaraDB for MongoDB (MongoDB) NAT Gateway VPN Gateway Cloud Firewall
Anti-DDoS Web Application Firewall (WAF) Log Service DataWorks MaxCompute
Elastic MapReduce (EMR) Elasticsearch

Alibaba Cloud Free Trail

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.