Word and. txt files to HTML and PDF files, using poi jsoup itext Tips

Source: Internet
Author: User
Tags tidy maven central

word and. txt files to HTML and PDF files, using poi jsoup itext tips
I first write a blog, there is insufficient or need to correct the hope that everyone points out, together to learn to exchange discussions.
As a result of this problem encountered in the project, in the online also found a lot of methods, feel the same, there are always some problems, so summed up word to HTML and PDF file use method.
Although the POI feature is not very powerful, it does not depend on the local office software, and there is also a way to use Jacob to turn word to HTML, but this relies on local office and is not supported on UNIX systems only under the Windows platform.
It's easier for Jacob to use it, and I'll share it with you if you need a way to use Jacob.
About. txt file to HTML, is to use IO operation to read the. txt file and then write to the HTML, do not need additional jar package.

Note: The use of POI need to pay attention to the following items, because I do this function when I did not pay attention to the existence of this problem, has been unable to find the reason, but also ask about Daniel to correct why?

1. Documents using Office. doc and. docx format are no problem, but when you use WPS to generate a Word document, you can only go to. doc format files, no pictures after the. docx document is turned out, no IMG attributes.
2. When using a Word document to a file in PDF format, the resulting PDF is not in Chinese and is not supported for Chinese display.
3. When you turn word into a PDF, you need to convert the generated HTML file into a standard HTML file, otherwise the <meta> or tag will not be closed after parsing.
4. The jar package used below can be downloaded from the MAVEN central repository.
The following is directly attached to the code, I hope you have any questions in the following comments to communicate and learn from each other,
Call the method directly when you use it. If you think you can have some praise, thank you.
Package Com.kqco.tools;import Org.apache.poi.hwpf.hwpfdocument;import Org.apache.poi.hwpf.converter.picturesmanager;import Org.apache.poi.hwpf.converter.wordtohtmlconverter;import Org.apache.poi.hwpf.usermodel.picturetype;import Org.apache.poi.xwpf.converter.core.basicuriresolver;import Org.apache.poi.xwpf.converter.core.fileimageextractor;import Org.apache.poi.xwpf.converter.xhtml.XHTMLConverter ; Import Org.apache.poi.xwpf.converter.xhtml.xhtmloptions;import org.apache.poi.xwpf.usermodel.XWPFDocument; Import Org.jsoup.jsoup;import Org.w3c.dom.document;import Org.w3c.tidy.tidy;import Org.xhtmlrenderer.pdf.itextfontresolver;import Org.xhtmlrenderer.pdf.itextrenderer;import Com.lowagie.text.pdf.basefont;import Javax.xml.parsers.documentbuilderfactory;import Javax.xml.transform.outputkeys;import Javax.xml.transform.transformer;import Javax.xml.transform.transformerfactory;import Javax.xml.transform.dom.domsource;import Javax.xml.transform.stream.streamresult;import Java.io.BufferedInputStreAm;import Java.io.bufferedoutputstream;import Java.io.bufferedreader;import Java.io.bufferedwriter;import Java.io.bytearrayinputstream;import Java.io.bytearrayoutputstream;import Java.io.dataoutputstream;import Java.io.file;import Java.io.fileinputstream;import Java.io.fileoutputstream;import Java.io.IOException;import Java.io.inputstreamreader;import Java.io.outputstream;import Java.io.outputstreamwriter;import Java.io.printwriter;import Java.nio.file.path;import Java.nio.file.paths;public class FileConverter {/* * Word file to HTML file * Sourcefilepath: Source Word file path * targetfileposition: Generated HTML file path after conversion */public void wordtohtml (String Sourcefilepath, String targetfileposition) throws Exception {if (". docx". Equals (Sourcefilepath.substring ( Sourcefilepath.lastindexof (".", Sourcefilepath.length ()))) {docxtohtml (Sourcefilepath, targetfileposition);} else if (". doc". Equals (Sourcefilepath.substring (Sourcefilepath.lastindexof (".", Sourcefilepath.length ())))) {doctohtml (Sourcefilepath, TargetfilepositiON);} else {throw new runtimeexception ("file format is incorrect");}} /* * Doc converted to HTML * Sourcefilepath: Source Word file path * targetfileposition: Generated HTML file path */private void doctohtml (String Sourcefilepa Th, String targetfileposition) throws Exception {final Path ImagePath = Paths.get (targetfileposition). GetParent (). Resolve ("image"); Hwpfdocument worddocument = new Hwpfdocument (new FileInputStream (Sourcefilepath));D ocument Document = Documentbuilderfactory.newinstance (). Newdocumentbuilder (). NewDocument (); Wordtohtmlconverter wordtohtmlconverter = new Wordtohtmlconverter (document);//Save picture, and returns the relative path of the picture Wordtohtmlconverter.setpicturesmanager (new Picturesmanager () {@Overridepublic String savepicture (byte[] Content, PictureType PictureType, String name, float width, float height) {try (FileOutputStream out = new Fileoutputstrea M (imagepath.resolve (name). ToString ())) {out.write (content);} catch (Exception e) {e.printstacktrace ();} Return ". /tmp/image/"+ Name;}}); Wordtohtmlconverter.processdocument (worddocument);D ocument HTMLDOcument = Wordtohtmlconverter.getdocument ();D omsource domsource = new Domsource (htmldocument); Streamresult Streamresult = new Streamresult (new File (targetfileposition)); Transformerfactory tf = Transformerfactory.newinstance (); Transformer serializer = Tf.newtransformer (); Serializer.setoutputproperty (outputkeys.encoding, "UTF-8"); Serializer.setoutputproperty (outputkeys.indent, "yes"); Serializer.setoutputproperty (Outputkeys.method, "html"); Serializer.transform (Domsource, streamresult);} /* Docx converted to HTML * Sourcefilepath: Source Word file path * TargetFileName: Generated HTML file path */private void docxtohtml (String sourcefilepath , string targetfilename) throws Exception {string imagepathstr = Paths.get (TargetFileName). GetParent (). Resolve (".. /tmp/image/word/media "). toString (); OutputStreamWriter OutputStreamWriter = null;try {xwpfdocument document = new Xwpfdocument (New FileInputStream (Sourcefilepath)); Xhtmloptions options = Xhtmloptions.create ();//folder where the picture is stored Options.setextractor (new Fileimageextractor (The new File (imagEPATHSTR));//The path to the picture in the HTML options. Uriresolver (New Basicuriresolver (". /tmp/image/word/media ") OutputStreamWriter = new OutputStreamWriter (new FileOutputStream (TargetFileName)," UTF-8 ") ; Xhtmlconverter Xhtmlconverter = (xhtmlconverter) xhtmlconverter.getinstance (); Xhtmlconverter.convert (document, OutputStreamWriter, options);} finally {if (outputstreamwriter! = null) {Outputstreamwriter.close ();}}} /* TXT document to HTML filepath:txt original file path htmlposition: HTML path generated after conversion */public void txttohtml (String filePath, String htmlpositio N) {try {String encoding = "GBK"; File File = new file (FilePath), if (File.isfile () && file.exists ()) {//To determine if files exist inputstreamreader read = new Input StreamReader (new FileInputStream (file), encoding);//considering the encoding format BufferedReader BufferedReader = new BufferedReader (read) ;//write file FileOutputStream fos = new FileOutputStream (new file (htmlposition)); OutputStreamWriter OSW = new OutputStreamWriter (FOS, "UTF-8"); BufferedWriter bw = new BufferedWriter (OSW); String Linetxt = Null;while ((Linetxt = Bufferedreader.readline ()) = null) {Bw.write (linetxt + "</br>");} Bw.close (); Osw.close (); Fos.close (); Read.close ();} else {System.out.println ("The specified file cannot be found");}} catch (Exception e) {System.out.println ("Error reading file contents"); E.printstacktrace ();}} /* Move the picture to the specified path Sourcefilepath: The original path targetfileposition: The path that was stored after the move */public void Changeimageurl (String sourcefilepath, String targetfileposition) throws IOException {FileInputStream fis = new FileInputStream (Sourcefilepath); Bufferedinputstream Bufis = new Bufferedinputstream (FIS); FileOutputStream fos = new FileOutputStream (targetfileposition); Bufferedoutputstream Bufos = new Bufferedoutputstream (FOS); int len = 0; while (len = Bufis.read ())! =-1) {bufos.write (len); } bufis.close (); Bufos.close (); }/* * HTML file parsing into XHTML, becomes standard HTML file * f_in: Source HTML file path * outfile: Output after XHTML file path */private boolean parsetoxhtml (String f_in, St Ring outfile) {Boolean bo = false; BytearrayoutputsTream tidyoutstream = null; Output stream FileInputStream FIS = null; Bytearrayoutputstream BOS = NULL; Bytearrayinputstream stream = Null;dataoutputstream to = null;try {//Reader Reader;fis = new FileInputStream (f_in); bos = New Bytearrayoutputstream (); int ch;while ((ch = fis.read ())! =-1) {bos.write (ch);} byte[] bs = Bos.tobytearray (); Bos.close (); String hope_gb2312 = new String (BS, "gb2312");//Note that the default is GB2312, so this translates into GB2312 and then into the other. byte[] Hope_b = Hope_gb2312.getbytes (); string basil = new String (Hope_b, "gb2312");//convert GB2312 to Utf-8stream = new Bytearrayinputstream (Basil.getbytes ()); Tidyoutstream = new Bytearrayoutputstream (); Tidy Tidy = new Tidy () tidy.setinputencoding ("gb2312"); Tidy.setquiet (true); Tidy.setoutputencoding ("UTF-8"); Tidy.setshowwarnings (TRUE); Does not display a warning message tidy.setindentcontent (true);//tidy.setsmartindent (true); Tidy.setindentattributes (false); Tidy.setwraplen (1024); Multi-length newline//Output is xhtmltidy.setxhtml (true); Tidy.seterrout (new PrintWriter (System.out)); Tidy.parse (Stream, Tidyoutstream ); to= new DataOutputStream (new FileOutputStream (outfile));//Writes the generated XHTML to the Tidyoutstream.writeto (to); Bo = true;} catch (Exception ex) {System.out.println (ex.tostring ()); Ex.printstacktrace (); return Bo;} finally {try {if (to! = null) {T O.close ();} if (stream! = null) {Stream.Close ();} if (FIS! = null) {Fis.close ();} if (BOS! = null) {Bos.close ();} if (tidyoutstream! = null) {Tidyoutstream.close ();}} catch (IOException e) {e.printstacktrace ();} System.GC ();} return Bo;} /* * XHTML file to PDF file * inputfile:xhtml source file path * OutputFile: Output PDF file path * ImagePath: Image storage path for example (file:/d:/test) */private Boole An converthtmltopdf (string inputfile, String outputFile) throws Exception {outputstream os = new FileOutputStream (outputf ile); Itextrenderer renderer = new Itextrenderer (); String url = new File (inputfile). Touri (). Tourl (). toString (); renderer.setdocument (URL);//solve Chinese support problem itextfontresolver Fontresolver = Renderer.getfontresolver (); Fontresolver.addfont ("C:/WINDOWS/FONTS/SIMSUN.TTC", BaseFont.IDENTITY_H , Basefont.not_embeddED);//Solve relative path problem of picture Renderer.getsharedcontext (). Setbaseurl ("ImagePath"); Renderer.layout (); renderer.createpdf (OS) ; Os.flush (); Os.close (); return true;} /* * XHTML to standard HTML file * targethtml: The HTML file path to be processed */private static void standardhtml (String targethtml) throws IOException { File F = new file (targethtml), org.jsoup.nodes.Document doc = Jsoup.parse (f, "UTF-8");d oc.select ("meta"). Removeattr (" Name ");d oc.select (" meta "). attr (" Content "," text/html; Charset=utf-8 ");d oc.select (" meta "). attr (" Http-equiv "," Content-type ");d oc.select (" meta "). HTML (" the "); Doc.select ("img"). HTML ("a");d Oc.select ("style"). attr ("Mce_bogus", "1");d Oc.select ("Body"). attr (" Font-family "," SimSun ");d oc.select (" HTML "). Before (" <?xml version= ' 1.0 ' encoding= ' UTF-8 ' > "); */* jsoup just parsing, Changes cannot be saved, so save the changes here. */fileoutputstream fos = new FileOutputStream (f, false); OutputStreamWriter OSW = new OutputStreamWriter (FOS, "UTF-8"); o Sw.write (doc.html ()); System.out.println (doc.html ()); Osw.close ();}}

  

Word and. txt files to HTML and PDF files, using poi jsoup itext Tips

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.