A solution to generate a bookmarked PDF document based on a bulk URL

Last Update:2017-02-27 Source: Internet

Author: User

Tags cdata

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

First, the Origin

See a good article, a wonderful blog, our earliest practice there are two, add to Favorites, or Save As, later appeared a new way, posted to their own blog, or to some favorite sites (such as 360doc). Recently appeared the way to generate PDF documents, such as some sites, you submit a URL, for you to generate the corresponding Pdf,javaeye ebook production is also very good, even to predict that the browser "Save as" window there may be a *.pdf option. Because the PDF is very good, small and expressive rich. However, if there are a number of good articles, such as a very good serial (such as the cow to write the tutorial, development experience), we can do is to add to the favorites or add to their blog link inside. Think of that year, when there is no money to buy books, on the internet to find some tutorials to see, do a better job will have a page listing all the connections, this situation I usually use the Thunder download all the links (and then remove unrelated links) in bulk downloading down, do simple to provide a similar to the next link in the previous article, This is still the case until now. (Ease of use, ah, this problem can be small, to a large extent, related to the application model and business model, to small said may be a beautiful and convenient.) Javaei provides quick reading to take the left tree menu display directory, the right to display the form of content, this form in the interface design. ）

This article is about a solution that generates a bookmarked PDF document based on a batch URL, which is the URL of a good article that generates a merged PDF document based on these URLs and has a bookmark (that is, the tree menu on the left) and must be bookmarked. "Java and Mode" This book must have been read by many people, the old thick book Ah, I have no money to buy, see is a download PDF, this PDF gives me the impression is too bad, no bookmarks, to find something can only pull the scroll bar, however, I still read, write well, the person who made the PDF balance.

Second, train of thought

The goal is to generate a bookmarked PDF document based on the batch URL, which is accomplished in two steps: First, solve the problem of creating a PDF document based on a URL, and then solving multiple PDFs merging and creating bookmarks.

(1) Generate a PDF document based on a URL

It seems easy to generate a PDF document based on a URL, because we have itext,pdfbox these open source frameworks, but it is not simple, because to ensure that the resulting PDF document open to the same effect as the browser, which is tantamount to a browser, the current browser there is a compatibility problem, So it's hard to write your own idea of creating PDFs based on HTML. Then another idea is to use some Web sites to achieve this goal, after trying, some sites are required to provide URL and email, the production of a good PDF sent to your mailbox, this form can not be accessed through the code, it can not be batch processing; some sites just submit URLs, The generated PDF is responded to the client, which can be processed in batches by program, but the resulting pdf is too far away from the browser, and some sites do not support Chinese at all. Through exploration, finally found a Web site provided by C # to do the DLL can achieve this requirement, using this DLL, write a simple C # program can be generated in batches of PDF, and the effect is quite perfect, the drawback is that the generated PDF has someone else's watermark.

(2) Multiple PDFs merged and generated bookmarks

Multiple PDFs merging and creating bookmarks can be easily done with itext, the merge is in a certain order, and the bookmark is a tree structure, so the order of merging, the hierarchy of the bookmark needs to be determined in advance. Therefore, the bulk of the URL to be a certain description, so it is natural to choose XML.

Third, realize

I'm getting the feeling that as long as it's not infrastructure, it's technically simple, and the key is that you have no idea. This implementation begins with the XML description.

The XML description is divided into two steps, first describing a batch of URLs (called Href.h2p.xml), and then describing the hierarchical relationship (called outline.h2p.xml). H2P is the meaning of HTML to PDF

Watch Href.h2p.xml first.

 
This XML is simple, because the URL usually has & and the symbol does not appear in the XML, and as the value of the attribute, it is not <! [Cdata[]]>, so just as a node.
 
The value of the ID of each PDF file generated based on this XML, and the suffix is pdf.
 
Outline.h2p.xml contents are as follows:
 
<book　name="我的PDF书">
 　　　　　　　　　 <chapter　name="163"　href="KxgYaRxG">
 　　　　　　　　　　　　　　　　　　　 <chapter　name="163新闻"　href="eyEis6ra"　/>
 　　　　　　　　　　　　　　　　　　　 <chapter　name="163体育"　href="DMQoSN2t"　/>
 　　　　　　　　　 </chapter>
 　　　　　　　　　 <chapter　name="sohu"　href="53Bw5A32">
 　　　　　　　　　　　　　　　　　　　 <chapter　name="sohu新闻"　href="5vaf3LN7"　/>
 　　　　　　　　　 </chapter>
 </book>
 
This XML describes the order of each PDF merge, the value of href corresponds to the ID value of the previous XML, and the level of the chapter tag nesting is the level of the bookmark, and the value of name is the name of the bookmark. Itext each PDF into a PDF based on this XML and generates bookmarks.
 
I refer to these two XML files as h2p files.
 
Iv. h2p Documents
 
In this case, the solution is over, as the saying goes, Ching, first of all, we want to have the above two XML files, these two XML files if by hand-edited, a small number of URLs are OK, if more of it is inconvenient. So there should be a tool to edit the h2p file.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More