Tuesday, February 17, 2015

iTextSharp hates your HTML5

When it comes to PDF generation in .Net there aren't many open source projects to choose from, so by default many people choose the most popular community driven project, iTextSharp. With a relatively easy learning curve and decent API, iTextSharp can address most simple scenarios for generating PDF's, and with a little more effort can render even more complex PDF's as well. iTextSharp has been essentially free using the LGPL license, but with the introduction of version 5, the AGPL license has clouded the waters a bit.

This post addresses a potential gotcha when using the iTextSharp XmlWorker to render HTML5 to PDF's.

I was working on a solution that required HTML to be clipped from the client using jQuery's .html() method, and then that HTML would be posted to a .Net Web Api controller that would take the posted HTML and create a PDF using iTextSharp.

The issue that I encountered is that iTextSharp's XmlWorker would not process the HTML from the client. At first I did not understand why the XmlWorker was failing... I mean I could already tell from the client that the HTML was rendering just fine. But... when I took a minute to examine the HTML in the Chrome Developer Tools, the problem hit me. I had open tags in my markup... therefore in the eyes of the iTextSharp XmlWorker... it was not valid XML.

I reviewed my code and noticed that I was properly closing all of my tags... so why was the browser rendering open tags? The page was using an HTML5 DOCTYPE... that's why. I was properly closing my tags... which is actually improper if you are accurately following the HTML5 spec (Here is a great post on HTML5's treatment of self-closing tags). The browser was taking my invalid markup, and correcting it... which in turn was causing the iTextSharp XmlWorker to fail.

So, how to fix the situation? Well, I came up with two solutions. First, I added the jQuery-clean plugin to my View so that after I grabbed the HTML from the page, I could then "clean" it so that it would be XHTML compatible. This was actually enough to solve the problem. After converting the HTML5 to XHTML, iTextSharp began processing the HTML input and creating PDF's from it. However, I took an additional step to ensure that any HTML that entered the controller would arrive at the XmlWorker as valid XHTML by installing the HTML Agility Pack and creating a method that would transform any HTML input to valid XHTML output.

My PDF's looked horrible, but this had more to do with iTextSharp's abysmal CSS support. But hey... at least PDF's were being generated! (I later completely restructured the View's HTML to use old school Table layouts, which helped the look and layout of the PDF's ).

Although I've often used iTextSharp in the past, I generally look for alternatives if they are available. For instance, if your organization is running SharePoint 2010 or higher, I recommend checking out SharePoint's Word Automation Services. If SharePoint is not an option there's PDFSharp, and my new favorite, Pechkin, which is a .Net wrapper for the wkhtmltopdf  rendering engine. As a matter of fact, I initially used Pechkin in this project before going with iTextSharp. I was able to create my PDF using HTML clipped from the client in 1 line of code. The problem I encountered is that the site I was working on was hosted in Azure, and the wkhtmltopdf engine requires GDI+, which Azure currently does not support.

I hope this helps someone who may be struggling to get iTextSharp to play nicely with their markup. If you have any suggestions or alternative approaches for rendering PDF's using iTextSharp or any other projects out there, I'd love to hear them!

No comments:

Post a Comment