Search This Blog

iTextSharp to the rescue: converting HTML to PDF


posted on Thursday, April 19, 2012

So, I was on this project and everything was going wel until suddenly one of the end users suggested the following: "Hey, could we get this detailview in PDF-format? With the current lay-out?". So I googled "iTextSharp convert html pdf" and said "Yes, you can!". 

Initially I thought I would have to manually "translate" my HTML
 for iTextSharp to interpret it, but building a PDF like that can take days if the end user wants to hold on to their entire design. But this is the great part, iTextSharp can actually read HTML-objects and convert them to PDF-objects. Making it incredibly easy to create a PDF-document based on a HTML-file. 


But, and there's always a but, there's one thing you need to keep in mind: iTextSharp can't convert all HTML tags! As an example, use the "<hr/>"-tag and the conversion process will fail! (Yes, I speak from experience on that one :)). 


So here goes! I created an HTML-file called "Template.html" and added it to my solution. Below you can find the contents of that HTML-file. Notice the "pStyle" css class in this example.


<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
    <title>HTML template</title>
</head>
<body>
    <div>
<h1>Convert a HTML-file to a PDF-document!</h1>
<p class="pStyle">Using iTextSharp it's possible to convert HTML content to a PDF file.</p>
<h4>Just like this!!!</h4>
<br />
<table>
<tr>
<td>row 1 - cell 1</td>
<td>row 1 - cell 2</td>
</tr>
<tr>
<td>row 2 - cell 1</td>
<td>row 2 - cell 2</td>
</tr>
</table>
</div>
</body>
</html>


You start off as usual by creating a document and a writer, you might want to define some fonts and add some elements to your document. Then we'll go ahead and read the HTML-file, I just read the contents of the file in a string. Next, we use the (in)famous HTMLWorker-object to parse the string. Using the ParseToList-method you can parse the string to a list of IElements. These IElements can be added to a PDF-document. I guess this IElement can be seen as a wrapper for PDF-objects since you can deduct the original object type from this IElement by using the GetType()-method (I'll get back to this in a little bit).

The next step is to loop over the list of IELements, cast them separately and add them to the document. That's all there is to it!

try
{
 // set the file name
 string file = "C:/pdf/MyPdf.pdf";

 // create a pdf document
 Document document = new Document();

 // set the page size, set the orientation
 document.SetPageSize(PageSize.A4);

 // create a writer instance
 PdfWriter pdfWriter = PdfWriter.GetInstance(document, new FileStream(file, FileMode.Create));

 // open the document
 document.Open();

 // define fonts
 Font titleFont = FontFactory.GetFont("Helvetica", 14, Font.BOLD, new iTextSharp.text.BaseColor(0, 179, 212));
 Font textFont = FontFactory.GetFont("Helvetica", 9, Font.NORMAL, BaseColor.BLACK);

 // create a paragraph and add it to the document
 Paragraph title = new Paragraph("This PDF-document was created using iTextSharp!");
 title.Font = titleFont;
 title.IndentationLeft = 50f;
 document.Add(title);

 // add line below title
 LineSeparator line = new LineSeparator(1f, 100f, BaseColor.BLACK, Element.ALIGN_CENTER, -1);
 document.Add(new Chunk(line));

 // html pagina inlezen
 string htmlText = File.ReadAllText(Server.MapPath("~/template.html")); 

 // html pagina parsen in een arraylist van IElements
 List< IElement > htmlElements = HTMLWorker.ParseToList(new StringReader(htmlText), null);

 // add the IElements to the document
 for (int i = 0; i < htmlElements.Count; i++)
 {
  // cast the element
  IElement htmlElement = ((IElement)htmlElements[i]);
  document.Add(htmlElement);
 }

 // close the document
 document.Close();
 
 // open the pdf document
 Process.Start(file);
}
catch (Exception ex)
{
 lblError.Text = "An error ocurred, the PDF-document could not be created.

Exception: " + ex.Message;
}

But wait, there's more! How about styling? Embedded css - as in, adding the css within <style>-tags - is not supported. Inline css is supported in a limited way, so you'll have to use trial and error to figure out what works and what not. The "%"-sign doesn't seem to be supported neither.

There are two other ways of implementing styling, when you use the ParseToList-method, you can add a iTextSharp StyleSheet-object as an argument. In the first example, I wasn't using styling so I set "null" as my second argument for this method.

You can use the two methods "LoadTagStyle" and "LoadStyle" to apply styles. The "LoadTagStyle"-method can be used to map a HTML-object to a style as you would in a cascading style sheet. The "LoadStyle"-method can be used to map a css class to a style. Remember the "pStyle"-class in the code above? In the code below you can see examples of both LoadTagStyle and LoadStyle.

Another way to customize your PDF-objects is to find the specific object you want to customize after it has been cast but before it is added to the document. You will then be able to deduct the real type of the object by using the "GetType"-method. You can then compare the object to known PDF-types such as Paragraph or PdfPTable.

If the types are equal, cast the object to the correct type and customize away! Don't forget to add the object to the document when you are ready. I use the boolean addedSeparately to verify if I already added the object to the document or not.

Truth of the matter is, implementing styling could be tricky and it's a matter of trial and error. Some things work, some just don't...

        // stylesheet voorbereiden
 StyleSheet css = new StyleSheet();
 css.LoadTagStyle("table", "border", "1");
 css.LoadStyle("pStyle", "style", "color:blue; font-style:italic;");

 // html pagina parsen in een arraylist van IElements
 List< IElement > htmlElements = HTMLWorker.ParseToList(new StringReader(htmlText), css);
 
 // this boolean defines whether the element was already added or not
 Boolean addedSeparately;

 // add the IElements to the document
 for (int i = 0; i < htmlElements.Count; i++)
 {
  // initialize
  addedSeparately = false;

  // cast the element
  IElement htmlElement = ((IElement)htmlElements[i]);

  // check every element to see if is of the type Paragraph
  if (htmlElement.GetType().Name.Equals("Paragraph"))
  {
   // cast the element to a Paragraph
   Paragraph htmlPar = ((Paragraph)htmlElement);

   // look for a paragraph containing the text "Just like this!!!"
   if (htmlPar.Content.Trim().Equals("Just like this!!!"))
   {
    // get the content of the paragraph and underline it
    Chunk underlinedChunk = new Chunk(htmlPar.Content.Trim());
    underlinedChunk.SetUnderline(1f, -2f);

    // add the paragraph to the document
    document.Add(new Paragraph(underlinedChunk));
    addedSeparately = true;
   }
  }

  // if we didn't add the element yet, do it now
  if (!addedSeparately)
  {
   document.Add(htmlElement);
  }
 }

iTextSharp fans might protest to this post by pointing to the XMLWorker, which is supposed to replace the HTMLWorker. I tried this once, got horrible results and threw it out right away. But in the name of science, I will give it a new go in a little while to make the comparison.

To summarize, I know that in this version of iTextSharp the HTMLWorker does not support the <hr/>-tag and the %-sign. If you know any others that are not supported, feel free to comment and i'll add them to the post. I'll try to do some testing on this as well in a while.

So, that's all for today!  This helped me implement my PDF-document in a day, so I hope this can be useful to you as well and save you some time. :)


Update: I just remembered that "font-size: smaller" in your HTML will fail the conversion as well... "font-size" is no problem it's the "smaller"-part that causes the problem. The upside is that this will actually throw a clear error: "Font size too small".

So for future reference, following tags and signs will make the conversion fail:
  • <hr/>
  • %
  • (font-size:) smaller (same goes for small, large, ...) 

Sources:



Could be useful, right?

21 comments:

  1. This is a great tool I've used before. Right now I'm working with a company that uses a very expensive piece to this and I'm not sure why they didn't just go with iTextSharp. Great post.

    ReplyDelete
  2. Seems easier than trying to the same with PDFSharp

    ReplyDelete
    Replies
    1. I'm not familiar with PDFSharp, is it any good? And indeed, depending on the structure of the HTML-file this could be literally 10 minutes work :)

      Delete
  3. Im having this code:

    Document document = new Document();
    document.SetPageSize(PageSize.A4.Rotate());

    string file = Server.MapPath("~/salem/pdf/flier_completed.pdf");

    PdfWriter.GetInstance(document, new FileStream(file, FileMode.Create));
    document.Open();

    StringWriter sw = new StringWriter();
    HtmlTextWriter w = new HtmlTextWriter(sw);
    getHTML.RenderControl(w);

    //this is only here to test if it works
    string s1 = "";

    string s = sw.GetStringBuilder().ToString();

    //this is only here to test if it works
    string s2 = "";

    s = s1 + s + s2;

    StyleSheet styles = new StyleSheet();

    styles.LoadTagStyle("table", "border", "1");

    styles.LoadStyle("productItem", "style", "width:100px; float:left; padding:5px; margin:5px; text-align:center;");

    styles.LoadStyle("groupSeparator", "style", "border-top: 1px dotted Gray; height:1px; clear:both;");

    styles.LoadStyle("itemSeparator", "style", "height:80px; width:1px; border-left:1px dotted Gray; margin-top:5px; margin-bottom:5px; float:left;");

    using (TextReader sReader = new StringReader(s.ToString()))
    {
    List list = HTMLWorker.ParseToList(sReader, styles);
    foreach (IElement ielm in list)
    {
    document.Add(ielm);
    }
    }
    document.Close();

    Process.Start(file);













    but the css styles are not added could you help??? :) thank you in advance...

    ReplyDelete
    Replies
    1. Hi Simpa!

      Sorry for the late reply. I checked out your code and it seems like you have a little error on this line:

      List list = HTMLWorker.ParseToList(sReader, styles);

      You forgot the here.

      Can you succesfully generate a PDF-file? I was able to add your styles to my file, and it worked, but couldn't try your code since I don't have the code of the control you are trying to render.

      I suspect something is going wrong with the rendering of that control. If you still need help, post the design file code here as well.

      Delete
  4. Yes i generated the file sucessfully...and i gave up on itextsharp. I feelt it was waste of time...I dont understand your explanaiton of what is wrong with the code. But now im only using itextsharp to update pdf files.
    The other library i found wich works perfectly was wkhtmltopdf and gmanny/Pechkin wrapper... But if you explain what i exactly did wrong with itextsharp i could give it a chance... Otherwise its useless to me...

    ReplyDelete
  5. Hi Simpa

    The reason why I can't immediately help you is because there is code missing (there is no code about where getHTML comes from, for example). Perhaps you can email me your code (info at thiscouldbeuseful dot com) and then I'll take a look at it. Maybe I can help you out so you don't give up on iTextSharp just yet :)

    Regards

    ReplyDelete
  6. getHTML is a div wich runat server... In that div i have two user controls which are in divs float left. i use this code to get the html

    //for test
    //StringWriter sw = new StringWriter();
    //HtmlTextWriter w = new HtmlTextWriter(sw);
    //getHTML.RenderControl(w);
    //string shtml = sw.GetStringBuilder().ToString();
    //for test

    I tryed all the parameters:
    //List> css = new List>();

    ////css.Add(new Dictionary() {
    ////{"width", "100px"},
    ////{"float","left"},
    ////{"padding","5px"},
    ////{"margin","5px"},
    ////{"text-align", "center"}
    ////});

    ////foreach (Dictionary dic in css)
    ////{
    //// styles.LoadStyle("productItem", dic);
    ////}

    ////styles.LoadStyle("productItem", "style", "width:100px; float:left; padding:5px; margin:5px; text-align:center;");

    //styles.LoadStyle("productItem", "width", "100px");
    //styles.LoadStyle("productItem", "float", "left");
    //styles.LoadStyle("productItem", "padding", "5px");
    //styles.LoadStyle("productItem", "margin", "5px");
    //styles.LoadStyle("productItem", "text-align", "center");


    but if it is easyer i can email running project then you can test it...

    ReplyDelete
  7. Hi Simpa, I mentioned the e-mailadres in my previous reply, but here it is again: info@thiscouldbeuseful.com. If you send me the code, I'll take a look at it :)

    ReplyDelete
  8. Unfortunately I was not able to help Simpa out, for some reason, some styling tags are just being ignored. I will try to update this post with a bigger example with more styling tags and try to figure out which ones will work and which won't. I'll keep you all posted!

    ReplyDelete
  9. what can be used against the tag in html to create line in pdf

    ReplyDelete
  10. Hi Ravideep Bansal!

    How to add a line without using the hr-tag?
    You can add a line using the following code:

    LineSeparator line = new LineSeparator(1f, 100f, BaseColor.BLACK, Element.ALIGN_CENTER, -1);
    document.Add(new Chunk(line));

    You could probably write some code to replace occurrences of that tag with a iTextSharp generated line.

    ReplyDelete
  11. Hi Frd i want to add Border bottom for TD using itext sharp pdf but it is not support then how to add border bottom? is there any other way.

    ReplyDelete
    Replies
    1. Hi Mohan Dhokare!

      Sorry for the late response, have been quite busy :)

      You could use this to get table borders:

      // styles
      StyleSheet styles = new StyleSheet();
      styles.LoadTagStyle("table", "border", "1");

      // parse html
      List htmlElements = HTMLWorker.ParseToList(new StringReader(htmlText), styles);


      Good luck!

      Delete
  12. i struggled for almost one day to apply styles using itextsharp.dll. thanks for the code

    css.LoadStyle("pStyle", "style", "color:blue; font-style:italic;");

    thanks for saving my life.

    ReplyDelete
  13. Hi the post was useful... However is there a way to set background colour for certain text? my html tags have background color property.... but the pdf generated doesnt have them...can anyone provide me with a solution??

    ReplyDelete
  14. Then when I inserted null for the argument (Stylesheet style) then The same line List htmlElements = HTMLWorker.ParseToList(reader, null); returns an error Cannot implicitly convert type 'System.Collections.ArrayList' to 'System.Collections.Generic.List'

    ReplyDelete
  15. A comment got lost. I was reporting that List htmlElements = HTMLWorker.ParseToList(reader); like you show on line 36, returns an error that no overload of HTMLWorker.ParseToList takes one argument

    ReplyDelete
  16. hi ThisCouldBeUseful
    the html tag fieldset->legend not working using HTMLWorker.Why is the iTextsharp is not rebust enough to convert a basic html to PDF?

    Is there a way to bypass 'HTMLWorker.ParseToList' to recontructing PDF with all html formatting intact?

    Thanks in advance
    Lenin
    jlenin.edwin@gmail.com

    ReplyDelete
  17. It seems like I'm behind on responding to the comments, sorry guys!

    @Guru: you're very welcome! :)
    @Divakar Viswanthan: what code did you try? Remember that embedded css will be ignored! You need to add classes or apply styles to specific tags.
    @Lenin Edwin: I'm afraid that these tags are just not supported. Well if you think about it PDF is completely differently structured then HTML which makes it hard to convert one to another.
    As far as I know there is no way to use the HTMLWorker without the ParseToList method. In your case, where you use more then just basic HTML, you might benefit from giving the XMLWorker a spin.
    This is the new and improved version of the HTMLWorker and I've heard great things about it, might do the trick!

    Thanks for posting guys!

    ReplyDelete
  18. @Joeller: You're right, there appears to be a problem with the comments. I've just changed the commenting mode, this should fix it.

    The error you have is most likely a result of using a different version of iTextSharp then I did, but passing a null for the Stylesheet seems like a good solution to me. Is this the exact code you're using, "List htmlElements = HTMLWorker.ParseToList(reader, null);"? Could you verify the return type of the ParseToList method? Maybe it has been changed?

    ReplyDelete