Reading PDF Form Data in C#

Logical Moon

2015-09-13

.net

Like many of you (yes, I know who you are), I’ve used Adobe’s Acrobat Portable Document Format - PDF - many times. How can you not? In those days when it was hard to share documents, PDF made things so much easier. They are a great (free) way to distribute files which describe the layout, fonts, graphics and text in flat documents but there are interactive versions, too, and in particular, so-called AcroForms which allow users to enter form data and save it. That’s what this brief article is about - editable PDF files and in particular, how to read them in C#.

If you do a quick Google of which libraries are available, you will come up with a few possibilities but in my opinion, it comes down to using iTextSharp. The two options are the freely usable version (4.0.3.0) and the one you are meant to pay for (5.5.6) which comes with lots of support, has fixed lots of bugs and has no further potential licensing issues. Clearly then, we’ll go for version 4! :-)

An Example Form

First things first - I need a form. I found one here (courtesy of Foersom Engineering Solutions - thank you) and filled it in.

Downloading and Installing

Getting hold of version 4.0.3.0 of iTextSharp is easy if you use the NuGet Package manager in Visual Studio. Go to the menu: Tools -> NuGet Package Manager -> Manage NuGet Packages for Solution and fill the fields in as in the image below (see yellow highlighting). You can see I have looked online for itextsharp and picked the one with the title: “iTextSharp, a .NET PDF library“.

Next, click Install then OK and Close.

The Using Statement

We’ve got the package DLLs as part of our project, but don’t forget to reference the classes you will need as below.

using iTextSharp.text.pdf;

Traversing the Forms Data

This example is strictly only interested in form data and for illustration purposes, I am not going to get it in any particular order or do anything useful with it.

var reader = new PdfReader(@"G:\\OoPdfFormExampleFilled.pdf");

foreach (var item in fields.Keys)
{
    Console.WriteLine("Key: \\"{0,-25}\\" Value: \\"{1}\\"", item, reader.AcroFields.GetField(item.ToString()));
}

reader.Close();

As you can see, we simply open up the PDF file and then iterate over each of the keys before extracting the field data for it using GetField(). Sadly, the class PdfReader doesn’t support System.IDisposable so you must remember to close the file and can’t use a using statement to envelope everything.

The Output

Key: "Given Name Text Box      " Value: "Stephen"
Key: "Address 1 Text Box       " Value: "Whatsit Road"
Key: "Address 2 Text Box       " Value: "North Heath"
Key: "House nr Text Box        " Value: "11"
Key: "Gender List Box          " Value: "Woman"
Key: "Postcode Text Box        " Value: "DA7 6NU"
Key: "Family Name Text Box     " Value: "Moon"
Key: "Language 4 Check Box     " Value: "Off"
Key: "Favourite Colour List Box" Value: "Green"
Key: "Driving License Check Box" Value: "Yes"
Key: "Language 2 Check Box     " Value: "Yes"
Key: "Country Combo Box        " Value: "Britain"
Key: "City Text Box            " Value: "London"
Key: "Language 5 Check Box     " Value: "Off"
Key: "Height Formatted Field   " Value: "182"
Key: "Language 3 Check Box     " Value: "Yes"
Key: "Language 1 Check Box     " Value: "Off"

You will notice that checkboxes have values which are "Off" or "Yes" (Groan: I know, I know…) and all others can be treated as text. Pretty simple and a testament to how well this library handles things for you.

Final Thoughts

So far, in my limited use, I haven’t had any real problems or encountered bugs, but of course, they are there. Use this with some caution but if it isn’t mission critical, I don’t think you can go far wrong.

Hi! Did you find this useful or interesting? I have an email list coming soon, but in the meantime, if you ready anything you fancy chatting about, I would love to hear from you. You can contact me here or at stephen ‘at’ logicalmoon.com