Reading PDF Form Data in C#
Like many of you (yes, I know who you are), I’ve used Adobe’s Acrobat Portable Document Format - PDF - many times. How can you not? In those days when it was hard to share documents, PDF made things so much easier. They are a great (free) way to distribute files which describe the layout, fonts, graphics and text in flat documents but there are interactive versions, too, and in particular, so-called AcroForms which allow users to enter form data and save it. That’s what this brief article is about - editable PDF files and in particular, how to read them in C#.
If you do a quick Google of which libraries are available, you will come up with a few possibilities but in my opinion, it comes down to using iTextSharp. The two options are the freely usable version (4.0.3.0
) and the one you are meant to pay for (5.5.6
) which comes with lots of support, has fixed lots of bugs and has no further potential licensing issues. Clearly then, we’ll go for version 4! :-)
An Example Form
First things first - I need a form. I found one here (courtesy of Foersom Engineering Solutions - thank you) and filled it in.
Downloading and Installing
Getting hold of version 4.0.3.0
of iTextSharp is easy if you use the NuGet Package manager in Visual Studio. Go to the menu: Tools -> NuGet Package Manager -> Manage NuGet Packages for Solution
and fill the fields in as in the image below (see yellow highlighting). You can see I have looked online for itextsharp and picked the one with the title: “iTextSharp, a .NET PDF library“.
Next, click Install
then OK
and Close
.
The Using Statement
We’ve got the package DLLs as part of our project, but don’t forget to reference the classes you will need as below.
using iTextSharp.text.pdf; |
Traversing the Forms Data
This example is strictly only interested in form data and for illustration purposes, I am not going to get it in any particular order or do anything useful with it.
var reader = new PdfReader(@"G:\\OoPdfFormExampleFilled.pdf"); |
As you can see, we simply open up the PDF file and then iterate over each of the keys before extracting the field data for it using GetField()
. Sadly, the class PdfReader
doesn’t support System.IDisposable
so you must remember to close the file and can’t use a using statement to envelope everything.
The Output
Key: "Given Name Text Box " Value: "Stephen" |
You will notice that checkboxes have values which are "Off"
or "Yes"
(Groan: I know, I know…) and all others can be treated as text. Pretty simple and a testament to how well this library handles things for you.
Final Thoughts
So far, in my limited use, I haven’t had any real problems or encountered bugs, but of course, they are there. Use this with some caution but if it isn’t mission critical, I don’t think you can go far wrong.
Hi! Did you find this useful or interesting? I have an email list coming soon, but in the meantime, if you ready anything you fancy chatting about, I would love to hear from you. You can contact me here or at stephen ‘at’ logicalmoon.com