Reading/Parsing a PDF File with C#

Last week, I had the pleasure of dealing with PDF files…for processing data. If you are [or potentially could be] doing EDI with a third party, please please PLEASE be able to provide something other than a PDF format; while PDF is a great format for human-reading or printing, using it for EDI is a no-no. If there is a standard format for the type of data you’re sending, that’s ideal. In this particular case, since the client was only sending data in PDF format, dealing with it was a necessity.

Looking at some sample PDF files, the data was actually text and not in the form of images (thankfully). I found a [seemingly] quick and simple write-up on parsing/reading a pdf file with c# and to text, which may be useful for many. In my particular case, though, our projects are signed–so in order to use the Apache PDFBox library mentioned there, I would need to sign it.

The Apache PDFBox library is open source, but written in Java and using both Apache Ant for automating the build process and IKVM.NET (IKVM.NET is an implementation of Java for Mono and the Microsoft .NET Framework). So, I decided to try just signing the assemblies without recompiling first.

I found Signer from the Alois Kraus blog for signing third party assemblies and decided to give it a try. From the project page:

How does it work?

Signer does basically a full round trip by decompiling the assembly into IL code make the necessary modifications and compile it back to a valid assembly. The required modifications include

  • Update of all references
  • Change/Removal of InternalsVisibleToAttribute
  • Update of custom attributes with a type parameter
  • A little fix to work around an ILDASM problem

Although running the assemblies through Signer claimed to be successful, the assemblies were left yet unsigned (after a failed attempt at loading them in my project, I used the corflags tool to verify that the ‘Signed’ flag was still not set).

I also tried following the instructions from a post on Ryan Farley’s blog covering how to Sign a .NET Assembly with a Strong Name Without Recompiling, but again without any luck (the resulting assembly appeared to have a corrupt header with an unrecognizable assembly name).

With some further digging, I found an issue ticket for PDFBox in the issue tracking system related to getting some signed assemblies, but also ran into issues. When a quick attempt at building the project manually and following the .NET version instructions also failed, I decided to take a different approach. More time probably would have resulted in a working outcome, but I didn’t feel the time and effort was worth it.

Using Process and the PDFBox Command-line Tools
In the end, I decided to use the Process class to send each file I was working with over to the ExtractText application included in the PDFBox Command Line Utilities. With some tweaked ProcessStartInfo for grabbing the output
and its StandardOutput property, I was able to get what I needed:

public static string ReadPDF(string inputFilePath)
		// first, build our ProcessStartInfo so we can grab the console output
		string arguments = "-console "" + inputFilePath + """;
		string pathToPDFBox = Properties.Settings.Default.PathToPDFBoxExtractText;
		if (string.IsNullOrEmpty(pathToPDFBox))
			throw new Exception("Unable to find PDFBox (the setting 'PathToPDFBoxExtractText' has not been set)");
		ProcessStartInfo startInfo = new ProcessStartInfo()
			FileName = pathToPDFBox,
			Arguments = arguments,
			UseShellExecute = false,
			RedirectStandardOutput = true
		// now launch the process and grab the result
		string output = string.Empty;
		using (Process pdfReaderProcess = new Process() { StartInfo = startInfo })
			output = pdfReaderProcess.StandardOutput.ReadToEnd();

		return output;
	catch (Exception ex)
		string message = string.Format("There was a problem reading the PDF file '{0}': {1}",
			inputFilePath, ex.Message);
		throw new Exception(message, ex);

I was working with some smaller test files, so in order to just test it out I just put something together real quick. Really, you’d want to use the GOCR.

If any of you have had a related experience, I’d love to hear about it.

Print Friendly, PDF & Email

3 thoughts on “Reading/Parsing a PDF File with C#”

  1. Excellent post. I am constantly checking this blog and I’m impressed! Very helpful info particularly the last part

Leave a Reply

Your email address will not be published. Required fields are marked *