PDF tools for web servers
This article lists some of the PDF tools we have used on our server projects. They all have one thing in common: they can be driven from the command line, and therefore can be run on servers lacking any kind of GUI.
Please note that I won’t be publishing every comment that starts “I work for XYZ and we have a PDF product…” (especially if I see the same posts listed on a hundred other sites with ‘PDF’ somewhere in the title). These are mainly the open source, free or cheap tools that I have personally found useful in projects. I still appreciate any tips or recommendatios.
Most of these tools fall into the data conversion category – they convert to- or from- PDF formats.
Convert a PDF document to an SVG document. The command line tool pdf2svg.exe is a WIndows-only tool, but is useful for off-line conversions. It is not free, but significantly cheaper than the Adobe CS3 suite that you would otherwise have to use.
Using the tool without license inserts several watermark layers. The resulting SVG can be manipulated easily using Inkscape (a free SVG editor that every developer should have in their toolbox). The thumbnails generated from each page contain the same watermarks.
This tool will also extract all images (as PNGs) from the PDF document, which is very handy.
This free Java tool is platform-idependant, and provides a few features that make it very suited to batch PDF manipulation. Features include the ability to:
- Split up a document and manipulate pages.
- Generate PDF content on-the-fly.
- Fill out PDF forms.
- Add digital signatures.
- Create PDFs from scratch, including barcodes.
iText can also output Rich Text Format (RTF) documents. I have used it in the past to extract text from PDFs for indexing on a website, though there are lighter tools for doing this.
pdftk (PDF Tool Kit, by AccessPDF)
This is surely the lightest Swiss Army Knife of PDF tools. Features include:
- Bursting into single-page PDFs and recombining.
- Inserting and extracting form data (for older style forms, though Adobe seems to have changed the way forms work in later versions of the PDF format, so that extracting from filled forms is no longer straight-forward).
- Extract and manipulate metadata.
This is another free java library, with features including:
- Extracting text from a PDF (for indexing, e.g. with Lucene or mnoGoSearch).
- Manipulating pages (inserting, extracting, reordering).
- Filling and extracting form data (PDF version below 1.6) using FDF and XFDF data files.
- Creating images from PDF files – good for thumbnails and creating ‘page flipper’ applications.
To use the text extraction in a search engine, I use the following shell script to wrap it all up:
# Convert a PDF document to text
# Usage: $0 [OPTIONS] [Text File]
# -password Password to decrypt document
# -encoding (ISO-8859-1,UTF-16BE,UTF-16LE,…)
# -console Send text to console instead of file
# -html Output in HTML format instead of raw text
# -sort Sort the text before writing
# -startPage The first page to start extraction(1 based)
# -endPage The last page to extract(inclusive)
# The PDF document to use
# [Text File] The file to write the text to
java org.pdfbox.ExtractText “$@”[/sourcecode]
This assumes that PDFBox has been installed under /usr/local/lib.
Although this toolkit is primarily about SWF files, it does have some neat PDF to SFW conversion scripts. Versions are available for Windows and Linux under an Open Source licence.
One of these tools also extracts images from PDFs, which can b every useful when converting PDF to HTML formats.
There are a number of PHP tools, other libraries (e.g. Image Magick) and more heavy-weight tools (e.g. Ghostscript) that I will cover later. Hopefully this selection will help in the meantime. If you have any further suggestions, I would love to hear of them. Even if a tool duplicates much of what these do, it only needs have do one extra feature that the others don’t cover to be worthwhile using.