Easy Way to Append PDF Files Programmatically in Python

Recently I had a bunch of one-page PDF documents that I wanted to be able to paste together so I can read them all as a single document. It turns out this is possible to do with Adobe Acrobat (which makes sense), but you have to pay for this product, and I wanted to do it for free. So I spent some time searching around, but I could not find any applications that appear to do this without much effort on my part. This type of task is something I might want to do in the future, so I thought maybe it would be worth the time to do it programmatically.

I decided to see how to do it in Python. It turned out to be extremely easy (with one caveat).

The first thing you need to do is install Python. I installed version 2.5.

The next thing you need to do is install pyPdf, which is an open-source piece of software that allows you to manipulate PDF files. I really liked how simple their library is, the example they gave is very straightforward.

Then let’s say you have a bunch of .pdf files in a directory and that the names of the files are in the order you want to append them: for example 1.pdf, 2.pdf, 3.pdf…

I created a simple script that would do this, but I ran into trouble because Python wasn’t sorting the file names properly (so my PDF document had pages in the wrong order). So I used the code described in this post to get the code to sort properly (this code is contained in my SortHelper file referenced below). The following script does the job:

from pyPdf import PdfFileWriter, PdfFileReader
import os
import os.path
import glob
from SortHelper import *
 
outputFileName = "MergedDocument.pdf"
 
if os.path.isfile(outputFileName):
    os.unlink(outputFileName) # deletes the output file if it exists
 
inputFileNames = glob.glob(os.getcwd() + "/*.pdf") # searches for all files with .pdf extension
inputFileNames = SortNumericStringList(inputFileNames) # sorts the file names appropriately
 
output = PdfFileWriter()
 
for pdfFileName in inputFileNames:
    output.addPage(PdfFileReader(file(pdfFileName, "rb")).getPage(0)) # adds the first page of each document
 
outputStream = file(outputFileName, "wb")
output.write(outputStream)
outputStream.close()

Leave a Reply