PDF manipulation with Python: Part 1

Published Jan 24, 2022

If you are a coder and comfortable with python, there are many libraries by which you can manipulate PDF file as your own wish. Here are some list of python library which can be used to create or edit PDF files:

PDFMiner – This tool is to extract text data from a PDF file, along with its position and other information like fonts or lines etc. It can convert the PDF to other formats such as HTML. It comes with command line utility which is useful for non-programmer.
PyPDF4 – It is a pure python library capable of splitting, merging, cropping and transforming PDF pages. It can retrieve metadata and text from PDF as well as encrypt(adding password) the PDF file.
Slate – It is a wrapper of PDFMiner. One class implementation of PDFMiner tools.
pikepdf – Is is an emerging library to process PDF files. Implemented on a mature C++ based library named QPDF. It can create PDF file that pass PDF validation tests. It has the capability to repair PDF files with internal errors. In this post, I am going to use this library to implement unlocking PDF files. Unlocking or decryption is sometimes useful to collaborate with PDF files.

PDF manipulate with Python

Part 1: Unlocking the PDF files

We will use pikepdf to open the PDF file which is password protected and then simply save it back to disk without password.

Don’t like coding? No issue, I have created a tool to unlock pdf file with a single click. Give it a try:
https://unlockpdf.deta.dev

Installation

pip install pikepdf
# OR
pip3 install pikepdf

Import the library

Open the file and save it back

with pp.open('lockedfile.pdf',password='yourpassword') as pdf:
    pdf.save('unlockedfile.pdf')

# OR

pdf = pp.open('lockedfile.pdf',password='yourpassword')
pdf.save('unlockedfile.pdf')
pdf.close()

If you use “with” statement, then the “pdf” object will be disposed as soon as it leaves the with segment. So no need to explicitly close the pdf file.

Hope you enjoyed the post. In my next post, I will provide more PDF manipulation technique like splitting, merging, deleting pages or extracting information along with OCR and conversion etc.

Report

Enjoy this post? Give Subhadip Mukherjee a like if it's helpful.

Subhadip Mukherjee

Sr. Data Engineer/ML Engineer with Python/DBT/SQL [GCP Specialist]

Specialist in GCP services. Experience in Data Engineering since 2010. Applied Data/machine learning engineer using python for the last few years. Understanding data and applying the ML algorithm to solve use cases include data pr...

Discover and read more posts from Subhadip Mukherjee

get started