PDF manipulation with Python: Part 1
If you are a coder and comfortable with python, there are many libraries by which you can manipulate PDF file as your own wish. Here are some list of python library which can be used to create or edit PDF files:
- PDFMiner – This tool is to extract text data from a PDF file, along with its position and other information like fonts or lines etc. It can convert the PDF to other formats such as HTML. It comes with command line utility which is useful for non-programmer.
- PyPDF4 – It is a pure python library capable of splitting, merging, cropping and transforming PDF pages. It can retrieve metadata and text from PDF as well as encrypt(adding password) the PDF file.
- Slate – It is a wrapper of PDFMiner. One class implementation of PDFMiner tools.
- pikepdf – Is is an emerging library to process PDF files. Implemented on a mature C++ based library named QPDF. It can create PDF file that pass PDF validation tests. It has the capability to repair PDF files with internal errors. In this post, I am going to use this library to implement unlocking PDF files. Unlocking or decryption is sometimes useful to collaborate with PDF files.
PDF manipulate with Python
Part 1: Unlocking the PDF files
We will use pikepdf to open the PDF file which is password protected and then simply save it back to disk without password.
Don’t like coding? No issue, I have created a tool to unlock pdf file with a single click. Give it a try:
https://unlockpdf.deta.dev
Installation
pip install pikepdf
# OR
pip3 install pikepdf
Import the library
Open the file and save it back
with pp.open('lockedfile.pdf',password='yourpassword') as pdf:
pdf.save('unlockedfile.pdf')
# OR
pdf = pp.open('lockedfile.pdf',password='yourpassword')
pdf.save('unlockedfile.pdf')
pdf.close()
If you use “with” statement, then the “pdf” object will be disposed as soon as it leaves the with segment. So no need to explicitly close the pdf file.
Hope you enjoyed the post. In my next post, I will provide more PDF manipulation technique like splitting, merging, deleting pages or extracting information along with OCR and conversion etc.