How to split PDF files in multi-level directories with Python
Use os.walk() and PyPDF2 to automate pdf file splitting from multiple sub-directories with a Python script
If you are reading this you know, most office type jobs require repetitive tasks. This is where having a bit of Python knowledge comes in handy. In fact, there is a great book published on this subject by Al Sweigart entitled Automate the Boring Stuff with Python. This post builds on some topics covered in aforementioned, and walks through a real script I use in my day-to-day workplace to automate an otherwise mundane task.
I have a root directory with multiple sub-directories spanning the alphabet from A-Z. When I scan in documents, I do so in alphabetical chunks, scanning each letter in its own batch. However, I am left with a multiple page document when I need them each individually. Now I could scan in each one at a time, but that would be a time consuming process.
Here is a snippet of what my hypothetical directory looks like:
|- a2_doc.pdf |- B
The os package allows us to navigate directories, which comes in handy when writing code that can be applied to different pathways. In particular, the
os.walk() function, which allows us to isolate each aspect of a directory containing multiple levels.
for root, dir, file in os.walk("C:\FY22"):
print(root, dir, file)
Would output the following:
C:\FY22 ['A', 'B', 'C']
C:\FY22\A ['a_doc.pdf', 'a2_doc.pdf']
This function generates 3 variables —
rootis a string value referring to the file path starting from the
dirsis a list containing strings, and each string refers to a subfolder inside
filesis a list containing strings, where each…