Member-only story
How to split PDF files in multi-level directories with Python
Use os.walk() and PyPDF2 to automate pdf file splitting from multiple sub-directories with a Python script
Background
If you are reading this you know, most office type jobs require repetitive tasks. This is where having a bit of Python knowledge comes in handy. In fact, there is a great book published on this subject by Al Sweigart entitled Automate the Boring Stuff with Python. This post builds on some topics covered in aforementioned, and walks through a real script I use in my day-to-day workplace to automate an otherwise mundane task.
Task
I have a root directory with multiple sub-directories spanning the alphabet from A-Z. When I scan in documents, I do so in alphabetical chunks, scanning each letter in its own batch. However, I am left with a multiple page document when I need them each individually. Now I could scan in each one at a time, but that would be a time consuming process.
Here is a snippet of what my hypothetical directory looks like:
FY22
|- A
|- a_doc.pdf
|- a2_doc.pdf |- B
|- b_doc.pdf
|- C
|- c_doc.pdf
Packages
os
The os package allows us to navigate directories, which comes in handy when writing code that can be applied to different pathways. In particular, the os.walk()
function, which allows us to isolate each aspect of a directory containing multiple levels.
For example:
import os
for root, dir, file in os.walk("C:\FY22"):
print(root, dir, file)
Would output the following:
C:\FY22 ['A', 'B', 'C']
C:\FY22\A[] ['a_doc.pdf', 'a2_doc.pdf']
C:\FY22\B[] ['b_doc.pdf']
C:\FY22\C[] ['c_doc.pdf']
This function generates 3 variables — root
, dirs
and files.
root
is a string value referring to the file path starting from the"main"
directorydirs
is a list containing strings, and each string refers to a subfolder insideroot
files
is a list containing strings, where each…