How to split PDF files in multi-level directories with Python

Use os.walk() and PyPDF2 to automate pdf file splitting from multiple sub-directories with a Python script

Justin Morgan Williams

--

Photo by Drew Beamer on Unsplash

Background

If you are reading this you know, most office type jobs require repetitive tasks. This is where having a bit of Python knowledge comes in handy. In fact, there is a great book published on this subject by Al Sweigart entitled Automate the Boring Stuff with Python. This post builds on some topics covered in aforementioned, and walks through a real script I use in my day-to-day workplace to automate an otherwise mundane task.

Task

I have a root directory with multiple sub-directories spanning the alphabet from A-Z. When I scan in documents, I do so in alphabetical chunks, scanning each letter in its own batch. However, I am left with a multiple page document when I need them each individually. Now I could scan in each one at a time, but that would be a time consuming process.

Here is a snippet of what my hypothetical directory looks like:

FY22
|- A
|- a_doc.pdf
|- a2_doc.pdf
|- B
|- b_doc.pdf
|- C
|- c_doc.pdf

Packages

os

The os package allows us to navigate directories, which comes in handy when writing code that can be applied to different pathways. In particular, the os.walk() function, which allows us to isolate each aspect of a directory containing multiple levels.

For example:

import os
for root, dir, file in os.walk("C:\FY22"):
print(root, dir, file)

Would output the following:

C:\FY22 ['A', 'B', 'C']
C:\FY22\A[] ['a_doc.pdf', 'a2_doc.pdf']
C:\FY22\B[] ['b_doc.pdf']
C:\FY22\C[] ['c_doc.pdf']

This function generates 3 variables — root, dirsand files.

  • rootis a string value referring to the file path starting from the "main" directory
  • dirs is a list containing strings, and each string refers to a subfolder insideroot
  • files is a list containing strings, where each…

--

--

Justin Morgan Williams

Data scientist passionate about the intersectionality of sustainability and data.