Have you ever had a PDF file with images you wanted to extract, but didn’t know how? Whether it’s for presentations, reports, or personal projects, extracting images manually from a PDF can be tedious. Luckily, Python and a library called PyMuPDF
can help you automate this process with just a few lines of code.
In this article, I’ll walk you through how to extract images from a PDF and save them in a well-organized folder on your computer.
What is PyMuPDF?
PyMuPDF (also called fitz
) is a Python library that allows you to work with PDFs. It’s fast, lightweight, and perfect for tasks like extracting images directly from PDF files.
Let’s Get Started!
First things first, let’s break down the steps you’ll need to follow.
Step 1: Install PyMuPDF
To begin, you need to install the PyMuPDF library. Open your command line (or terminal) and run this command:
pip install PyMuPDF
This installs the library that will let you interact with PDF files in Python.
Step 2: Write the Python Script to Extract Images
Now let’s move on to the fun part — writing the script. Don’t worry if you’re new to coding, I’ll explain each step.
import fitz # PyMuPDF
import os
import io
from PIL import Image
# Function to extract images from a PDF and save them to an output folder
def extract_images_from_pdf(pdf_path, output_folder):
# Create the output folder if it doesn't exist
if not os.path.exists(output_folder):
os.makedirs(output_folder)
# Open the PDF file
pdf_document = fitz.open(pdf_path)
# Loop through each page
for page_number in range(len(pdf_document)):
page = pdf_document.load_page(page_number)
images = page.get_images(full=True) # Get all images from the page
# Extract each image
for img_index, img in enumerate(images):
xref = img[0]
base_image = pdf_document.extract_image(xref)
image_bytes = base_image["image"]
image_ext = base_image["ext"]
image = Image.open(io.BytesIO(image_bytes))
# Save image as pageX_Y.ext (X = page number, Y = image index)
image_filename = f"page{page_number + 1}_{img_index + 1}.{image_ext}"
image_path = os.path.join(output_folder, image_filename)
image.save(image_path)
print(f"Saved: {image_path}")
print("Image extraction completed.")
# Example usage
pdf_path = 'my_note.pdf'
output_folder = 'output' # Folder to save extracted images
extract_images_from_pdf(pdf_path, output_folder)
What’s Happening in the Code?
Let me break it down step by step, so it makes sense even if you’re seeing Python for the first time.
- Importing Libraries:
fitz
is the main library for working with PDFs.os
helps us work with folders and files.io
allows us to handle in-memory files.Pillow
(PIL
) is used to handle image processing.
- Creating an Output Folder:
- The script checks if a folder called
output
exists. If not, it creates one. This is where your extracted images will be saved.
- The script checks if a folder called
- Opening the PDF File:
fitz.open(pdf_path)
opens the PDF so that we can access each page inside it.
- Looping Through Each Page:
for page_number in range(len(pdf_document))
means the script goes through each page in the PDF, one by one.
- Extracting Images:
get_images(full=True)
finds all the images on a page. If there’s more than one image on a page, it handles that too.
- Saving Images:
- For each image, the script extracts it and saves it with a file name based on the page and image number (e.g.,
page1_1.png
if it’s the first image on page 1). - Images are saved in their original format (
PNG
,JPG
, etc.).
- For each image, the script extracts it and saves it with a file name based on the page and image number (e.g.,
- Output:
- After processing each image, the script prints a message to let you know the image has been saved.
Step 3: Run the Script
To run the script, make sure you replace 'my_note.pdf'
with the path to your PDF file. You can do this by saving the script in a .py
file (e.g., extract_images.py
) and running it from your terminal or command line:
python extract_images.py
Example:
Let’s say you have a PDF with two images on the first page and one image on the second page. After running the script, your output
folder will look like this:
output/
page1_1.png
page1_2.png
page2_1.png
Each image is saved in sequence based on the page and image number.
Common Questions You Might Have
What if my PDF has text and images?
The script will only extract the images from the PDF and ignore the text. If a page contains both, only the images are saved.
What if there are no images on a page?
If a page doesn’t have any images, the script will just skip over that page and move on to the next one.
Conclusion
By following this simple guide, you can now easily extract images from any PDF file using Python and PyMuPDF. Whether you’re dealing with scanned documents, reports, or books, this approach saves you time and effort.
Not only do you get the images in their original quality, but they’re also saved in a well-organized folder, making it easy to manage them.
This is all about how you can extract images from a PDF using Python. I hope this article helped you learn something new today! Thank you for reading, and I’ll see you in the next article. If you want to explore more Python projects like this, feel free to click here.