Extract Images from a PDF using Python

Have you ever had a PDF file with images you wanted to extract, but didn’t know how? Whether it’s for presentations, reports, or personal projects, extracting images manually from a PDF can be tedious. Luckily, Python and a library called PyMuPDF can help you automate this process with just a few lines of code.

In this article, I’ll walk you through how to extract images from a PDF and save them in a well-organized folder on your computer.

Table of Contents hide

1 What is PyMuPDF?

2 Let’s Get Started!

2.1 Step 1: Install PyMuPDF

2.2 Step 2: Write the Python Script to Extract Images

2.3 What’s Happening in the Code?

2.4 Step 3: Run the Script

2.5 Example:

3 Common Questions You Might Have

3.1 What if my PDF has text and images?

3.2 What if there are no images on a page?

4 Conclusion

What is PyMuPDF?

PyMuPDF (also called fitz) is a Python library that allows you to work with PDFs. It’s fast, lightweight, and perfect for tasks like extracting images directly from PDF files.

Let’s Get Started!

First things first, let’s break down the steps you’ll need to follow.

Step 1: Install PyMuPDF

To begin, you need to install the PyMuPDF library. Open your command line (or terminal) and run this command:

pip install PyMuPDF

This installs the library that will let you interact with PDF files in Python.

Step 2: Write the Python Script to Extract Images

Now let’s move on to the fun part — writing the script. Don’t worry if you’re new to coding, I’ll explain each step.

import fitz  # PyMuPDF
import os
import io
from PIL import Image

# Function to extract images from a PDF and save them to an output folder
def extract_images_from_pdf(pdf_path, output_folder):
    # Create the output folder if it doesn't exist
    if not os.path.exists(output_folder):
        os.makedirs(output_folder)

    # Open the PDF file
    pdf_document = fitz.open(pdf_path)

    # Loop through each page
    for page_number in range(len(pdf_document)):
        page = pdf_document.load_page(page_number)
        images = page.get_images(full=True)  # Get all images from the page

        # Extract each image
        for img_index, img in enumerate(images):
            xref = img[0]
            base_image = pdf_document.extract_image(xref)
            image_bytes = base_image["image"]
            image_ext = base_image["ext"]
            image = Image.open(io.BytesIO(image_bytes))

            # Save image as pageX_Y.ext (X = page number, Y = image index)
            image_filename = f"page{page_number + 1}_{img_index + 1}.{image_ext}"
            image_path = os.path.join(output_folder, image_filename)
            image.save(image_path)

            print(f"Saved: {image_path}")

    print("Image extraction completed.")

# Example usage
pdf_path = 'my_note.pdf'
output_folder = 'output'  # Folder to save extracted images
extract_images_from_pdf(pdf_path, output_folder)

What’s Happening in the Code?

Let me break it down step by step, so it makes sense even if you’re seeing Python for the first time.

Importing Libraries:
- fitz is the main library for working with PDFs.
- os helps us work with folders and files.
- io allows us to handle in-memory files.
- Pillow (PIL) is used to handle image processing.
Creating an Output Folder:
- The script checks if a folder called output exists. If not, it creates one. This is where your extracted images will be saved.
Opening the PDF File:
- fitz.open(pdf_path) opens the PDF so that we can access each page inside it.
Looping Through Each Page:
- for page_number in range(len(pdf_document)) means the script goes through each page in the PDF, one by one.
Extracting Images:
- get_images(full=True) finds all the images on a page. If there’s more than one image on a page, it handles that too.
Saving Images:
- For each image, the script extracts it and saves it with a file name based on the page and image number (e.g., page1_1.png if it’s the first image on page 1).
- Images are saved in their original format (PNG, JPG, etc.).
Output:
- After processing each image, the script prints a message to let you know the image has been saved.

Step 3: Run the Script

To run the script, make sure you replace 'my_note.pdf' with the path to your PDF file. You can do this by saving the script in a .py file (e.g., extract_images.py) and running it from your terminal or command line:

python extract_images.py

Example:

Let’s say you have a PDF with two images on the first page and one image on the second page. After running the script, your output folder will look like this:

output/
    page1_1.png
    page1_2.png
    page2_1.png

Each image is saved in sequence based on the page and image number.

Common Questions You Might Have

What if my PDF has text and images?

The script will only extract the images from the PDF and ignore the text. If a page contains both, only the images are saved.

What if there are no images on a page?

If a page doesn’t have any images, the script will just skip over that page and move on to the next one.

Conclusion

By following this simple guide, you can now easily extract images from any PDF file using Python and PyMuPDF. Whether you’re dealing with scanned documents, reports, or books, this approach saves you time and effort.

Not only do you get the images in their original quality, but they’re also saved in a well-organized folder, making it easy to manage them.

This is all about how you can extract images from a PDF using Python. I hope this article helped you learn something new today! Thank you for reading, and I’ll see you in the next article. If you want to explore more Python projects like this, feel free to click here.