Convert PDF to HTML using Python

Prasanth Sai

June 23, 2024

Convert PDF to HTML using Python

June 23, 2024

by Prasanth Sai

PDF is the most common format companies use to store and exchange information between various stakeholders. Converting PDFs with images and tables into HTML while retaining the structure is a crucial task for many Generative AI applications. Generative AI understands text, HTML, JSON, and Markdown better than any other format, making PDF to HTML conversion an ideal way to help models comprehend PDFs more effectively.

Challenges with Today’s Free PDF Tools

Does Not Retain Structure: Many free tools fail to keep the original structure of the PDF intact.
Poor Image and Table Retention: Converting images and tables from PDFs to HTML often results in a loss of formatting and accuracy.

Solution

Out of all the open-source solutions available, we have found the following method to be the best for converting PDFs to HTML. Although this method is not always 100% accurate, it is the most reliable among the solutions we tested, retaining structures, images, and tables effectively. Developers might still need to build some custom code on top of this based on specific PDF structures and business requirements to achieve 100% consistency.

We will use the pymudf4llm and fitz modules from pymudf.

Solution Steps

Step 1: Import Necessary Libraries

First, install the required libraries if you haven’t already:

pip install pymupdf pymudf4llm

Now, import the necessary libraries in your Python script:

import fitz
import os
import pymudf4llm
import pathlib
import re

Step 2: Function to Extract All Images from the PDF

Define a function to extract images from the PDF:

def extract_images(pdf_path, output_folder):
    pdf_document = fitz.open(pdf_path)
    for page_num in range(len(pdf_document)):
        page = pdf_document[page_num]
        image_list = page.get_images(full=True)
        for image_index, img in enumerate(image_list):
            xref = img[0]
            base_image = pdf_document.extract_image(xref)
            image_bytes = base_image["image"]
            image_ext = base_image["ext"]
            image_filename = f"image_{page_num + 1}_{image_index + 1}.{image_ext}"
            with open(os.path.join(output_folder, image_filename), "wb") as image_file:
                image_file.write(image_bytes)
    pdf_document.close()

Step 3: Function to Replace Images in the PDF with Image Path Placeholders

Create a function to replace images in the PDF with placeholders:

def replace_images_with_placeholders(pdf_path, output_pdf_path, image_positions):
    pdf_document = fitz.open(pdf_path)
    for page_number, images in image_positions.items():
        page = pdf_document.load_page(page_number)
        images.sort(key=lambda x: (x[0].y0, x[0].x0))  # Sort images by their position on the page
        for rect, image_filename in images:
            placeholder_text = f"[{image_filename}]"
            page.insert_textbox(rect, placeholder_text, fontsize=12, color=(0, 0, 0))
    pdf_document.save(output_pdf_path)

Step 4: Convert the Modified PDF to Markdown with the Right Image Path Format

Use pymudf4llm to convert the modified PDF to Markdown:

def convert_pdf_to_markdown(pdf_path, image_folder):
    output_markdown_path = "output.md"
    pymudf4llm.convert(pdf_path, output_markdown_path, image_folder=image_folder)
    return output_markdown_path

Step 5: Replace the image paths in the markdown with right markdown image format

Function to replace image placeholders with the right markdown format:

def convert_into_markdownimages(markdown_path, output_folder):
    with open(markdown_path, "r") as md_file:
        lines = md_file.readlines()

    new_lines = []
    image_pattern = re.compile(r'\[([^\]]+\.png)\]')

    for line in lines:
        matches = image_pattern.findall(line)
        if matches:
            for match in matches:
                image_filename = match
                image_path = os.path.join(output_folder, image_filename)
                if os.path.exists(image_path):
                    image_markdown = f"![image]({image_path})"
                    line = line.replace(f'[{image_filename}]', image_markdown)
        new_lines.append(line)

    with open(markdown_path, "w") as md_file:
        md_file.writelines(new_lines)

Step 6: Convert Markdown to HTML

Finally, convert the Markdown to HTML:

from markdown import markdown
def convert_markdown_to_html(markdown_path, html_output_path):
   
   html = markdown(open(markdown_path, 'r', errors='ignore').read())

   with open(html_output_path, 'w') as html_file:
        html_file.write(html)

Conclusion

Converting PDFs to HTML is essential for maintaining the structure and helping AI models understand the content better. While you can build your own code as demonstrated, creating generative AI applications often requires multiple tools that need modification. Building all custom code on your own can be time-consuming. Instead, leverage these pre-built blocks provided by Chatgen Automation workflows and modify them as needed to suit your requirements.

By using these tools and techniques, you can streamline the conversion process, ensuring high accuracy and efficiency in your generative AI projects.