Convert PDF to HTML using Python
Convert PDF to HTML using Python
PDF is the most common format companies use to store and exchange information between various stakeholders. Converting PDFs with images and tables into HTML while retaining the structure is a crucial task for many Generative AI applications. Generative AI understands text, HTML, JSON, and Markdown better than any other format, making PDF to HTML conversion an ideal way to help models comprehend PDFs more effectively.
Challenges with Today’s Free PDF Tools
- Does Not Retain Structure: Many free tools fail to keep the original structure of the PDF intact.
- Poor Image and Table Retention: Converting images and tables from PDFs to HTML often results in a loss of formatting and accuracy.
Solution
Out of all the open-source solutions available, we have found the following method to be the best for converting PDFs to HTML. Although this method is not always 100% accurate, it is the most reliable among the solutions we tested, retaining structures, images, and tables effectively. Developers might still need to build some custom code on top of this based on specific PDF structures and business requirements to achieve 100% consistency.
We will use the pymudf4llm
and fitz
modules from pymudf
.
Solution Steps
Step 1: Import Necessary Libraries
First, install the required libraries if you haven’t already:
pip install pymupdf pymudf4llm
Now, import the necessary libraries in your Python script:
import fitz
import os
import pymudf4llm
import pathlib
import re
Step 2: Function to Extract All Images from the PDF
Define a function to extract images from the PDF:
def extract_images(pdf_path, output_folder):
pdf_document = fitz.open(pdf_path)
for page_num in range(len(pdf_document)):
page = pdf_document[page_num]
image_list = page.get_images(full=True)
for image_index, img in enumerate(image_list):
xref = img[0]
base_image = pdf_document.extract_image(xref)
image_bytes = base_image["image"]
image_ext = base_image["ext"]
image_filename = f"image_{page_num + 1}_{image_index + 1}.{image_ext}"
with open(os.path.join(output_folder, image_filename), "wb") as image_file:
image_file.write(image_bytes)
pdf_document.close()
Step 3: Function to Replace Images in the PDF with Image Path Placeholders
Create a function to replace images in the PDF with placeholders:
def replace_images_with_placeholders(pdf_path, output_pdf_path, image_positions):
pdf_document = fitz.open(pdf_path)
for page_number, images in image_positions.items():
page = pdf_document.load_page(page_number)
images.sort(key=lambda x: (x[0].y0, x[0].x0)) # Sort images by their position on the page
for rect, image_filename in images:
placeholder_text = f"[{image_filename}]"
page.insert_textbox(rect, placeholder_text, fontsize=12, color=(0, 0, 0))
pdf_document.save(output_pdf_path)
Step 4: Convert the Modified PDF to Markdown with the Right Image Path Format
Use pymudf4llm
to convert the modified PDF to Markdown:
def convert_pdf_to_markdown(pdf_path, image_folder):
output_markdown_path = "output.md"
pymudf4llm.convert(pdf_path, output_markdown_path, image_folder=image_folder)
return output_markdown_path
Step 5: Replace the image paths in the markdown with right markdown image format
Function to replace image placeholders with the right markdown format:
def convert_into_markdownimages(markdown_path, output_folder):
with open(markdown_path, "r") as md_file:
lines = md_file.readlines()
new_lines = []
image_pattern = re.compile(r'\[([^\]]+\.png)\]')
for line in lines:
matches = image_pattern.findall(line)
if matches:
for match in matches:
image_filename = match
image_path = os.path.join(output_folder, image_filename)
if os.path.exists(image_path):
image_markdown = f"![image]({image_path})"
line = line.replace(f'[{image_filename}]', image_markdown)
new_lines.append(line)
with open(markdown_path, "w") as md_file:
md_file.writelines(new_lines)
Step 6: Convert Markdown to HTML
Finally, convert the Markdown to HTML:
from markdown import markdown
def convert_markdown_to_html(markdown_path, html_output_path):
html = markdown(open(markdown_path, 'r', errors='ignore').read())
with open(html_output_path, 'w') as html_file:
html_file.write(html)
Conclusion
Converting PDFs to HTML is essential for maintaining the structure and helping AI models understand the content better. While you can build your own code as demonstrated, creating generative AI applications often requires multiple tools that need modification. Building all custom code on your own can be time-consuming. Instead, leverage these pre-built blocks provided by Chatgen Automation workflows and modify them as needed to suit your requirements.
By using these tools and techniques, you can streamline the conversion process, ensuring high accuracy and efficiency in your generative AI projects.