Author Topic: how to extract images embedded in a self-contained html page as files?  (Read 1134 times)

0 Members and 1 Guest are viewing this topic.

Online DiTBhoTopic starter

  • Super Contributor
  • ***
  • Posts: 4017
  • Country: gb
if someone (insanely) gives you a 14Mbyte Self-Contained Web Page html file with a table where the third column of all the row contains an embedded image

Code: [Select]
  </v:shapetype><v:shape id="Picture_x0020_507" o:spid="_x0000_s1518" type="#_x0000_t75"
   height:184.5pt;z-index:494;visibility:visible' o:gfxdata="UEsDBBQABgAIAAAAIQDAV3P7DAEAA

How to extract every pictures as files?

If they were just a few, I would do it manually, but they are thousand and thousand, so I need a script or something  :-//
The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow

Online DiTBhoTopic starter

  • Super Contributor
  • ***
  • Posts: 4017
  • Country: gb
(Can I seriously embed a .png image into an HTML page?  :o :o :o
Yup, see here ... )
The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow

Offline Smokey

  • Super Contributor
  • ***
  • Posts: 2691
  • Country: us
  • Not An Expert
Paste exactly what you posted into chatgpt.  Added I wanted a python script. 

Code: [Select]
import re
import os
from bs4 import BeautifulSoup
import base64

# Read the HTML content from the file
file_path = 'path/to/your/file.html'  # Update this with the path to your HTML file
with open(file_path, 'r', encoding='utf-8') as file:
    html_content =

# Parse the HTML content
soup = BeautifulSoup(html_content, 'html.parser')

# Find all elements that contain the image data
image_elements = soup.find_all('v:shape', {'o:gfxdata': True})

# Create a directory to save the extracted images
output_dir = 'extracted_images'
os.makedirs(output_dir, exist_ok=True)

# Loop through each image element and extract the image data
for index, element in enumerate(image_elements):
    # Extract the base64 image data
    gfxdata = element.get('o:gfxdata')
    # Decode the base64 data
    image_data = base64.b64decode(gfxdata)
    # Determine the image format (assuming PNG for this example)
    image_filename = os.path.join(output_dir, f'image_{index + 1}.png')
    # Save the image to a file
    with open(image_filename, 'wb') as image_file:

    print(f'Saved {image_filename}')

Offline ledtester

  • Super Contributor
  • ***
  • Posts: 3105
  • Country: us
One way is to use a library like puppeteer to control a headless version of Chrome.

Some links:

- headless Chrome:
- puppeteer:
- example puppeteer code to download the image data of an image ("Method 3"):,Best%20methods%20for%20downloading%20a%20single%20image,-Unique%20Selectors%20or


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo