Skip to main content

HTML Cleaning Script

Description

This is a short script written in Python that performs the following on HTML files:

  • Removes:
    • <head> and all its content
    • <span> (keeps content)
    • <html>
    • <body>
    • empty tags
    • all tag attributes (classes, etc.)
    • comments
    • trailing and leading spaces

I used the Beautiful Soup module to perform most of the operations. The script can be used to prepare raw HTML files for import into content management systems.

Technologies Used

  • Python
    • Beautiful Soup
    • re
    • argparse


  View code on GitHub