HTML Cleaning Script

Description

This is a short script written in Python that performs the following on HTML files:

Removes:
- <head> and all its content
- <span> (keeps content)
- <html>
- <body>
- empty tags
- all tag attributes (classes, etc.)
- comments
- trailing and leading spaces

I used the Beautiful Soup module to perform most of the operations. The script can be used to prepare raw HTML files for import into content management systems.

Technologies Used

Python
- Beautiful Soup
- re
- argparse

View code on GitHub

Description​

Technologies Used​

Description

Technologies Used