
HTML Cleaning Script
Description
This is a short script written in Python that performs the following on HTML files:
- Removes:
<head>and all its content<span>(keeps content)<html><body>- empty tags
- all tag attributes (classes, etc.)
- comments
- trailing and leading spaces
I used the Beautiful Soup module to perform most of the operations. The script can be used to prepare raw HTML files for import into content management systems.
Technologies Used
- Python
- Beautiful Soup
- re
- argparse
View code on GitHub