- Update base_scraper.py convert_to_markdown() to properly clean HTML - Remove script/style blocks and their content before conversion - Strip inline JavaScript event handlers - Clean up br tags and excessive blank lines - Fix malformed comparison operators that look like tags - Add comprehensive HTML cleaning during content extraction (not after) - Test confirms WordPress content now generates clean markdown without HTML This ensures all future WordPress scraping produces specification-compliant markdown without any HTML/XML contamination. |
||
|---|---|---|
| .. | ||
| hvacknowitall_podcast_backlog_20250818_221531.md | ||
| hvacknowitall_wordpress_backlog_20250818_215653.md | ||
| hvacknowitall_wordpress_backlog_20250818_215653.md.backup | ||
| hvacknowitall_wordpress_backlog_20250818_221159.md | ||
| hvacknowitall_wordpress_backlog_20250818_221159.md.backup | ||
| hvacknowitall_wordpress_backlog_20250818_221430.md | ||
| hvacknowitall_wordpress_backlog_20250818_221430.md.backup | ||
| hvacknowitall_youtube_backlog_20250818_221604.md | ||