hvac-kia-content/data_production_backlog/markdown_current
Ben Reed 8b83185130 Fix HTML/XML contamination in WordPress markdown extraction
- Update base_scraper.py convert_to_markdown() to properly clean HTML
- Remove script/style blocks and their content before conversion
- Strip inline JavaScript event handlers
- Clean up br tags and excessive blank lines
- Fix malformed comparison operators that look like tags
- Add comprehensive HTML cleaning during content extraction (not after)
- Test confirms WordPress content now generates clean markdown without HTML

This ensures all future WordPress scraping produces specification-compliant
markdown without any HTML/XML contamination.
2025-08-18 23:11:08 -03:00
..
hvacknowitall_podcast_backlog_20250818_221531.md Fix HTML/XML contamination in WordPress markdown extraction 2025-08-18 23:11:08 -03:00
hvacknowitall_wordpress_backlog_20250818_215653.md Fix HTML/XML contamination in WordPress markdown extraction 2025-08-18 23:11:08 -03:00
hvacknowitall_wordpress_backlog_20250818_215653.md.backup Fix HTML/XML contamination in WordPress markdown extraction 2025-08-18 23:11:08 -03:00
hvacknowitall_wordpress_backlog_20250818_221159.md Fix HTML/XML contamination in WordPress markdown extraction 2025-08-18 23:11:08 -03:00
hvacknowitall_wordpress_backlog_20250818_221159.md.backup Fix HTML/XML contamination in WordPress markdown extraction 2025-08-18 23:11:08 -03:00
hvacknowitall_wordpress_backlog_20250818_221430.md Fix HTML/XML contamination in WordPress markdown extraction 2025-08-18 23:11:08 -03:00
hvacknowitall_wordpress_backlog_20250818_221430.md.backup Fix HTML/XML contamination in WordPress markdown extraction 2025-08-18 23:11:08 -03:00
hvacknowitall_youtube_backlog_20250818_221604.md Fix HTML/XML contamination in WordPress markdown extraction 2025-08-18 23:11:08 -03:00