hvac-kia-content/data_production_backlog/.state/podcast_state.json at ccdb9366db350dfd40ab52e39f353fcfcefb7cd8 - ben/hvac-kia-content - Forgejo: Beyond coding. We Forge.

ben/hvac-kia-content

Ben Reed 8b83185130 Fix HTML/XML contamination in WordPress markdown extraction

- Update base_scraper.py convert_to_markdown() to properly clean HTML
- Remove script/style blocks and their content before conversion
- Strip inline JavaScript event handlers
- Clean up br tags and excessive blank lines
- Fix malformed comparison operators that look like tags
- Add comprehensive HTML cleaning during content extraction (not after)
- Test confirms WordPress content now generates clean markdown without HTML

This ensures all future WordPress scraping produces specification-compliant
markdown without any HTML/XML contamination.

2025-08-18 23:11:08 -03:00

7 lines

No EOL

198 B

JSON

Raw Blame History

 {
   "last_update": "2025-08-18T22:15:31.540072",
   "last_item_count": 428,
   "backlog_captured": true,
   "backlog_timestamp": "20250818_221531",
   "last_id": "b6e505a9-6545-c858-e325-e43bbbcf7a45"
 }