hvac-kia-content

History

Ben Reed 8b83185130 Fix HTML/XML contamination in WordPress markdown extraction - Update base_scraper.py convert_to_markdown() to properly clean HTML - Remove script/style blocks and their content before conversion - Strip inline JavaScript event handlers - Clean up br tags and excessive blank lines - Fix malformed comparison operators that look like tags - Add comprehensive HTML cleaning during content extraction (not after) - Test confirms WordPress content now generates clean markdown without HTML This ensures all future WordPress scraping produces specification-compliant markdown without any HTML/XML contamination.		2025-08-18 23:11:08 -03:00
..
.cookies	Fix critical production issues and improve spec compliance	2025-08-18 20:07:55 -03:00
.sessions	Fix critical production issues and improve spec compliance	2025-08-18 20:07:55 -03:00
backlog	Fix critical production issues and improve spec compliance	2025-08-18 20:07:55 -03:00
debug/.sessions	Fix critical production issues and improve spec compliance	2025-08-18 20:07:55 -03:00
recent	Fix critical production issues and improve spec compliance	2025-08-18 20:07:55 -03:00
wordpress_clean	Fix HTML/XML contamination in WordPress markdown extraction	2025-08-18 23:11:08 -03:00
test_wordpress.md	Add Instagram scraper with instaloader and parallel processing orchestrator	2025-08-18 12:56:57 -03:00
test_youtube.md	Add Instagram scraper with instaloader and parallel processing orchestrator	2025-08-18 12:56:57 -03:00
tiktok_advanced_test.md	Fix critical production issues and improve spec compliance	2025-08-18 20:07:55 -03:00
wordpress_content.html	Fix critical production issues and improve spec compliance	2025-08-18 20:07:55 -03:00
wordpress_content.md	Fix critical production issues and improve spec compliance	2025-08-18 20:07:55 -03:00
wordpress_markdownify.md	Fix critical production issues and improve spec compliance	2025-08-18 20:07:55 -03:00
wordpress_post_raw.json	Fix critical production issues and improve spec compliance	2025-08-18 20:07:55 -03:00