- Update base_scraper.py convert_to_markdown() to properly clean HTML - Remove script/style blocks and their content before conversion - Strip inline JavaScript event handlers - Clean up br tags and excessive blank lines - Fix malformed comparison operators that look like tags - Add comprehensive HTML cleaning during content extraction (not after) - Test confirms WordPress content now generates clean markdown without HTML This ensures all future WordPress scraping produces specification-compliant markdown without any HTML/XML contamination.
7 lines
No EOL
198 B
JSON
7 lines
No EOL
198 B
JSON
{
|
|
"last_update": "2025-08-18T22:15:31.540072",
|
|
"last_item_count": 428,
|
|
"backlog_captured": true,
|
|
"backlog_timestamp": "20250818_221531",
|
|
"last_id": "b6e505a9-6545-c858-e325-e43bbbcf7a45"
|
|
} |