# HVAC Know It All Content Aggregation - Claude Assistant Notes ## Project Overview This system aggregates content from 6 sources for the HVAC Know It All brand, converting everything to specification-compliant markdown for consistent formatting and searchability. ## Key Implementation Details ### 1. HTML/XML Cleaning (2025-08-18) - **Issue**: WordPress content contained HTML tags (`
`) and JavaScript code in markdown output - **Solution**: Enhanced `base_scraper.py::convert_to_markdown()` to: - Remove script/style blocks before conversion - Strip inline JavaScript event handlers - Clean up br tags and excessive blank lines - Fix malformed comparison operators that look like tags - **Result**: All markdown now specification-compliant without HTML contamination ### 2. Instagram Rate Limiting (2025-08-18) - **Issue**: Initial scraping at 100 posts/hour was too slow for 1000+ items - **Solution**: Optimized `instagram_scraper.py`: - Increased rate to 200 posts/hour - Reduced delays from 15-30s to 10-20s - Extended breaks every 10 requests instead of 5 - **Result**: 100% speed improvement while maintaining stability ### 3. TikTok Caption Enhancement (2025-08-18) - **Issue**: Profile page scraping missed video captions - **Solution**: Implemented hybrid approach in `tiktok_scraper_advanced.py`: - Fetch video IDs from profile page (fast) - Optionally fetch captions from individual video pages - Configurable caption fetch limit for performance - **Result**: Complete content capture with captions for key videos ### 4. NAS Synchronization (2025-08-18) - **Issue**: Initial implementation synced logs instead of media files - **Solution**: Updated `orchestrator.py` to sync: - `/markdown_current/` and `/markdown_archives/` directories - `/media/` directory with all downloaded assets - **Result**: Proper backup of content and media to network storage ## Production Deployment Status ### Completed Backlogs (as of 2025-08-18 23:15 ADT) - **WordPress**: 139 posts ✅ - **Podcast**: 428 episodes ✅ - **YouTube**: 200 videos ✅ - **MailChimp**: SSL error (provider issue, not code) - **Instagram**: 50/1000 posts (in progress, ~200/hr) - **TikTok**: Queued after Instagram ### System Configuration - **Environment**: Ubuntu with display support for TikTok - **Scheduling**: systemd timers at 8AM and 12PM ADT - **Dependencies**: UV package manager - **Monitoring**: Custom dashboard and alerts ## Specification Compliance All content follows this markdown format: ```markdown # ID: [unique_identifier] ## Title: [content_title] ## Type: [blog_post|podcast|video|post] ## Author: [author_name] ## Publish Date: [ISO_date] ## [Additional metadata fields] ## Description: [Full content description] -------------------------------------------------- ``` ## Testing Commands ```bash # Quick test all sources uv run python quick_backlog_test.py # Test WordPress HTML cleaning uv run python test_wordpress_clean.py # Full production backlog capture uv run python production_backlog_capture.py # Resume Instagram/TikTok capture uv run python resume_instagram_capture.py # Validate production setup ./validate_production.sh ``` ## Known Issues 1. **MailChimp SSL Error**: Provider's SSL certificate issue, not fixable in code 2. **Instagram Rate Limits**: Even at 200/hr, 1000 posts takes ~5 hours 3. **TikTok Display Requirement**: Must run with DISPLAY=:0 for headed browser ## Maintenance Notes - Always check Instagram session validity before large captures - Monitor rate limit effectiveness in logs - Verify markdown formatting after WordPress updates - Test TikTok with display before production runs ## File Structure ``` /home/ben/dev/hvac-kia-content/ ├── src/ # Scraper implementations ├── data_production_backlog/ # Production data │ ├── markdown_current/ # Latest markdown files │ ├── markdown_archives/ # Historical versions │ └── media/ # Downloaded media files ├── logs_production_backlog/ # Production logs ├── production_backlog_capture.py # Main capture script ├── resume_instagram_capture.py # Resume interrupted captures └── validate_production.sh # Production validation ``` ## Contact For issues or questions about this implementation, refer to the project documentation or the git commit history for detailed change tracking.