- Update status.md with current production deployment status - Document completed backlogs (WordPress: 139, Podcast: 428, YouTube: 200) - Track Instagram progress (50/1000 @ 200/hr) and TikTok queue status - Create claude.md with implementation notes and key solutions - Document HTML cleaning fix, rate limit optimization, and NAS sync - Add testing commands and maintenance notes for future reference - Include known issues and file structure documentation
4.3 KiB
4.3 KiB
HVAC Know It All Content Aggregation - Claude Assistant Notes
Project Overview
This system aggregates content from 6 sources for the HVAC Know It All brand, converting everything to specification-compliant markdown for consistent formatting and searchability.
Key Implementation Details
1. HTML/XML Cleaning (2025-08-18)
- Issue: WordPress content contained HTML tags (
<br />) and JavaScript code in markdown output - Solution: Enhanced
base_scraper.py::convert_to_markdown()to:- Remove script/style blocks before conversion
- Strip inline JavaScript event handlers
- Clean up br tags and excessive blank lines
- Fix malformed comparison operators that look like tags
- Result: All markdown now specification-compliant without HTML contamination
2. Instagram Rate Limiting (2025-08-18)
- Issue: Initial scraping at 100 posts/hour was too slow for 1000+ items
- Solution: Optimized
instagram_scraper.py:- Increased rate to 200 posts/hour
- Reduced delays from 15-30s to 10-20s
- Extended breaks every 10 requests instead of 5
- Result: 100% speed improvement while maintaining stability
3. TikTok Caption Enhancement (2025-08-18)
- Issue: Profile page scraping missed video captions
- Solution: Implemented hybrid approach in
tiktok_scraper_advanced.py:- Fetch video IDs from profile page (fast)
- Optionally fetch captions from individual video pages
- Configurable caption fetch limit for performance
- Result: Complete content capture with captions for key videos
4. NAS Synchronization (2025-08-18)
- Issue: Initial implementation synced logs instead of media files
- Solution: Updated
orchestrator.pyto sync:/markdown_current/and/markdown_archives/directories/media/directory with all downloaded assets
- Result: Proper backup of content and media to network storage
Production Deployment Status
Completed Backlogs (as of 2025-08-18 23:15 ADT)
- WordPress: 139 posts ✅
- Podcast: 428 episodes ✅
- YouTube: 200 videos ✅
- MailChimp: SSL error (provider issue, not code)
- Instagram: 50/1000 posts (in progress, ~200/hr)
- TikTok: Queued after Instagram
System Configuration
- Environment: Ubuntu with display support for TikTok
- Scheduling: systemd timers at 8AM and 12PM ADT
- Dependencies: UV package manager
- Monitoring: Custom dashboard and alerts
Specification Compliance
All content follows this markdown format:
# ID: [unique_identifier]
## Title: [content_title]
## Type: [blog_post|podcast|video|post]
## Author: [author_name]
## Publish Date: [ISO_date]
## [Additional metadata fields]
## Description:
[Full content description]
--------------------------------------------------
Testing Commands
# Quick test all sources
uv run python quick_backlog_test.py
# Test WordPress HTML cleaning
uv run python test_wordpress_clean.py
# Full production backlog capture
uv run python production_backlog_capture.py
# Resume Instagram/TikTok capture
uv run python resume_instagram_capture.py
# Validate production setup
./validate_production.sh
Known Issues
- MailChimp SSL Error: Provider's SSL certificate issue, not fixable in code
- Instagram Rate Limits: Even at 200/hr, 1000 posts takes ~5 hours
- TikTok Display Requirement: Must run with DISPLAY=:0 for headed browser
Maintenance Notes
- Always check Instagram session validity before large captures
- Monitor rate limit effectiveness in logs
- Verify markdown formatting after WordPress updates
- Test TikTok with display before production runs
File Structure
/home/ben/dev/hvac-kia-content/
├── src/ # Scraper implementations
├── data_production_backlog/ # Production data
│ ├── markdown_current/ # Latest markdown files
│ ├── markdown_archives/ # Historical versions
│ └── media/ # Downloaded media files
├── logs_production_backlog/ # Production logs
├── production_backlog_capture.py # Main capture script
├── resume_instagram_capture.py # Resume interrupted captures
└── validate_production.sh # Production validation
Contact
For issues or questions about this implementation, refer to the project documentation or the git commit history for detailed change tracking.