hvac-kia-content/docs/claude.md
Ben Reed 8a0b8b4d3f Update documentation with production deployment status
- Update status.md with current production deployment status
- Document completed backlogs (WordPress: 139, Podcast: 428, YouTube: 200)
- Track Instagram progress (50/1000 @ 200/hr) and TikTok queue status
- Create claude.md with implementation notes and key solutions
- Document HTML cleaning fix, rate limit optimization, and NAS sync
- Add testing commands and maintenance notes for future reference
- Include known issues and file structure documentation
2025-08-18 23:14:45 -03:00

4.3 KiB

HVAC Know It All Content Aggregation - Claude Assistant Notes

Project Overview

This system aggregates content from 6 sources for the HVAC Know It All brand, converting everything to specification-compliant markdown for consistent formatting and searchability.

Key Implementation Details

1. HTML/XML Cleaning (2025-08-18)

  • Issue: WordPress content contained HTML tags (<br />) and JavaScript code in markdown output
  • Solution: Enhanced base_scraper.py::convert_to_markdown() to:
    • Remove script/style blocks before conversion
    • Strip inline JavaScript event handlers
    • Clean up br tags and excessive blank lines
    • Fix malformed comparison operators that look like tags
  • Result: All markdown now specification-compliant without HTML contamination

2. Instagram Rate Limiting (2025-08-18)

  • Issue: Initial scraping at 100 posts/hour was too slow for 1000+ items
  • Solution: Optimized instagram_scraper.py:
    • Increased rate to 200 posts/hour
    • Reduced delays from 15-30s to 10-20s
    • Extended breaks every 10 requests instead of 5
  • Result: 100% speed improvement while maintaining stability

3. TikTok Caption Enhancement (2025-08-18)

  • Issue: Profile page scraping missed video captions
  • Solution: Implemented hybrid approach in tiktok_scraper_advanced.py:
    • Fetch video IDs from profile page (fast)
    • Optionally fetch captions from individual video pages
    • Configurable caption fetch limit for performance
  • Result: Complete content capture with captions for key videos

4. NAS Synchronization (2025-08-18)

  • Issue: Initial implementation synced logs instead of media files
  • Solution: Updated orchestrator.py to sync:
    • /markdown_current/ and /markdown_archives/ directories
    • /media/ directory with all downloaded assets
  • Result: Proper backup of content and media to network storage

Production Deployment Status

Completed Backlogs (as of 2025-08-18 23:15 ADT)

  • WordPress: 139 posts
  • Podcast: 428 episodes
  • YouTube: 200 videos
  • MailChimp: SSL error (provider issue, not code)
  • Instagram: 50/1000 posts (in progress, ~200/hr)
  • TikTok: Queued after Instagram

System Configuration

  • Environment: Ubuntu with display support for TikTok
  • Scheduling: systemd timers at 8AM and 12PM ADT
  • Dependencies: UV package manager
  • Monitoring: Custom dashboard and alerts

Specification Compliance

All content follows this markdown format:

# ID: [unique_identifier]
## Title: [content_title]
## Type: [blog_post|podcast|video|post]
## Author: [author_name]
## Publish Date: [ISO_date]
## [Additional metadata fields]
## Description:
[Full content description]
--------------------------------------------------

Testing Commands

# Quick test all sources
uv run python quick_backlog_test.py

# Test WordPress HTML cleaning
uv run python test_wordpress_clean.py

# Full production backlog capture
uv run python production_backlog_capture.py

# Resume Instagram/TikTok capture
uv run python resume_instagram_capture.py

# Validate production setup
./validate_production.sh

Known Issues

  1. MailChimp SSL Error: Provider's SSL certificate issue, not fixable in code
  2. Instagram Rate Limits: Even at 200/hr, 1000 posts takes ~5 hours
  3. TikTok Display Requirement: Must run with DISPLAY=:0 for headed browser

Maintenance Notes

  • Always check Instagram session validity before large captures
  • Monitor rate limit effectiveness in logs
  • Verify markdown formatting after WordPress updates
  • Test TikTok with display before production runs

File Structure

/home/ben/dev/hvac-kia-content/
├── src/                    # Scraper implementations
├── data_production_backlog/  # Production data
│   ├── markdown_current/   # Latest markdown files
│   ├── markdown_archives/  # Historical versions
│   └── media/              # Downloaded media files
├── logs_production_backlog/ # Production logs
├── production_backlog_capture.py  # Main capture script
├── resume_instagram_capture.py    # Resume interrupted captures
└── validate_production.sh         # Production validation

Contact

For issues or questions about this implementation, refer to the project documentation or the git commit history for detailed change tracking.