hvac-kia-content/docs/claude.md
Ben Reed 8a0b8b4d3f Update documentation with production deployment status
- Update status.md with current production deployment status
- Document completed backlogs (WordPress: 139, Podcast: 428, YouTube: 200)
- Track Instagram progress (50/1000 @ 200/hr) and TikTok queue status
- Create claude.md with implementation notes and key solutions
- Document HTML cleaning fix, rate limit optimization, and NAS sync
- Add testing commands and maintenance notes for future reference
- Include known issues and file structure documentation
2025-08-18 23:14:45 -03:00

119 lines
No EOL
4.3 KiB
Markdown

# HVAC Know It All Content Aggregation - Claude Assistant Notes
## Project Overview
This system aggregates content from 6 sources for the HVAC Know It All brand, converting everything to specification-compliant markdown for consistent formatting and searchability.
## Key Implementation Details
### 1. HTML/XML Cleaning (2025-08-18)
- **Issue**: WordPress content contained HTML tags (`<br />`) and JavaScript code in markdown output
- **Solution**: Enhanced `base_scraper.py::convert_to_markdown()` to:
- Remove script/style blocks before conversion
- Strip inline JavaScript event handlers
- Clean up br tags and excessive blank lines
- Fix malformed comparison operators that look like tags
- **Result**: All markdown now specification-compliant without HTML contamination
### 2. Instagram Rate Limiting (2025-08-18)
- **Issue**: Initial scraping at 100 posts/hour was too slow for 1000+ items
- **Solution**: Optimized `instagram_scraper.py`:
- Increased rate to 200 posts/hour
- Reduced delays from 15-30s to 10-20s
- Extended breaks every 10 requests instead of 5
- **Result**: 100% speed improvement while maintaining stability
### 3. TikTok Caption Enhancement (2025-08-18)
- **Issue**: Profile page scraping missed video captions
- **Solution**: Implemented hybrid approach in `tiktok_scraper_advanced.py`:
- Fetch video IDs from profile page (fast)
- Optionally fetch captions from individual video pages
- Configurable caption fetch limit for performance
- **Result**: Complete content capture with captions for key videos
### 4. NAS Synchronization (2025-08-18)
- **Issue**: Initial implementation synced logs instead of media files
- **Solution**: Updated `orchestrator.py` to sync:
- `/markdown_current/` and `/markdown_archives/` directories
- `/media/` directory with all downloaded assets
- **Result**: Proper backup of content and media to network storage
## Production Deployment Status
### Completed Backlogs (as of 2025-08-18 23:15 ADT)
- **WordPress**: 139 posts ✅
- **Podcast**: 428 episodes ✅
- **YouTube**: 200 videos ✅
- **MailChimp**: SSL error (provider issue, not code)
- **Instagram**: 50/1000 posts (in progress, ~200/hr)
- **TikTok**: Queued after Instagram
### System Configuration
- **Environment**: Ubuntu with display support for TikTok
- **Scheduling**: systemd timers at 8AM and 12PM ADT
- **Dependencies**: UV package manager
- **Monitoring**: Custom dashboard and alerts
## Specification Compliance
All content follows this markdown format:
```markdown
# ID: [unique_identifier]
## Title: [content_title]
## Type: [blog_post|podcast|video|post]
## Author: [author_name]
## Publish Date: [ISO_date]
## [Additional metadata fields]
## Description:
[Full content description]
--------------------------------------------------
```
## Testing Commands
```bash
# Quick test all sources
uv run python quick_backlog_test.py
# Test WordPress HTML cleaning
uv run python test_wordpress_clean.py
# Full production backlog capture
uv run python production_backlog_capture.py
# Resume Instagram/TikTok capture
uv run python resume_instagram_capture.py
# Validate production setup
./validate_production.sh
```
## Known Issues
1. **MailChimp SSL Error**: Provider's SSL certificate issue, not fixable in code
2. **Instagram Rate Limits**: Even at 200/hr, 1000 posts takes ~5 hours
3. **TikTok Display Requirement**: Must run with DISPLAY=:0 for headed browser
## Maintenance Notes
- Always check Instagram session validity before large captures
- Monitor rate limit effectiveness in logs
- Verify markdown formatting after WordPress updates
- Test TikTok with display before production runs
## File Structure
```
/home/ben/dev/hvac-kia-content/
├── src/ # Scraper implementations
├── data_production_backlog/ # Production data
│ ├── markdown_current/ # Latest markdown files
│ ├── markdown_archives/ # Historical versions
│ └── media/ # Downloaded media files
├── logs_production_backlog/ # Production logs
├── production_backlog_capture.py # Main capture script
├── resume_instagram_capture.py # Resume interrupted captures
└── validate_production.sh # Production validation
```
## Contact
For issues or questions about this implementation, refer to the project documentation or the git commit history for detailed change tracking.