- Update status.md with current production deployment status - Document completed backlogs (WordPress: 139, Podcast: 428, YouTube: 200) - Track Instagram progress (50/1000 @ 200/hr) and TikTok queue status - Create claude.md with implementation notes and key solutions - Document HTML cleaning fix, rate limit optimization, and NAS sync - Add testing commands and maintenance notes for future reference - Include known issues and file structure documentation
119 lines
No EOL
4.3 KiB
Markdown
119 lines
No EOL
4.3 KiB
Markdown
# HVAC Know It All Content Aggregation - Claude Assistant Notes
|
|
|
|
## Project Overview
|
|
This system aggregates content from 6 sources for the HVAC Know It All brand, converting everything to specification-compliant markdown for consistent formatting and searchability.
|
|
|
|
## Key Implementation Details
|
|
|
|
### 1. HTML/XML Cleaning (2025-08-18)
|
|
- **Issue**: WordPress content contained HTML tags (`<br />`) and JavaScript code in markdown output
|
|
- **Solution**: Enhanced `base_scraper.py::convert_to_markdown()` to:
|
|
- Remove script/style blocks before conversion
|
|
- Strip inline JavaScript event handlers
|
|
- Clean up br tags and excessive blank lines
|
|
- Fix malformed comparison operators that look like tags
|
|
- **Result**: All markdown now specification-compliant without HTML contamination
|
|
|
|
### 2. Instagram Rate Limiting (2025-08-18)
|
|
- **Issue**: Initial scraping at 100 posts/hour was too slow for 1000+ items
|
|
- **Solution**: Optimized `instagram_scraper.py`:
|
|
- Increased rate to 200 posts/hour
|
|
- Reduced delays from 15-30s to 10-20s
|
|
- Extended breaks every 10 requests instead of 5
|
|
- **Result**: 100% speed improvement while maintaining stability
|
|
|
|
### 3. TikTok Caption Enhancement (2025-08-18)
|
|
- **Issue**: Profile page scraping missed video captions
|
|
- **Solution**: Implemented hybrid approach in `tiktok_scraper_advanced.py`:
|
|
- Fetch video IDs from profile page (fast)
|
|
- Optionally fetch captions from individual video pages
|
|
- Configurable caption fetch limit for performance
|
|
- **Result**: Complete content capture with captions for key videos
|
|
|
|
### 4. NAS Synchronization (2025-08-18)
|
|
- **Issue**: Initial implementation synced logs instead of media files
|
|
- **Solution**: Updated `orchestrator.py` to sync:
|
|
- `/markdown_current/` and `/markdown_archives/` directories
|
|
- `/media/` directory with all downloaded assets
|
|
- **Result**: Proper backup of content and media to network storage
|
|
|
|
## Production Deployment Status
|
|
|
|
### Completed Backlogs (as of 2025-08-18 23:15 ADT)
|
|
- **WordPress**: 139 posts ✅
|
|
- **Podcast**: 428 episodes ✅
|
|
- **YouTube**: 200 videos ✅
|
|
- **MailChimp**: SSL error (provider issue, not code)
|
|
- **Instagram**: 50/1000 posts (in progress, ~200/hr)
|
|
- **TikTok**: Queued after Instagram
|
|
|
|
### System Configuration
|
|
- **Environment**: Ubuntu with display support for TikTok
|
|
- **Scheduling**: systemd timers at 8AM and 12PM ADT
|
|
- **Dependencies**: UV package manager
|
|
- **Monitoring**: Custom dashboard and alerts
|
|
|
|
## Specification Compliance
|
|
|
|
All content follows this markdown format:
|
|
```markdown
|
|
# ID: [unique_identifier]
|
|
## Title: [content_title]
|
|
## Type: [blog_post|podcast|video|post]
|
|
## Author: [author_name]
|
|
## Publish Date: [ISO_date]
|
|
## [Additional metadata fields]
|
|
## Description:
|
|
[Full content description]
|
|
--------------------------------------------------
|
|
```
|
|
|
|
## Testing Commands
|
|
|
|
```bash
|
|
# Quick test all sources
|
|
uv run python quick_backlog_test.py
|
|
|
|
# Test WordPress HTML cleaning
|
|
uv run python test_wordpress_clean.py
|
|
|
|
# Full production backlog capture
|
|
uv run python production_backlog_capture.py
|
|
|
|
# Resume Instagram/TikTok capture
|
|
uv run python resume_instagram_capture.py
|
|
|
|
# Validate production setup
|
|
./validate_production.sh
|
|
```
|
|
|
|
## Known Issues
|
|
|
|
1. **MailChimp SSL Error**: Provider's SSL certificate issue, not fixable in code
|
|
2. **Instagram Rate Limits**: Even at 200/hr, 1000 posts takes ~5 hours
|
|
3. **TikTok Display Requirement**: Must run with DISPLAY=:0 for headed browser
|
|
|
|
## Maintenance Notes
|
|
|
|
- Always check Instagram session validity before large captures
|
|
- Monitor rate limit effectiveness in logs
|
|
- Verify markdown formatting after WordPress updates
|
|
- Test TikTok with display before production runs
|
|
|
|
## File Structure
|
|
|
|
```
|
|
/home/ben/dev/hvac-kia-content/
|
|
├── src/ # Scraper implementations
|
|
├── data_production_backlog/ # Production data
|
|
│ ├── markdown_current/ # Latest markdown files
|
|
│ ├── markdown_archives/ # Historical versions
|
|
│ └── media/ # Downloaded media files
|
|
├── logs_production_backlog/ # Production logs
|
|
├── production_backlog_capture.py # Main capture script
|
|
├── resume_instagram_capture.py # Resume interrupted captures
|
|
└── validate_production.sh # Production validation
|
|
```
|
|
|
|
## Contact
|
|
For issues or questions about this implementation, refer to the project documentation or the git commit history for detailed change tracking. |