# Production Deployment Guide ## Overview This guide covers the production deployment of the HVAC Know It All Content Aggregator system. ## System Architecture ### Components 1. **Core Scrapers** (6 sources) - YouTube: Video metadata and descriptions - WordPress: Blog posts with full content - Instagram: Posts with rate limiting protection - TikTok: Videos with optional caption fetching - MailChimp RSS: Newsletter updates (limited to 10 items) - Podcast RSS: Episode information with audio links 2. **Orchestrator** - Manages parallel execution (except TikTok/Instagram) - Handles incremental updates - Combines output from all sources 3. **Systemd Services** - Main aggregator (runs twice daily) - Optional TikTok caption fetcher (overnight job) ## Production Recommendations ### 1. Scheduling Strategy **Regular Scraping (6 AM & 6 PM)** - All sources except Instagram - Fast execution (~2-3 minutes total) - Incremental updates only - Parallel processing for RSS/WordPress/YouTube **Instagram (Once Daily at 7 AM)** - Separate schedule due to aggressive rate limiting - Maximum 10 posts to avoid detection - Sequential processing with delays **TikTok Captions (Optional, 2 AM)** - Only if captions are critical - Runs during low-traffic hours - Fetches captions for top 20 videos - Takes 30-60 minutes ### 2. Performance Optimization **Parallel Processing** ```python PARALLEL_PROCESSING = { "enabled": True, "max_workers": 3, "exclude": ["tiktok", "instagram"] # Require sequential } ``` **Rate Limiting** - Instagram: 20 requests/hour (very conservative) - TikTok: 100 requests/hour - Others: 100-500 requests/hour ### 3. Error Handling **Retry Strategy** - 3 attempts with exponential backoff - Initial delay: 5 seconds - Max delay: 60 seconds **Failure Isolation** - Each source fails independently - Partial results are still saved - Failed sources logged for manual review ### 4. Resource Management **Disk Space** - Archive after 30 days - Compress old files - Typical usage: ~100MB/month **Memory** - Peak usage: ~500MB during TikTok browser automation - Average: ~200MB for regular scraping **CPU** - Minimal usage except during browser automation - TikTok/Instagram may spike to 50% for short periods ### 5. Security Considerations **API Keys** - Store in `.env` file (never commit) - Restrict file permissions: `chmod 600 .env` - Rotate keys quarterly **Service Isolation** - Run as non-root user - Separate log directories - No network exposure (local only) ### 6. Monitoring **Health Checks** ```bash # Check timer status systemctl list-timers | grep hvac # View recent runs journalctl -u hvac-content-aggregator -n 50 # Check for errors grep ERROR /var/log/hvac-content/aggregator.log ``` **Metrics to Monitor** - Items fetched per source - Execution time - Error rate - Disk usage ### 7. Backup Strategy **What to Backup** - `/opt/hvac-kia-content/state/` (incremental state) - `.env` file (encrypted) - `/opt/hvac-kia-content/data/` (optional, can regenerate) **Backup Schedule** - State files: Daily - Environment: On change - Data: Weekly (optional) ## Installation ### Prerequisites ```bash # System requirements - Ubuntu 20.04+ or similar - Python 3.9+ - 2GB RAM minimum - 10GB disk space - Display server (for TikTok) # Required packages sudo apt update sudo apt install python3-pip python3-venv git chromium-browser ``` ### Quick Start ```bash # Clone repository git clone https://git.tealmaker.com/ben/hvac-kia-content.git cd hvac-kia-content # Create and configure .env cp .env.example .env # Edit .env with your API keys # Run installation chmod +x install_production.sh ./install_production.sh # Start services sudo systemctl start hvac-content-aggregator.timer # Verify systemctl status hvac-content-aggregator.timer ``` ## Troubleshooting ### Common Issues **1. TikTok Browser Timeout** - Symptom: TikTok scraper times out - Solution: Check DISPLAY variable, may need manual CAPTCHA solving - Alternative: Disable caption fetching, use IDs only **2. Instagram Rate Limiting** - Symptom: 429 errors or account restrictions - Solution: Reduce max_posts, increase delays - Prevention: Never exceed 10 posts per run **3. RSS Feed Empty** - Symptom: MailChimp returns 0 items - Solution: Verify RSS URL is correct - Note: Feed limited to 10 items by provider **4. Memory Issues** - Symptom: OOM kills during TikTok scraping - Solution: Reduce max_posts or disable browser features - Prevention: Monitor memory usage, add swap if needed ### Debug Mode ```bash # Test specific source uv run python run_production.py --job regular --dry-run # Run with debug logging PYTHONPATH=. python -m src.orchestrator --debug # Test individual scraper python test_real_data.py --source youtube --items 3 ``` ## Maintenance ### Weekly Tasks - Review error logs - Check disk usage - Verify all sources are updating ### Monthly Tasks - Archive old data - Review performance metrics - Update dependencies (test first!) ### Quarterly Tasks - Rotate API keys - Review rate limits - Full backup verification ## Performance Benchmarks | Source | Items | Time | Memory | |--------|-------|------|--------| | YouTube | 20 | 15s | 50MB | | WordPress | 20 | 10s | 30MB | | Instagram | 10 | 120s | 100MB | | TikTok (no captions) | 35 | 30s | 400MB | | TikTok (with captions) | 10 | 300s | 500MB | | MailChimp RSS | 10 | 2s | 20MB | | Podcast RSS | 10 | 3s | 25MB | **Total (typical run)**: 95 items in ~3 minutes ## Cost Analysis ### Resource Costs - VPS: ~$20/month (2GB RAM, 50GB disk) - Bandwidth: Minimal (~1GB/month) - Total: ~$20/month ### Time Savings - Manual collection: ~2 hours/day - Automated: ~5 minutes/day - Savings: ~60 hours/month ## Support ### Logs Location - Main: `/var/log/hvac-content/aggregator.log` - Errors: `/var/log/hvac-content/aggregator-error.log` - TikTok: `/var/log/hvac-content/tiktok-captions.log` - Application: `/opt/hvac-kia-content/logs/` ### Contact - Forgejo Issues: https://git.tealmaker.com/ben/hvac-kia-content/issues - Email: [your-email] ## Version History - v1.0.0 - Initial production release - v1.1.0 - Added TikTok caption fetching - v1.2.0 - Instagram rate limiting improvements