- Updated repository URLs in PRODUCTION_GUIDE.md - Updated project specification repository reference - Updated rollback and deployment documentation - All references now point to git.tealmaker.com/ben/hvac-kia-content.git
6.1 KiB
6.1 KiB
Production Deployment Guide
Overview
This guide covers the production deployment of the HVAC Know It All Content Aggregator system.
System Architecture
Components
-
Core Scrapers (6 sources)
- YouTube: Video metadata and descriptions
- WordPress: Blog posts with full content
- Instagram: Posts with rate limiting protection
- TikTok: Videos with optional caption fetching
- MailChimp RSS: Newsletter updates (limited to 10 items)
- Podcast RSS: Episode information with audio links
-
Orchestrator
- Manages parallel execution (except TikTok/Instagram)
- Handles incremental updates
- Combines output from all sources
-
Systemd Services
- Main aggregator (runs twice daily)
- Optional TikTok caption fetcher (overnight job)
Production Recommendations
1. Scheduling Strategy
Regular Scraping (6 AM & 6 PM)
- All sources except Instagram
- Fast execution (~2-3 minutes total)
- Incremental updates only
- Parallel processing for RSS/WordPress/YouTube
Instagram (Once Daily at 7 AM)
- Separate schedule due to aggressive rate limiting
- Maximum 10 posts to avoid detection
- Sequential processing with delays
TikTok Captions (Optional, 2 AM)
- Only if captions are critical
- Runs during low-traffic hours
- Fetches captions for top 20 videos
- Takes 30-60 minutes
2. Performance Optimization
Parallel Processing
PARALLEL_PROCESSING = {
"enabled": True,
"max_workers": 3,
"exclude": ["tiktok", "instagram"] # Require sequential
}
Rate Limiting
- Instagram: 20 requests/hour (very conservative)
- TikTok: 100 requests/hour
- Others: 100-500 requests/hour
3. Error Handling
Retry Strategy
- 3 attempts with exponential backoff
- Initial delay: 5 seconds
- Max delay: 60 seconds
Failure Isolation
- Each source fails independently
- Partial results are still saved
- Failed sources logged for manual review
4. Resource Management
Disk Space
- Archive after 30 days
- Compress old files
- Typical usage: ~100MB/month
Memory
- Peak usage: ~500MB during TikTok browser automation
- Average: ~200MB for regular scraping
CPU
- Minimal usage except during browser automation
- TikTok/Instagram may spike to 50% for short periods
5. Security Considerations
API Keys
- Store in
.envfile (never commit) - Restrict file permissions:
chmod 600 .env - Rotate keys quarterly
Service Isolation
- Run as non-root user
- Separate log directories
- No network exposure (local only)
6. Monitoring
Health Checks
# Check timer status
systemctl list-timers | grep hvac
# View recent runs
journalctl -u hvac-content-aggregator -n 50
# Check for errors
grep ERROR /var/log/hvac-content/aggregator.log
Metrics to Monitor
- Items fetched per source
- Execution time
- Error rate
- Disk usage
7. Backup Strategy
What to Backup
/opt/hvac-kia-content/state/(incremental state).envfile (encrypted)/opt/hvac-kia-content/data/(optional, can regenerate)
Backup Schedule
- State files: Daily
- Environment: On change
- Data: Weekly (optional)
Installation
Prerequisites
# System requirements
- Ubuntu 20.04+ or similar
- Python 3.9+
- 2GB RAM minimum
- 10GB disk space
- Display server (for TikTok)
# Required packages
sudo apt update
sudo apt install python3-pip python3-venv git chromium-browser
Quick Start
# Clone repository
git clone https://git.tealmaker.com/ben/hvac-kia-content.git
cd hvac-kia-content
# Create and configure .env
cp .env.example .env
# Edit .env with your API keys
# Run installation
chmod +x install_production.sh
./install_production.sh
# Start services
sudo systemctl start hvac-content-aggregator.timer
# Verify
systemctl status hvac-content-aggregator.timer
Troubleshooting
Common Issues
1. TikTok Browser Timeout
- Symptom: TikTok scraper times out
- Solution: Check DISPLAY variable, may need manual CAPTCHA solving
- Alternative: Disable caption fetching, use IDs only
2. Instagram Rate Limiting
- Symptom: 429 errors or account restrictions
- Solution: Reduce max_posts, increase delays
- Prevention: Never exceed 10 posts per run
3. RSS Feed Empty
- Symptom: MailChimp returns 0 items
- Solution: Verify RSS URL is correct
- Note: Feed limited to 10 items by provider
4. Memory Issues
- Symptom: OOM kills during TikTok scraping
- Solution: Reduce max_posts or disable browser features
- Prevention: Monitor memory usage, add swap if needed
Debug Mode
# Test specific source
uv run python run_production.py --job regular --dry-run
# Run with debug logging
PYTHONPATH=. python -m src.orchestrator --debug
# Test individual scraper
python test_real_data.py --source youtube --items 3
Maintenance
Weekly Tasks
- Review error logs
- Check disk usage
- Verify all sources are updating
Monthly Tasks
- Archive old data
- Review performance metrics
- Update dependencies (test first!)
Quarterly Tasks
- Rotate API keys
- Review rate limits
- Full backup verification
Performance Benchmarks
| Source | Items | Time | Memory |
|---|---|---|---|
| YouTube | 20 | 15s | 50MB |
| WordPress | 20 | 10s | 30MB |
| 10 | 120s | 100MB | |
| TikTok (no captions) | 35 | 30s | 400MB |
| TikTok (with captions) | 10 | 300s | 500MB |
| MailChimp RSS | 10 | 2s | 20MB |
| Podcast RSS | 10 | 3s | 25MB |
Total (typical run): 95 items in ~3 minutes
Cost Analysis
Resource Costs
- VPS: ~$20/month (2GB RAM, 50GB disk)
- Bandwidth: Minimal (~1GB/month)
- Total: ~$20/month
Time Savings
- Manual collection: ~2 hours/day
- Automated: ~5 minutes/day
- Savings: ~60 hours/month
Support
Logs Location
- Main:
/var/log/hvac-content/aggregator.log - Errors:
/var/log/hvac-content/aggregator-error.log - TikTok:
/var/log/hvac-content/tiktok-captions.log - Application:
/opt/hvac-kia-content/logs/
Contact
- Forgejo Issues: https://git.tealmaker.com/ben/hvac-kia-content/issues
- Email: [your-email]
Version History
- v1.0.0 - Initial production release
- v1.1.0 - Added TikTok caption fetching
- v1.2.0 - Instagram rate limiting improvements