- Updated repository URLs in PRODUCTION_GUIDE.md - Updated project specification repository reference - Updated rollback and deployment documentation - All references now point to git.tealmaker.com/ben/hvac-kia-content.git
266 lines
No EOL
6.1 KiB
Markdown
266 lines
No EOL
6.1 KiB
Markdown
# Production Deployment Guide
|
|
|
|
## Overview
|
|
This guide covers the production deployment of the HVAC Know It All Content Aggregator system.
|
|
|
|
## System Architecture
|
|
|
|
### Components
|
|
1. **Core Scrapers** (6 sources)
|
|
- YouTube: Video metadata and descriptions
|
|
- WordPress: Blog posts with full content
|
|
- Instagram: Posts with rate limiting protection
|
|
- TikTok: Videos with optional caption fetching
|
|
- MailChimp RSS: Newsletter updates (limited to 10 items)
|
|
- Podcast RSS: Episode information with audio links
|
|
|
|
2. **Orchestrator**
|
|
- Manages parallel execution (except TikTok/Instagram)
|
|
- Handles incremental updates
|
|
- Combines output from all sources
|
|
|
|
3. **Systemd Services**
|
|
- Main aggregator (runs twice daily)
|
|
- Optional TikTok caption fetcher (overnight job)
|
|
|
|
## Production Recommendations
|
|
|
|
### 1. Scheduling Strategy
|
|
|
|
**Regular Scraping (6 AM & 6 PM)**
|
|
- All sources except Instagram
|
|
- Fast execution (~2-3 minutes total)
|
|
- Incremental updates only
|
|
- Parallel processing for RSS/WordPress/YouTube
|
|
|
|
**Instagram (Once Daily at 7 AM)**
|
|
- Separate schedule due to aggressive rate limiting
|
|
- Maximum 10 posts to avoid detection
|
|
- Sequential processing with delays
|
|
|
|
**TikTok Captions (Optional, 2 AM)**
|
|
- Only if captions are critical
|
|
- Runs during low-traffic hours
|
|
- Fetches captions for top 20 videos
|
|
- Takes 30-60 minutes
|
|
|
|
### 2. Performance Optimization
|
|
|
|
**Parallel Processing**
|
|
```python
|
|
PARALLEL_PROCESSING = {
|
|
"enabled": True,
|
|
"max_workers": 3,
|
|
"exclude": ["tiktok", "instagram"] # Require sequential
|
|
}
|
|
```
|
|
|
|
**Rate Limiting**
|
|
- Instagram: 20 requests/hour (very conservative)
|
|
- TikTok: 100 requests/hour
|
|
- Others: 100-500 requests/hour
|
|
|
|
### 3. Error Handling
|
|
|
|
**Retry Strategy**
|
|
- 3 attempts with exponential backoff
|
|
- Initial delay: 5 seconds
|
|
- Max delay: 60 seconds
|
|
|
|
**Failure Isolation**
|
|
- Each source fails independently
|
|
- Partial results are still saved
|
|
- Failed sources logged for manual review
|
|
|
|
### 4. Resource Management
|
|
|
|
**Disk Space**
|
|
- Archive after 30 days
|
|
- Compress old files
|
|
- Typical usage: ~100MB/month
|
|
|
|
**Memory**
|
|
- Peak usage: ~500MB during TikTok browser automation
|
|
- Average: ~200MB for regular scraping
|
|
|
|
**CPU**
|
|
- Minimal usage except during browser automation
|
|
- TikTok/Instagram may spike to 50% for short periods
|
|
|
|
### 5. Security Considerations
|
|
|
|
**API Keys**
|
|
- Store in `.env` file (never commit)
|
|
- Restrict file permissions: `chmod 600 .env`
|
|
- Rotate keys quarterly
|
|
|
|
**Service Isolation**
|
|
- Run as non-root user
|
|
- Separate log directories
|
|
- No network exposure (local only)
|
|
|
|
### 6. Monitoring
|
|
|
|
**Health Checks**
|
|
```bash
|
|
# Check timer status
|
|
systemctl list-timers | grep hvac
|
|
|
|
# View recent runs
|
|
journalctl -u hvac-content-aggregator -n 50
|
|
|
|
# Check for errors
|
|
grep ERROR /var/log/hvac-content/aggregator.log
|
|
```
|
|
|
|
**Metrics to Monitor**
|
|
- Items fetched per source
|
|
- Execution time
|
|
- Error rate
|
|
- Disk usage
|
|
|
|
### 7. Backup Strategy
|
|
|
|
**What to Backup**
|
|
- `/opt/hvac-kia-content/state/` (incremental state)
|
|
- `.env` file (encrypted)
|
|
- `/opt/hvac-kia-content/data/` (optional, can regenerate)
|
|
|
|
**Backup Schedule**
|
|
- State files: Daily
|
|
- Environment: On change
|
|
- Data: Weekly (optional)
|
|
|
|
## Installation
|
|
|
|
### Prerequisites
|
|
```bash
|
|
# System requirements
|
|
- Ubuntu 20.04+ or similar
|
|
- Python 3.9+
|
|
- 2GB RAM minimum
|
|
- 10GB disk space
|
|
- Display server (for TikTok)
|
|
|
|
# Required packages
|
|
sudo apt update
|
|
sudo apt install python3-pip python3-venv git chromium-browser
|
|
```
|
|
|
|
### Quick Start
|
|
```bash
|
|
# Clone repository
|
|
git clone https://git.tealmaker.com/ben/hvac-kia-content.git
|
|
cd hvac-kia-content
|
|
|
|
# Create and configure .env
|
|
cp .env.example .env
|
|
# Edit .env with your API keys
|
|
|
|
# Run installation
|
|
chmod +x install_production.sh
|
|
./install_production.sh
|
|
|
|
# Start services
|
|
sudo systemctl start hvac-content-aggregator.timer
|
|
|
|
# Verify
|
|
systemctl status hvac-content-aggregator.timer
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
### Common Issues
|
|
|
|
**1. TikTok Browser Timeout**
|
|
- Symptom: TikTok scraper times out
|
|
- Solution: Check DISPLAY variable, may need manual CAPTCHA solving
|
|
- Alternative: Disable caption fetching, use IDs only
|
|
|
|
**2. Instagram Rate Limiting**
|
|
- Symptom: 429 errors or account restrictions
|
|
- Solution: Reduce max_posts, increase delays
|
|
- Prevention: Never exceed 10 posts per run
|
|
|
|
**3. RSS Feed Empty**
|
|
- Symptom: MailChimp returns 0 items
|
|
- Solution: Verify RSS URL is correct
|
|
- Note: Feed limited to 10 items by provider
|
|
|
|
**4. Memory Issues**
|
|
- Symptom: OOM kills during TikTok scraping
|
|
- Solution: Reduce max_posts or disable browser features
|
|
- Prevention: Monitor memory usage, add swap if needed
|
|
|
|
### Debug Mode
|
|
|
|
```bash
|
|
# Test specific source
|
|
uv run python run_production.py --job regular --dry-run
|
|
|
|
# Run with debug logging
|
|
PYTHONPATH=. python -m src.orchestrator --debug
|
|
|
|
# Test individual scraper
|
|
python test_real_data.py --source youtube --items 3
|
|
```
|
|
|
|
## Maintenance
|
|
|
|
### Weekly Tasks
|
|
- Review error logs
|
|
- Check disk usage
|
|
- Verify all sources are updating
|
|
|
|
### Monthly Tasks
|
|
- Archive old data
|
|
- Review performance metrics
|
|
- Update dependencies (test first!)
|
|
|
|
### Quarterly Tasks
|
|
- Rotate API keys
|
|
- Review rate limits
|
|
- Full backup verification
|
|
|
|
## Performance Benchmarks
|
|
|
|
| Source | Items | Time | Memory |
|
|
|--------|-------|------|--------|
|
|
| YouTube | 20 | 15s | 50MB |
|
|
| WordPress | 20 | 10s | 30MB |
|
|
| Instagram | 10 | 120s | 100MB |
|
|
| TikTok (no captions) | 35 | 30s | 400MB |
|
|
| TikTok (with captions) | 10 | 300s | 500MB |
|
|
| MailChimp RSS | 10 | 2s | 20MB |
|
|
| Podcast RSS | 10 | 3s | 25MB |
|
|
|
|
**Total (typical run)**: 95 items in ~3 minutes
|
|
|
|
## Cost Analysis
|
|
|
|
### Resource Costs
|
|
- VPS: ~$20/month (2GB RAM, 50GB disk)
|
|
- Bandwidth: Minimal (~1GB/month)
|
|
- Total: ~$20/month
|
|
|
|
### Time Savings
|
|
- Manual collection: ~2 hours/day
|
|
- Automated: ~5 minutes/day
|
|
- Savings: ~60 hours/month
|
|
|
|
## Support
|
|
|
|
### Logs Location
|
|
- Main: `/var/log/hvac-content/aggregator.log`
|
|
- Errors: `/var/log/hvac-content/aggregator-error.log`
|
|
- TikTok: `/var/log/hvac-content/tiktok-captions.log`
|
|
- Application: `/opt/hvac-kia-content/logs/`
|
|
|
|
### Contact
|
|
- Forgejo Issues: https://git.tealmaker.com/ben/hvac-kia-content/issues
|
|
- Email: [your-email]
|
|
|
|
## Version History
|
|
- v1.0.0 - Initial production release
|
|
- v1.1.0 - Added TikTok caption fetching
|
|
- v1.2.0 - Instagram rate limiting improvements |