hvac-kia-content/docs/ROLLBACK_PROCEDURES.md
Ben Reed a80af693ba Add comprehensive production documentation and testing
Documentation Added:
- ARCHITECTURE_DECISIONS.md: Explains why systemd over k8s (TikTok display requirements)
- DEPLOYMENT_CHECKLIST.md: Step-by-step deployment procedures
- ROLLBACK_PROCEDURES.md: Emergency rollback and recovery procedures
- test_production_deployment.py: Automated deployment verification script

Key Documentation Highlights:
- Detailed explanation of containerization limitations with browser automation
- Complete deployment checklist with pre/post verification steps
- Rollback scenarios with recovery time objectives
- Emergency contact templates and backup procedures
- Automated test script for production readiness

17 of 25 tasks completed (68% done)
Remaining work focuses on spec compliance and testing

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-08-18 20:20:52 -03:00

341 lines
No EOL
8.1 KiB
Markdown

# Rollback Procedures
## Overview
This document provides step-by-step procedures for rolling back the HVAC Know It All Content Aggregator in case of deployment issues or system failures.
## Risk Assessment
### Severity Levels
- **CRITICAL**: System completely non-functional, no data collection
- **HIGH**: Major features broken, partial data loss
- **MEDIUM**: Some scrapers failing, degraded performance
- **LOW**: Minor issues, cosmetic problems
## Pre-Rollback Checklist
### Before Rolling Back
1. **Document the Issue**
- [ ] Screenshot error messages
- [ ] Save relevant log files
- [ ] Note exact time of failure
- [ ] Record affected components
2. **Attempt Quick Fixes**
- [ ] Check environment variables
- [ ] Verify network connectivity
- [ ] Restart failed service
- [ ] Check disk space
3. **Backup Current State**
```bash
# Backup current state before rollback
sudo tar -czf /backup/emergency-$(date +%Y%m%d-%H%M%S).tar.gz \
/opt/hvac-kia-content/state/ \
/opt/hvac-kia-content/data/ \
/var/log/hvac-content/
```
## Rollback Scenarios
### Scenario 1: Service Won't Start
**Symptoms:** Systemd service fails to start after deployment
**Quick Fix:**
```bash
# Check service status
systemctl status hvac-content-aggregator.service
# Check journal for errors
journalctl -u hvac-content-aggregator.service -n 100
# Validate environment
cd /opt/hvac-kia-content
python3 -c "from run_production import validate_environment; validate_environment()"
```
**Rollback Steps:**
1. Stop the timer:
```bash
sudo systemctl stop hvac-content-aggregator.timer
```
2. Revert to previous version:
```bash
cd /opt/hvac-kia-content
git fetch --tags
git checkout v1.0.0 # Previous stable version
```
3. Reinstall dependencies:
```bash
pip install -r requirements.txt
```
4. Restart service:
```bash
sudo systemctl daemon-reload
sudo systemctl start hvac-content-aggregator.timer
```
### Scenario 2: Data Corruption
**Symptoms:** Malformed output, duplicate entries, missing data
**Quick Fix:**
```bash
# Check state files
ls -la /opt/hvac-kia-content/state/
# Validate JSON state files
python3 -c "import json; json.load(open('/opt/hvac-kia-content/state/youtube_state.json'))"
```
**Rollback Steps:**
1. Stop all services:
```bash
sudo systemctl stop hvac-content-aggregator.timer
sudo systemctl stop hvac-tiktok-captions.timer
```
2. Restore state from backup:
```bash
# Find latest backup
ls -lt /backup/hvac-state-*.tar.gz | head -1
# Restore state files
cd /
sudo tar -xzf /backup/hvac-state-20241217.tar.gz
```
3. Clear corrupted output:
```bash
# Move corrupted files to quarantine
mkdir -p /opt/hvac-kia-content/quarantine
mv /opt/hvac-kia-content/data/*_corrupted.md /opt/hvac-kia-content/quarantine/
```
4. Restart services:
```bash
sudo systemctl start hvac-content-aggregator.timer
```
### Scenario 3: Performance Degradation
**Symptoms:** Slow execution, timeouts, high CPU/memory usage
**Quick Fix:**
```bash
# Check resource usage
top -p $(pgrep -f run_production.py)
# Check disk space
df -h /opt/hvac-kia-content
# Clear old logs
find /var/log/hvac-content -name "*.log" -mtime +7 -delete
```
**Rollback Steps:**
1. Reduce scraper limits temporarily:
```bash
# Edit production config
nano /opt/hvac-kia-content/config/production.py
# Reduce max_posts, max_videos, etc.
```
2. Disable problematic scrapers:
```python
# In config/production.py
SCRAPERS_CONFIG = {
"instagram": {
"enabled": False, # Temporarily disable
...
}
}
```
3. Restart with reduced load:
```bash
sudo systemctl restart hvac-content-aggregator.service
```
### Scenario 4: Complete System Failure
**Symptoms:** Nothing works, multiple component failures
**Full System Rollback:**
1. **Stop Everything:**
```bash
# Stop all timers and services
sudo systemctl stop hvac-content-aggregator.timer
sudo systemctl stop hvac-tiktok-captions.timer
sudo systemctl disable hvac-content-aggregator.timer
sudo systemctl disable hvac-tiktok-captions.timer
```
2. **Backup Current State:**
```bash
# Full backup before rollback
sudo tar -czf /backup/full-backup-$(date +%Y%m%d-%H%M%S).tar.gz \
/opt/hvac-kia-content/ \
/etc/systemd/system/hvac-*.{service,timer} \
/var/log/hvac-content/
```
3. **Clean Installation:**
```bash
# Remove current installation
sudo rm -rf /opt/hvac-kia-content
sudo rm -f /etc/systemd/system/hvac-*
# Clone stable version
cd /opt
sudo git clone https://github.com/yourusername/hvac-kia-content.git
cd hvac-kia-content
sudo git checkout v1.0.0 # Last known stable
# Restore configuration
sudo cp /backup/.env /opt/hvac-kia-content/
# Set permissions
sudo chown -R $USER:$USER /opt/hvac-kia-content
```
4. **Reinstall Services:**
```bash
cd /opt/hvac-kia-content
./install_production.sh
```
5. **Restore State (Optional):**
```bash
# Only if state is not corrupted
sudo tar -xzf /backup/hvac-state-latest.tar.gz -C /
```
6. **Verify and Start:**
```bash
# Test first
python3 run_production.py --dry-run
# If successful, enable services
sudo systemctl enable hvac-content-aggregator.timer
sudo systemctl start hvac-content-aggregator.timer
```
## Post-Rollback Verification
### Immediate Checks
- [ ] Services are running:
```bash
systemctl status hvac-content-aggregator.timer
```
- [ ] No errors in logs:
```bash
tail -n 100 /var/log/hvac-content/aggregator.log | grep ERROR
```
- [ ] Test run successful:
```bash
cd /opt/hvac-kia-content
python3 test_real_data.py --source youtube --items 1
```
### 1-Hour Verification
- [ ] Timer fired as scheduled
- [ ] All scrapers executed
- [ ] Output files generated
- [ ] NAS sync completed
- [ ] No memory leaks
- [ ] CPU usage normal
### 24-Hour Verification
- [ ] System stable
- [ ] No missed schedules
- [ ] Data quality good
- [ ] No duplicate entries
- [ ] Incremental updates working
## Emergency Contacts
### Technical Support
- **Primary Contact:** [Name] - [Phone] - [Email]
- **Secondary Contact:** [Name] - [Phone] - [Email]
- **Escalation:** [Manager Name] - [Phone] - [Email]
### System Access
- **Server:** production-scraper.example.com
- **SSH:** `ssh user@production-scraper.example.com`
- **Logs:** `/var/log/hvac-content/`
- **Config:** `/opt/hvac-kia-content/.env`
## Recovery Time Objectives
| Scenario | Target Recovery Time | Maximum Data Loss |
|----------|---------------------|-------------------|
| Service Restart | 5 minutes | None |
| Version Rollback | 15 minutes | Since last backup |
| State Restoration | 30 minutes | 24 hours |
| Complete Rebuild | 1 hour | 48 hours |
## Lessons Learned Log
### Previous Incidents
Document any rollbacks performed and lessons learned:
| Date | Issue | Resolution | Prevention |
|------|-------|------------|------------|
| | | | |
## Backup Schedule
### Automated Backups
```bash
# Add to crontab
0 2 * * * /opt/hvac-kia-content/scripts/backup.sh
```
### Backup Script
```bash
#!/bin/bash
# /opt/hvac-kia-content/scripts/backup.sh
BACKUP_DIR="/backup/hvac-content"
DATE=$(date +%Y%m%d)
RETENTION_DAYS=30
# Create backup
tar -czf "$BACKUP_DIR/state-$DATE.tar.gz" /opt/hvac-kia-content/state/
tar -czf "$BACKUP_DIR/config-$DATE.tar.gz" /opt/hvac-kia-content/.env
# Clean old backups
find "$BACKUP_DIR" -name "*.tar.gz" -mtime +$RETENTION_DAYS -delete
# Verify backup
tar -tzf "$BACKUP_DIR/state-$DATE.tar.gz" > /dev/null 2>&1
if [ $? -eq 0 ]; then
echo "Backup successful: $DATE"
else
echo "Backup failed: $DATE" | mail -s "HVAC Backup Failed" alerts@example.com
fi
```
## Testing Rollback Procedures
### Monthly Drill
1. Schedule maintenance window
2. Perform controlled rollback
3. Verify recovery procedures
4. Document any issues
5. Update procedures as needed
### Test Checklist
- [ ] Backup procedures work
- [ ] Rollback completes in target time
- [ ] Data integrity maintained
- [ ] Services restart properly
- [ ] Monitoring alerts fire
- [ ] Documentation is current
---
*Last Updated: 2024-12-18*
*Version: 1.0*
*Next Review: 2025-01-18*