# Rollback Procedures ## Overview This document provides step-by-step procedures for rolling back the HVAC Know It All Content Aggregator in case of deployment issues or system failures. ## Risk Assessment ### Severity Levels - **CRITICAL**: System completely non-functional, no data collection - **HIGH**: Major features broken, partial data loss - **MEDIUM**: Some scrapers failing, degraded performance - **LOW**: Minor issues, cosmetic problems ## Pre-Rollback Checklist ### Before Rolling Back 1. **Document the Issue** - [ ] Screenshot error messages - [ ] Save relevant log files - [ ] Note exact time of failure - [ ] Record affected components 2. **Attempt Quick Fixes** - [ ] Check environment variables - [ ] Verify network connectivity - [ ] Restart failed service - [ ] Check disk space 3. **Backup Current State** ```bash # Backup current state before rollback sudo tar -czf /backup/emergency-$(date +%Y%m%d-%H%M%S).tar.gz \ /opt/hvac-kia-content/state/ \ /opt/hvac-kia-content/data/ \ /var/log/hvac-content/ ``` ## Rollback Scenarios ### Scenario 1: Service Won't Start **Symptoms:** Systemd service fails to start after deployment **Quick Fix:** ```bash # Check service status systemctl status hvac-content-aggregator.service # Check journal for errors journalctl -u hvac-content-aggregator.service -n 100 # Validate environment cd /opt/hvac-kia-content python3 -c "from run_production import validate_environment; validate_environment()" ``` **Rollback Steps:** 1. Stop the timer: ```bash sudo systemctl stop hvac-content-aggregator.timer ``` 2. Revert to previous version: ```bash cd /opt/hvac-kia-content git fetch --tags git checkout v1.0.0 # Previous stable version ``` 3. Reinstall dependencies: ```bash pip install -r requirements.txt ``` 4. Restart service: ```bash sudo systemctl daemon-reload sudo systemctl start hvac-content-aggregator.timer ``` ### Scenario 2: Data Corruption **Symptoms:** Malformed output, duplicate entries, missing data **Quick Fix:** ```bash # Check state files ls -la /opt/hvac-kia-content/state/ # Validate JSON state files python3 -c "import json; json.load(open('/opt/hvac-kia-content/state/youtube_state.json'))" ``` **Rollback Steps:** 1. Stop all services: ```bash sudo systemctl stop hvac-content-aggregator.timer sudo systemctl stop hvac-tiktok-captions.timer ``` 2. Restore state from backup: ```bash # Find latest backup ls -lt /backup/hvac-state-*.tar.gz | head -1 # Restore state files cd / sudo tar -xzf /backup/hvac-state-20241217.tar.gz ``` 3. Clear corrupted output: ```bash # Move corrupted files to quarantine mkdir -p /opt/hvac-kia-content/quarantine mv /opt/hvac-kia-content/data/*_corrupted.md /opt/hvac-kia-content/quarantine/ ``` 4. Restart services: ```bash sudo systemctl start hvac-content-aggregator.timer ``` ### Scenario 3: Performance Degradation **Symptoms:** Slow execution, timeouts, high CPU/memory usage **Quick Fix:** ```bash # Check resource usage top -p $(pgrep -f run_production.py) # Check disk space df -h /opt/hvac-kia-content # Clear old logs find /var/log/hvac-content -name "*.log" -mtime +7 -delete ``` **Rollback Steps:** 1. Reduce scraper limits temporarily: ```bash # Edit production config nano /opt/hvac-kia-content/config/production.py # Reduce max_posts, max_videos, etc. ``` 2. Disable problematic scrapers: ```python # In config/production.py SCRAPERS_CONFIG = { "instagram": { "enabled": False, # Temporarily disable ... } } ``` 3. Restart with reduced load: ```bash sudo systemctl restart hvac-content-aggregator.service ``` ### Scenario 4: Complete System Failure **Symptoms:** Nothing works, multiple component failures **Full System Rollback:** 1. **Stop Everything:** ```bash # Stop all timers and services sudo systemctl stop hvac-content-aggregator.timer sudo systemctl stop hvac-tiktok-captions.timer sudo systemctl disable hvac-content-aggregator.timer sudo systemctl disable hvac-tiktok-captions.timer ``` 2. **Backup Current State:** ```bash # Full backup before rollback sudo tar -czf /backup/full-backup-$(date +%Y%m%d-%H%M%S).tar.gz \ /opt/hvac-kia-content/ \ /etc/systemd/system/hvac-*.{service,timer} \ /var/log/hvac-content/ ``` 3. **Clean Installation:** ```bash # Remove current installation sudo rm -rf /opt/hvac-kia-content sudo rm -f /etc/systemd/system/hvac-* # Clone stable version cd /opt sudo git clone https://git.tealmaker.com/ben/hvac-kia-content.git cd hvac-kia-content sudo git checkout v1.0.0 # Last known stable # Restore configuration sudo cp /backup/.env /opt/hvac-kia-content/ # Set permissions sudo chown -R $USER:$USER /opt/hvac-kia-content ``` 4. **Reinstall Services:** ```bash cd /opt/hvac-kia-content ./install_production.sh ``` 5. **Restore State (Optional):** ```bash # Only if state is not corrupted sudo tar -xzf /backup/hvac-state-latest.tar.gz -C / ``` 6. **Verify and Start:** ```bash # Test first python3 run_production.py --dry-run # If successful, enable services sudo systemctl enable hvac-content-aggregator.timer sudo systemctl start hvac-content-aggregator.timer ``` ## Post-Rollback Verification ### Immediate Checks - [ ] Services are running: ```bash systemctl status hvac-content-aggregator.timer ``` - [ ] No errors in logs: ```bash tail -n 100 /var/log/hvac-content/aggregator.log | grep ERROR ``` - [ ] Test run successful: ```bash cd /opt/hvac-kia-content python3 test_real_data.py --source youtube --items 1 ``` ### 1-Hour Verification - [ ] Timer fired as scheduled - [ ] All scrapers executed - [ ] Output files generated - [ ] NAS sync completed - [ ] No memory leaks - [ ] CPU usage normal ### 24-Hour Verification - [ ] System stable - [ ] No missed schedules - [ ] Data quality good - [ ] No duplicate entries - [ ] Incremental updates working ## Emergency Contacts ### Technical Support - **Primary Contact:** [Name] - [Phone] - [Email] - **Secondary Contact:** [Name] - [Phone] - [Email] - **Escalation:** [Manager Name] - [Phone] - [Email] ### System Access - **Server:** production-scraper.example.com - **SSH:** `ssh user@production-scraper.example.com` - **Logs:** `/var/log/hvac-content/` - **Config:** `/opt/hvac-kia-content/.env` ## Recovery Time Objectives | Scenario | Target Recovery Time | Maximum Data Loss | |----------|---------------------|-------------------| | Service Restart | 5 minutes | None | | Version Rollback | 15 minutes | Since last backup | | State Restoration | 30 minutes | 24 hours | | Complete Rebuild | 1 hour | 48 hours | ## Lessons Learned Log ### Previous Incidents Document any rollbacks performed and lessons learned: | Date | Issue | Resolution | Prevention | |------|-------|------------|------------| | | | | | ## Backup Schedule ### Automated Backups ```bash # Add to crontab 0 2 * * * /opt/hvac-kia-content/scripts/backup.sh ``` ### Backup Script ```bash #!/bin/bash # /opt/hvac-kia-content/scripts/backup.sh BACKUP_DIR="/backup/hvac-content" DATE=$(date +%Y%m%d) RETENTION_DAYS=30 # Create backup tar -czf "$BACKUP_DIR/state-$DATE.tar.gz" /opt/hvac-kia-content/state/ tar -czf "$BACKUP_DIR/config-$DATE.tar.gz" /opt/hvac-kia-content/.env # Clean old backups find "$BACKUP_DIR" -name "*.tar.gz" -mtime +$RETENTION_DAYS -delete # Verify backup tar -tzf "$BACKUP_DIR/state-$DATE.tar.gz" > /dev/null 2>&1 if [ $? -eq 0 ]; then echo "Backup successful: $DATE" else echo "Backup failed: $DATE" | mail -s "HVAC Backup Failed" alerts@example.com fi ``` ## Testing Rollback Procedures ### Monthly Drill 1. Schedule maintenance window 2. Perform controlled rollback 3. Verify recovery procedures 4. Document any issues 5. Update procedures as needed ### Test Checklist - [ ] Backup procedures work - [ ] Rollback completes in target time - [ ] Data integrity maintained - [ ] Services restart properly - [ ] Monitoring alerts fire - [ ] Documentation is current --- *Last Updated: 2024-12-18* *Version: 1.0* *Next Review: 2025-01-18*