Documentation Added: - ARCHITECTURE_DECISIONS.md: Explains why systemd over k8s (TikTok display requirements) - DEPLOYMENT_CHECKLIST.md: Step-by-step deployment procedures - ROLLBACK_PROCEDURES.md: Emergency rollback and recovery procedures - test_production_deployment.py: Automated deployment verification script Key Documentation Highlights: - Detailed explanation of containerization limitations with browser automation - Complete deployment checklist with pre/post verification steps - Rollback scenarios with recovery time objectives - Emergency contact templates and backup procedures - Automated test script for production readiness 17 of 25 tasks completed (68% done) Remaining work focuses on spec compliance and testing 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
8.1 KiB
Rollback Procedures
Overview
This document provides step-by-step procedures for rolling back the HVAC Know It All Content Aggregator in case of deployment issues or system failures.
Risk Assessment
Severity Levels
- CRITICAL: System completely non-functional, no data collection
- HIGH: Major features broken, partial data loss
- MEDIUM: Some scrapers failing, degraded performance
- LOW: Minor issues, cosmetic problems
Pre-Rollback Checklist
Before Rolling Back
-
Document the Issue
- Screenshot error messages
- Save relevant log files
- Note exact time of failure
- Record affected components
-
Attempt Quick Fixes
- Check environment variables
- Verify network connectivity
- Restart failed service
- Check disk space
-
Backup Current State
# Backup current state before rollback sudo tar -czf /backup/emergency-$(date +%Y%m%d-%H%M%S).tar.gz \ /opt/hvac-kia-content/state/ \ /opt/hvac-kia-content/data/ \ /var/log/hvac-content/
Rollback Scenarios
Scenario 1: Service Won't Start
Symptoms: Systemd service fails to start after deployment
Quick Fix:
# Check service status
systemctl status hvac-content-aggregator.service
# Check journal for errors
journalctl -u hvac-content-aggregator.service -n 100
# Validate environment
cd /opt/hvac-kia-content
python3 -c "from run_production import validate_environment; validate_environment()"
Rollback Steps:
-
Stop the timer:
sudo systemctl stop hvac-content-aggregator.timer -
Revert to previous version:
cd /opt/hvac-kia-content git fetch --tags git checkout v1.0.0 # Previous stable version -
Reinstall dependencies:
pip install -r requirements.txt -
Restart service:
sudo systemctl daemon-reload sudo systemctl start hvac-content-aggregator.timer
Scenario 2: Data Corruption
Symptoms: Malformed output, duplicate entries, missing data
Quick Fix:
# Check state files
ls -la /opt/hvac-kia-content/state/
# Validate JSON state files
python3 -c "import json; json.load(open('/opt/hvac-kia-content/state/youtube_state.json'))"
Rollback Steps:
-
Stop all services:
sudo systemctl stop hvac-content-aggregator.timer sudo systemctl stop hvac-tiktok-captions.timer -
Restore state from backup:
# Find latest backup ls -lt /backup/hvac-state-*.tar.gz | head -1 # Restore state files cd / sudo tar -xzf /backup/hvac-state-20241217.tar.gz -
Clear corrupted output:
# Move corrupted files to quarantine mkdir -p /opt/hvac-kia-content/quarantine mv /opt/hvac-kia-content/data/*_corrupted.md /opt/hvac-kia-content/quarantine/ -
Restart services:
sudo systemctl start hvac-content-aggregator.timer
Scenario 3: Performance Degradation
Symptoms: Slow execution, timeouts, high CPU/memory usage
Quick Fix:
# Check resource usage
top -p $(pgrep -f run_production.py)
# Check disk space
df -h /opt/hvac-kia-content
# Clear old logs
find /var/log/hvac-content -name "*.log" -mtime +7 -delete
Rollback Steps:
-
Reduce scraper limits temporarily:
# Edit production config nano /opt/hvac-kia-content/config/production.py # Reduce max_posts, max_videos, etc. -
Disable problematic scrapers:
# In config/production.py SCRAPERS_CONFIG = { "instagram": { "enabled": False, # Temporarily disable ... } } -
Restart with reduced load:
sudo systemctl restart hvac-content-aggregator.service
Scenario 4: Complete System Failure
Symptoms: Nothing works, multiple component failures
Full System Rollback:
-
Stop Everything:
# Stop all timers and services sudo systemctl stop hvac-content-aggregator.timer sudo systemctl stop hvac-tiktok-captions.timer sudo systemctl disable hvac-content-aggregator.timer sudo systemctl disable hvac-tiktok-captions.timer -
Backup Current State:
# Full backup before rollback sudo tar -czf /backup/full-backup-$(date +%Y%m%d-%H%M%S).tar.gz \ /opt/hvac-kia-content/ \ /etc/systemd/system/hvac-*.{service,timer} \ /var/log/hvac-content/ -
Clean Installation:
# Remove current installation sudo rm -rf /opt/hvac-kia-content sudo rm -f /etc/systemd/system/hvac-* # Clone stable version cd /opt sudo git clone https://github.com/yourusername/hvac-kia-content.git cd hvac-kia-content sudo git checkout v1.0.0 # Last known stable # Restore configuration sudo cp /backup/.env /opt/hvac-kia-content/ # Set permissions sudo chown -R $USER:$USER /opt/hvac-kia-content -
Reinstall Services:
cd /opt/hvac-kia-content ./install_production.sh -
Restore State (Optional):
# Only if state is not corrupted sudo tar -xzf /backup/hvac-state-latest.tar.gz -C / -
Verify and Start:
# Test first python3 run_production.py --dry-run # If successful, enable services sudo systemctl enable hvac-content-aggregator.timer sudo systemctl start hvac-content-aggregator.timer
Post-Rollback Verification
Immediate Checks
- Services are running:
systemctl status hvac-content-aggregator.timer - No errors in logs:
tail -n 100 /var/log/hvac-content/aggregator.log | grep ERROR - Test run successful:
cd /opt/hvac-kia-content python3 test_real_data.py --source youtube --items 1
1-Hour Verification
- Timer fired as scheduled
- All scrapers executed
- Output files generated
- NAS sync completed
- No memory leaks
- CPU usage normal
24-Hour Verification
- System stable
- No missed schedules
- Data quality good
- No duplicate entries
- Incremental updates working
Emergency Contacts
Technical Support
- Primary Contact: [Name] - [Phone] - [Email]
- Secondary Contact: [Name] - [Phone] - [Email]
- Escalation: [Manager Name] - [Phone] - [Email]
System Access
- Server: production-scraper.example.com
- SSH:
ssh user@production-scraper.example.com - Logs:
/var/log/hvac-content/ - Config:
/opt/hvac-kia-content/.env
Recovery Time Objectives
| Scenario | Target Recovery Time | Maximum Data Loss |
|---|---|---|
| Service Restart | 5 minutes | None |
| Version Rollback | 15 minutes | Since last backup |
| State Restoration | 30 minutes | 24 hours |
| Complete Rebuild | 1 hour | 48 hours |
Lessons Learned Log
Previous Incidents
Document any rollbacks performed and lessons learned:
| Date | Issue | Resolution | Prevention |
|---|---|---|---|
Backup Schedule
Automated Backups
# Add to crontab
0 2 * * * /opt/hvac-kia-content/scripts/backup.sh
Backup Script
#!/bin/bash
# /opt/hvac-kia-content/scripts/backup.sh
BACKUP_DIR="/backup/hvac-content"
DATE=$(date +%Y%m%d)
RETENTION_DAYS=30
# Create backup
tar -czf "$BACKUP_DIR/state-$DATE.tar.gz" /opt/hvac-kia-content/state/
tar -czf "$BACKUP_DIR/config-$DATE.tar.gz" /opt/hvac-kia-content/.env
# Clean old backups
find "$BACKUP_DIR" -name "*.tar.gz" -mtime +$RETENTION_DAYS -delete
# Verify backup
tar -tzf "$BACKUP_DIR/state-$DATE.tar.gz" > /dev/null 2>&1
if [ $? -eq 0 ]; then
echo "Backup successful: $DATE"
else
echo "Backup failed: $DATE" | mail -s "HVAC Backup Failed" alerts@example.com
fi
Testing Rollback Procedures
Monthly Drill
- Schedule maintenance window
- Perform controlled rollback
- Verify recovery procedures
- Document any issues
- Update procedures as needed
Test Checklist
- Backup procedures work
- Rollback completes in target time
- Data integrity maintained
- Services restart properly
- Monitoring alerts fire
- Documentation is current
Last Updated: 2024-12-18 Version: 1.0 Next Review: 2025-01-18