hvac-kia-content/docs/ROLLBACK_PROCEDURES.md
Ben Reed a80af693ba Add comprehensive production documentation and testing
Documentation Added:
- ARCHITECTURE_DECISIONS.md: Explains why systemd over k8s (TikTok display requirements)
- DEPLOYMENT_CHECKLIST.md: Step-by-step deployment procedures
- ROLLBACK_PROCEDURES.md: Emergency rollback and recovery procedures
- test_production_deployment.py: Automated deployment verification script

Key Documentation Highlights:
- Detailed explanation of containerization limitations with browser automation
- Complete deployment checklist with pre/post verification steps
- Rollback scenarios with recovery time objectives
- Emergency contact templates and backup procedures
- Automated test script for production readiness

17 of 25 tasks completed (68% done)
Remaining work focuses on spec compliance and testing

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-08-18 20:20:52 -03:00

8.1 KiB

Rollback Procedures

Overview

This document provides step-by-step procedures for rolling back the HVAC Know It All Content Aggregator in case of deployment issues or system failures.

Risk Assessment

Severity Levels

  • CRITICAL: System completely non-functional, no data collection
  • HIGH: Major features broken, partial data loss
  • MEDIUM: Some scrapers failing, degraded performance
  • LOW: Minor issues, cosmetic problems

Pre-Rollback Checklist

Before Rolling Back

  1. Document the Issue

    • Screenshot error messages
    • Save relevant log files
    • Note exact time of failure
    • Record affected components
  2. Attempt Quick Fixes

    • Check environment variables
    • Verify network connectivity
    • Restart failed service
    • Check disk space
  3. Backup Current State

    # Backup current state before rollback
    sudo tar -czf /backup/emergency-$(date +%Y%m%d-%H%M%S).tar.gz \
      /opt/hvac-kia-content/state/ \
      /opt/hvac-kia-content/data/ \
      /var/log/hvac-content/
    

Rollback Scenarios

Scenario 1: Service Won't Start

Symptoms: Systemd service fails to start after deployment

Quick Fix:

# Check service status
systemctl status hvac-content-aggregator.service

# Check journal for errors
journalctl -u hvac-content-aggregator.service -n 100

# Validate environment
cd /opt/hvac-kia-content
python3 -c "from run_production import validate_environment; validate_environment()"

Rollback Steps:

  1. Stop the timer:

    sudo systemctl stop hvac-content-aggregator.timer
    
  2. Revert to previous version:

    cd /opt/hvac-kia-content
    git fetch --tags
    git checkout v1.0.0  # Previous stable version
    
  3. Reinstall dependencies:

    pip install -r requirements.txt
    
  4. Restart service:

    sudo systemctl daemon-reload
    sudo systemctl start hvac-content-aggregator.timer
    

Scenario 2: Data Corruption

Symptoms: Malformed output, duplicate entries, missing data

Quick Fix:

# Check state files
ls -la /opt/hvac-kia-content/state/

# Validate JSON state files
python3 -c "import json; json.load(open('/opt/hvac-kia-content/state/youtube_state.json'))"

Rollback Steps:

  1. Stop all services:

    sudo systemctl stop hvac-content-aggregator.timer
    sudo systemctl stop hvac-tiktok-captions.timer
    
  2. Restore state from backup:

    # Find latest backup
    ls -lt /backup/hvac-state-*.tar.gz | head -1
    
    # Restore state files
    cd /
    sudo tar -xzf /backup/hvac-state-20241217.tar.gz
    
  3. Clear corrupted output:

    # Move corrupted files to quarantine
    mkdir -p /opt/hvac-kia-content/quarantine
    mv /opt/hvac-kia-content/data/*_corrupted.md /opt/hvac-kia-content/quarantine/
    
  4. Restart services:

    sudo systemctl start hvac-content-aggregator.timer
    

Scenario 3: Performance Degradation

Symptoms: Slow execution, timeouts, high CPU/memory usage

Quick Fix:

# Check resource usage
top -p $(pgrep -f run_production.py)

# Check disk space
df -h /opt/hvac-kia-content

# Clear old logs
find /var/log/hvac-content -name "*.log" -mtime +7 -delete

Rollback Steps:

  1. Reduce scraper limits temporarily:

    # Edit production config
    nano /opt/hvac-kia-content/config/production.py
    # Reduce max_posts, max_videos, etc.
    
  2. Disable problematic scrapers:

    # In config/production.py
    SCRAPERS_CONFIG = {
        "instagram": {
            "enabled": False,  # Temporarily disable
            ...
        }
    }
    
  3. Restart with reduced load:

    sudo systemctl restart hvac-content-aggregator.service
    

Scenario 4: Complete System Failure

Symptoms: Nothing works, multiple component failures

Full System Rollback:

  1. Stop Everything:

    # Stop all timers and services
    sudo systemctl stop hvac-content-aggregator.timer
    sudo systemctl stop hvac-tiktok-captions.timer
    sudo systemctl disable hvac-content-aggregator.timer
    sudo systemctl disable hvac-tiktok-captions.timer
    
  2. Backup Current State:

    # Full backup before rollback
    sudo tar -czf /backup/full-backup-$(date +%Y%m%d-%H%M%S).tar.gz \
      /opt/hvac-kia-content/ \
      /etc/systemd/system/hvac-*.{service,timer} \
      /var/log/hvac-content/
    
  3. Clean Installation:

    # Remove current installation
    sudo rm -rf /opt/hvac-kia-content
    sudo rm -f /etc/systemd/system/hvac-*
    
    # Clone stable version
    cd /opt
    sudo git clone https://github.com/yourusername/hvac-kia-content.git
    cd hvac-kia-content
    sudo git checkout v1.0.0  # Last known stable
    
    # Restore configuration
    sudo cp /backup/.env /opt/hvac-kia-content/
    
    # Set permissions
    sudo chown -R $USER:$USER /opt/hvac-kia-content
    
  4. Reinstall Services:

    cd /opt/hvac-kia-content
    ./install_production.sh
    
  5. Restore State (Optional):

    # Only if state is not corrupted
    sudo tar -xzf /backup/hvac-state-latest.tar.gz -C /
    
  6. Verify and Start:

    # Test first
    python3 run_production.py --dry-run
    
    # If successful, enable services
    sudo systemctl enable hvac-content-aggregator.timer
    sudo systemctl start hvac-content-aggregator.timer
    

Post-Rollback Verification

Immediate Checks

  • Services are running:
    systemctl status hvac-content-aggregator.timer
    
  • No errors in logs:
    tail -n 100 /var/log/hvac-content/aggregator.log | grep ERROR
    
  • Test run successful:
    cd /opt/hvac-kia-content
    python3 test_real_data.py --source youtube --items 1
    

1-Hour Verification

  • Timer fired as scheduled
  • All scrapers executed
  • Output files generated
  • NAS sync completed
  • No memory leaks
  • CPU usage normal

24-Hour Verification

  • System stable
  • No missed schedules
  • Data quality good
  • No duplicate entries
  • Incremental updates working

Emergency Contacts

Technical Support

  • Primary Contact: [Name] - [Phone] - [Email]
  • Secondary Contact: [Name] - [Phone] - [Email]
  • Escalation: [Manager Name] - [Phone] - [Email]

System Access

  • Server: production-scraper.example.com
  • SSH: ssh user@production-scraper.example.com
  • Logs: /var/log/hvac-content/
  • Config: /opt/hvac-kia-content/.env

Recovery Time Objectives

Scenario Target Recovery Time Maximum Data Loss
Service Restart 5 minutes None
Version Rollback 15 minutes Since last backup
State Restoration 30 minutes 24 hours
Complete Rebuild 1 hour 48 hours

Lessons Learned Log

Previous Incidents

Document any rollbacks performed and lessons learned:

Date Issue Resolution Prevention

Backup Schedule

Automated Backups

# Add to crontab
0 2 * * * /opt/hvac-kia-content/scripts/backup.sh

Backup Script

#!/bin/bash
# /opt/hvac-kia-content/scripts/backup.sh

BACKUP_DIR="/backup/hvac-content"
DATE=$(date +%Y%m%d)
RETENTION_DAYS=30

# Create backup
tar -czf "$BACKUP_DIR/state-$DATE.tar.gz" /opt/hvac-kia-content/state/
tar -czf "$BACKUP_DIR/config-$DATE.tar.gz" /opt/hvac-kia-content/.env

# Clean old backups
find "$BACKUP_DIR" -name "*.tar.gz" -mtime +$RETENTION_DAYS -delete

# Verify backup
tar -tzf "$BACKUP_DIR/state-$DATE.tar.gz" > /dev/null 2>&1
if [ $? -eq 0 ]; then
    echo "Backup successful: $DATE"
else
    echo "Backup failed: $DATE" | mail -s "HVAC Backup Failed" alerts@example.com
fi

Testing Rollback Procedures

Monthly Drill

  1. Schedule maintenance window
  2. Perform controlled rollback
  3. Verify recovery procedures
  4. Document any issues
  5. Update procedures as needed

Test Checklist

  • Backup procedures work
  • Rollback completes in target time
  • Data integrity maintained
  • Services restart properly
  • Monitoring alerts fire
  • Documentation is current

Last Updated: 2024-12-18 Version: 1.0 Next Review: 2025-01-18