hvac-kia-content/monitoring/README.md
Ben Reed dc57ce80d5 Add comprehensive monitoring and alerting system
- Created SystemMonitor class for health check monitoring
- Implemented system metrics collection (CPU, memory, disk, network)
- Added application metrics monitoring (scrapers, logs, data sizes)
- Built alert system with configurable thresholds
- Developed HTML dashboard generator with real-time charts
- Added systemd services for automated monitoring (15-min intervals)
- Created responsive web dashboard with Bootstrap and Chart.js
- Implemented automatic cleanup of old metric files
- Added comprehensive documentation and troubleshooting guide

Features:
- Real-time system resource monitoring
- Scraper performance tracking and alerts
- Interactive dashboard with trend charts
- Email-ready alert notifications
- Systemd integration for production deployment
- Security hardening with minimal privileges
- Auto-refresh dashboard every 5 minutes
- 7-day metric retention with automatic cleanup

Alert conditions:
- Critical: CPU >80%, Memory >85%, Disk >90%
- Warning: Scraper inactive >24h, Log files >100MB
- Error: Monitoring failures, configuration issues

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-08-18 21:35:28 -03:00

7 KiB

HVAC Know It All - Monitoring System

This directory contains the monitoring and alerting system for the HVAC Know It All Content Aggregation System.

Components

1. Monitoring Script (setup_monitoring.py)

  • Collects system metrics (CPU, memory, disk, network)
  • Monitors application metrics (scraper status, data sizes, log files)
  • Checks for alert conditions
  • Generates health reports
  • Cleans up old metric files

2. Dashboard Generator (dashboard_generator.py)

  • Creates HTML dashboard with real-time system status
  • Shows resource usage trends with charts
  • Displays scraper performance metrics
  • Lists recent alerts and system health
  • Auto-refreshes every 5 minutes

3. Systemd Services

  • hvac-monitoring.service: Runs monitoring and dashboard generation
  • hvac-monitoring.timer: Executes monitoring every 15 minutes

Installation

  1. Install dependencies:

    sudo apt update
    sudo apt install python3-psutil
    
  2. Install systemd services:

    sudo cp systemd/hvac-monitoring.* /etc/systemd/system/
    sudo systemctl daemon-reload
    sudo systemctl enable hvac-monitoring.timer
    sudo systemctl start hvac-monitoring.timer
    
  3. Verify monitoring is running:

    sudo systemctl status hvac-monitoring.timer
    sudo journalctl -u hvac-monitoring -f
    

Directory Structure

monitoring/
├── setup_monitoring.py      # Main monitoring script
├── dashboard_generator.py    # HTML dashboard generator
├── README.md                # This file
├── metrics/                 # JSON metric files (auto-created)
│   ├── system_YYYYMMDD_HHMMSS.json
│   ├── application_YYYYMMDD_HHMMSS.json
│   └── health_report_YYYYMMDD_HHMMSS.json
├── alerts/                  # Alert files (auto-created)
│   └── alerts_YYYYMMDD_HHMMSS.json
└── dashboard/               # HTML dashboard files (auto-created)
    ├── index.html           # Current dashboard
    └── dashboard_YYYYMMDD_HHMMSS.html  # Timestamped backups

Monitoring Metrics

System Metrics

  • CPU Usage: Percentage utilization
  • Memory Usage: Percentage of RAM used
  • Disk Usage: Percentage of disk space used
  • Network I/O: Bytes sent/received, packets
  • System Uptime: Hours since last boot
  • Load Average: System load (Linux only)

Application Metrics

  • Scraper Status: Last update time, item counts, state
  • Data Directory Sizes: Markdown, media, archives
  • Log File Status: Size, last modified time
  • State File Analysis: Last IDs, update timestamps

Alert Conditions

Critical Alerts

  • CPU usage > 80%
  • Memory usage > 85%
  • Disk usage > 90%

Warning Alerts

  • Scraper hasn't updated in > 24 hours
  • Log files > 100MB
  • Application errors detected

Error Alerts

  • Monitoring system failures
  • File access errors
  • Configuration issues

Dashboard Features

Health Overview

  • Overall system status (HEALTHY/WARNING/CRITICAL)
  • Resource usage gauges
  • Alert summary counts

Trend Charts

  • CPU, memory, disk usage over time
  • Scraper item collection trends
  • Historical performance data

Real-time Status

  • Current scraper status table
  • Recent alert history
  • Last update timestamps

Auto-refresh

  • Dashboard updates every 5 minutes
  • Manual refresh available
  • Responsive design for mobile/desktop

Usage

Manual Monitoring

# Run monitoring check
python3 /opt/hvac-kia-content/monitoring/setup_monitoring.py

# Generate dashboard
python3 /opt/hvac-kia-content/monitoring/dashboard_generator.py

# View dashboard
firefox file:///opt/hvac-kia-content/monitoring/dashboard/index.html

Check Recent Metrics

# View latest health report
ls -la /opt/hvac-kia-content/monitoring/metrics/health_report_*.json | tail -1

# View recent alerts
ls -la /opt/hvac-kia-content/monitoring/alerts/alerts_*.json | tail -5

Monitor Logs

# Follow monitoring logs
sudo journalctl -u hvac-monitoring -f

# View timer status
sudo systemctl list-timers hvac-monitoring.timer

Troubleshooting

Common Issues

  1. Permission Errors

    sudo chown -R hvac:hvac /opt/hvac-kia-content/monitoring/
    sudo chmod +x /opt/hvac-kia-content/monitoring/*.py
    
  2. Missing Dependencies

    sudo apt install python3-psutil python3-json
    
  3. Service Not Running

    sudo systemctl status hvac-monitoring.timer
    sudo systemctl restart hvac-monitoring.timer
    
  4. Dashboard Not Updating

    # Check if files are being generated
    ls -la /opt/hvac-kia-content/monitoring/metrics/
    
    # Manually run dashboard generator
    python3 /opt/hvac-kia-content/monitoring/dashboard_generator.py
    

Log Analysis

# Check for errors in monitoring
sudo journalctl -u hvac-monitoring --since "1 hour ago"

# Monitor system resources
htop

# Check disk space
df -h /opt/hvac-kia-content/

Integration

Web Server Setup (Optional)

To serve the dashboard via HTTP:

# Install nginx
sudo apt install nginx

# Create site config
sudo tee /etc/nginx/sites-available/hvac-monitoring << EOF
server {
    listen 8080;
    root /opt/hvac-kia-content/monitoring/dashboard;
    index index.html;
    
    location / {
        try_files \$uri \$uri/ =404;
    }
}
EOF

# Enable site
sudo ln -s /etc/nginx/sites-available/hvac-monitoring /etc/nginx/sites-enabled/
sudo nginx -t
sudo systemctl reload nginx

Access dashboard at: http://your-server:8080

Email Alerts (Optional)

To enable email alerts for critical issues:

# Install mail utilities
sudo apt install mailutils

# Configure in monitoring script
export ALERT_EMAIL="admin@yourdomain.com"
export SMTP_SERVER="smtp.yourdomain.com"

Customization

Adding New Metrics

Edit setup_monitoring.py and add to collect_application_metrics():

def collect_application_metrics(self):
    # ... existing code ...
    
    # Add custom metric
    metrics['custom'] = {
        'your_metric': calculate_your_metric(),
        'another_metric': get_another_value()
    }

Modifying Alert Thresholds

Edit alert conditions in check_alerts():

# Change CPU threshold
if sys.get('cpu_percent', 0) > 90:  # Changed from 80% to 90%

# Add new alert
if custom_condition():
    alerts.append({
        'type': 'WARNING',
        'component': 'custom',
        'message': 'Custom alert condition met'
    })

Dashboard Styling

Modify the CSS in generate_html_dashboard() to customize appearance.

Security Considerations

  • Monitoring runs with limited user privileges
  • No network services exposed by default
  • File permissions restrict access to monitoring data
  • Systemd security features enabled (PrivateTmp, ProtectSystem, etc.)
  • Dashboard contains no sensitive information

Performance Impact

  • Monitoring runs every 15 minutes (configurable)
  • Low CPU/memory overhead (< 1% during execution)
  • Automatic cleanup of old metric files (7-day retention)
  • Dashboard generation is lightweight (< 1MB files)