hvac-kia-content/monitoring/README.md
Ben Reed dc57ce80d5 Add comprehensive monitoring and alerting system
- Created SystemMonitor class for health check monitoring
- Implemented system metrics collection (CPU, memory, disk, network)
- Added application metrics monitoring (scrapers, logs, data sizes)
- Built alert system with configurable thresholds
- Developed HTML dashboard generator with real-time charts
- Added systemd services for automated monitoring (15-min intervals)
- Created responsive web dashboard with Bootstrap and Chart.js
- Implemented automatic cleanup of old metric files
- Added comprehensive documentation and troubleshooting guide

Features:
- Real-time system resource monitoring
- Scraper performance tracking and alerts
- Interactive dashboard with trend charts
- Email-ready alert notifications
- Systemd integration for production deployment
- Security hardening with minimal privileges
- Auto-refresh dashboard every 5 minutes
- 7-day metric retention with automatic cleanup

Alert conditions:
- Critical: CPU >80%, Memory >85%, Disk >90%
- Warning: Scraper inactive >24h, Log files >100MB
- Error: Monitoring failures, configuration issues

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-08-18 21:35:28 -03:00

284 lines
No EOL
7 KiB
Markdown

# HVAC Know It All - Monitoring System
This directory contains the monitoring and alerting system for the HVAC Know It All Content Aggregation System.
## Components
### 1. Monitoring Script (`setup_monitoring.py`)
- Collects system metrics (CPU, memory, disk, network)
- Monitors application metrics (scraper status, data sizes, log files)
- Checks for alert conditions
- Generates health reports
- Cleans up old metric files
### 2. Dashboard Generator (`dashboard_generator.py`)
- Creates HTML dashboard with real-time system status
- Shows resource usage trends with charts
- Displays scraper performance metrics
- Lists recent alerts and system health
- Auto-refreshes every 5 minutes
### 3. Systemd Services
- `hvac-monitoring.service`: Runs monitoring and dashboard generation
- `hvac-monitoring.timer`: Executes monitoring every 15 minutes
## Installation
1. **Install dependencies:**
```bash
sudo apt update
sudo apt install python3-psutil
```
2. **Install systemd services:**
```bash
sudo cp systemd/hvac-monitoring.* /etc/systemd/system/
sudo systemctl daemon-reload
sudo systemctl enable hvac-monitoring.timer
sudo systemctl start hvac-monitoring.timer
```
3. **Verify monitoring is running:**
```bash
sudo systemctl status hvac-monitoring.timer
sudo journalctl -u hvac-monitoring -f
```
## Directory Structure
```
monitoring/
├── setup_monitoring.py # Main monitoring script
├── dashboard_generator.py # HTML dashboard generator
├── README.md # This file
├── metrics/ # JSON metric files (auto-created)
│ ├── system_YYYYMMDD_HHMMSS.json
│ ├── application_YYYYMMDD_HHMMSS.json
│ └── health_report_YYYYMMDD_HHMMSS.json
├── alerts/ # Alert files (auto-created)
│ └── alerts_YYYYMMDD_HHMMSS.json
└── dashboard/ # HTML dashboard files (auto-created)
├── index.html # Current dashboard
└── dashboard_YYYYMMDD_HHMMSS.html # Timestamped backups
```
## Monitoring Metrics
### System Metrics
- **CPU Usage**: Percentage utilization
- **Memory Usage**: Percentage of RAM used
- **Disk Usage**: Percentage of disk space used
- **Network I/O**: Bytes sent/received, packets
- **System Uptime**: Hours since last boot
- **Load Average**: System load (Linux only)
### Application Metrics
- **Scraper Status**: Last update time, item counts, state
- **Data Directory Sizes**: Markdown, media, archives
- **Log File Status**: Size, last modified time
- **State File Analysis**: Last IDs, update timestamps
## Alert Conditions
### Critical Alerts
- CPU usage > 80%
- Memory usage > 85%
- Disk usage > 90%
### Warning Alerts
- Scraper hasn't updated in > 24 hours
- Log files > 100MB
- Application errors detected
### Error Alerts
- Monitoring system failures
- File access errors
- Configuration issues
## Dashboard Features
### Health Overview
- Overall system status (HEALTHY/WARNING/CRITICAL)
- Resource usage gauges
- Alert summary counts
### Trend Charts
- CPU, memory, disk usage over time
- Scraper item collection trends
- Historical performance data
### Real-time Status
- Current scraper status table
- Recent alert history
- Last update timestamps
### Auto-refresh
- Dashboard updates every 5 minutes
- Manual refresh available
- Responsive design for mobile/desktop
## Usage
### Manual Monitoring
```bash
# Run monitoring check
python3 /opt/hvac-kia-content/monitoring/setup_monitoring.py
# Generate dashboard
python3 /opt/hvac-kia-content/monitoring/dashboard_generator.py
# View dashboard
firefox file:///opt/hvac-kia-content/monitoring/dashboard/index.html
```
### Check Recent Metrics
```bash
# View latest health report
ls -la /opt/hvac-kia-content/monitoring/metrics/health_report_*.json | tail -1
# View recent alerts
ls -la /opt/hvac-kia-content/monitoring/alerts/alerts_*.json | tail -5
```
### Monitor Logs
```bash
# Follow monitoring logs
sudo journalctl -u hvac-monitoring -f
# View timer status
sudo systemctl list-timers hvac-monitoring.timer
```
## Troubleshooting
### Common Issues
1. **Permission Errors**
```bash
sudo chown -R hvac:hvac /opt/hvac-kia-content/monitoring/
sudo chmod +x /opt/hvac-kia-content/monitoring/*.py
```
2. **Missing Dependencies**
```bash
sudo apt install python3-psutil python3-json
```
3. **Service Not Running**
```bash
sudo systemctl status hvac-monitoring.timer
sudo systemctl restart hvac-monitoring.timer
```
4. **Dashboard Not Updating**
```bash
# Check if files are being generated
ls -la /opt/hvac-kia-content/monitoring/metrics/
# Manually run dashboard generator
python3 /opt/hvac-kia-content/monitoring/dashboard_generator.py
```
### Log Analysis
```bash
# Check for errors in monitoring
sudo journalctl -u hvac-monitoring --since "1 hour ago"
# Monitor system resources
htop
# Check disk space
df -h /opt/hvac-kia-content/
```
## Integration
### Web Server Setup (Optional)
To serve the dashboard via HTTP:
```bash
# Install nginx
sudo apt install nginx
# Create site config
sudo tee /etc/nginx/sites-available/hvac-monitoring << EOF
server {
listen 8080;
root /opt/hvac-kia-content/monitoring/dashboard;
index index.html;
location / {
try_files \$uri \$uri/ =404;
}
}
EOF
# Enable site
sudo ln -s /etc/nginx/sites-available/hvac-monitoring /etc/nginx/sites-enabled/
sudo nginx -t
sudo systemctl reload nginx
```
Access dashboard at: `http://your-server:8080`
### Email Alerts (Optional)
To enable email alerts for critical issues:
```bash
# Install mail utilities
sudo apt install mailutils
# Configure in monitoring script
export ALERT_EMAIL="admin@yourdomain.com"
export SMTP_SERVER="smtp.yourdomain.com"
```
## Customization
### Adding New Metrics
Edit `setup_monitoring.py` and add to `collect_application_metrics()`:
```python
def collect_application_metrics(self):
# ... existing code ...
# Add custom metric
metrics['custom'] = {
'your_metric': calculate_your_metric(),
'another_metric': get_another_value()
}
```
### Modifying Alert Thresholds
Edit alert conditions in `check_alerts()`:
```python
# Change CPU threshold
if sys.get('cpu_percent', 0) > 90: # Changed from 80% to 90%
# Add new alert
if custom_condition():
alerts.append({
'type': 'WARNING',
'component': 'custom',
'message': 'Custom alert condition met'
})
```
### Dashboard Styling
Modify the CSS in `generate_html_dashboard()` to customize appearance.
## Security Considerations
- Monitoring runs with limited user privileges
- No network services exposed by default
- File permissions restrict access to monitoring data
- Systemd security features enabled (PrivateTmp, ProtectSystem, etc.)
- Dashboard contains no sensitive information
## Performance Impact
- Monitoring runs every 15 minutes (configurable)
- Low CPU/memory overhead (< 1% during execution)
- Automatic cleanup of old metric files (7-day retention)
- Dashboard generation is lightweight (< 1MB files)