- Created SystemMonitor class for health check monitoring - Implemented system metrics collection (CPU, memory, disk, network) - Added application metrics monitoring (scrapers, logs, data sizes) - Built alert system with configurable thresholds - Developed HTML dashboard generator with real-time charts - Added systemd services for automated monitoring (15-min intervals) - Created responsive web dashboard with Bootstrap and Chart.js - Implemented automatic cleanup of old metric files - Added comprehensive documentation and troubleshooting guide Features: - Real-time system resource monitoring - Scraper performance tracking and alerts - Interactive dashboard with trend charts - Email-ready alert notifications - Systemd integration for production deployment - Security hardening with minimal privileges - Auto-refresh dashboard every 5 minutes - 7-day metric retention with automatic cleanup Alert conditions: - Critical: CPU >80%, Memory >85%, Disk >90% - Warning: Scraper inactive >24h, Log files >100MB - Error: Monitoring failures, configuration issues 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
284 lines
No EOL
7 KiB
Markdown
284 lines
No EOL
7 KiB
Markdown
# HVAC Know It All - Monitoring System
|
|
|
|
This directory contains the monitoring and alerting system for the HVAC Know It All Content Aggregation System.
|
|
|
|
## Components
|
|
|
|
### 1. Monitoring Script (`setup_monitoring.py`)
|
|
- Collects system metrics (CPU, memory, disk, network)
|
|
- Monitors application metrics (scraper status, data sizes, log files)
|
|
- Checks for alert conditions
|
|
- Generates health reports
|
|
- Cleans up old metric files
|
|
|
|
### 2. Dashboard Generator (`dashboard_generator.py`)
|
|
- Creates HTML dashboard with real-time system status
|
|
- Shows resource usage trends with charts
|
|
- Displays scraper performance metrics
|
|
- Lists recent alerts and system health
|
|
- Auto-refreshes every 5 minutes
|
|
|
|
### 3. Systemd Services
|
|
- `hvac-monitoring.service`: Runs monitoring and dashboard generation
|
|
- `hvac-monitoring.timer`: Executes monitoring every 15 minutes
|
|
|
|
## Installation
|
|
|
|
1. **Install dependencies:**
|
|
```bash
|
|
sudo apt update
|
|
sudo apt install python3-psutil
|
|
```
|
|
|
|
2. **Install systemd services:**
|
|
```bash
|
|
sudo cp systemd/hvac-monitoring.* /etc/systemd/system/
|
|
sudo systemctl daemon-reload
|
|
sudo systemctl enable hvac-monitoring.timer
|
|
sudo systemctl start hvac-monitoring.timer
|
|
```
|
|
|
|
3. **Verify monitoring is running:**
|
|
```bash
|
|
sudo systemctl status hvac-monitoring.timer
|
|
sudo journalctl -u hvac-monitoring -f
|
|
```
|
|
|
|
## Directory Structure
|
|
|
|
```
|
|
monitoring/
|
|
├── setup_monitoring.py # Main monitoring script
|
|
├── dashboard_generator.py # HTML dashboard generator
|
|
├── README.md # This file
|
|
├── metrics/ # JSON metric files (auto-created)
|
|
│ ├── system_YYYYMMDD_HHMMSS.json
|
|
│ ├── application_YYYYMMDD_HHMMSS.json
|
|
│ └── health_report_YYYYMMDD_HHMMSS.json
|
|
├── alerts/ # Alert files (auto-created)
|
|
│ └── alerts_YYYYMMDD_HHMMSS.json
|
|
└── dashboard/ # HTML dashboard files (auto-created)
|
|
├── index.html # Current dashboard
|
|
└── dashboard_YYYYMMDD_HHMMSS.html # Timestamped backups
|
|
```
|
|
|
|
## Monitoring Metrics
|
|
|
|
### System Metrics
|
|
- **CPU Usage**: Percentage utilization
|
|
- **Memory Usage**: Percentage of RAM used
|
|
- **Disk Usage**: Percentage of disk space used
|
|
- **Network I/O**: Bytes sent/received, packets
|
|
- **System Uptime**: Hours since last boot
|
|
- **Load Average**: System load (Linux only)
|
|
|
|
### Application Metrics
|
|
- **Scraper Status**: Last update time, item counts, state
|
|
- **Data Directory Sizes**: Markdown, media, archives
|
|
- **Log File Status**: Size, last modified time
|
|
- **State File Analysis**: Last IDs, update timestamps
|
|
|
|
## Alert Conditions
|
|
|
|
### Critical Alerts
|
|
- CPU usage > 80%
|
|
- Memory usage > 85%
|
|
- Disk usage > 90%
|
|
|
|
### Warning Alerts
|
|
- Scraper hasn't updated in > 24 hours
|
|
- Log files > 100MB
|
|
- Application errors detected
|
|
|
|
### Error Alerts
|
|
- Monitoring system failures
|
|
- File access errors
|
|
- Configuration issues
|
|
|
|
## Dashboard Features
|
|
|
|
### Health Overview
|
|
- Overall system status (HEALTHY/WARNING/CRITICAL)
|
|
- Resource usage gauges
|
|
- Alert summary counts
|
|
|
|
### Trend Charts
|
|
- CPU, memory, disk usage over time
|
|
- Scraper item collection trends
|
|
- Historical performance data
|
|
|
|
### Real-time Status
|
|
- Current scraper status table
|
|
- Recent alert history
|
|
- Last update timestamps
|
|
|
|
### Auto-refresh
|
|
- Dashboard updates every 5 minutes
|
|
- Manual refresh available
|
|
- Responsive design for mobile/desktop
|
|
|
|
## Usage
|
|
|
|
### Manual Monitoring
|
|
```bash
|
|
# Run monitoring check
|
|
python3 /opt/hvac-kia-content/monitoring/setup_monitoring.py
|
|
|
|
# Generate dashboard
|
|
python3 /opt/hvac-kia-content/monitoring/dashboard_generator.py
|
|
|
|
# View dashboard
|
|
firefox file:///opt/hvac-kia-content/monitoring/dashboard/index.html
|
|
```
|
|
|
|
### Check Recent Metrics
|
|
```bash
|
|
# View latest health report
|
|
ls -la /opt/hvac-kia-content/monitoring/metrics/health_report_*.json | tail -1
|
|
|
|
# View recent alerts
|
|
ls -la /opt/hvac-kia-content/monitoring/alerts/alerts_*.json | tail -5
|
|
```
|
|
|
|
### Monitor Logs
|
|
```bash
|
|
# Follow monitoring logs
|
|
sudo journalctl -u hvac-monitoring -f
|
|
|
|
# View timer status
|
|
sudo systemctl list-timers hvac-monitoring.timer
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
### Common Issues
|
|
|
|
1. **Permission Errors**
|
|
```bash
|
|
sudo chown -R hvac:hvac /opt/hvac-kia-content/monitoring/
|
|
sudo chmod +x /opt/hvac-kia-content/monitoring/*.py
|
|
```
|
|
|
|
2. **Missing Dependencies**
|
|
```bash
|
|
sudo apt install python3-psutil python3-json
|
|
```
|
|
|
|
3. **Service Not Running**
|
|
```bash
|
|
sudo systemctl status hvac-monitoring.timer
|
|
sudo systemctl restart hvac-monitoring.timer
|
|
```
|
|
|
|
4. **Dashboard Not Updating**
|
|
```bash
|
|
# Check if files are being generated
|
|
ls -la /opt/hvac-kia-content/monitoring/metrics/
|
|
|
|
# Manually run dashboard generator
|
|
python3 /opt/hvac-kia-content/monitoring/dashboard_generator.py
|
|
```
|
|
|
|
### Log Analysis
|
|
```bash
|
|
# Check for errors in monitoring
|
|
sudo journalctl -u hvac-monitoring --since "1 hour ago"
|
|
|
|
# Monitor system resources
|
|
htop
|
|
|
|
# Check disk space
|
|
df -h /opt/hvac-kia-content/
|
|
```
|
|
|
|
## Integration
|
|
|
|
### Web Server Setup (Optional)
|
|
To serve the dashboard via HTTP:
|
|
|
|
```bash
|
|
# Install nginx
|
|
sudo apt install nginx
|
|
|
|
# Create site config
|
|
sudo tee /etc/nginx/sites-available/hvac-monitoring << EOF
|
|
server {
|
|
listen 8080;
|
|
root /opt/hvac-kia-content/monitoring/dashboard;
|
|
index index.html;
|
|
|
|
location / {
|
|
try_files \$uri \$uri/ =404;
|
|
}
|
|
}
|
|
EOF
|
|
|
|
# Enable site
|
|
sudo ln -s /etc/nginx/sites-available/hvac-monitoring /etc/nginx/sites-enabled/
|
|
sudo nginx -t
|
|
sudo systemctl reload nginx
|
|
```
|
|
|
|
Access dashboard at: `http://your-server:8080`
|
|
|
|
### Email Alerts (Optional)
|
|
To enable email alerts for critical issues:
|
|
|
|
```bash
|
|
# Install mail utilities
|
|
sudo apt install mailutils
|
|
|
|
# Configure in monitoring script
|
|
export ALERT_EMAIL="admin@yourdomain.com"
|
|
export SMTP_SERVER="smtp.yourdomain.com"
|
|
```
|
|
|
|
## Customization
|
|
|
|
### Adding New Metrics
|
|
Edit `setup_monitoring.py` and add to `collect_application_metrics()`:
|
|
|
|
```python
|
|
def collect_application_metrics(self):
|
|
# ... existing code ...
|
|
|
|
# Add custom metric
|
|
metrics['custom'] = {
|
|
'your_metric': calculate_your_metric(),
|
|
'another_metric': get_another_value()
|
|
}
|
|
```
|
|
|
|
### Modifying Alert Thresholds
|
|
Edit alert conditions in `check_alerts()`:
|
|
|
|
```python
|
|
# Change CPU threshold
|
|
if sys.get('cpu_percent', 0) > 90: # Changed from 80% to 90%
|
|
|
|
# Add new alert
|
|
if custom_condition():
|
|
alerts.append({
|
|
'type': 'WARNING',
|
|
'component': 'custom',
|
|
'message': 'Custom alert condition met'
|
|
})
|
|
```
|
|
|
|
### Dashboard Styling
|
|
Modify the CSS in `generate_html_dashboard()` to customize appearance.
|
|
|
|
## Security Considerations
|
|
|
|
- Monitoring runs with limited user privileges
|
|
- No network services exposed by default
|
|
- File permissions restrict access to monitoring data
|
|
- Systemd security features enabled (PrivateTmp, ProtectSystem, etc.)
|
|
- Dashboard contains no sensitive information
|
|
|
|
## Performance Impact
|
|
|
|
- Monitoring runs every 15 minutes (configurable)
|
|
- Low CPU/memory overhead (< 1% during execution)
|
|
- Automatic cleanup of old metric files (7-day retention)
|
|
- Dashboard generation is lightweight (< 1MB files) |