- Created SystemMonitor class for health check monitoring - Implemented system metrics collection (CPU, memory, disk, network) - Added application metrics monitoring (scrapers, logs, data sizes) - Built alert system with configurable thresholds - Developed HTML dashboard generator with real-time charts - Added systemd services for automated monitoring (15-min intervals) - Created responsive web dashboard with Bootstrap and Chart.js - Implemented automatic cleanup of old metric files - Added comprehensive documentation and troubleshooting guide Features: - Real-time system resource monitoring - Scraper performance tracking and alerts - Interactive dashboard with trend charts - Email-ready alert notifications - Systemd integration for production deployment - Security hardening with minimal privileges - Auto-refresh dashboard every 5 minutes - 7-day metric retention with automatic cleanup Alert conditions: - Critical: CPU >80%, Memory >85%, Disk >90% - Warning: Scraper inactive >24h, Log files >100MB - Error: Monitoring failures, configuration issues 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
		
			
				
	
	
		
			284 lines
		
	
	
		
			No EOL
		
	
	
		
			7 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
			
		
		
	
	
			284 lines
		
	
	
		
			No EOL
		
	
	
		
			7 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
| # HVAC Know It All - Monitoring System
 | |
| 
 | |
| This directory contains the monitoring and alerting system for the HVAC Know It All Content Aggregation System.
 | |
| 
 | |
| ## Components
 | |
| 
 | |
| ### 1. Monitoring Script (`setup_monitoring.py`)
 | |
| - Collects system metrics (CPU, memory, disk, network)
 | |
| - Monitors application metrics (scraper status, data sizes, log files)
 | |
| - Checks for alert conditions
 | |
| - Generates health reports
 | |
| - Cleans up old metric files
 | |
| 
 | |
| ### 2. Dashboard Generator (`dashboard_generator.py`)
 | |
| - Creates HTML dashboard with real-time system status
 | |
| - Shows resource usage trends with charts
 | |
| - Displays scraper performance metrics
 | |
| - Lists recent alerts and system health
 | |
| - Auto-refreshes every 5 minutes
 | |
| 
 | |
| ### 3. Systemd Services
 | |
| - `hvac-monitoring.service`: Runs monitoring and dashboard generation
 | |
| - `hvac-monitoring.timer`: Executes monitoring every 15 minutes
 | |
| 
 | |
| ## Installation
 | |
| 
 | |
| 1. **Install dependencies:**
 | |
|    ```bash
 | |
|    sudo apt update
 | |
|    sudo apt install python3-psutil
 | |
|    ```
 | |
| 
 | |
| 2. **Install systemd services:**
 | |
|    ```bash
 | |
|    sudo cp systemd/hvac-monitoring.* /etc/systemd/system/
 | |
|    sudo systemctl daemon-reload
 | |
|    sudo systemctl enable hvac-monitoring.timer
 | |
|    sudo systemctl start hvac-monitoring.timer
 | |
|    ```
 | |
| 
 | |
| 3. **Verify monitoring is running:**
 | |
|    ```bash
 | |
|    sudo systemctl status hvac-monitoring.timer
 | |
|    sudo journalctl -u hvac-monitoring -f
 | |
|    ```
 | |
| 
 | |
| ## Directory Structure
 | |
| 
 | |
| ```
 | |
| monitoring/
 | |
| ├── setup_monitoring.py      # Main monitoring script
 | |
| ├── dashboard_generator.py    # HTML dashboard generator
 | |
| ├── README.md                # This file
 | |
| ├── metrics/                 # JSON metric files (auto-created)
 | |
| │   ├── system_YYYYMMDD_HHMMSS.json
 | |
| │   ├── application_YYYYMMDD_HHMMSS.json
 | |
| │   └── health_report_YYYYMMDD_HHMMSS.json
 | |
| ├── alerts/                  # Alert files (auto-created)
 | |
| │   └── alerts_YYYYMMDD_HHMMSS.json
 | |
| └── dashboard/               # HTML dashboard files (auto-created)
 | |
|     ├── index.html           # Current dashboard
 | |
|     └── dashboard_YYYYMMDD_HHMMSS.html  # Timestamped backups
 | |
| ```
 | |
| 
 | |
| ## Monitoring Metrics
 | |
| 
 | |
| ### System Metrics
 | |
| - **CPU Usage**: Percentage utilization
 | |
| - **Memory Usage**: Percentage of RAM used
 | |
| - **Disk Usage**: Percentage of disk space used
 | |
| - **Network I/O**: Bytes sent/received, packets
 | |
| - **System Uptime**: Hours since last boot
 | |
| - **Load Average**: System load (Linux only)
 | |
| 
 | |
| ### Application Metrics
 | |
| - **Scraper Status**: Last update time, item counts, state
 | |
| - **Data Directory Sizes**: Markdown, media, archives
 | |
| - **Log File Status**: Size, last modified time
 | |
| - **State File Analysis**: Last IDs, update timestamps
 | |
| 
 | |
| ## Alert Conditions
 | |
| 
 | |
| ### Critical Alerts
 | |
| - CPU usage > 80%
 | |
| - Memory usage > 85%
 | |
| - Disk usage > 90%
 | |
| 
 | |
| ### Warning Alerts
 | |
| - Scraper hasn't updated in > 24 hours
 | |
| - Log files > 100MB
 | |
| - Application errors detected
 | |
| 
 | |
| ### Error Alerts
 | |
| - Monitoring system failures
 | |
| - File access errors
 | |
| - Configuration issues
 | |
| 
 | |
| ## Dashboard Features
 | |
| 
 | |
| ### Health Overview
 | |
| - Overall system status (HEALTHY/WARNING/CRITICAL)
 | |
| - Resource usage gauges
 | |
| - Alert summary counts
 | |
| 
 | |
| ### Trend Charts
 | |
| - CPU, memory, disk usage over time
 | |
| - Scraper item collection trends
 | |
| - Historical performance data
 | |
| 
 | |
| ### Real-time Status
 | |
| - Current scraper status table
 | |
| - Recent alert history
 | |
| - Last update timestamps
 | |
| 
 | |
| ### Auto-refresh
 | |
| - Dashboard updates every 5 minutes
 | |
| - Manual refresh available
 | |
| - Responsive design for mobile/desktop
 | |
| 
 | |
| ## Usage
 | |
| 
 | |
| ### Manual Monitoring
 | |
| ```bash
 | |
| # Run monitoring check
 | |
| python3 /opt/hvac-kia-content/monitoring/setup_monitoring.py
 | |
| 
 | |
| # Generate dashboard
 | |
| python3 /opt/hvac-kia-content/monitoring/dashboard_generator.py
 | |
| 
 | |
| # View dashboard
 | |
| firefox file:///opt/hvac-kia-content/monitoring/dashboard/index.html
 | |
| ```
 | |
| 
 | |
| ### Check Recent Metrics
 | |
| ```bash
 | |
| # View latest health report
 | |
| ls -la /opt/hvac-kia-content/monitoring/metrics/health_report_*.json | tail -1
 | |
| 
 | |
| # View recent alerts
 | |
| ls -la /opt/hvac-kia-content/monitoring/alerts/alerts_*.json | tail -5
 | |
| ```
 | |
| 
 | |
| ### Monitor Logs
 | |
| ```bash
 | |
| # Follow monitoring logs
 | |
| sudo journalctl -u hvac-monitoring -f
 | |
| 
 | |
| # View timer status
 | |
| sudo systemctl list-timers hvac-monitoring.timer
 | |
| ```
 | |
| 
 | |
| ## Troubleshooting
 | |
| 
 | |
| ### Common Issues
 | |
| 
 | |
| 1. **Permission Errors**
 | |
|    ```bash
 | |
|    sudo chown -R hvac:hvac /opt/hvac-kia-content/monitoring/
 | |
|    sudo chmod +x /opt/hvac-kia-content/monitoring/*.py
 | |
|    ```
 | |
| 
 | |
| 2. **Missing Dependencies**
 | |
|    ```bash
 | |
|    sudo apt install python3-psutil python3-json
 | |
|    ```
 | |
| 
 | |
| 3. **Service Not Running**
 | |
|    ```bash
 | |
|    sudo systemctl status hvac-monitoring.timer
 | |
|    sudo systemctl restart hvac-monitoring.timer
 | |
|    ```
 | |
| 
 | |
| 4. **Dashboard Not Updating**
 | |
|    ```bash
 | |
|    # Check if files are being generated
 | |
|    ls -la /opt/hvac-kia-content/monitoring/metrics/
 | |
|    
 | |
|    # Manually run dashboard generator
 | |
|    python3 /opt/hvac-kia-content/monitoring/dashboard_generator.py
 | |
|    ```
 | |
| 
 | |
| ### Log Analysis
 | |
| ```bash
 | |
| # Check for errors in monitoring
 | |
| sudo journalctl -u hvac-monitoring --since "1 hour ago"
 | |
| 
 | |
| # Monitor system resources
 | |
| htop
 | |
| 
 | |
| # Check disk space
 | |
| df -h /opt/hvac-kia-content/
 | |
| ```
 | |
| 
 | |
| ## Integration
 | |
| 
 | |
| ### Web Server Setup (Optional)
 | |
| To serve the dashboard via HTTP:
 | |
| 
 | |
| ```bash
 | |
| # Install nginx
 | |
| sudo apt install nginx
 | |
| 
 | |
| # Create site config
 | |
| sudo tee /etc/nginx/sites-available/hvac-monitoring << EOF
 | |
| server {
 | |
|     listen 8080;
 | |
|     root /opt/hvac-kia-content/monitoring/dashboard;
 | |
|     index index.html;
 | |
|     
 | |
|     location / {
 | |
|         try_files \$uri \$uri/ =404;
 | |
|     }
 | |
| }
 | |
| EOF
 | |
| 
 | |
| # Enable site
 | |
| sudo ln -s /etc/nginx/sites-available/hvac-monitoring /etc/nginx/sites-enabled/
 | |
| sudo nginx -t
 | |
| sudo systemctl reload nginx
 | |
| ```
 | |
| 
 | |
| Access dashboard at: `http://your-server:8080`
 | |
| 
 | |
| ### Email Alerts (Optional)
 | |
| To enable email alerts for critical issues:
 | |
| 
 | |
| ```bash
 | |
| # Install mail utilities
 | |
| sudo apt install mailutils
 | |
| 
 | |
| # Configure in monitoring script
 | |
| export ALERT_EMAIL="admin@yourdomain.com"
 | |
| export SMTP_SERVER="smtp.yourdomain.com"
 | |
| ```
 | |
| 
 | |
| ## Customization
 | |
| 
 | |
| ### Adding New Metrics
 | |
| Edit `setup_monitoring.py` and add to `collect_application_metrics()`:
 | |
| 
 | |
| ```python
 | |
| def collect_application_metrics(self):
 | |
|     # ... existing code ...
 | |
|     
 | |
|     # Add custom metric
 | |
|     metrics['custom'] = {
 | |
|         'your_metric': calculate_your_metric(),
 | |
|         'another_metric': get_another_value()
 | |
|     }
 | |
| ```
 | |
| 
 | |
| ### Modifying Alert Thresholds
 | |
| Edit alert conditions in `check_alerts()`:
 | |
| 
 | |
| ```python
 | |
| # Change CPU threshold
 | |
| if sys.get('cpu_percent', 0) > 90:  # Changed from 80% to 90%
 | |
| 
 | |
| # Add new alert
 | |
| if custom_condition():
 | |
|     alerts.append({
 | |
|         'type': 'WARNING',
 | |
|         'component': 'custom',
 | |
|         'message': 'Custom alert condition met'
 | |
|     })
 | |
| ```
 | |
| 
 | |
| ### Dashboard Styling
 | |
| Modify the CSS in `generate_html_dashboard()` to customize appearance.
 | |
| 
 | |
| ## Security Considerations
 | |
| 
 | |
| - Monitoring runs with limited user privileges
 | |
| - No network services exposed by default
 | |
| - File permissions restrict access to monitoring data
 | |
| - Systemd security features enabled (PrivateTmp, ProtectSystem, etc.)
 | |
| - Dashboard contains no sensitive information
 | |
| 
 | |
| ## Performance Impact
 | |
| 
 | |
| - Monitoring runs every 15 minutes (configurable)
 | |
| - Low CPU/memory overhead (< 1% during execution)
 | |
| - Automatic cleanup of old metric files (7-day retention)
 | |
| - Dashboard generation is lightweight (< 1MB files) |