# Claude.md - AI Context and Implementation Notes ## Project Overview HVAC Know It All content aggregation system that pulls from 5 sources (YouTube, Instagram, WordPress, Podcast RSS, MailChimp RSS), converts to markdown, and syncs to NAS. Runs as containerized application in Kubernetes. ## Key Implementation Details ### Environment Variables All credentials stored in `.env` file (not committed to git): - `WORDPRESS_URL`: https://hvacknowitall.com/ - `WORDPRESS_USERNAME`: Email for WordPress API - `WORDPRESS_API_KEY`: WordPress application password - `YOUTUBE_USERNAME`: YouTube login email - `YOUTUBE_PASSWORD`: YouTube password - `INSTAGRAM_USERNAME`: Instagram username - `INSTAGRAM_PASSWORD`: Instagram password - `MAILCHIMP_RSS_URL`: MailChimp RSS feed URL - `PODCAST_RSS_URL`: Podcast RSS feed URL - `NAS_PATH`: /mnt/nas/hvacknowitall/ - `TIMEZONE`: America/Halifax ### Architecture Decisions 1. **Abstract Base Class Pattern**: All scrapers inherit from `BaseScraper` for consistent interface 2. **State Management**: JSON files track last fetched IDs for incremental updates 3. **Parallel Processing**: Use multiprocessing.Pool for concurrent scraping 4. **Error Handling**: Exponential backoff with max 3 retries per source 5. **Logging**: Separate rotating logs per source (max 10MB, keep 5 backups) ### Testing Approach - TDD: Write tests first, then implementation - Mock external APIs to avoid rate limiting during tests - Use pytest with fixtures for common test data - Integration tests use docker-compose for isolated testing ### Rate Limiting Strategy #### YouTube (yt-dlp) - Random delay 2-5 seconds between requests - Use cookies/session to avoid repeated login - Rotate user agents - Exponential backoff on 429 errors #### Instagram (instaloader) - Random delay 5-10 seconds between requests - Limit to 100 requests per hour - Save session to avoid re-authentication - Human-like browsing patterns (view profile, then posts) ### Markdown Conversion - Use MarkItDown library for HTML/XML to Markdown - Custom templates per source for consistent format - Preserve media references as markdown links - Strip unnecessary HTML attributes ### File Management - Atomic writes (write to temp, then move) - Archive previous files before creating new ones - Use file locks to prevent concurrent access - Validate markdown before saving ### Kubernetes Deployment - CronJob runs at 8AM and 12PM ADT - Node selector ensures runs on control plane - Secrets mounted as environment variables - PVC for persistent data and logs - Resource limits: 1 CPU, 2GB RAM ### Development Workflow 1. Make changes in feature branch 2. Run tests locally with `uv run pytest` 3. Build container with `docker build -t hvac-content:latest .` 4. Test container locally before deploying 5. Deploy to k8s with `kubectl apply -f k8s/` 6. Monitor logs with `kubectl logs -f cronjob/hvac-content` ### Common Commands ```bash # Run tests uv run pytest # Run specific scraper uv run python src/main.py --source wordpress # Build container docker build -t hvac-content:latest . # Deploy to Kubernetes kubectl apply -f k8s/ # Check CronJob status kubectl get cronjobs # View logs kubectl logs -f job/hvac-content-xxxxx ``` ### Known Issues & Workarounds - Instagram rate limiting: Increase delays if getting 429 errors - YouTube authentication: May need to update cookies periodically - RSS feed changes: Update feed parsing if structure changes ### Performance Considerations - Each source scraper timeout: 5 minutes - Total job timeout: 30 minutes - Parallel processing limited to 5 concurrent processes - Memory usage peaks during media download ### Security Notes - Never commit credentials to git - Use Kubernetes secrets for production - Rotate API keys regularly - Monitor for unauthorized access in logs ## TODO - Implement retry queue for failed sources - Add Prometheus metrics for monitoring - Create admin dashboard for manual triggers - Add email notifications for failures