- Set up UV environment with all required packages
- Created comprehensive project structure
- Implemented abstract BaseScraper class with TDD
- Added documentation (project spec, implementation plan, status)
- Configured .env for credentials (not committed)
- All base scraper tests passing (9/9)
🤖 Generated with Claude Code
Co-Authored-By: Claude <noreply@anthropic.com>
3.9 KiB
3.9 KiB
Claude.md - AI Context and Implementation Notes
Project Overview
HVAC Know It All content aggregation system that pulls from 5 sources (YouTube, Instagram, WordPress, Podcast RSS, MailChimp RSS), converts to markdown, and syncs to NAS. Runs as containerized application in Kubernetes.
Key Implementation Details
Environment Variables
All credentials stored in .env file (not committed to git):
WORDPRESS_URL: https://hvacknowitall.com/WORDPRESS_USERNAME: Email for WordPress APIWORDPRESS_API_KEY: WordPress application passwordYOUTUBE_USERNAME: YouTube login emailYOUTUBE_PASSWORD: YouTube passwordINSTAGRAM_USERNAME: Instagram usernameINSTAGRAM_PASSWORD: Instagram passwordMAILCHIMP_RSS_URL: MailChimp RSS feed URLPODCAST_RSS_URL: Podcast RSS feed URLNAS_PATH: /mnt/nas/hvacknowitall/TIMEZONE: America/Halifax
Architecture Decisions
- Abstract Base Class Pattern: All scrapers inherit from
BaseScraperfor consistent interface - State Management: JSON files track last fetched IDs for incremental updates
- Parallel Processing: Use multiprocessing.Pool for concurrent scraping
- Error Handling: Exponential backoff with max 3 retries per source
- Logging: Separate rotating logs per source (max 10MB, keep 5 backups)
Testing Approach
- TDD: Write tests first, then implementation
- Mock external APIs to avoid rate limiting during tests
- Use pytest with fixtures for common test data
- Integration tests use docker-compose for isolated testing
Rate Limiting Strategy
YouTube (yt-dlp)
- Random delay 2-5 seconds between requests
- Use cookies/session to avoid repeated login
- Rotate user agents
- Exponential backoff on 429 errors
Instagram (instaloader)
- Random delay 5-10 seconds between requests
- Limit to 100 requests per hour
- Save session to avoid re-authentication
- Human-like browsing patterns (view profile, then posts)
Markdown Conversion
- Use MarkItDown library for HTML/XML to Markdown
- Custom templates per source for consistent format
- Preserve media references as markdown links
- Strip unnecessary HTML attributes
File Management
- Atomic writes (write to temp, then move)
- Archive previous files before creating new ones
- Use file locks to prevent concurrent access
- Validate markdown before saving
Kubernetes Deployment
- CronJob runs at 8AM and 12PM ADT
- Node selector ensures runs on control plane
- Secrets mounted as environment variables
- PVC for persistent data and logs
- Resource limits: 1 CPU, 2GB RAM
Development Workflow
- Make changes in feature branch
- Run tests locally with
uv run pytest - Build container with
docker build -t hvac-content:latest . - Test container locally before deploying
- Deploy to k8s with
kubectl apply -f k8s/ - Monitor logs with
kubectl logs -f cronjob/hvac-content
Common Commands
# Run tests
uv run pytest
# Run specific scraper
uv run python src/main.py --source wordpress
# Build container
docker build -t hvac-content:latest .
# Deploy to Kubernetes
kubectl apply -f k8s/
# Check CronJob status
kubectl get cronjobs
# View logs
kubectl logs -f job/hvac-content-xxxxx
Known Issues & Workarounds
- Instagram rate limiting: Increase delays if getting 429 errors
- YouTube authentication: May need to update cookies periodically
- RSS feed changes: Update feed parsing if structure changes
Performance Considerations
- Each source scraper timeout: 5 minutes
- Total job timeout: 30 minutes
- Parallel processing limited to 5 concurrent processes
- Memory usage peaks during media download
Security Notes
- Never commit credentials to git
- Use Kubernetes secrets for production
- Rotate API keys regularly
- Monitor for unauthorized access in logs
TODO
- Implement retry queue for failed sources
- Add Prometheus metrics for monitoring
- Create admin dashboard for manual triggers
- Add email notifications for failures