hvac-kia-content/claude.md
Ben Reed f9a8e719a7 Initial commit: Project foundation with base scraper and tests
- Set up UV environment with all required packages
- Created comprehensive project structure
- Implemented abstract BaseScraper class with TDD
- Added documentation (project spec, implementation plan, status)
- Configured .env for credentials (not committed)
- All base scraper tests passing (9/9)

🤖 Generated with Claude Code

Co-Authored-By: Claude <noreply@anthropic.com>
2025-08-18 12:15:17 -03:00

119 lines
No EOL
3.9 KiB
Markdown

# Claude.md - AI Context and Implementation Notes
## Project Overview
HVAC Know It All content aggregation system that pulls from 5 sources (YouTube, Instagram, WordPress, Podcast RSS, MailChimp RSS), converts to markdown, and syncs to NAS. Runs as containerized application in Kubernetes.
## Key Implementation Details
### Environment Variables
All credentials stored in `.env` file (not committed to git):
- `WORDPRESS_URL`: https://hvacknowitall.com/
- `WORDPRESS_USERNAME`: Email for WordPress API
- `WORDPRESS_API_KEY`: WordPress application password
- `YOUTUBE_USERNAME`: YouTube login email
- `YOUTUBE_PASSWORD`: YouTube password
- `INSTAGRAM_USERNAME`: Instagram username
- `INSTAGRAM_PASSWORD`: Instagram password
- `MAILCHIMP_RSS_URL`: MailChimp RSS feed URL
- `PODCAST_RSS_URL`: Podcast RSS feed URL
- `NAS_PATH`: /mnt/nas/hvacknowitall/
- `TIMEZONE`: America/Halifax
### Architecture Decisions
1. **Abstract Base Class Pattern**: All scrapers inherit from `BaseScraper` for consistent interface
2. **State Management**: JSON files track last fetched IDs for incremental updates
3. **Parallel Processing**: Use multiprocessing.Pool for concurrent scraping
4. **Error Handling**: Exponential backoff with max 3 retries per source
5. **Logging**: Separate rotating logs per source (max 10MB, keep 5 backups)
### Testing Approach
- TDD: Write tests first, then implementation
- Mock external APIs to avoid rate limiting during tests
- Use pytest with fixtures for common test data
- Integration tests use docker-compose for isolated testing
### Rate Limiting Strategy
#### YouTube (yt-dlp)
- Random delay 2-5 seconds between requests
- Use cookies/session to avoid repeated login
- Rotate user agents
- Exponential backoff on 429 errors
#### Instagram (instaloader)
- Random delay 5-10 seconds between requests
- Limit to 100 requests per hour
- Save session to avoid re-authentication
- Human-like browsing patterns (view profile, then posts)
### Markdown Conversion
- Use MarkItDown library for HTML/XML to Markdown
- Custom templates per source for consistent format
- Preserve media references as markdown links
- Strip unnecessary HTML attributes
### File Management
- Atomic writes (write to temp, then move)
- Archive previous files before creating new ones
- Use file locks to prevent concurrent access
- Validate markdown before saving
### Kubernetes Deployment
- CronJob runs at 8AM and 12PM ADT
- Node selector ensures runs on control plane
- Secrets mounted as environment variables
- PVC for persistent data and logs
- Resource limits: 1 CPU, 2GB RAM
### Development Workflow
1. Make changes in feature branch
2. Run tests locally with `uv run pytest`
3. Build container with `docker build -t hvac-content:latest .`
4. Test container locally before deploying
5. Deploy to k8s with `kubectl apply -f k8s/`
6. Monitor logs with `kubectl logs -f cronjob/hvac-content`
### Common Commands
```bash
# Run tests
uv run pytest
# Run specific scraper
uv run python src/main.py --source wordpress
# Build container
docker build -t hvac-content:latest .
# Deploy to Kubernetes
kubectl apply -f k8s/
# Check CronJob status
kubectl get cronjobs
# View logs
kubectl logs -f job/hvac-content-xxxxx
```
### Known Issues & Workarounds
- Instagram rate limiting: Increase delays if getting 429 errors
- YouTube authentication: May need to update cookies periodically
- RSS feed changes: Update feed parsing if structure changes
### Performance Considerations
- Each source scraper timeout: 5 minutes
- Total job timeout: 30 minutes
- Parallel processing limited to 5 concurrent processes
- Memory usage peaks during media download
### Security Notes
- Never commit credentials to git
- Use Kubernetes secrets for production
- Rotate API keys regularly
- Monitor for unauthorized access in logs
## TODO
- Implement retry queue for failed sources
- Add Prometheus metrics for monitoring
- Create admin dashboard for manual triggers
- Add email notifications for failures