hvac-kia-content/claude.md

# Claude.md - AI Context and Implementation Notes

## Project Overview
HVAC Know It All content aggregation system that pulls from 5 sources (YouTube, Instagram, WordPress, Podcast RSS, MailChimp RSS), converts to markdown, and syncs to NAS. Runs as containerized application in Kubernetes.

## Key Implementation Details

### Environment Variables
All credentials stored in `.env` file (not committed to git):
- `WORDPRESS_URL`: https://hvacknowitall.com/
- `WORDPRESS_USERNAME`: Email for WordPress API
- `WORDPRESS_API_KEY`: WordPress application password
- `YOUTUBE_USERNAME`: YouTube login email
- `YOUTUBE_PASSWORD`: YouTube password
- `INSTAGRAM_USERNAME`: Instagram username
- `INSTAGRAM_PASSWORD`: Instagram password
- `MAILCHIMP_RSS_URL`: MailChimp RSS feed URL
- `PODCAST_RSS_URL`: Podcast RSS feed URL
- `NAS_PATH`: /mnt/nas/hvacknowitall/
- `TIMEZONE`: America/Halifax

### Architecture Decisions

1. **Abstract Base Class Pattern**: All scrapers inherit from `BaseScraper` for consistent interface
2. **State Management**: JSON files track last fetched IDs for incremental updates
3. **Parallel Processing**: Use multiprocessing.Pool for concurrent scraping
4. **Error Handling**: Exponential backoff with max 3 retries per source
5. **Logging**: Separate rotating logs per source (max 10MB, keep 5 backups)

### Testing Approach
- TDD: Write tests first, then implementation
- Mock external APIs to avoid rate limiting during tests
- Use pytest with fixtures for common test data
- Integration tests use docker-compose for isolated testing

### Rate Limiting Strategy

#### YouTube (yt-dlp)
- Random delay 2-5 seconds between requests
- Use cookies/session to avoid repeated login
- Rotate user agents
- Exponential backoff on 429 errors

#### Instagram (instaloader)
- Random delay 5-10 seconds between requests
- Limit to 100 requests per hour
- Save session to avoid re-authentication
- Human-like browsing patterns (view profile, then posts)

### Markdown Conversion
- Use MarkItDown library for HTML/XML to Markdown
- Custom templates per source for consistent format
- Preserve media references as markdown links
- Strip unnecessary HTML attributes

### File Management
- Atomic writes (write to temp, then move)
- Archive previous files before creating new ones
- Use file locks to prevent concurrent access
- Validate markdown before saving

### Kubernetes Deployment
- CronJob runs at 8AM and 12PM ADT
- Node selector ensures runs on control plane
- Secrets mounted as environment variables
- PVC for persistent data and logs
- Resource limits: 1 CPU, 2GB RAM

### Development Workflow
1. Make changes in feature branch
2. Run tests locally with `uv run pytest`
3. Build container with `docker build -t hvac-content:latest .`
4. Test container locally before deploying
5. Deploy to k8s with `kubectl apply -f k8s/`
6. Monitor logs with `kubectl logs -f cronjob/hvac-content`

### Common Commands
```bash
# Run tests
uv run pytest

# Run specific scraper
uv run python src/main.py --source wordpress

# Build container
docker build -t hvac-content:latest .

# Deploy to Kubernetes
kubectl apply -f k8s/

# Check CronJob status
kubectl get cronjobs

# View logs
kubectl logs -f job/hvac-content-xxxxx
```

### Known Issues & Workarounds
- Instagram rate limiting: Increase delays if getting 429 errors
- YouTube authentication: May need to update cookies periodically
- RSS feed changes: Update feed parsing if structure changes

### Performance Considerations
- Each source scraper timeout: 5 minutes
- Total job timeout: 30 minutes
- Parallel processing limited to 5 concurrent processes
- Memory usage peaks during media download

### Security Notes
- Never commit credentials to git
- Use Kubernetes secrets for production
- Rotate API keys regularly
- Monitor for unauthorized access in logs

## TODO
- Implement retry queue for failed sources
- Add Prometheus metrics for monitoring
- Create admin dashboard for manual triggers
- Add email notifications for failures