- Set up UV environment with all required packages
- Created comprehensive project structure
- Implemented abstract BaseScraper class with TDD
- Added documentation (project spec, implementation plan, status)
- Configured .env for credentials (not committed)
- All base scraper tests passing (9/9)
🤖 Generated with Claude Code
Co-Authored-By: Claude <noreply@anthropic.com>
119 lines
No EOL
3.9 KiB
Markdown
119 lines
No EOL
3.9 KiB
Markdown
# Claude.md - AI Context and Implementation Notes
|
|
|
|
## Project Overview
|
|
HVAC Know It All content aggregation system that pulls from 5 sources (YouTube, Instagram, WordPress, Podcast RSS, MailChimp RSS), converts to markdown, and syncs to NAS. Runs as containerized application in Kubernetes.
|
|
|
|
## Key Implementation Details
|
|
|
|
### Environment Variables
|
|
All credentials stored in `.env` file (not committed to git):
|
|
- `WORDPRESS_URL`: https://hvacknowitall.com/
|
|
- `WORDPRESS_USERNAME`: Email for WordPress API
|
|
- `WORDPRESS_API_KEY`: WordPress application password
|
|
- `YOUTUBE_USERNAME`: YouTube login email
|
|
- `YOUTUBE_PASSWORD`: YouTube password
|
|
- `INSTAGRAM_USERNAME`: Instagram username
|
|
- `INSTAGRAM_PASSWORD`: Instagram password
|
|
- `MAILCHIMP_RSS_URL`: MailChimp RSS feed URL
|
|
- `PODCAST_RSS_URL`: Podcast RSS feed URL
|
|
- `NAS_PATH`: /mnt/nas/hvacknowitall/
|
|
- `TIMEZONE`: America/Halifax
|
|
|
|
### Architecture Decisions
|
|
|
|
1. **Abstract Base Class Pattern**: All scrapers inherit from `BaseScraper` for consistent interface
|
|
2. **State Management**: JSON files track last fetched IDs for incremental updates
|
|
3. **Parallel Processing**: Use multiprocessing.Pool for concurrent scraping
|
|
4. **Error Handling**: Exponential backoff with max 3 retries per source
|
|
5. **Logging**: Separate rotating logs per source (max 10MB, keep 5 backups)
|
|
|
|
### Testing Approach
|
|
- TDD: Write tests first, then implementation
|
|
- Mock external APIs to avoid rate limiting during tests
|
|
- Use pytest with fixtures for common test data
|
|
- Integration tests use docker-compose for isolated testing
|
|
|
|
### Rate Limiting Strategy
|
|
|
|
#### YouTube (yt-dlp)
|
|
- Random delay 2-5 seconds between requests
|
|
- Use cookies/session to avoid repeated login
|
|
- Rotate user agents
|
|
- Exponential backoff on 429 errors
|
|
|
|
#### Instagram (instaloader)
|
|
- Random delay 5-10 seconds between requests
|
|
- Limit to 100 requests per hour
|
|
- Save session to avoid re-authentication
|
|
- Human-like browsing patterns (view profile, then posts)
|
|
|
|
### Markdown Conversion
|
|
- Use MarkItDown library for HTML/XML to Markdown
|
|
- Custom templates per source for consistent format
|
|
- Preserve media references as markdown links
|
|
- Strip unnecessary HTML attributes
|
|
|
|
### File Management
|
|
- Atomic writes (write to temp, then move)
|
|
- Archive previous files before creating new ones
|
|
- Use file locks to prevent concurrent access
|
|
- Validate markdown before saving
|
|
|
|
### Kubernetes Deployment
|
|
- CronJob runs at 8AM and 12PM ADT
|
|
- Node selector ensures runs on control plane
|
|
- Secrets mounted as environment variables
|
|
- PVC for persistent data and logs
|
|
- Resource limits: 1 CPU, 2GB RAM
|
|
|
|
### Development Workflow
|
|
1. Make changes in feature branch
|
|
2. Run tests locally with `uv run pytest`
|
|
3. Build container with `docker build -t hvac-content:latest .`
|
|
4. Test container locally before deploying
|
|
5. Deploy to k8s with `kubectl apply -f k8s/`
|
|
6. Monitor logs with `kubectl logs -f cronjob/hvac-content`
|
|
|
|
### Common Commands
|
|
```bash
|
|
# Run tests
|
|
uv run pytest
|
|
|
|
# Run specific scraper
|
|
uv run python src/main.py --source wordpress
|
|
|
|
# Build container
|
|
docker build -t hvac-content:latest .
|
|
|
|
# Deploy to Kubernetes
|
|
kubectl apply -f k8s/
|
|
|
|
# Check CronJob status
|
|
kubectl get cronjobs
|
|
|
|
# View logs
|
|
kubectl logs -f job/hvac-content-xxxxx
|
|
```
|
|
|
|
### Known Issues & Workarounds
|
|
- Instagram rate limiting: Increase delays if getting 429 errors
|
|
- YouTube authentication: May need to update cookies periodically
|
|
- RSS feed changes: Update feed parsing if structure changes
|
|
|
|
### Performance Considerations
|
|
- Each source scraper timeout: 5 minutes
|
|
- Total job timeout: 30 minutes
|
|
- Parallel processing limited to 5 concurrent processes
|
|
- Memory usage peaks during media download
|
|
|
|
### Security Notes
|
|
- Never commit credentials to git
|
|
- Use Kubernetes secrets for production
|
|
- Rotate API keys regularly
|
|
- Monitor for unauthorized access in logs
|
|
|
|
## TODO
|
|
- Implement retry queue for failed sources
|
|
- Add Prometheus metrics for monitoring
|
|
- Create admin dashboard for manual triggers
|
|
- Add email notifications for failures |