hvac-kia-content/claude.md
Ben Reed f9a8e719a7 Initial commit: Project foundation with base scraper and tests
- Set up UV environment with all required packages
- Created comprehensive project structure
- Implemented abstract BaseScraper class with TDD
- Added documentation (project spec, implementation plan, status)
- Configured .env for credentials (not committed)
- All base scraper tests passing (9/9)

🤖 Generated with Claude Code

Co-Authored-By: Claude <noreply@anthropic.com>
2025-08-18 12:15:17 -03:00

3.9 KiB

Claude.md - AI Context and Implementation Notes

Project Overview

HVAC Know It All content aggregation system that pulls from 5 sources (YouTube, Instagram, WordPress, Podcast RSS, MailChimp RSS), converts to markdown, and syncs to NAS. Runs as containerized application in Kubernetes.

Key Implementation Details

Environment Variables

All credentials stored in .env file (not committed to git):

  • WORDPRESS_URL: https://hvacknowitall.com/
  • WORDPRESS_USERNAME: Email for WordPress API
  • WORDPRESS_API_KEY: WordPress application password
  • YOUTUBE_USERNAME: YouTube login email
  • YOUTUBE_PASSWORD: YouTube password
  • INSTAGRAM_USERNAME: Instagram username
  • INSTAGRAM_PASSWORD: Instagram password
  • MAILCHIMP_RSS_URL: MailChimp RSS feed URL
  • PODCAST_RSS_URL: Podcast RSS feed URL
  • NAS_PATH: /mnt/nas/hvacknowitall/
  • TIMEZONE: America/Halifax

Architecture Decisions

  1. Abstract Base Class Pattern: All scrapers inherit from BaseScraper for consistent interface
  2. State Management: JSON files track last fetched IDs for incremental updates
  3. Parallel Processing: Use multiprocessing.Pool for concurrent scraping
  4. Error Handling: Exponential backoff with max 3 retries per source
  5. Logging: Separate rotating logs per source (max 10MB, keep 5 backups)

Testing Approach

  • TDD: Write tests first, then implementation
  • Mock external APIs to avoid rate limiting during tests
  • Use pytest with fixtures for common test data
  • Integration tests use docker-compose for isolated testing

Rate Limiting Strategy

YouTube (yt-dlp)

  • Random delay 2-5 seconds between requests
  • Use cookies/session to avoid repeated login
  • Rotate user agents
  • Exponential backoff on 429 errors

Instagram (instaloader)

  • Random delay 5-10 seconds between requests
  • Limit to 100 requests per hour
  • Save session to avoid re-authentication
  • Human-like browsing patterns (view profile, then posts)

Markdown Conversion

  • Use MarkItDown library for HTML/XML to Markdown
  • Custom templates per source for consistent format
  • Preserve media references as markdown links
  • Strip unnecessary HTML attributes

File Management

  • Atomic writes (write to temp, then move)
  • Archive previous files before creating new ones
  • Use file locks to prevent concurrent access
  • Validate markdown before saving

Kubernetes Deployment

  • CronJob runs at 8AM and 12PM ADT
  • Node selector ensures runs on control plane
  • Secrets mounted as environment variables
  • PVC for persistent data and logs
  • Resource limits: 1 CPU, 2GB RAM

Development Workflow

  1. Make changes in feature branch
  2. Run tests locally with uv run pytest
  3. Build container with docker build -t hvac-content:latest .
  4. Test container locally before deploying
  5. Deploy to k8s with kubectl apply -f k8s/
  6. Monitor logs with kubectl logs -f cronjob/hvac-content

Common Commands

# Run tests
uv run pytest

# Run specific scraper
uv run python src/main.py --source wordpress

# Build container
docker build -t hvac-content:latest .

# Deploy to Kubernetes
kubectl apply -f k8s/

# Check CronJob status
kubectl get cronjobs

# View logs
kubectl logs -f job/hvac-content-xxxxx

Known Issues & Workarounds

  • Instagram rate limiting: Increase delays if getting 429 errors
  • YouTube authentication: May need to update cookies periodically
  • RSS feed changes: Update feed parsing if structure changes

Performance Considerations

  • Each source scraper timeout: 5 minutes
  • Total job timeout: 30 minutes
  • Parallel processing limited to 5 concurrent processes
  • Memory usage peaks during media download

Security Notes

  • Never commit credentials to git
  • Use Kubernetes secrets for production
  • Rotate API keys regularly
  • Monitor for unauthorized access in logs

TODO

  • Implement retry queue for failed sources
  • Add Prometheus metrics for monitoring
  • Create admin dashboard for manual triggers
  • Add email notifications for failures