# Claude.md - AI Context and Implementation Notes ## Project Overview HVAC Know It All content aggregation system that pulls from 6 sources (YouTube, Instagram, WordPress, Podcast RSS, MailChimp RSS, TikTok), converts to markdown, and syncs to NAS. Runs as systemd services due to TikTok's GUI requirements. ## Key Implementation Details ### Environment Variables All credentials stored in `.env` file (not committed to git): - `WORDPRESS_URL`: https://hvacknowitall.com/ - `WORDPRESS_USERNAME`: Email for WordPress API - `WORDPRESS_API_KEY`: WordPress application password - `YOUTUBE_USERNAME`: YouTube login email - `YOUTUBE_PASSWORD`: YouTube password - `INSTAGRAM_USERNAME`: Instagram username - `INSTAGRAM_PASSWORD`: Instagram password (I22W5YlbRl7x) - `TIKTOK_USERNAME`: TikTok username - `TIKTOK_PASSWORD`: TikTok password - `MAILCHIMP_RSS_URL`: MailChimp RSS feed URL - `PODCAST_RSS_URL`: https://feeds.libsyn.com/568690/spotify (Corrected URL) - `NAS_PATH`: /mnt/nas/hvacknowitall/ - `TIMEZONE`: America/Halifax ### Architecture Decisions 1. **Abstract Base Class Pattern**: All scrapers inherit from `BaseScraper` for consistent interface 2. **State Management**: JSON files track last fetched IDs for incremental updates 3. **Parallel Processing**: ThreadPoolExecutor for 5/6 sources (TikTok runs separately due to GUI) 4. **Error Handling**: Comprehensive exception handling with graceful degradation 5. **Logging**: Centralized logging with detailed error tracking 6. **TikTok Stealth**: Scrapling + Camofaux with headed browser for bot detection avoidance ### Testing Approach - TDD: Write tests first, then implementation - Mock external APIs to avoid rate limiting during tests - Use pytest with fixtures for common test data - Integration tests use docker-compose for isolated testing ### Rate Limiting Strategy #### YouTube (yt-dlp) - Random delay 2-5 seconds between requests - Use cookies/session to avoid repeated login - Rotate user agents - Exponential backoff on 429 errors #### Instagram (instaloader) - Random delay 5-10 seconds between requests - Aggressive rate limiting with session persistence - Save session to avoid re-authentication - Human-like browsing patterns (view profile, then posts) #### TikTok (Scrapling + Camofaux) - Headed browser with DISPLAY=:0 environment - Stealth configuration with geolocation spoofing - OS randomization and WebGL support - Human-like interaction patterns ### Markdown Conversion - Use markdownify library for HTML/XML to Markdown (replaced MarkItDown due to Unicode issues) - Custom templates per source for consistent format - Preserve media references as markdown links - Strip unnecessary HTML attributes ### File Management - Atomic writes (write to temp, then move) - Archive previous files before creating new ones - Use file locks to prevent concurrent access - Validate markdown before saving ### systemd Deployment (Production) - Services run at 8AM and 12PM ADT via systemd timers - Deployed on control plane as user 'ben' for GUI access - Environment variables from .env file - Local file system for data and logs - TikTok requires DISPLAY=:0 for headed browser ### Kubernetes Deployment (Not Viable) - ❌ Blocked by TikTok GUI requirements - Cannot containerize headed browser applications - DISPLAY forwarding adds complexity and unreliability - systemd chosen as alternative deployment strategy ### Development Workflow 1. Make changes in feature branch 2. Run tests locally with `uv run pytest` 3. Test individual scrapers with real data 4. Deploy to production with `sudo ./install.sh` 5. Monitor systemd services 6. Check logs with journalctl ### Common Commands ```bash # Run tests uv run pytest # Test specific scraper python -m src.orchestrator --sources wordpress instagram # Install to production sudo ./install.sh # Check service status systemctl status hvac-scraper-*.timer # Manual execution sudo systemctl start hvac-scraper.service # View logs journalctl -u hvac-scraper.service -f # Test TikTok with display DISPLAY=:0 XAUTHORITY="/run/user/1000/.mutter-Xwaylandauth.90WDB3" uv run python test_tiktok_advanced.py ``` ### Known Issues & Workarounds - Instagram rate limiting: Session persistence helps avoid re-authentication - TikTok bot detection: Scrapling with stealth features overcomes detection - Unicode conversion: markdownify replaced MarkItDown for better handling - Podcast RSS: Corrected to use Libsyn URL (https://feeds.libsyn.com/568690/spotify) ### Performance Considerations - TikTok requires headed browser (cannot be containerized) - Parallel processing: 5/6 sources concurrent, TikTok sequential - Memory usage: Minimal footprint with efficient processing - Network efficiency: Incremental updates reduce API calls ### Security Notes - Never commit credentials to git - Use .env file for local credential storage - Rotate API keys regularly - Monitor for unauthorized access in logs - TikTok stealth mode prevents account detection ## Current Status: COMPLETE ✅ - All 6 sources implemented and tested - Production deployment ready via systemd - Comprehensive testing completed with real data - Documentation and deployment scripts finalized - System ready for automated operation