# HKIA Content Aggregation System ## Project Overview Complete content aggregation system that scrapes 6 sources (WordPress, MailChimp RSS, Podcast RSS, YouTube, Instagram, TikTok), converts to markdown, and runs twice daily with incremental updates. ## Architecture - **Base Pattern**: Abstract scraper class with common interface - **State Management**: JSON-based incremental update tracking - **Parallel Processing**: 5 sources run in parallel, TikTok separate (GUI requirement) - **Output Format**: `hkia_[source]_[timestamp].md` - **Archive System**: Previous files archived to timestamped directories - **NAS Sync**: Automated rsync to `/mnt/nas/hkia/` ## Key Implementation Details ### Instagram Scraper (`src/instagram_scraper.py`) - Uses `instaloader` with session persistence - Aggressive rate limiting: 15-30 second delays, extended breaks every 5 requests - Session file: `instagram_session_hkia1.session` - Authentication: Username `hkia1`, password `I22W5YlbRl7x` ### TikTok Scraper (`src/tiktok_scraper_advanced.py`) - Advanced anti-bot detection using Scrapling + Camofaux - **Requires headed browser with DISPLAY=:0** - Stealth features: geolocation spoofing, OS randomization, WebGL support - Cannot be containerized due to GUI requirements ### YouTube Scraper (`src/youtube_scraper.py`) - Uses `yt-dlp` for metadata extraction - Channel: `@HVACKnowItAll` - Fetches video metadata without downloading videos ### RSS Scrapers - **MailChimp**: `https://us10.campaign-archive.com/feed?u=d1a98c3e62003104038942e21&id=2205dbf985` - **Podcast**: `https://feeds.libsyn.com/568690/spotify` ### WordPress Scraper (`src/wordpress_scraper.py`) - Direct API access to `hkia.com` - Fetches blog posts with full content ## Technical Stack - **Python**: 3.11+ with UV package manager - **Key Dependencies**: - `instaloader` (Instagram) - `scrapling[all]` (TikTok anti-bot) - `yt-dlp` (YouTube) - `feedparser` (RSS) - `markdownify` (HTML conversion) - **Testing**: pytest with comprehensive mocking ## Deployment Strategy ### ⚠️ IMPORTANT: systemd Services (Not Kubernetes) Originally planned for Kubernetes deployment but **TikTok requires headed browser with DISPLAY=:0**, making containerization impossible. ### Production Setup ```bash # Service files location /etc/systemd/system/hvac-scraper.service /etc/systemd/system/hvac-scraper.timer /etc/systemd/system/hvac-scraper-nas.service /etc/systemd/system/hvac-scraper-nas.timer # Installation directory /opt/hvac-kia-content/ # Environment setup export DISPLAY=:0 export XAUTHORITY="/run/user/1000/.mutter-Xwaylandauth.90WDB3" ``` ### Schedule - **Main Scraping**: 8AM and 12PM Atlantic Daylight Time - **NAS Sync**: 30 minutes after each scraping run - **User**: ben (requires GUI access for TikTok) ## Environment Variables ```bash # Required in /opt/hvac-kia-content/.env INSTAGRAM_USERNAME=hkia1 INSTAGRAM_PASSWORD=I22W5YlbRl7x YOUTUBE_CHANNEL=@HVACKnowItAll TIKTOK_USERNAME=hkia NAS_PATH=/mnt/nas/hkia TIMEZONE=America/Halifax DISPLAY=:0 XAUTHORITY="/run/user/1000/.mutter-Xwaylandauth.90WDB3" ``` ## Commands ### Testing ```bash # Test individual sources uv run python test_real_data.py --source [youtube|instagram|tiktok|wordpress|mailchimp|podcast] # Test backlog processing uv run python test_real_data.py --type backlog --items 50 # Full test suite uv run pytest tests/ -v ``` ### Production Operations ```bash # Run orchestrator manually uv run python -m src.orchestrator # Run specific sources uv run python -m src.orchestrator --sources youtube instagram # NAS sync only uv run python -m src.orchestrator --nas-only # Check service status sudo systemctl status hvac-scraper.service sudo journalctl -f -u hvac-scraper.service ``` ## Critical Notes 1. **TikTok GUI Requirement**: Must run on desktop environment with DISPLAY=:0 2. **Instagram Rate Limiting**: 100 requests/hour with exponential backoff 3. **State Files**: Located in `state/` directory for incremental updates 4. **Archive Management**: Previous files automatically moved to timestamped archives 5. **Error Recovery**: All scrapers handle rate limits and network failures gracefully ## Project Status: ✅ COMPLETE - All 6 sources working and tested - Production deployment ready via systemd - Comprehensive testing completed (68+ tests passing) - Real-world data validation completed - Full backlog processing capability verified