# CLAUDE.md This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. # HKIA Content Aggregation System ## Project Overview Complete content aggregation system that scrapes 5 sources (WordPress, MailChimp RSS, Podcast RSS, YouTube, Instagram), converts to markdown, and runs twice daily with incremental updates. TikTok scraper disabled due to technical issues. ## Architecture - **Base Pattern**: Abstract scraper class with common interface - **State Management**: JSON-based incremental update tracking - **Parallel Processing**: All 5 active sources run in parallel - **Output Format**: `hkia_[source]_[timestamp].md` - **Archive System**: Previous files archived to timestamped directories - **NAS Sync**: Automated rsync to `/mnt/nas/hkia/` ## Key Implementation Details ### Instagram Scraper (`src/instagram_scraper.py`) - Uses `instaloader` with session persistence - Aggressive rate limiting: 15-30 second delays, extended breaks every 5 requests - Session file: `instagram_session_hkia1.session` - Authentication: Username `hkia1`, password `I22W5YlbRl7x` ### ~~TikTok Scraper~~ ❌ **DISABLED** - **Status**: Disabled in orchestrator due to technical issues - **Reason**: GUI requirements incompatible with automated deployment - **Code**: Still available in `src/tiktok_scraper_advanced.py` but not active ### YouTube Scraper (`src/youtube_scraper.py`) - Uses `yt-dlp` with authentication for metadata and transcript extraction - Channel: `@hkia` - **Authentication**: Firefox cookie extraction via `YouTubeAuthHandler` - **Transcript Support**: Can extract transcripts when `fetch_transcripts=True` - ⚠️ **Current Limitation**: YouTube's new PO token requirements (Aug 2025) block transcript extraction - Error: "The following content is not available on this app" - **179 videos identified** with captions available but currently inaccessible - Requires `yt-dlp` updates to handle new YouTube restrictions ### RSS Scrapers - **MailChimp**: `https://us10.campaign-archive.com/feed?u=d1a98c3e62003104038942e21&id=2205dbf985` - **Podcast**: `https://feeds.libsyn.com/568690/spotify` ### WordPress Scraper (`src/wordpress_scraper.py`) - Direct API access to `hkia.com` - Fetches blog posts with full content ## Technical Stack - **Python**: 3.11+ with UV package manager - **Key Dependencies**: - `instaloader` (Instagram) - `scrapling[all]` (TikTok anti-bot) - `yt-dlp` (YouTube) - `feedparser` (RSS) - `markdownify` (HTML conversion) - **Testing**: pytest with comprehensive mocking ## Deployment Strategy ### ✅ Production Setup - systemd Services **TikTok disabled** - no longer requires GUI access or containerization restrictions. ```bash # Service files location (✅ INSTALLED) /etc/systemd/system/hkia-scraper.service /etc/systemd/system/hkia-scraper.timer /etc/systemd/system/hkia-scraper-nas.service /etc/systemd/system/hkia-scraper-nas.timer # Working directory /home/ben/dev/hvac-kia-content/ # Installation script ./install-hkia-services.sh # Environment setup export DISPLAY=:0 export XAUTHORITY="/run/user/1000/.mutter-Xwaylandauth.90WDB3" ``` ### Schedule (✅ ACTIVE) - **Main Scraping**: 8:00 AM and 12:00 PM Atlantic Daylight Time (5 sources) - **NAS Sync**: 8:30 AM and 12:30 PM (30 minutes after scraping) - **User**: ben (GUI environment available but not required) ## Environment Variables ```bash # Required in /opt/hvac-kia-content/.env INSTAGRAM_USERNAME=hkia1 INSTAGRAM_PASSWORD=I22W5YlbRl7x YOUTUBE_CHANNEL=@hkia TIKTOK_USERNAME=hkia NAS_PATH=/mnt/nas/hkia TIMEZONE=America/Halifax DISPLAY=:0 XAUTHORITY="/run/user/1000/.mutter-Xwaylandauth.90WDB3" ``` ## Commands ### Testing ```bash # Test individual sources uv run python test_real_data.py --source [youtube|instagram|tiktok|wordpress|mailchimp|podcast] # Test backlog processing uv run python test_real_data.py --type backlog --items 50 # Test cumulative markdown system uv run python test_cumulative_mode.py # Full test suite uv run pytest tests/ -v # Test with specific GUI environment for TikTok DISPLAY=:0 XAUTHORITY="/run/user/1000/.mutter-Xwaylandauth.90WDB3" uv run python test_real_data.py --source tiktok # Test YouTube transcript extraction (currently blocked by YouTube) DISPLAY=:0 XAUTHORITY="/run/user/1000/.mutter-Xwaylandauth.90WDB3" uv run python youtube_backlog_all_with_transcripts.py ``` ### Production Operations ```bash # Service management (✅ ACTIVE SERVICES) sudo systemctl status hkia-scraper.timer sudo systemctl status hkia-scraper-nas.timer sudo journalctl -f -u hkia-scraper.service sudo journalctl -f -u hkia-scraper-nas.service # Manual runs (for testing) uv run python run_production_with_images.py uv run python -m src.orchestrator --sources youtube instagram uv run python -m src.orchestrator --nas-only # Legacy commands (still work) uv run python -m src.orchestrator uv run python run_production_cumulative.py ``` ## Critical Notes 1. **✅ TikTok Scraper**: DISABLED - No longer blocks deployment or requires GUI access 2. **Instagram Rate Limiting**: 100 requests/hour with exponential backoff 3. **YouTube Transcript Limitations**: As of August 2025, YouTube blocks transcript extraction - PO token requirements prevent `yt-dlp` access to subtitle/caption data - 179 videos identified with captions but currently inaccessible - Authentication system works but content restricted at platform level 4. **State Files**: Located in `data/markdown_current/.state/` directory for incremental updates 5. **Archive Management**: Previous files automatically moved to timestamped archives 6. **Error Recovery**: All scrapers handle rate limits and network failures gracefully 7. **✅ Production Services**: Fully automated with systemd timers running twice daily ## YouTube Transcript Investigation (August 2025) **Objective**: Extract transcripts for 179 YouTube videos identified as having captions available. **Investigation Findings**: - ✅ **179 videos identified** with captions from existing YouTube data - ✅ **Existing authentication system** (`YouTubeAuthHandler` + Firefox cookies) working - ✅ **Transcript extraction code** properly implemented in `YouTubeScraper` - ❌ **Platform restrictions** blocking all video access as of August 2025 **Technical Attempts**: 1. **YouTube Data API v3**: Requires OAuth2 for `captions.download` (not just API keys) 2. **youtube-transcript-api**: IP blocking after minimal requests 3. **yt-dlp with authentication**: All videos blocked with "not available on this app" **Current Blocker**: YouTube's new PO token requirements prevent access to video content and transcripts, even with valid authentication. Error: "The following content is not available on this app.. Watch on the latest version of YouTube." **Resolution**: Requires upstream `yt-dlp` updates to handle new YouTube platform restrictions. ## Project Status: ✅ COMPLETE & DEPLOYED - **5 active sources** working and tested (TikTok disabled) - **✅ Production deployment**: systemd services installed and running - **✅ Automated scheduling**: 8 AM & 12 PM ADT with NAS sync - **✅ Comprehensive testing**: 68+ tests passing - **✅ Real-world data validation**: All sources producing content - **✅ Full backlog processing**: Verified for all active sources - **✅ Cumulative markdown system**: Operational - **✅ Image downloading system**: 686 images synced daily - **✅ NAS synchronization**: Automated twice-daily sync - **YouTube transcript extraction**: Blocked by platform restrictions (not code issues)