MAJOR CHANGES: - TikTok scraper disabled in orchestrator (GUI dependency issues) - Created new hkia-scraper systemd services replacing hvac-content-* - Added comprehensive installation script: install-hkia-services.sh - Updated documentation to reflect 5 active sources (WordPress, MailChimp, Podcast, YouTube, Instagram) PRODUCTION DEPLOYMENT: - Services installed and active: hkia-scraper.timer, hkia-scraper-nas.timer - Schedule: 8:00 AM & 12:00 PM ADT scraping + 30min NAS sync - All sources now run in parallel (no TikTok GUI blocking) - Automated twice-daily content aggregation with image downloads TECHNICAL: - Orchestrator simplified: removed TikTok special handling - Service files: proper naming convention (hkia-scraper vs hvac-content) - Documentation: marked TikTok as disabled, updated deployment status 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
7.4 KiB
CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
HKIA Content Aggregation System
Project Overview
Complete content aggregation system that scrapes 5 sources (WordPress, MailChimp RSS, Podcast RSS, YouTube, Instagram), converts to markdown, and runs twice daily with incremental updates. TikTok scraper disabled due to technical issues.
Architecture
- Base Pattern: Abstract scraper class with common interface
- State Management: JSON-based incremental update tracking
- Parallel Processing: All 5 active sources run in parallel
- Output Format:
hkia_[source]_[timestamp].md - Archive System: Previous files archived to timestamped directories
- NAS Sync: Automated rsync to
/mnt/nas/hkia/
Key Implementation Details
Instagram Scraper (src/instagram_scraper.py)
- Uses
instaloaderwith session persistence - Aggressive rate limiting: 15-30 second delays, extended breaks every 5 requests
- Session file:
instagram_session_hkia1.session - Authentication: Username
hkia1, passwordI22W5YlbRl7x
TikTok Scraper ❌ DISABLED
- Status: Disabled in orchestrator due to technical issues
- Reason: GUI requirements incompatible with automated deployment
- Code: Still available in
src/tiktok_scraper_advanced.pybut not active
YouTube Scraper (src/youtube_scraper.py)
- Uses
yt-dlpwith authentication for metadata and transcript extraction - Channel:
@hkia - Authentication: Firefox cookie extraction via
YouTubeAuthHandler - Transcript Support: Can extract transcripts when
fetch_transcripts=True - ⚠️ Current Limitation: YouTube's new PO token requirements (Aug 2025) block transcript extraction
- Error: "The following content is not available on this app"
- 179 videos identified with captions available but currently inaccessible
- Requires
yt-dlpupdates to handle new YouTube restrictions
RSS Scrapers
- MailChimp:
https://us10.campaign-archive.com/feed?u=d1a98c3e62003104038942e21&id=2205dbf985 - Podcast:
https://feeds.libsyn.com/568690/spotify
WordPress Scraper (src/wordpress_scraper.py)
- Direct API access to
hkia.com - Fetches blog posts with full content
Technical Stack
- Python: 3.11+ with UV package manager
- Key Dependencies:
instaloader(Instagram)scrapling[all](TikTok anti-bot)yt-dlp(YouTube)feedparser(RSS)markdownify(HTML conversion)
- Testing: pytest with comprehensive mocking
Deployment Strategy
✅ Production Setup - systemd Services
TikTok disabled - no longer requires GUI access or containerization restrictions.
# Service files location (✅ INSTALLED)
/etc/systemd/system/hkia-scraper.service
/etc/systemd/system/hkia-scraper.timer
/etc/systemd/system/hkia-scraper-nas.service
/etc/systemd/system/hkia-scraper-nas.timer
# Working directory
/home/ben/dev/hvac-kia-content/
# Installation script
./install-hkia-services.sh
# Environment setup
export DISPLAY=:0
export XAUTHORITY="/run/user/1000/.mutter-Xwaylandauth.90WDB3"
Schedule (✅ ACTIVE)
- Main Scraping: 8:00 AM and 12:00 PM Atlantic Daylight Time (5 sources)
- NAS Sync: 8:30 AM and 12:30 PM (30 minutes after scraping)
- User: ben (GUI environment available but not required)
Environment Variables
# Required in /opt/hvac-kia-content/.env
INSTAGRAM_USERNAME=hkia1
INSTAGRAM_PASSWORD=I22W5YlbRl7x
YOUTUBE_CHANNEL=@hkia
TIKTOK_USERNAME=hkia
NAS_PATH=/mnt/nas/hkia
TIMEZONE=America/Halifax
DISPLAY=:0
XAUTHORITY="/run/user/1000/.mutter-Xwaylandauth.90WDB3"
Commands
Testing
# Test individual sources
uv run python test_real_data.py --source [youtube|instagram|tiktok|wordpress|mailchimp|podcast]
# Test backlog processing
uv run python test_real_data.py --type backlog --items 50
# Test cumulative markdown system
uv run python test_cumulative_mode.py
# Full test suite
uv run pytest tests/ -v
# Test with specific GUI environment for TikTok
DISPLAY=:0 XAUTHORITY="/run/user/1000/.mutter-Xwaylandauth.90WDB3" uv run python test_real_data.py --source tiktok
# Test YouTube transcript extraction (currently blocked by YouTube)
DISPLAY=:0 XAUTHORITY="/run/user/1000/.mutter-Xwaylandauth.90WDB3" uv run python youtube_backlog_all_with_transcripts.py
Production Operations
# Service management (✅ ACTIVE SERVICES)
sudo systemctl status hkia-scraper.timer
sudo systemctl status hkia-scraper-nas.timer
sudo journalctl -f -u hkia-scraper.service
sudo journalctl -f -u hkia-scraper-nas.service
# Manual runs (for testing)
uv run python run_production_with_images.py
uv run python -m src.orchestrator --sources youtube instagram
uv run python -m src.orchestrator --nas-only
# Legacy commands (still work)
uv run python -m src.orchestrator
uv run python run_production_cumulative.py
Critical Notes
- ✅ TikTok Scraper: DISABLED - No longer blocks deployment or requires GUI access
- Instagram Rate Limiting: 100 requests/hour with exponential backoff
- YouTube Transcript Limitations: As of August 2025, YouTube blocks transcript extraction
- PO token requirements prevent
yt-dlpaccess to subtitle/caption data - 179 videos identified with captions but currently inaccessible
- Authentication system works but content restricted at platform level
- PO token requirements prevent
- State Files: Located in
data/markdown_current/.state/directory for incremental updates - Archive Management: Previous files automatically moved to timestamped archives
- Error Recovery: All scrapers handle rate limits and network failures gracefully
- ✅ Production Services: Fully automated with systemd timers running twice daily
YouTube Transcript Investigation (August 2025)
Objective: Extract transcripts for 179 YouTube videos identified as having captions available.
Investigation Findings:
- ✅ 179 videos identified with captions from existing YouTube data
- ✅ Existing authentication system (
YouTubeAuthHandler+ Firefox cookies) working - ✅ Transcript extraction code properly implemented in
YouTubeScraper - ❌ Platform restrictions blocking all video access as of August 2025
Technical Attempts:
- YouTube Data API v3: Requires OAuth2 for
captions.download(not just API keys) - youtube-transcript-api: IP blocking after minimal requests
- yt-dlp with authentication: All videos blocked with "not available on this app"
Current Blocker: YouTube's new PO token requirements prevent access to video content and transcripts, even with valid authentication. Error: "The following content is not available on this app.. Watch on the latest version of YouTube."
Resolution: Requires upstream yt-dlp updates to handle new YouTube platform restrictions.
Project Status: ✅ COMPLETE & DEPLOYED
- 5 active sources working and tested (TikTok disabled)
- ✅ Production deployment: systemd services installed and running
- ✅ Automated scheduling: 8 AM & 12 PM ADT with NAS sync
- ✅ Comprehensive testing: 68+ tests passing
- ✅ Real-world data validation: All sources producing content
- ✅ Full backlog processing: Verified for all active sources
- ✅ Cumulative markdown system: Operational
- ✅ Image downloading system: 686 images synced daily
- ✅ NAS synchronization: Automated twice-daily sync
- YouTube transcript extraction: Blocked by platform restrictions (not code issues)