hvac-kia-content/CLAUDE.md
Ben Reed 71ab1c2407 feat: Disable TikTok scraper and deploy production systemd services
MAJOR CHANGES:
- TikTok scraper disabled in orchestrator (GUI dependency issues)
- Created new hkia-scraper systemd services replacing hvac-content-*
- Added comprehensive installation script: install-hkia-services.sh
- Updated documentation to reflect 5 active sources (WordPress, MailChimp, Podcast, YouTube, Instagram)

PRODUCTION DEPLOYMENT:
- Services installed and active: hkia-scraper.timer, hkia-scraper-nas.timer
- Schedule: 8:00 AM & 12:00 PM ADT scraping + 30min NAS sync
- All sources now run in parallel (no TikTok GUI blocking)
- Automated twice-daily content aggregation with image downloads

TECHNICAL:
- Orchestrator simplified: removed TikTok special handling
- Service files: proper naming convention (hkia-scraper vs hvac-content)
- Documentation: marked TikTok as disabled, updated deployment status

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-08-21 10:40:48 -03:00

7.4 KiB

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

HKIA Content Aggregation System

Project Overview

Complete content aggregation system that scrapes 5 sources (WordPress, MailChimp RSS, Podcast RSS, YouTube, Instagram), converts to markdown, and runs twice daily with incremental updates. TikTok scraper disabled due to technical issues.

Architecture

  • Base Pattern: Abstract scraper class with common interface
  • State Management: JSON-based incremental update tracking
  • Parallel Processing: All 5 active sources run in parallel
  • Output Format: hkia_[source]_[timestamp].md
  • Archive System: Previous files archived to timestamped directories
  • NAS Sync: Automated rsync to /mnt/nas/hkia/

Key Implementation Details

Instagram Scraper (src/instagram_scraper.py)

  • Uses instaloader with session persistence
  • Aggressive rate limiting: 15-30 second delays, extended breaks every 5 requests
  • Session file: instagram_session_hkia1.session
  • Authentication: Username hkia1, password I22W5YlbRl7x

TikTok Scraper DISABLED

  • Status: Disabled in orchestrator due to technical issues
  • Reason: GUI requirements incompatible with automated deployment
  • Code: Still available in src/tiktok_scraper_advanced.py but not active

YouTube Scraper (src/youtube_scraper.py)

  • Uses yt-dlp with authentication for metadata and transcript extraction
  • Channel: @hkia
  • Authentication: Firefox cookie extraction via YouTubeAuthHandler
  • Transcript Support: Can extract transcripts when fetch_transcripts=True
  • ⚠️ Current Limitation: YouTube's new PO token requirements (Aug 2025) block transcript extraction
    • Error: "The following content is not available on this app"
    • 179 videos identified with captions available but currently inaccessible
    • Requires yt-dlp updates to handle new YouTube restrictions

RSS Scrapers

  • MailChimp: https://us10.campaign-archive.com/feed?u=d1a98c3e62003104038942e21&id=2205dbf985
  • Podcast: https://feeds.libsyn.com/568690/spotify

WordPress Scraper (src/wordpress_scraper.py)

  • Direct API access to hkia.com
  • Fetches blog posts with full content

Technical Stack

  • Python: 3.11+ with UV package manager
  • Key Dependencies:
    • instaloader (Instagram)
    • scrapling[all] (TikTok anti-bot)
    • yt-dlp (YouTube)
    • feedparser (RSS)
    • markdownify (HTML conversion)
  • Testing: pytest with comprehensive mocking

Deployment Strategy

Production Setup - systemd Services

TikTok disabled - no longer requires GUI access or containerization restrictions.

# Service files location (✅ INSTALLED)
/etc/systemd/system/hkia-scraper.service
/etc/systemd/system/hkia-scraper.timer
/etc/systemd/system/hkia-scraper-nas.service  
/etc/systemd/system/hkia-scraper-nas.timer

# Working directory
/home/ben/dev/hvac-kia-content/

# Installation script
./install-hkia-services.sh

# Environment setup
export DISPLAY=:0
export XAUTHORITY="/run/user/1000/.mutter-Xwaylandauth.90WDB3"

Schedule ( ACTIVE)

  • Main Scraping: 8:00 AM and 12:00 PM Atlantic Daylight Time (5 sources)
  • NAS Sync: 8:30 AM and 12:30 PM (30 minutes after scraping)
  • User: ben (GUI environment available but not required)

Environment Variables

# Required in /opt/hvac-kia-content/.env
INSTAGRAM_USERNAME=hkia1
INSTAGRAM_PASSWORD=I22W5YlbRl7x
YOUTUBE_CHANNEL=@hkia
TIKTOK_USERNAME=hkia
NAS_PATH=/mnt/nas/hkia
TIMEZONE=America/Halifax
DISPLAY=:0
XAUTHORITY="/run/user/1000/.mutter-Xwaylandauth.90WDB3"

Commands

Testing

# Test individual sources
uv run python test_real_data.py --source [youtube|instagram|tiktok|wordpress|mailchimp|podcast]

# Test backlog processing  
uv run python test_real_data.py --type backlog --items 50

# Test cumulative markdown system
uv run python test_cumulative_mode.py

# Full test suite
uv run pytest tests/ -v

# Test with specific GUI environment for TikTok
DISPLAY=:0 XAUTHORITY="/run/user/1000/.mutter-Xwaylandauth.90WDB3" uv run python test_real_data.py --source tiktok

# Test YouTube transcript extraction (currently blocked by YouTube)
DISPLAY=:0 XAUTHORITY="/run/user/1000/.mutter-Xwaylandauth.90WDB3" uv run python youtube_backlog_all_with_transcripts.py

Production Operations

# Service management (✅ ACTIVE SERVICES)
sudo systemctl status hkia-scraper.timer
sudo systemctl status hkia-scraper-nas.timer
sudo journalctl -f -u hkia-scraper.service
sudo journalctl -f -u hkia-scraper-nas.service

# Manual runs (for testing)
uv run python run_production_with_images.py
uv run python -m src.orchestrator --sources youtube instagram
uv run python -m src.orchestrator --nas-only

# Legacy commands (still work)
uv run python -m src.orchestrator
uv run python run_production_cumulative.py

Critical Notes

  1. TikTok Scraper: DISABLED - No longer blocks deployment or requires GUI access
  2. Instagram Rate Limiting: 100 requests/hour with exponential backoff
  3. YouTube Transcript Limitations: As of August 2025, YouTube blocks transcript extraction
    • PO token requirements prevent yt-dlp access to subtitle/caption data
    • 179 videos identified with captions but currently inaccessible
    • Authentication system works but content restricted at platform level
  4. State Files: Located in data/markdown_current/.state/ directory for incremental updates
  5. Archive Management: Previous files automatically moved to timestamped archives
  6. Error Recovery: All scrapers handle rate limits and network failures gracefully
  7. Production Services: Fully automated with systemd timers running twice daily

YouTube Transcript Investigation (August 2025)

Objective: Extract transcripts for 179 YouTube videos identified as having captions available.

Investigation Findings:

  • 179 videos identified with captions from existing YouTube data
  • Existing authentication system (YouTubeAuthHandler + Firefox cookies) working
  • Transcript extraction code properly implemented in YouTubeScraper
  • Platform restrictions blocking all video access as of August 2025

Technical Attempts:

  1. YouTube Data API v3: Requires OAuth2 for captions.download (not just API keys)
  2. youtube-transcript-api: IP blocking after minimal requests
  3. yt-dlp with authentication: All videos blocked with "not available on this app"

Current Blocker: YouTube's new PO token requirements prevent access to video content and transcripts, even with valid authentication. Error: "The following content is not available on this app.. Watch on the latest version of YouTube."

Resolution: Requires upstream yt-dlp updates to handle new YouTube platform restrictions.

Project Status: COMPLETE & DEPLOYED

  • 5 active sources working and tested (TikTok disabled)
  • Production deployment: systemd services installed and running
  • Automated scheduling: 8 AM & 12 PM ADT with NAS sync
  • Comprehensive testing: 68+ tests passing
  • Real-world data validation: All sources producing content
  • Full backlog processing: Verified for all active sources
  • Cumulative markdown system: Operational
  • Image downloading system: 686 images synced daily
  • NAS synchronization: Automated twice-daily sync
  • YouTube transcript extraction: Blocked by platform restrictions (not code issues)