hvac-kia-content/CLAUDE.md
Ben Reed 7e5377e7b1 docs: Update all documentation to use hkia naming convention
Documentation Updates:
- Updated project specification with hkia naming and paths
- Modified all markdown documentation files (12 files updated)
- Changed service names from hvac-content-* to hkia-content-*
- Updated NAS paths from /mnt/nas/hvacknowitall to /mnt/nas/hkia
- Replaced all instances of "HVAC Know It All" with "HKIA"

Files Updated:
- README.md - Updated service names and commands
- CLAUDE.md - Updated environment variables and paths
- DEPLOY.md - Updated deployment instructions
- docs/project_specification.md - Updated naming convention specs
- docs/status.md - Updated project status with new naming
- docs/final_status.md - Updated completion status
- docs/deployment_strategy.md - Updated deployment paths
- docs/DEPLOYMENT_CHECKLIST.md - Updated checklist items
- docs/PRODUCTION_TODO.md - Updated production tasks
- BACKLOG_STATUS.md - Updated backlog references
- UPDATED_CAPTURE_STATUS.md - Updated capture status
- FINAL_TALLY_REPORT.md - Updated tally report

Notes:
- Repository name remains hvacknowitall-content (unchanged)
- Project directory remains hvac-kia-content (unchanged)
- All user-facing outputs now use clean "hkia" naming

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-08-19 13:40:27 -03:00

4.3 KiB

HKIA Content Aggregation System

Project Overview

Complete content aggregation system that scrapes 6 sources (WordPress, MailChimp RSS, Podcast RSS, YouTube, Instagram, TikTok), converts to markdown, and runs twice daily with incremental updates.

Architecture

  • Base Pattern: Abstract scraper class with common interface
  • State Management: JSON-based incremental update tracking
  • Parallel Processing: 5 sources run in parallel, TikTok separate (GUI requirement)
  • Output Format: hkia_[source]_[timestamp].md
  • Archive System: Previous files archived to timestamped directories
  • NAS Sync: Automated rsync to /mnt/nas/hkia/

Key Implementation Details

Instagram Scraper (src/instagram_scraper.py)

  • Uses instaloader with session persistence
  • Aggressive rate limiting: 15-30 second delays, extended breaks every 5 requests
  • Session file: instagram_session_hkia1.session
  • Authentication: Username hkia1, password I22W5YlbRl7x

TikTok Scraper (src/tiktok_scraper_advanced.py)

  • Advanced anti-bot detection using Scrapling + Camofaux
  • Requires headed browser with DISPLAY=:0
  • Stealth features: geolocation spoofing, OS randomization, WebGL support
  • Cannot be containerized due to GUI requirements

YouTube Scraper (src/youtube_scraper.py)

  • Uses yt-dlp for metadata extraction
  • Channel: @hkia
  • Fetches video metadata without downloading videos

RSS Scrapers

  • MailChimp: https://us10.campaign-archive.com/feed?u=d1a98c3e62003104038942e21&id=2205dbf985
  • Podcast: https://feeds.libsyn.com/568690/spotify

WordPress Scraper (src/wordpress_scraper.py)

  • Direct API access to hkia.com
  • Fetches blog posts with full content

Technical Stack

  • Python: 3.11+ with UV package manager
  • Key Dependencies:
    • instaloader (Instagram)
    • scrapling[all] (TikTok anti-bot)
    • yt-dlp (YouTube)
    • feedparser (RSS)
    • markdownify (HTML conversion)
  • Testing: pytest with comprehensive mocking

Deployment Strategy

⚠️ IMPORTANT: systemd Services (Not Kubernetes)

Originally planned for Kubernetes deployment but TikTok requires headed browser with DISPLAY=:0, making containerization impossible.

Production Setup

# Service files location
/etc/systemd/system/hkia-scraper.service
/etc/systemd/system/hkia-scraper.timer
/etc/systemd/system/hkia-scraper-nas.service  
/etc/systemd/system/hkia-scraper-nas.timer

# Installation directory
/opt/hvac-kia-content/

# Environment setup
export DISPLAY=:0
export XAUTHORITY="/run/user/1000/.mutter-Xwaylandauth.90WDB3"

Schedule

  • Main Scraping: 8AM and 12PM Atlantic Daylight Time
  • NAS Sync: 30 minutes after each scraping run
  • User: ben (requires GUI access for TikTok)

Environment Variables

# Required in /opt/hvac-kia-content/.env
INSTAGRAM_USERNAME=hkia1
INSTAGRAM_PASSWORD=I22W5YlbRl7x
YOUTUBE_CHANNEL=@hkia
TIKTOK_USERNAME=hkia
NAS_PATH=/mnt/nas/hkia
TIMEZONE=America/Halifax
DISPLAY=:0
XAUTHORITY="/run/user/1000/.mutter-Xwaylandauth.90WDB3"

Commands

Testing

# Test individual sources
uv run python test_real_data.py --source [youtube|instagram|tiktok|wordpress|mailchimp|podcast]

# Test backlog processing  
uv run python test_real_data.py --type backlog --items 50

# Full test suite
uv run pytest tests/ -v

Production Operations

# Run orchestrator manually
uv run python -m src.orchestrator

# Run specific sources
uv run python -m src.orchestrator --sources youtube instagram

# NAS sync only
uv run python -m src.orchestrator --nas-only

# Check service status
sudo systemctl status hkia-scraper.service
sudo journalctl -f -u hkia-scraper.service

Critical Notes

  1. TikTok GUI Requirement: Must run on desktop environment with DISPLAY=:0
  2. Instagram Rate Limiting: 100 requests/hour with exponential backoff
  3. State Files: Located in state/ directory for incremental updates
  4. Archive Management: Previous files automatically moved to timestamped archives
  5. Error Recovery: All scrapers handle rate limits and network failures gracefully

Project Status: COMPLETE

  • All 6 sources working and tested
  • Production deployment ready via systemd
  • Comprehensive testing completed (68+ tests passing)
  • Real-world data validation completed
  • Full backlog processing capability verified