# HKIA Content Aggregation System

## Project Overview
Complete content aggregation system that scrapes 6 sources (WordPress, MailChimp RSS, Podcast RSS, YouTube, Instagram, TikTok), converts to markdown, and runs twice daily with incremental updates.

## Architecture
- **Base Pattern**: Abstract scraper class with common interface
- **State Management**: JSON-based incremental update tracking
- **Parallel Processing**: 5 sources run in parallel, TikTok separate (GUI requirement)
- **Output Format**: `hkia_[source]_[timestamp].md`
- **Archive System**: Previous files archived to timestamped directories
- **NAS Sync**: Automated rsync to `/mnt/nas/hkia/`

## Key Implementation Details

### Instagram Scraper (`src/instagram_scraper.py`)
- Uses `instaloader` with session persistence
- Aggressive rate limiting: 15-30 second delays, extended breaks every 5 requests
- Session file: `instagram_session_hkia1.session`
- Authentication: Username `hkia1`, password `I22W5YlbRl7x`

### TikTok Scraper (`src/tiktok_scraper_advanced.py`)
- Advanced anti-bot detection using Scrapling + Camofaux
- **Requires headed browser with DISPLAY=:0**
- Stealth features: geolocation spoofing, OS randomization, WebGL support
- Cannot be containerized due to GUI requirements

### YouTube Scraper (`src/youtube_scraper.py`)
- Uses `yt-dlp` for metadata extraction
- Channel: `@HVACKnowItAll`
- Fetches video metadata without downloading videos

### RSS Scrapers
- **MailChimp**: `https://us10.campaign-archive.com/feed?u=d1a98c3e62003104038942e21&id=2205dbf985`
- **Podcast**: `https://feeds.libsyn.com/568690/spotify`

### WordPress Scraper (`src/wordpress_scraper.py`)
- Direct API access to `hkia.com`
- Fetches blog posts with full content

## Technical Stack
- **Python**: 3.11+ with UV package manager
- **Key Dependencies**: 
  - `instaloader` (Instagram)
  - `scrapling[all]` (TikTok anti-bot)
  - `yt-dlp` (YouTube)
  - `feedparser` (RSS)
  - `markdownify` (HTML conversion)
- **Testing**: pytest with comprehensive mocking

## Deployment Strategy

### ⚠️ IMPORTANT: systemd Services (Not Kubernetes)
Originally planned for Kubernetes deployment but **TikTok requires headed browser with DISPLAY=:0**, making containerization impossible.

### Production Setup
```bash
# Service files location
/etc/systemd/system/hvac-scraper.service
/etc/systemd/system/hvac-scraper.timer
/etc/systemd/system/hvac-scraper-nas.service  
/etc/systemd/system/hvac-scraper-nas.timer

# Installation directory
/opt/hvac-kia-content/

# Environment setup
export DISPLAY=:0
export XAUTHORITY="/run/user/1000/.mutter-Xwaylandauth.90WDB3"
```

### Schedule
- **Main Scraping**: 8AM and 12PM Atlantic Daylight Time
- **NAS Sync**: 30 minutes after each scraping run
- **User**: ben (requires GUI access for TikTok)

## Environment Variables
```bash
# Required in /opt/hvac-kia-content/.env
INSTAGRAM_USERNAME=hkia1
INSTAGRAM_PASSWORD=I22W5YlbRl7x
YOUTUBE_CHANNEL=@HVACKnowItAll
TIKTOK_USERNAME=hkia
NAS_PATH=/mnt/nas/hkia
TIMEZONE=America/Halifax
DISPLAY=:0
XAUTHORITY="/run/user/1000/.mutter-Xwaylandauth.90WDB3"
```

## Commands

### Testing
```bash
# Test individual sources
uv run python test_real_data.py --source [youtube|instagram|tiktok|wordpress|mailchimp|podcast]

# Test backlog processing  
uv run python test_real_data.py --type backlog --items 50

# Full test suite
uv run pytest tests/ -v
```

### Production Operations
```bash
# Run orchestrator manually
uv run python -m src.orchestrator

# Run specific sources
uv run python -m src.orchestrator --sources youtube instagram

# NAS sync only
uv run python -m src.orchestrator --nas-only

# Check service status
sudo systemctl status hvac-scraper.service
sudo journalctl -f -u hvac-scraper.service
```

## Critical Notes

1. **TikTok GUI Requirement**: Must run on desktop environment with DISPLAY=:0
2. **Instagram Rate Limiting**: 100 requests/hour with exponential backoff
3. **State Files**: Located in `state/` directory for incremental updates
4. **Archive Management**: Previous files automatically moved to timestamped archives
5. **Error Recovery**: All scrapers handle rate limits and network failures gracefully

## Project Status: ✅ COMPLETE
- All 6 sources working and tested
- Production deployment ready via systemd
- Comprehensive testing completed (68+ tests passing)
- Real-world data validation completed
- Full backlog processing capability verified