MAJOR CHANGES: - TikTok scraper disabled in orchestrator (GUI dependency issues) - Created new hkia-scraper systemd services replacing hvac-content-* - Added comprehensive installation script: install-hkia-services.sh - Updated documentation to reflect 5 active sources (WordPress, MailChimp, Podcast, YouTube, Instagram) PRODUCTION DEPLOYMENT: - Services installed and active: hkia-scraper.timer, hkia-scraper-nas.timer - Schedule: 8:00 AM & 12:00 PM ADT scraping + 30min NAS sync - All sources now run in parallel (no TikTok GUI blocking) - Automated twice-daily content aggregation with image downloads TECHNICAL: - Orchestrator simplified: removed TikTok special handling - Service files: proper naming convention (hkia-scraper vs hvac-content) - Documentation: marked TikTok as disabled, updated deployment status 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
184 lines
No EOL
7.4 KiB
Markdown
184 lines
No EOL
7.4 KiB
Markdown
# CLAUDE.md
|
|
|
|
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
|
|
|
|
# HKIA Content Aggregation System
|
|
|
|
## Project Overview
|
|
Complete content aggregation system that scrapes 5 sources (WordPress, MailChimp RSS, Podcast RSS, YouTube, Instagram), converts to markdown, and runs twice daily with incremental updates. TikTok scraper disabled due to technical issues.
|
|
|
|
## Architecture
|
|
- **Base Pattern**: Abstract scraper class with common interface
|
|
- **State Management**: JSON-based incremental update tracking
|
|
- **Parallel Processing**: All 5 active sources run in parallel
|
|
- **Output Format**: `hkia_[source]_[timestamp].md`
|
|
- **Archive System**: Previous files archived to timestamped directories
|
|
- **NAS Sync**: Automated rsync to `/mnt/nas/hkia/`
|
|
|
|
## Key Implementation Details
|
|
|
|
### Instagram Scraper (`src/instagram_scraper.py`)
|
|
- Uses `instaloader` with session persistence
|
|
- Aggressive rate limiting: 15-30 second delays, extended breaks every 5 requests
|
|
- Session file: `instagram_session_hkia1.session`
|
|
- Authentication: Username `hkia1`, password `I22W5YlbRl7x`
|
|
|
|
### ~~TikTok Scraper~~ ❌ **DISABLED**
|
|
- **Status**: Disabled in orchestrator due to technical issues
|
|
- **Reason**: GUI requirements incompatible with automated deployment
|
|
- **Code**: Still available in `src/tiktok_scraper_advanced.py` but not active
|
|
|
|
### YouTube Scraper (`src/youtube_scraper.py`)
|
|
- Uses `yt-dlp` with authentication for metadata and transcript extraction
|
|
- Channel: `@hkia`
|
|
- **Authentication**: Firefox cookie extraction via `YouTubeAuthHandler`
|
|
- **Transcript Support**: Can extract transcripts when `fetch_transcripts=True`
|
|
- ⚠️ **Current Limitation**: YouTube's new PO token requirements (Aug 2025) block transcript extraction
|
|
- Error: "The following content is not available on this app"
|
|
- **179 videos identified** with captions available but currently inaccessible
|
|
- Requires `yt-dlp` updates to handle new YouTube restrictions
|
|
|
|
### RSS Scrapers
|
|
- **MailChimp**: `https://us10.campaign-archive.com/feed?u=d1a98c3e62003104038942e21&id=2205dbf985`
|
|
- **Podcast**: `https://feeds.libsyn.com/568690/spotify`
|
|
|
|
### WordPress Scraper (`src/wordpress_scraper.py`)
|
|
- Direct API access to `hkia.com`
|
|
- Fetches blog posts with full content
|
|
|
|
## Technical Stack
|
|
- **Python**: 3.11+ with UV package manager
|
|
- **Key Dependencies**:
|
|
- `instaloader` (Instagram)
|
|
- `scrapling[all]` (TikTok anti-bot)
|
|
- `yt-dlp` (YouTube)
|
|
- `feedparser` (RSS)
|
|
- `markdownify` (HTML conversion)
|
|
- **Testing**: pytest with comprehensive mocking
|
|
|
|
## Deployment Strategy
|
|
|
|
### ✅ Production Setup - systemd Services
|
|
**TikTok disabled** - no longer requires GUI access or containerization restrictions.
|
|
|
|
```bash
|
|
# Service files location (✅ INSTALLED)
|
|
/etc/systemd/system/hkia-scraper.service
|
|
/etc/systemd/system/hkia-scraper.timer
|
|
/etc/systemd/system/hkia-scraper-nas.service
|
|
/etc/systemd/system/hkia-scraper-nas.timer
|
|
|
|
# Working directory
|
|
/home/ben/dev/hvac-kia-content/
|
|
|
|
# Installation script
|
|
./install-hkia-services.sh
|
|
|
|
# Environment setup
|
|
export DISPLAY=:0
|
|
export XAUTHORITY="/run/user/1000/.mutter-Xwaylandauth.90WDB3"
|
|
```
|
|
|
|
### Schedule (✅ ACTIVE)
|
|
- **Main Scraping**: 8:00 AM and 12:00 PM Atlantic Daylight Time (5 sources)
|
|
- **NAS Sync**: 8:30 AM and 12:30 PM (30 minutes after scraping)
|
|
- **User**: ben (GUI environment available but not required)
|
|
|
|
## Environment Variables
|
|
```bash
|
|
# Required in /opt/hvac-kia-content/.env
|
|
INSTAGRAM_USERNAME=hkia1
|
|
INSTAGRAM_PASSWORD=I22W5YlbRl7x
|
|
YOUTUBE_CHANNEL=@hkia
|
|
TIKTOK_USERNAME=hkia
|
|
NAS_PATH=/mnt/nas/hkia
|
|
TIMEZONE=America/Halifax
|
|
DISPLAY=:0
|
|
XAUTHORITY="/run/user/1000/.mutter-Xwaylandauth.90WDB3"
|
|
```
|
|
|
|
## Commands
|
|
|
|
### Testing
|
|
```bash
|
|
# Test individual sources
|
|
uv run python test_real_data.py --source [youtube|instagram|tiktok|wordpress|mailchimp|podcast]
|
|
|
|
# Test backlog processing
|
|
uv run python test_real_data.py --type backlog --items 50
|
|
|
|
# Test cumulative markdown system
|
|
uv run python test_cumulative_mode.py
|
|
|
|
# Full test suite
|
|
uv run pytest tests/ -v
|
|
|
|
# Test with specific GUI environment for TikTok
|
|
DISPLAY=:0 XAUTHORITY="/run/user/1000/.mutter-Xwaylandauth.90WDB3" uv run python test_real_data.py --source tiktok
|
|
|
|
# Test YouTube transcript extraction (currently blocked by YouTube)
|
|
DISPLAY=:0 XAUTHORITY="/run/user/1000/.mutter-Xwaylandauth.90WDB3" uv run python youtube_backlog_all_with_transcripts.py
|
|
```
|
|
|
|
### Production Operations
|
|
```bash
|
|
# Service management (✅ ACTIVE SERVICES)
|
|
sudo systemctl status hkia-scraper.timer
|
|
sudo systemctl status hkia-scraper-nas.timer
|
|
sudo journalctl -f -u hkia-scraper.service
|
|
sudo journalctl -f -u hkia-scraper-nas.service
|
|
|
|
# Manual runs (for testing)
|
|
uv run python run_production_with_images.py
|
|
uv run python -m src.orchestrator --sources youtube instagram
|
|
uv run python -m src.orchestrator --nas-only
|
|
|
|
# Legacy commands (still work)
|
|
uv run python -m src.orchestrator
|
|
uv run python run_production_cumulative.py
|
|
```
|
|
|
|
## Critical Notes
|
|
|
|
1. **✅ TikTok Scraper**: DISABLED - No longer blocks deployment or requires GUI access
|
|
2. **Instagram Rate Limiting**: 100 requests/hour with exponential backoff
|
|
3. **YouTube Transcript Limitations**: As of August 2025, YouTube blocks transcript extraction
|
|
- PO token requirements prevent `yt-dlp` access to subtitle/caption data
|
|
- 179 videos identified with captions but currently inaccessible
|
|
- Authentication system works but content restricted at platform level
|
|
4. **State Files**: Located in `data/markdown_current/.state/` directory for incremental updates
|
|
5. **Archive Management**: Previous files automatically moved to timestamped archives
|
|
6. **Error Recovery**: All scrapers handle rate limits and network failures gracefully
|
|
7. **✅ Production Services**: Fully automated with systemd timers running twice daily
|
|
|
|
## YouTube Transcript Investigation (August 2025)
|
|
|
|
**Objective**: Extract transcripts for 179 YouTube videos identified as having captions available.
|
|
|
|
**Investigation Findings**:
|
|
- ✅ **179 videos identified** with captions from existing YouTube data
|
|
- ✅ **Existing authentication system** (`YouTubeAuthHandler` + Firefox cookies) working
|
|
- ✅ **Transcript extraction code** properly implemented in `YouTubeScraper`
|
|
- ❌ **Platform restrictions** blocking all video access as of August 2025
|
|
|
|
**Technical Attempts**:
|
|
1. **YouTube Data API v3**: Requires OAuth2 for `captions.download` (not just API keys)
|
|
2. **youtube-transcript-api**: IP blocking after minimal requests
|
|
3. **yt-dlp with authentication**: All videos blocked with "not available on this app"
|
|
|
|
**Current Blocker**:
|
|
YouTube's new PO token requirements prevent access to video content and transcripts, even with valid authentication. Error: "The following content is not available on this app.. Watch on the latest version of YouTube."
|
|
|
|
**Resolution**: Requires upstream `yt-dlp` updates to handle new YouTube platform restrictions.
|
|
|
|
## Project Status: ✅ COMPLETE & DEPLOYED
|
|
- **5 active sources** working and tested (TikTok disabled)
|
|
- **✅ Production deployment**: systemd services installed and running
|
|
- **✅ Automated scheduling**: 8 AM & 12 PM ADT with NAS sync
|
|
- **✅ Comprehensive testing**: 68+ tests passing
|
|
- **✅ Real-world data validation**: All sources producing content
|
|
- **✅ Full backlog processing**: Verified for all active sources
|
|
- **✅ Cumulative markdown system**: Operational
|
|
- **✅ Image downloading system**: 686 images synced daily
|
|
- **✅ NAS synchronization**: Automated twice-daily sync
|
|
- **YouTube transcript extraction**: Blocked by platform restrictions (not code issues) |