hvac-kia-content/CLAUDE.md
Ben Reed 34fd853874 feat: Add HVACRSchool scraper and fix all source connectivity
- Add new HVACRSchool scraper for technical articles (6th source)
- Fix WordPress API connectivity (corrected URL to hvacknowitall.com)
- Fix MailChimp RSS processing after environment consolidation
- Implement YouTube hybrid scraper (API + yt-dlp) with PO token support
- Disable YouTube transcripts due to platform restrictions (Aug 2025)
- Update orchestrator to use all 6 active sources
- Consolidate environment variables into single .env file
- Full system sync completed with all sources updating successfully
- Update documentation with current system status and capabilities

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-08-27 18:11:00 -03:00

218 lines
No EOL
8.8 KiB
Markdown

# CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
# HKIA Content Aggregation System
## Project Overview
Complete content aggregation system that scrapes 6 sources (WordPress, MailChimp RSS, Podcast RSS, YouTube, Instagram, HVACRSchool), converts to markdown, and runs twice daily with incremental updates. TikTok scraper disabled due to technical issues.
## Architecture
- **Base Pattern**: Abstract scraper class (`BaseScraper`) with common interface
- **State Management**: JSON-based incremental update tracking in `data/.state/`
- **Parallel Processing**: All 6 active sources run in parallel via `ContentOrchestrator`
- **Output Format**: `hkia_[source]_[timestamp].md`
- **Archive System**: Previous files archived to timestamped directories in `data/markdown_archives/`
- **Media Downloads**: Images/thumbnails saved to `data/media/[source]/`
- **NAS Sync**: Automated rsync to `/mnt/nas/hkia/`
## Key Implementation Details
### Instagram Scraper (`src/instagram_scraper.py`)
- Uses `instaloader` with session persistence
- Aggressive rate limiting: 15-30 second delays, extended breaks every 5 requests
- Session file: `instagram_session_hkia1.session`
- Authentication: Username `hkia1`, password `I22W5YlbRl7x`
### ~~TikTok Scraper~~ ❌ **DISABLED**
- **Status**: Disabled in orchestrator due to technical issues
- **Reason**: GUI requirements incompatible with automated deployment
- **Code**: Still available in `src/tiktok_scraper_advanced.py` but not active
### YouTube Scraper (`src/youtube_hybrid_scraper.py`)
- **Hybrid Approach**: YouTube Data API v3 for metadata + yt-dlp for transcripts
- Channel: `@HVACKnowItAll` (38,400+ subscribers, 447 videos)
- **API Integration**: Rich metadata extraction with efficient quota usage (3 units per video)
- **Authentication**: Firefox cookie extraction + PO token support via `YouTubePOTokenHandler`
-**Transcript Status**: DISABLED due to YouTube platform restrictions (Aug 2025)
- Error: "The following content is not available on this app"
- **PO Token Implementation**: Complete but blocked by YouTube platform restrictions
- **179 videos identified** with captions available but currently inaccessible
- Will automatically resume transcript extraction when platform restrictions are lifted
### RSS Scrapers
- **MailChimp**: `https://us10.campaign-archive.com/feed?u=d1a98c3e62003104038942e21&id=2205dbf985`
- **Podcast**: `https://feeds.libsyn.com/568690/spotify`
### WordPress Scraper (`src/wordpress_scraper.py`)
- Direct API access to `hvacknowitall.com`
- Fetches blog posts with full content
### HVACRSchool Scraper (`src/hvacrschool_scraper.py`)
- Web scraping of technical articles from `hvacrschool.com`
- Enhanced content cleaning with duplicate removal
- Handles complex HTML structures and embedded media
## Technical Stack
- **Python**: 3.11+ with UV package manager
- **Key Dependencies**:
- `instaloader` (Instagram)
- `scrapling[all]` (TikTok anti-bot)
- `yt-dlp` (YouTube)
- `feedparser` (RSS)
- `markdownify` (HTML conversion)
- **Testing**: pytest with comprehensive mocking
## Deployment Strategy
### ✅ Production Setup - systemd Services
**TikTok disabled** - no longer requires GUI access or containerization restrictions.
```bash
# Service files location (✅ INSTALLED)
/etc/systemd/system/hkia-scraper.service
/etc/systemd/system/hkia-scraper.timer
/etc/systemd/system/hkia-scraper-nas.service
/etc/systemd/system/hkia-scraper-nas.timer
# Working directory
/home/ben/dev/hvac-kia-content/
# Installation script
./install-hkia-services.sh
# Environment setup
export DISPLAY=:0
export XAUTHORITY="/run/user/1000/.mutter-Xwaylandauth.90WDB3"
```
### Schedule (✅ ACTIVE)
- **Main Scraping**: 8:00 AM and 12:00 PM Atlantic Daylight Time (5 sources)
- **NAS Sync**: 8:30 AM and 12:30 PM (30 minutes after scraping)
- **User**: ben (GUI environment available but not required)
## Environment Variables
```bash
# Required in /opt/hvac-kia-content/.env
INSTAGRAM_USERNAME=hkia1
INSTAGRAM_PASSWORD=I22W5YlbRl7x
YOUTUBE_CHANNEL=@hkia
TIKTOK_USERNAME=hkia
NAS_PATH=/mnt/nas/hkia
TIMEZONE=America/Halifax
DISPLAY=:0
XAUTHORITY="/run/user/1000/.mutter-Xwaylandauth.90WDB3"
```
## Commands
### Development Setup
```bash
# Install UV package manager (if not installed)
pip install uv
# Install dependencies
uv sync
# Install Python dependencies
uv pip install -r requirements.txt
```
### Testing
```bash
# Test individual sources
uv run python test_real_data.py --source [youtube|instagram|tiktok|wordpress|mailchimp|podcast]
# Test backlog processing
uv run python test_real_data.py --type backlog --items 50
# Test cumulative markdown system
uv run python test_cumulative_mode.py
# Full test suite
uv run pytest tests/ -v
# Test specific scraper with detailed output
uv run pytest tests/test_[scraper_name].py -v -s
# Test with specific GUI environment for TikTok
DISPLAY=:0 XAUTHORITY="/run/user/1000/.mutter-Xwaylandauth.90WDB3" uv run python test_real_data.py --source tiktok
# Test YouTube transcript extraction (currently blocked by YouTube)
DISPLAY=:0 XAUTHORITY="/run/user/1000/.mutter-Xwaylandauth.90WDB3" uv run python youtube_backlog_all_with_transcripts.py
```
### Production Operations
```bash
# Service management (✅ ACTIVE SERVICES)
sudo systemctl status hkia-scraper.timer
sudo systemctl status hkia-scraper-nas.timer
sudo journalctl -f -u hkia-scraper.service
sudo journalctl -f -u hkia-scraper-nas.service
# Manual runs (for testing)
uv run python run_production_with_images.py
uv run python -m src.orchestrator --sources youtube instagram
uv run python -m src.orchestrator --nas-only
# Legacy commands (still work)
uv run python -m src.orchestrator
uv run python run_production_cumulative.py
# Debug and monitoring
tail -f logs/[source]/[source].log
ls -la data/markdown_current/
ls -la data/media/[source]/
```
## Critical Notes
1. **✅ TikTok Scraper**: DISABLED - No longer blocks deployment or requires GUI access
2. **Instagram Rate Limiting**: 100 requests/hour with exponential backoff
3. **YouTube Transcript Status**: DISABLED in production due to platform restrictions (Aug 2025)
- Complete PO token implementation but blocked by YouTube platform changes
- 179 videos identified with captions but currently inaccessible
- Hybrid scraper architecture ready to resume when restrictions are lifted
4. **State Files**: Located in `data/.state/` directory for incremental updates
5. **Archive Management**: Previous files automatically moved to timestamped archives in `data/markdown_archives/[source]/`
6. **Media Management**: Images/videos saved to `data/media/[source]/` with consistent naming
7. **Error Recovery**: All scrapers handle rate limits and network failures gracefully
8. **✅ Production Services**: Fully automated with systemd timers running twice daily
9. **Package Management**: Uses UV for fast Python package management (`uv run`, `uv sync`)
## YouTube Transcript Status (August 2025)
**Current Status**: ❌ **DISABLED** - Transcripts extraction disabled in production
**Implementation Status**:
-**Hybrid Scraper**: Complete (`src/youtube_hybrid_scraper.py`)
-**PO Token Handler**: Full implementation with environment variable support
-**Firefox Integration**: Cookie extraction and profile detection working
-**API Integration**: YouTube Data API v3 for efficient metadata extraction
-**Transcript Extraction**: Disabled due to YouTube platform restrictions
**Technical Details**:
- **179 videos identified** with captions available but currently inaccessible
- **PO Token**: Extracted and configured (`YOUTUBE_PO_TOKEN_MWEB_GVS` in .env)
- **Authentication**: Firefox cookies (147 extracted) + PO token support
- **Platform Error**: "The following content is not available on this app"
**Architecture**: True hybrid approach maintains efficiency:
- **Metadata**: YouTube Data API v3 (cheap, reliable, rich data)
- **Transcripts**: yt-dlp with authentication (currently blocked)
- **Fallback**: Gracefully continues without transcripts
**Future**: Will automatically resume transcript extraction when platform restrictions are resolved.
## Project Status: ✅ COMPLETE & DEPLOYED
- **6 active sources** working and tested (TikTok disabled)
- **✅ Production deployment**: systemd services installed and running
- **✅ Automated scheduling**: 8 AM & 12 PM ADT with NAS sync
- **✅ Comprehensive testing**: 68+ tests passing
- **✅ Real-world data validation**: All 6 sources producing content (Aug 27, 2025)
- **✅ Full backlog processing**: Verified for all active sources including HVACRSchool
- **✅ System reliability**: WordPress/MailChimp issues resolved, all sources updating
- **✅ Cumulative markdown system**: Operational
- **✅ Image downloading system**: 686 images synced daily
- **✅ NAS synchronization**: Automated twice-daily sync
- **YouTube transcript extraction**: Blocked by platform restrictions (not code issues)