Documentation Updates: - Updated project specification with hkia naming and paths - Modified all markdown documentation files (12 files updated) - Changed service names from hvac-content-* to hkia-content-* - Updated NAS paths from /mnt/nas/hvacknowitall to /mnt/nas/hkia - Replaced all instances of "HVAC Know It All" with "HKIA" Files Updated: - README.md - Updated service names and commands - CLAUDE.md - Updated environment variables and paths - DEPLOY.md - Updated deployment instructions - docs/project_specification.md - Updated naming convention specs - docs/status.md - Updated project status with new naming - docs/final_status.md - Updated completion status - docs/deployment_strategy.md - Updated deployment paths - docs/DEPLOYMENT_CHECKLIST.md - Updated checklist items - docs/PRODUCTION_TODO.md - Updated production tasks - BACKLOG_STATUS.md - Updated backlog references - UPDATED_CAPTURE_STATUS.md - Updated capture status - FINAL_TALLY_REPORT.md - Updated tally report Notes: - Repository name remains hvacknowitall-content (unchanged) - Project directory remains hvac-kia-content (unchanged) - All user-facing outputs now use clean "hkia" naming 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
		
			
				
	
	
		
			133 lines
		
	
	
		
			No EOL
		
	
	
		
			4.3 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
			
		
		
	
	
			133 lines
		
	
	
		
			No EOL
		
	
	
		
			4.3 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
| # HKIA Content Aggregation System
 | |
| 
 | |
| ## Project Overview
 | |
| Complete content aggregation system that scrapes 6 sources (WordPress, MailChimp RSS, Podcast RSS, YouTube, Instagram, TikTok), converts to markdown, and runs twice daily with incremental updates.
 | |
| 
 | |
| ## Architecture
 | |
| - **Base Pattern**: Abstract scraper class with common interface
 | |
| - **State Management**: JSON-based incremental update tracking
 | |
| - **Parallel Processing**: 5 sources run in parallel, TikTok separate (GUI requirement)
 | |
| - **Output Format**: `hkia_[source]_[timestamp].md`
 | |
| - **Archive System**: Previous files archived to timestamped directories
 | |
| - **NAS Sync**: Automated rsync to `/mnt/nas/hkia/`
 | |
| 
 | |
| ## Key Implementation Details
 | |
| 
 | |
| ### Instagram Scraper (`src/instagram_scraper.py`)
 | |
| - Uses `instaloader` with session persistence
 | |
| - Aggressive rate limiting: 15-30 second delays, extended breaks every 5 requests
 | |
| - Session file: `instagram_session_hkia1.session`
 | |
| - Authentication: Username `hkia1`, password `I22W5YlbRl7x`
 | |
| 
 | |
| ### TikTok Scraper (`src/tiktok_scraper_advanced.py`)
 | |
| - Advanced anti-bot detection using Scrapling + Camofaux
 | |
| - **Requires headed browser with DISPLAY=:0**
 | |
| - Stealth features: geolocation spoofing, OS randomization, WebGL support
 | |
| - Cannot be containerized due to GUI requirements
 | |
| 
 | |
| ### YouTube Scraper (`src/youtube_scraper.py`)
 | |
| - Uses `yt-dlp` for metadata extraction
 | |
| - Channel: `@hkia`
 | |
| - Fetches video metadata without downloading videos
 | |
| 
 | |
| ### RSS Scrapers
 | |
| - **MailChimp**: `https://us10.campaign-archive.com/feed?u=d1a98c3e62003104038942e21&id=2205dbf985`
 | |
| - **Podcast**: `https://feeds.libsyn.com/568690/spotify`
 | |
| 
 | |
| ### WordPress Scraper (`src/wordpress_scraper.py`)
 | |
| - Direct API access to `hkia.com`
 | |
| - Fetches blog posts with full content
 | |
| 
 | |
| ## Technical Stack
 | |
| - **Python**: 3.11+ with UV package manager
 | |
| - **Key Dependencies**: 
 | |
|   - `instaloader` (Instagram)
 | |
|   - `scrapling[all]` (TikTok anti-bot)
 | |
|   - `yt-dlp` (YouTube)
 | |
|   - `feedparser` (RSS)
 | |
|   - `markdownify` (HTML conversion)
 | |
| - **Testing**: pytest with comprehensive mocking
 | |
| 
 | |
| ## Deployment Strategy
 | |
| 
 | |
| ### ⚠️ IMPORTANT: systemd Services (Not Kubernetes)
 | |
| Originally planned for Kubernetes deployment but **TikTok requires headed browser with DISPLAY=:0**, making containerization impossible.
 | |
| 
 | |
| ### Production Setup
 | |
| ```bash
 | |
| # Service files location
 | |
| /etc/systemd/system/hkia-scraper.service
 | |
| /etc/systemd/system/hkia-scraper.timer
 | |
| /etc/systemd/system/hkia-scraper-nas.service  
 | |
| /etc/systemd/system/hkia-scraper-nas.timer
 | |
| 
 | |
| # Installation directory
 | |
| /opt/hvac-kia-content/
 | |
| 
 | |
| # Environment setup
 | |
| export DISPLAY=:0
 | |
| export XAUTHORITY="/run/user/1000/.mutter-Xwaylandauth.90WDB3"
 | |
| ```
 | |
| 
 | |
| ### Schedule
 | |
| - **Main Scraping**: 8AM and 12PM Atlantic Daylight Time
 | |
| - **NAS Sync**: 30 minutes after each scraping run
 | |
| - **User**: ben (requires GUI access for TikTok)
 | |
| 
 | |
| ## Environment Variables
 | |
| ```bash
 | |
| # Required in /opt/hvac-kia-content/.env
 | |
| INSTAGRAM_USERNAME=hkia1
 | |
| INSTAGRAM_PASSWORD=I22W5YlbRl7x
 | |
| YOUTUBE_CHANNEL=@hkia
 | |
| TIKTOK_USERNAME=hkia
 | |
| NAS_PATH=/mnt/nas/hkia
 | |
| TIMEZONE=America/Halifax
 | |
| DISPLAY=:0
 | |
| XAUTHORITY="/run/user/1000/.mutter-Xwaylandauth.90WDB3"
 | |
| ```
 | |
| 
 | |
| ## Commands
 | |
| 
 | |
| ### Testing
 | |
| ```bash
 | |
| # Test individual sources
 | |
| uv run python test_real_data.py --source [youtube|instagram|tiktok|wordpress|mailchimp|podcast]
 | |
| 
 | |
| # Test backlog processing  
 | |
| uv run python test_real_data.py --type backlog --items 50
 | |
| 
 | |
| # Full test suite
 | |
| uv run pytest tests/ -v
 | |
| ```
 | |
| 
 | |
| ### Production Operations
 | |
| ```bash
 | |
| # Run orchestrator manually
 | |
| uv run python -m src.orchestrator
 | |
| 
 | |
| # Run specific sources
 | |
| uv run python -m src.orchestrator --sources youtube instagram
 | |
| 
 | |
| # NAS sync only
 | |
| uv run python -m src.orchestrator --nas-only
 | |
| 
 | |
| # Check service status
 | |
| sudo systemctl status hkia-scraper.service
 | |
| sudo journalctl -f -u hkia-scraper.service
 | |
| ```
 | |
| 
 | |
| ## Critical Notes
 | |
| 
 | |
| 1. **TikTok GUI Requirement**: Must run on desktop environment with DISPLAY=:0
 | |
| 2. **Instagram Rate Limiting**: 100 requests/hour with exponential backoff
 | |
| 3. **State Files**: Located in `state/` directory for incremental updates
 | |
| 4. **Archive Management**: Previous files automatically moved to timestamped archives
 | |
| 5. **Error Recovery**: All scrapers handle rate limits and network failures gracefully
 | |
| 
 | |
| ## Project Status: ✅ COMPLETE
 | |
| - All 6 sources working and tested
 | |
| - Production deployment ready via systemd
 | |
| - Comprehensive testing completed (68+ tests passing)
 | |
| - Real-world data validation completed
 | |
| - Full backlog processing capability verified |