Documentation Updates: - Updated project specification with hkia naming and paths - Modified all markdown documentation files (12 files updated) - Changed service names from hvac-content-* to hkia-content-* - Updated NAS paths from /mnt/nas/hvacknowitall to /mnt/nas/hkia - Replaced all instances of "HVAC Know It All" with "HKIA" Files Updated: - README.md - Updated service names and commands - CLAUDE.md - Updated environment variables and paths - DEPLOY.md - Updated deployment instructions - docs/project_specification.md - Updated naming convention specs - docs/status.md - Updated project status with new naming - docs/final_status.md - Updated completion status - docs/deployment_strategy.md - Updated deployment paths - docs/DEPLOYMENT_CHECKLIST.md - Updated checklist items - docs/PRODUCTION_TODO.md - Updated production tasks - BACKLOG_STATUS.md - Updated backlog references - UPDATED_CAPTURE_STATUS.md - Updated capture status - FINAL_TALLY_REPORT.md - Updated tally report Notes: - Repository name remains hvacknowitall-content (unchanged) - Project directory remains hvac-kia-content (unchanged) - All user-facing outputs now use clean "hkia" naming 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
		
			
				
	
	
	
	
		
			4.3 KiB
		
	
	
	
	
	
	
	
			
		
		
	
	
			4.3 KiB
		
	
	
	
	
	
	
	
HKIA Content Aggregation System
Project Overview
Complete content aggregation system that scrapes 6 sources (WordPress, MailChimp RSS, Podcast RSS, YouTube, Instagram, TikTok), converts to markdown, and runs twice daily with incremental updates.
Architecture
- Base Pattern: Abstract scraper class with common interface
- State Management: JSON-based incremental update tracking
- Parallel Processing: 5 sources run in parallel, TikTok separate (GUI requirement)
- Output Format: hkia_[source]_[timestamp].md
- Archive System: Previous files archived to timestamped directories
- NAS Sync: Automated rsync to /mnt/nas/hkia/
Key Implementation Details
Instagram Scraper (src/instagram_scraper.py)
- Uses instaloaderwith session persistence
- Aggressive rate limiting: 15-30 second delays, extended breaks every 5 requests
- Session file: instagram_session_hkia1.session
- Authentication: Username hkia1, passwordI22W5YlbRl7x
TikTok Scraper (src/tiktok_scraper_advanced.py)
- Advanced anti-bot detection using Scrapling + Camofaux
- Requires headed browser with DISPLAY=:0
- Stealth features: geolocation spoofing, OS randomization, WebGL support
- Cannot be containerized due to GUI requirements
YouTube Scraper (src/youtube_scraper.py)
- Uses yt-dlpfor metadata extraction
- Channel: @hkia
- Fetches video metadata without downloading videos
RSS Scrapers
- MailChimp: https://us10.campaign-archive.com/feed?u=d1a98c3e62003104038942e21&id=2205dbf985
- Podcast: https://feeds.libsyn.com/568690/spotify
WordPress Scraper (src/wordpress_scraper.py)
- Direct API access to hkia.com
- Fetches blog posts with full content
Technical Stack
- Python: 3.11+ with UV package manager
- Key Dependencies:
- instaloader(Instagram)
- scrapling[all](TikTok anti-bot)
- yt-dlp(YouTube)
- feedparser(RSS)
- markdownify(HTML conversion)
 
- Testing: pytest with comprehensive mocking
Deployment Strategy
⚠️ IMPORTANT: systemd Services (Not Kubernetes)
Originally planned for Kubernetes deployment but TikTok requires headed browser with DISPLAY=:0, making containerization impossible.
Production Setup
# Service files location
/etc/systemd/system/hkia-scraper.service
/etc/systemd/system/hkia-scraper.timer
/etc/systemd/system/hkia-scraper-nas.service  
/etc/systemd/system/hkia-scraper-nas.timer
# Installation directory
/opt/hvac-kia-content/
# Environment setup
export DISPLAY=:0
export XAUTHORITY="/run/user/1000/.mutter-Xwaylandauth.90WDB3"
Schedule
- Main Scraping: 8AM and 12PM Atlantic Daylight Time
- NAS Sync: 30 minutes after each scraping run
- User: ben (requires GUI access for TikTok)
Environment Variables
# Required in /opt/hvac-kia-content/.env
INSTAGRAM_USERNAME=hkia1
INSTAGRAM_PASSWORD=I22W5YlbRl7x
YOUTUBE_CHANNEL=@hkia
TIKTOK_USERNAME=hkia
NAS_PATH=/mnt/nas/hkia
TIMEZONE=America/Halifax
DISPLAY=:0
XAUTHORITY="/run/user/1000/.mutter-Xwaylandauth.90WDB3"
Commands
Testing
# Test individual sources
uv run python test_real_data.py --source [youtube|instagram|tiktok|wordpress|mailchimp|podcast]
# Test backlog processing  
uv run python test_real_data.py --type backlog --items 50
# Full test suite
uv run pytest tests/ -v
Production Operations
# Run orchestrator manually
uv run python -m src.orchestrator
# Run specific sources
uv run python -m src.orchestrator --sources youtube instagram
# NAS sync only
uv run python -m src.orchestrator --nas-only
# Check service status
sudo systemctl status hkia-scraper.service
sudo journalctl -f -u hkia-scraper.service
Critical Notes
- TikTok GUI Requirement: Must run on desktop environment with DISPLAY=:0
- Instagram Rate Limiting: 100 requests/hour with exponential backoff
- State Files: Located in state/directory for incremental updates
- Archive Management: Previous files automatically moved to timestamped archives
- Error Recovery: All scrapers handle rate limits and network failures gracefully
Project Status: ✅ COMPLETE
- All 6 sources working and tested
- Production deployment ready via systemd
- Comprehensive testing completed (68+ tests passing)
- Real-world data validation completed
- Full backlog processing capability verified