Major Changes: - Updated all code references from hvacknowitall/hvacnkowitall to hkia - Renamed all existing markdown files to use hkia_ prefix - Updated configuration files, scrapers, and production scripts - Modified systemd service descriptions to use HKIA - Changed NAS sync path to /mnt/nas/hkia Files Updated: - 20+ source files updated with new naming convention - 34 markdown files renamed to hkia_* format - All ScraperConfig brand_name parameters now use 'hkia' - Documentation updated to reflect new naming Rationale: - Shorter, cleaner filenames - Consistent branding across all outputs - Easier to type and reference - Maintains same functionality with improved naming Next Steps: - Deploy updated services to production - Update any external references to old naming - Monitor scrapers to ensure proper operation 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
		
			
				
	
	
		
			244 lines
		
	
	
		
			No EOL
		
	
	
		
			6.8 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
			
		
		
	
	
			244 lines
		
	
	
		
			No EOL
		
	
	
		
			6.8 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
| # HKIA Content Aggregation System
 | |
| 
 | |
| A containerized Python application that aggregates content from multiple HKIA sources, converts them to markdown format, and syncs to a NAS.
 | |
| 
 | |
| ## Features
 | |
| 
 | |
| - **Multi-source content aggregation** from YouTube, Instagram, TikTok, MailChimp, WordPress, and Podcast RSS
 | |
| - **Comprehensive image downloading** for all visual content (Instagram posts, YouTube thumbnails, Podcast artwork)
 | |
| - **Cumulative markdown management** - Single source-of-truth files that grow with backlog and incremental updates
 | |
| - **API integrations** for YouTube Data API v3 and MailChimp API
 | |
| - **Intelligent content merging** with caption/transcript updates and metric tracking
 | |
| - **Automated NAS synchronization** to `/mnt/nas/hkia/` for both markdown and media files
 | |
| - **State management** for incremental updates
 | |
| - **Parallel processing** for multiple sources
 | |
| - **Atlantic timezone** (America/Halifax) timestamps
 | |
| 
 | |
| ## Cumulative Markdown System
 | |
| 
 | |
| ### Overview
 | |
| The system maintains a single markdown file per source that combines:
 | |
| - Initial backlog content (historical data)
 | |
| - Daily incremental updates (new content)
 | |
| - Content updates (new captions, updated metrics)
 | |
| 
 | |
| ### How It Works
 | |
| 
 | |
| 1. **Initial Backlog**: First run creates base file with all historical content
 | |
| 2. **Daily Incremental**: Subsequent runs merge new content into existing file
 | |
| 3. **Smart Merging**: Updates existing entries when better data is available (captions, transcripts, metrics)
 | |
| 4. **Archival**: Previous versions archived with timestamps for history
 | |
| 
 | |
| ### File Naming Convention
 | |
| ```
 | |
| <brandName>_<source>_<dateTime>.md
 | |
| Example: hkia_YouTube_2025-08-19T143045.md
 | |
| ```
 | |
| 
 | |
| ## Quick Start
 | |
| 
 | |
| ### Installation
 | |
| 
 | |
| ```bash
 | |
| # Install UV package manager
 | |
| pip install uv
 | |
| 
 | |
| # Install dependencies
 | |
| uv pip install -r requirements.txt
 | |
| ```
 | |
| 
 | |
| ### Configuration
 | |
| 
 | |
| Create `.env` file with credentials:
 | |
| ```env
 | |
| # YouTube
 | |
| YOUTUBE_API_KEY=your_api_key
 | |
| 
 | |
| # MailChimp
 | |
| MAILCHIMP_API_KEY=your_api_key
 | |
| MAILCHIMP_SERVER_PREFIX=us10
 | |
| 
 | |
| # Instagram
 | |
| INSTAGRAM_USERNAME=username
 | |
| INSTAGRAM_PASSWORD=password
 | |
| 
 | |
| # WordPress
 | |
| WORDPRESS_USERNAME=username
 | |
| WORDPRESS_API_KEY=api_key
 | |
| ```
 | |
| 
 | |
| ### Running
 | |
| 
 | |
| ```bash
 | |
| # Run all scrapers (parallel)
 | |
| uv run python run_all_scrapers.py
 | |
| 
 | |
| # Run single source
 | |
| uv run python -m src.youtube_api_scraper_v2
 | |
| 
 | |
| # Test cumulative mode
 | |
| uv run python test_cumulative_mode.py
 | |
| 
 | |
| # Consolidate existing files
 | |
| uv run python consolidate_current_files.py
 | |
| ```
 | |
| 
 | |
| ## Architecture
 | |
| 
 | |
| ### Core Components
 | |
| 
 | |
| - **BaseScraper**: Abstract base class for all scrapers
 | |
| - **BaseScraperCumulative**: Enhanced base with cumulative support
 | |
| - **CumulativeMarkdownManager**: Handles intelligent file merging
 | |
| - **ContentOrchestrator**: Manages parallel scraper execution
 | |
| 
 | |
| ### Data Flow
 | |
| 
 | |
| ```
 | |
| 1. Scraper fetches content (checks state for incremental)
 | |
| 2. CumulativeMarkdownManager loads existing file
 | |
| 3. Merges new content (adds new, updates existing)
 | |
| 4. Archives previous version
 | |
| 5. Saves updated file with current timestamp
 | |
| 6. Updates state for next run
 | |
| ```
 | |
| 
 | |
| ### Directory Structure
 | |
| 
 | |
| ```
 | |
| data/
 | |
| ├── markdown_current/       # Current single-source-of-truth files
 | |
| ├── markdown_archives/      # Historical versions by source
 | |
| │   ├── YouTube/
 | |
| │   ├── Instagram/
 | |
| │   └── ...
 | |
| ├── media/                  # Downloaded media files
 | |
| │   ├── Instagram/         # Instagram images and video thumbnails
 | |
| │   ├── YouTube/           # YouTube video thumbnails
 | |
| │   ├── Podcast/           # Podcast episode artwork
 | |
| │   └── ...
 | |
| └── .state/                # State files for incremental updates
 | |
| 
 | |
| logs/                      # Log files by source
 | |
| src/                       # Source code
 | |
| tests/                     # Test files
 | |
| ```
 | |
| 
 | |
| ## API Quota Management
 | |
| 
 | |
| ### YouTube Data API v3
 | |
| - **Daily Limit**: 10,000 units
 | |
| - **Usage Strategy**: 95% daily quota for captions
 | |
| - **Costs**: 
 | |
|   - videos.list: 1 unit
 | |
|   - captions.list: 50 units
 | |
|   - channels.list: 1 unit
 | |
| 
 | |
| ### Rate Limiting
 | |
| - Instagram: 200 posts/hour
 | |
| - YouTube: Respects API quotas
 | |
| - General: Exponential backoff with retry
 | |
| 
 | |
| ## Production Deployment
 | |
| 
 | |
| ### Systemd Services
 | |
| 
 | |
| Services are configured in `/etc/systemd/system/`:
 | |
| - `hvac-content-images-8am.service` - Morning run with image downloads
 | |
| - `hvac-content-images-12pm.service` - Noon run with image downloads
 | |
| - `hvac-content-images-8am.timer` - Morning schedule (8 AM Atlantic)
 | |
| - `hvac-content-images-12pm.timer` - Noon schedule (12 PM Atlantic)
 | |
| 
 | |
| ### Manual Deployment
 | |
| 
 | |
| ```bash
 | |
| # Start services
 | |
| sudo systemctl start hvac-content-8am.timer
 | |
| sudo systemctl start hvac-content-12pm.timer
 | |
| 
 | |
| # Enable on boot
 | |
| sudo systemctl enable hvac-content-8am.timer
 | |
| sudo systemctl enable hvac-content-12pm.timer
 | |
| 
 | |
| # Check status
 | |
| sudo systemctl status hvac-content-*.timer
 | |
| ```
 | |
| 
 | |
| ## Monitoring
 | |
| 
 | |
| ```bash
 | |
| # View logs
 | |
| journalctl -u hvac-content-8am -f
 | |
| 
 | |
| # Check file growth
 | |
| ls -lh data/markdown_current/
 | |
| 
 | |
| # View statistics
 | |
| uv run python -c "from src.cumulative_markdown_manager import CumulativeMarkdownManager; ..."
 | |
| ```
 | |
| 
 | |
| ## Testing
 | |
| 
 | |
| ```bash
 | |
| # Run all tests
 | |
| uv run pytest
 | |
| 
 | |
| # Test specific scraper
 | |
| uv run pytest tests/test_youtube_scraper.py
 | |
| 
 | |
| # Test cumulative mode
 | |
| uv run python test_cumulative_mode.py
 | |
| ```
 | |
| 
 | |
| ## Troubleshooting
 | |
| 
 | |
| ### Common Issues
 | |
| 
 | |
| 1. **Instagram Rate Limiting**: Scraper implements humanized delays (18-22 seconds between requests)
 | |
| 2. **YouTube Quota Exceeded**: Wait until next day, quota resets at midnight Pacific
 | |
| 3. **NAS Permission Errors**: Warnings are normal, files still sync successfully
 | |
| 4. **Missing Captions**: Use YouTube Data API instead of youtube-transcript-api
 | |
| 
 | |
| ### Debug Commands
 | |
| 
 | |
| ```bash
 | |
| # Check scraper state
 | |
| cat data/.state/*_state.json
 | |
| 
 | |
| # View recent logs
 | |
| tail -f logs/YouTube/youtube_*.log
 | |
| 
 | |
| # Test single source
 | |
| uv run python -m src.youtube_api_scraper_v2 --test
 | |
| ```
 | |
| 
 | |
| ## Recent Updates (2025-08-19)
 | |
| 
 | |
| ### Comprehensive Image Downloading
 | |
| - Implemented full image download capability for all content sources
 | |
| - Instagram: Downloads all post images, carousel images, and video thumbnails  
 | |
| - YouTube: Automatically fetches highest quality video thumbnails
 | |
| - Podcasts: Downloads episode artwork and thumbnails
 | |
| - Consistent naming: `{source}_{item_id}_{type}.{ext}`
 | |
| - Media organized in `data/media/{source}/` directories
 | |
| 
 | |
| ### File Naming Standardization
 | |
| - Migrated to project specification compliant naming
 | |
| - Format: `<brandName>_<source>_<dateTime>.md`
 | |
| - Example: `hkia_instagram_2025-08-19T100511.md`
 | |
| - Archived legacy file structures to `markdown_archives/legacy_structure/`
 | |
| 
 | |
| ### Instagram Backlog Expansion
 | |
| - Completed initial 1000 posts capture with images
 | |
| - Currently capturing posts 1001-2000 with rate limiting
 | |
| - Cumulative markdown updates every 100 posts
 | |
| - Full image download for all historical content
 | |
| 
 | |
| ### Production Automation
 | |
| - Deployed systemd services for twice-daily runs (8 AM, 12 PM Atlantic)
 | |
| - Automated NAS synchronization for markdown and media files
 | |
| - Rate-limited scraping with humanized delays (10-20 seconds per Instagram post)
 | |
| 
 | |
| ## License
 | |
| 
 | |
| Private repository - All rights reserved |