# HVAC Know It All Content Aggregation System A containerized Python application that aggregates content from multiple HVAC Know It All sources, converts them to markdown format, and syncs to a NAS. ## Features - **Multi-source content aggregation** from YouTube, Instagram, TikTok, MailChimp, WordPress, and Podcast RSS - **Comprehensive image downloading** for all visual content (Instagram posts, YouTube thumbnails, Podcast artwork) - **Cumulative markdown management** - Single source-of-truth files that grow with backlog and incremental updates - **API integrations** for YouTube Data API v3 and MailChimp API - **Intelligent content merging** with caption/transcript updates and metric tracking - **Automated NAS synchronization** to `/mnt/nas/hvacknowitall/` for both markdown and media files - **State management** for incremental updates - **Parallel processing** for multiple sources - **Atlantic timezone** (America/Halifax) timestamps ## Cumulative Markdown System ### Overview The system maintains a single markdown file per source that combines: - Initial backlog content (historical data) - Daily incremental updates (new content) - Content updates (new captions, updated metrics) ### How It Works 1. **Initial Backlog**: First run creates base file with all historical content 2. **Daily Incremental**: Subsequent runs merge new content into existing file 3. **Smart Merging**: Updates existing entries when better data is available (captions, transcripts, metrics) 4. **Archival**: Previous versions archived with timestamps for history ### File Naming Convention ``` __.md Example: hvacnkowitall_YouTube_2025-08-19T143045.md ``` ## Quick Start ### Installation ```bash # Install UV package manager pip install uv # Install dependencies uv pip install -r requirements.txt ``` ### Configuration Create `.env` file with credentials: ```env # YouTube YOUTUBE_API_KEY=your_api_key # MailChimp MAILCHIMP_API_KEY=your_api_key MAILCHIMP_SERVER_PREFIX=us10 # Instagram INSTAGRAM_USERNAME=username INSTAGRAM_PASSWORD=password # WordPress WORDPRESS_USERNAME=username WORDPRESS_API_KEY=api_key ``` ### Running ```bash # Run all scrapers (parallel) uv run python run_all_scrapers.py # Run single source uv run python -m src.youtube_api_scraper_v2 # Test cumulative mode uv run python test_cumulative_mode.py # Consolidate existing files uv run python consolidate_current_files.py ``` ## Architecture ### Core Components - **BaseScraper**: Abstract base class for all scrapers - **BaseScraperCumulative**: Enhanced base with cumulative support - **CumulativeMarkdownManager**: Handles intelligent file merging - **ContentOrchestrator**: Manages parallel scraper execution ### Data Flow ``` 1. Scraper fetches content (checks state for incremental) 2. CumulativeMarkdownManager loads existing file 3. Merges new content (adds new, updates existing) 4. Archives previous version 5. Saves updated file with current timestamp 6. Updates state for next run ``` ### Directory Structure ``` data/ ├── markdown_current/ # Current single-source-of-truth files ├── markdown_archives/ # Historical versions by source │ ├── YouTube/ │ ├── Instagram/ │ └── ... ├── media/ # Downloaded media files │ ├── Instagram/ # Instagram images and video thumbnails │ ├── YouTube/ # YouTube video thumbnails │ ├── Podcast/ # Podcast episode artwork │ └── ... └── .state/ # State files for incremental updates logs/ # Log files by source src/ # Source code tests/ # Test files ``` ## API Quota Management ### YouTube Data API v3 - **Daily Limit**: 10,000 units - **Usage Strategy**: 95% daily quota for captions - **Costs**: - videos.list: 1 unit - captions.list: 50 units - channels.list: 1 unit ### Rate Limiting - Instagram: 200 posts/hour - YouTube: Respects API quotas - General: Exponential backoff with retry ## Production Deployment ### Systemd Services Services are configured in `/etc/systemd/system/`: - `hvac-content-images-8am.service` - Morning run with image downloads - `hvac-content-images-12pm.service` - Noon run with image downloads - `hvac-content-images-8am.timer` - Morning schedule (8 AM Atlantic) - `hvac-content-images-12pm.timer` - Noon schedule (12 PM Atlantic) ### Manual Deployment ```bash # Start services sudo systemctl start hvac-content-8am.timer sudo systemctl start hvac-content-12pm.timer # Enable on boot sudo systemctl enable hvac-content-8am.timer sudo systemctl enable hvac-content-12pm.timer # Check status sudo systemctl status hvac-content-*.timer ``` ## Monitoring ```bash # View logs journalctl -u hvac-content-8am -f # Check file growth ls -lh data/markdown_current/ # View statistics uv run python -c "from src.cumulative_markdown_manager import CumulativeMarkdownManager; ..." ``` ## Testing ```bash # Run all tests uv run pytest # Test specific scraper uv run pytest tests/test_youtube_scraper.py # Test cumulative mode uv run python test_cumulative_mode.py ``` ## Troubleshooting ### Common Issues 1. **Instagram Rate Limiting**: Scraper implements humanized delays (18-22 seconds between requests) 2. **YouTube Quota Exceeded**: Wait until next day, quota resets at midnight Pacific 3. **NAS Permission Errors**: Warnings are normal, files still sync successfully 4. **Missing Captions**: Use YouTube Data API instead of youtube-transcript-api ### Debug Commands ```bash # Check scraper state cat data/.state/*_state.json # View recent logs tail -f logs/YouTube/youtube_*.log # Test single source uv run python -m src.youtube_api_scraper_v2 --test ``` ## Recent Updates (2025-08-19) ### Comprehensive Image Downloading - Implemented full image download capability for all content sources - Instagram: Downloads all post images, carousel images, and video thumbnails - YouTube: Automatically fetches highest quality video thumbnails - Podcasts: Downloads episode artwork and thumbnails - Consistent naming: `{source}_{item_id}_{type}.{ext}` - Media organized in `data/media/{source}/` directories ### File Naming Standardization - Migrated to project specification compliant naming - Format: `__.md` - Example: `hvacnkowitall_instagram_2025-08-19T100511.md` - Archived legacy file structures to `markdown_archives/legacy_structure/` ### Instagram Backlog Expansion - Completed initial 1000 posts capture with images - Currently capturing posts 1001-2000 with rate limiting - Cumulative markdown updates every 100 posts - Full image download for all historical content ### Production Automation - Deployed systemd services for twice-daily runs (8 AM, 12 PM Atlantic) - Automated NAS synchronization for markdown and media files - Rate-limited scraping with humanized delays (10-20 seconds per Instagram post) ## License Private repository - All rights reserved