Major Updates:
- Added image downloading for Instagram, YouTube, and Podcast scrapers
- Implemented cumulative markdown system for maintaining single source-of-truth files
- Deployed production services with automatic NAS sync for images
- Standardized file naming conventions per project specification
New Features:
- Instagram: Downloads all post images, carousel images, and video thumbnails
- YouTube: Downloads video thumbnails (highest quality available)
- Podcast: Downloads episode artwork/thumbnails
- Consistent image naming: {source}_{item_id}_{type}.{ext}
- Cumulative markdown updates to prevent file proliferation
- Automatic media sync to NAS at /mnt/nas/hvacknowitall/media/
Production Deployment:
- New systemd services: hvac-content-images-8am and hvac-content-images-12pm
- Runs twice daily at 8 AM and 12 PM Atlantic time
- Comprehensive rsync for both markdown and media files
File Structure Compliance:
- Renamed Instagram backlog to spec-compliant format
- Archived legacy directory structures
- Ensured all new files follow <brandName>_<source>_<dateTime>.md format
Testing:
- Successfully captured Instagram posts 1-1000 with images
- Launched next batch (posts 1001-2000) currently in progress
- Verified thumbnail downloads for YouTube and Podcast content
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>
5.1 KiB
5.1 KiB
Image Download System
Overview
The HVAC Know It All content aggregation system now includes comprehensive image downloading capabilities for all supported sources. This system downloads thumbnails and images (but not videos) to provide visual context alongside the markdown content.
Supported Image Types
- Post images: All images from single posts and carousel posts
- Video thumbnails: Thumbnail images for video posts (videos themselves are not downloaded)
- Story images: Images from stories (video stories get thumbnails only)
YouTube
- Video thumbnails: High-resolution thumbnails for each video
- Formats: Attempts to get maxres > high > medium > default quality
Podcasts
- Episode thumbnails: iTunes artwork and media thumbnails for each episode
- Formats: PNG/JPEG episode artwork
File Naming Convention
All downloaded images follow a consistent naming pattern:
{source}_{item_id}_{type}_{optional_number}.{ext}
Examples:
instagram_Cm1wgRMr_mj_video_thumb.jpginstagram_CpgiKyqPoX1_image_1.jpgyoutube_dQw4w9WgXcQ_thumbnail.jpgpodcast_episode123_thumbnail.png
Directory Structure
data/
├── media/
│ ├── Instagram/
│ │ ├── instagram_post1_image.jpg
│ │ └── instagram_post2_video_thumb.jpg
│ ├── YouTube/
│ │ ├── youtube_video1_thumbnail.jpg
│ │ └── youtube_video2_thumbnail.jpg
│ └── Podcast/
│ ├── podcast_ep1_thumbnail.png
│ └── podcast_ep2_thumbnail.jpg
└── markdown_current/
├── hvacnkowitall_instagram_*.md
├── hvacnkowitall_youtube_*.md
└── hvacnkowitall_podcast_*.md
Enhanced Scrapers
InstagramScraperWithImages
- Extends
InstagramScraper - Downloads all non-video media
- Handles carousel posts with multiple images
- Stores local paths in
local_imagesfield
YouTubeAPIScraperWithThumbnails
- Extends
YouTubeAPIScraper - Downloads video thumbnails
- Selects highest quality available
- Stores local path in
local_thumbnailfield
RSSScraperPodcastWithImages
- Extends
RSSScraperPodcast - Downloads episode thumbnails
- Extracts from iTunes metadata
- Stores local path in
local_thumbnailfield
Production Scripts
run_production_with_images.py
Main production script that:
- Runs all enhanced scrapers
- Downloads images during content fetching
- Updates cumulative markdown files
- Syncs both markdown and images to NAS
Test Script
test_image_downloads.py - Tests image downloading with small batches:
- 3 YouTube videos
- 3 Instagram posts
- 3 Podcast episodes
NAS Synchronization
The rsync function has been enhanced to sync images:
# Sync markdown files
rsync -av --include=*.md --exclude=* data/markdown_current/ /mnt/nas/hvacknowitall/markdown_current/
# Sync image files
rsync -av --include=*/ --include=*.jpg --include=*.jpeg --include=*.png --include=*.gif --exclude=* data/media/ /mnt/nas/hvacknowitall/media/
Markdown Integration
Downloaded images are referenced in markdown files:
## Thumbnail:

## Downloaded Images:
- [image1.jpg](media/Instagram/instagram_postId_image_1.jpg)
- [image2.jpg](media/Instagram/instagram_postId_image_2.jpg)
Rate Limiting Considerations
- Instagram: Aggressive delays between image downloads (10-20 seconds)
- YouTube: Minimal delays, respects API quota
- Podcast: No rate limiting needed for RSS feeds
Storage Estimates
Based on testing:
- Instagram: ~70-100 KB per image
- YouTube: ~100-200 KB per thumbnail
- Podcast: ~3-4 MB per episode thumbnail (high quality artwork)
For 1000 items per source:
- Instagram: ~100 MB (assuming 1 image per post)
- YouTube: ~200 MB
- Podcast: ~4 GB (if all episodes have artwork)
Usage
Test Image Downloads
python test_image_downloads.py
Production Run with Images
python run_production_with_images.py
Check Downloaded Images
# Count images per source
find data/media -name "*.jpg" -o -name "*.png" | wc -l
# Check disk usage
du -sh data/media/*
Configuration
No additional configuration needed. The system uses existing environment variables:
- Instagram credentials for authenticated image access
- YouTube API key (thumbnails are public)
- Podcast RSS URL (thumbnails in feed metadata)
Future Enhancements
Potential improvements:
- Image optimization/compression to reduce storage
- Configurable image quality settings
- Option to download video files (currently excluded)
- Thumbnail generation for videos without thumbnails
- Image deduplication for repeated content
Troubleshooting
Images Not Downloading
- Check network connectivity
- Verify source credentials (Instagram)
- Check disk space
- Review logs for HTTP errors
Rate Limiting
- Instagram may block rapid downloads
- Use aggressive delays in scraper
- Consider batching downloads
Storage Issues
- Monitor disk usage
- Consider external storage for media
- Implement rotation/archiving strategy