Ben Reed 2edc359b5e feat: Implement comprehensive image downloading and cumulative markdown system

Major Updates:
- Added image downloading for Instagram, YouTube, and Podcast scrapers
- Implemented cumulative markdown system for maintaining single source-of-truth files
- Deployed production services with automatic NAS sync for images
- Standardized file naming conventions per project specification

New Features:
- Instagram: Downloads all post images, carousel images, and video thumbnails
- YouTube: Downloads video thumbnails (highest quality available)
- Podcast: Downloads episode artwork/thumbnails
- Consistent image naming: {source}_{item_id}_{type}.{ext}
- Cumulative markdown updates to prevent file proliferation
- Automatic media sync to NAS at /mnt/nas/hvacknowitall/media/

Production Deployment:
- New systemd services: hvac-content-images-8am and hvac-content-images-12pm
- Runs twice daily at 8 AM and 12 PM Atlantic time
- Comprehensive rsync for both markdown and media files

File Structure Compliance:
- Renamed Instagram backlog to spec-compliant format
- Archived legacy directory structures
- Ensured all new files follow <brandName>_<source>_<dateTime>.md format

Testing:
- Successfully captured Instagram posts 1-1000 with images
- Launched next batch (posts 1001-2000) currently in progress
- Verified thumbnail downloads for YouTube and Podcast content

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-08-19 12:54:21 -03:00

5.1 KiB

Raw Blame History

Image Download System

Overview

The HVAC Know It All content aggregation system now includes comprehensive image downloading capabilities for all supported sources. This system downloads thumbnails and images (but not videos) to provide visual context alongside the markdown content.

Supported Image Types

Instagram

Post images: All images from single posts and carousel posts
Video thumbnails: Thumbnail images for video posts (videos themselves are not downloaded)
Story images: Images from stories (video stories get thumbnails only)

YouTube

Video thumbnails: High-resolution thumbnails for each video
Formats: Attempts to get maxres > high > medium > default quality

Podcasts

Episode thumbnails: iTunes artwork and media thumbnails for each episode
Formats: PNG/JPEG episode artwork

File Naming Convention

All downloaded images follow a consistent naming pattern:

{source}_{item_id}_{type}_{optional_number}.{ext}

Examples:

instagram_Cm1wgRMr_mj_video_thumb.jpg
instagram_CpgiKyqPoX1_image_1.jpg
youtube_dQw4w9WgXcQ_thumbnail.jpg
podcast_episode123_thumbnail.png

Directory Structure

data/
├── media/
│   ├── Instagram/
│   │   ├── instagram_post1_image.jpg
│   │   └── instagram_post2_video_thumb.jpg
│   ├── YouTube/
│   │   ├── youtube_video1_thumbnail.jpg
│   │   └── youtube_video2_thumbnail.jpg
│   └── Podcast/
│       ├── podcast_ep1_thumbnail.png
│       └── podcast_ep2_thumbnail.jpg
└── markdown_current/
    ├── hvacnkowitall_instagram_*.md
    ├── hvacnkowitall_youtube_*.md
    └── hvacnkowitall_podcast_*.md

Enhanced Scrapers

InstagramScraperWithImages

Extends InstagramScraper
Downloads all non-video media
Handles carousel posts with multiple images
Stores local paths in local_images field

YouTubeAPIScraperWithThumbnails

Extends YouTubeAPIScraper
Downloads video thumbnails
Selects highest quality available
Stores local path in local_thumbnail field

RSSScraperPodcastWithImages

Extends RSSScraperPodcast
Downloads episode thumbnails
Extracts from iTunes metadata
Stores local path in local_thumbnail field

Production Scripts

run_production_with_images.py

Main production script that:

Runs all enhanced scrapers
Downloads images during content fetching
Updates cumulative markdown files
Syncs both markdown and images to NAS

Test Script

test_image_downloads.py - Tests image downloading with small batches:

3 YouTube videos
3 Instagram posts
3 Podcast episodes

NAS Synchronization

The rsync function has been enhanced to sync images:

# Sync markdown files
rsync -av --include=*.md --exclude=* data/markdown_current/ /mnt/nas/hvacknowitall/markdown_current/

# Sync image files
rsync -av --include=*/ --include=*.jpg --include=*.jpeg --include=*.png --include=*.gif --exclude=* data/media/ /mnt/nas/hvacknowitall/media/

Markdown Integration

Downloaded images are referenced in markdown files:

## Thumbnail:
![Thumbnail](media/YouTube/youtube_videoId_thumbnail.jpg)

## Downloaded Images:
- [image1.jpg](media/Instagram/instagram_postId_image_1.jpg)
- [image2.jpg](media/Instagram/instagram_postId_image_2.jpg)

Rate Limiting Considerations

Instagram: Aggressive delays between image downloads (10-20 seconds)
YouTube: Minimal delays, respects API quota
Podcast: No rate limiting needed for RSS feeds

Storage Estimates

Based on testing:

Instagram: ~70-100 KB per image
YouTube: ~100-200 KB per thumbnail
Podcast: ~3-4 MB per episode thumbnail (high quality artwork)

For 1000 items per source:

Instagram: ~100 MB (assuming 1 image per post)
YouTube: ~200 MB
Podcast: ~4 GB (if all episodes have artwork)

Usage

Test Image Downloads

python test_image_downloads.py

Production Run with Images

python run_production_with_images.py

Check Downloaded Images

# Count images per source
find data/media -name "*.jpg" -o -name "*.png" | wc -l

# Check disk usage
du -sh data/media/*

Configuration

No additional configuration needed. The system uses existing environment variables:

Instagram credentials for authenticated image access
YouTube API key (thumbnails are public)
Podcast RSS URL (thumbnails in feed metadata)

Future Enhancements

Potential improvements:

Image optimization/compression to reduce storage
Configurable image quality settings
Option to download video files (currently excluded)
Thumbnail generation for videos without thumbnails
Image deduplication for repeated content

Troubleshooting

Images Not Downloading

Check network connectivity
Verify source credentials (Instagram)
Check disk space
Review logs for HTTP errors

Rate Limiting

Instagram may block rapid downloads
Use aggressive delays in scraper
Consider batching downloads

Storage Issues

Monitor disk usage
Consider external storage for media
Implement rotation/archiving strategy

5.1 KiB Raw Blame History