Major Changes: - Updated all code references from hvacknowitall/hvacnkowitall to hkia - Renamed all existing markdown files to use hkia_ prefix - Updated configuration files, scrapers, and production scripts - Modified systemd service descriptions to use HKIA - Changed NAS sync path to /mnt/nas/hkia Files Updated: - 20+ source files updated with new naming convention - 34 markdown files renamed to hkia_* format - All ScraperConfig brand_name parameters now use 'hkia' - Documentation updated to reflect new naming Rationale: - Shorter, cleaner filenames - Consistent branding across all outputs - Easier to type and reference - Maintains same functionality with improved naming Next Steps: - Deploy updated services to production - Update any external references to old naming - Monitor scrapers to ensure proper operation 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
186 lines
No EOL
5 KiB
Markdown
186 lines
No EOL
5 KiB
Markdown
# Image Download System
|
|
|
|
## Overview
|
|
|
|
The HKIA content aggregation system now includes comprehensive image downloading capabilities for all supported sources. This system downloads thumbnails and images (but not videos) to provide visual context alongside the markdown content.
|
|
|
|
## Supported Image Types
|
|
|
|
### Instagram
|
|
- **Post images**: All images from single posts and carousel posts
|
|
- **Video thumbnails**: Thumbnail images for video posts (videos themselves are not downloaded)
|
|
- **Story images**: Images from stories (video stories get thumbnails only)
|
|
|
|
### YouTube
|
|
- **Video thumbnails**: High-resolution thumbnails for each video
|
|
- **Formats**: Attempts to get maxres > high > medium > default quality
|
|
|
|
### Podcasts
|
|
- **Episode thumbnails**: iTunes artwork and media thumbnails for each episode
|
|
- **Formats**: PNG/JPEG episode artwork
|
|
|
|
## File Naming Convention
|
|
|
|
All downloaded images follow a consistent naming pattern:
|
|
```
|
|
{source}_{item_id}_{type}_{optional_number}.{ext}
|
|
```
|
|
|
|
Examples:
|
|
- `instagram_Cm1wgRMr_mj_video_thumb.jpg`
|
|
- `instagram_CpgiKyqPoX1_image_1.jpg`
|
|
- `youtube_dQw4w9WgXcQ_thumbnail.jpg`
|
|
- `podcast_episode123_thumbnail.png`
|
|
|
|
## Directory Structure
|
|
|
|
```
|
|
data/
|
|
├── media/
|
|
│ ├── Instagram/
|
|
│ │ ├── instagram_post1_image.jpg
|
|
│ │ └── instagram_post2_video_thumb.jpg
|
|
│ ├── YouTube/
|
|
│ │ ├── youtube_video1_thumbnail.jpg
|
|
│ │ └── youtube_video2_thumbnail.jpg
|
|
│ └── Podcast/
|
|
│ ├── podcast_ep1_thumbnail.png
|
|
│ └── podcast_ep2_thumbnail.jpg
|
|
└── markdown_current/
|
|
├── hkia_instagram_*.md
|
|
├── hkia_youtube_*.md
|
|
└── hkia_podcast_*.md
|
|
```
|
|
|
|
## Enhanced Scrapers
|
|
|
|
### InstagramScraperWithImages
|
|
- Extends `InstagramScraper`
|
|
- Downloads all non-video media
|
|
- Handles carousel posts with multiple images
|
|
- Stores local paths in `local_images` field
|
|
|
|
### YouTubeAPIScraperWithThumbnails
|
|
- Extends `YouTubeAPIScraper`
|
|
- Downloads video thumbnails
|
|
- Selects highest quality available
|
|
- Stores local path in `local_thumbnail` field
|
|
|
|
### RSSScraperPodcastWithImages
|
|
- Extends `RSSScraperPodcast`
|
|
- Downloads episode thumbnails
|
|
- Extracts from iTunes metadata
|
|
- Stores local path in `local_thumbnail` field
|
|
|
|
## Production Scripts
|
|
|
|
### run_production_with_images.py
|
|
Main production script that:
|
|
1. Runs all enhanced scrapers
|
|
2. Downloads images during content fetching
|
|
3. Updates cumulative markdown files
|
|
4. Syncs both markdown and images to NAS
|
|
|
|
### Test Script
|
|
`test_image_downloads.py` - Tests image downloading with small batches:
|
|
- 3 YouTube videos
|
|
- 3 Instagram posts
|
|
- 3 Podcast episodes
|
|
|
|
## NAS Synchronization
|
|
|
|
The rsync function has been enhanced to sync images:
|
|
|
|
```python
|
|
# Sync markdown files
|
|
rsync -av --include=*.md --exclude=* data/markdown_current/ /mnt/nas/hkia/markdown_current/
|
|
|
|
# Sync image files
|
|
rsync -av --include=*/ --include=*.jpg --include=*.jpeg --include=*.png --include=*.gif --exclude=* data/media/ /mnt/nas/hkia/media/
|
|
```
|
|
|
|
## Markdown Integration
|
|
|
|
Downloaded images are referenced in markdown files:
|
|
|
|
```markdown
|
|
## Thumbnail:
|
|

|
|
|
|
## Downloaded Images:
|
|
- [image1.jpg](media/Instagram/instagram_postId_image_1.jpg)
|
|
- [image2.jpg](media/Instagram/instagram_postId_image_2.jpg)
|
|
```
|
|
|
|
## Rate Limiting Considerations
|
|
|
|
- **Instagram**: Aggressive delays between image downloads (10-20 seconds)
|
|
- **YouTube**: Minimal delays, respects API quota
|
|
- **Podcast**: No rate limiting needed for RSS feeds
|
|
|
|
## Storage Estimates
|
|
|
|
Based on testing:
|
|
- **Instagram**: ~70-100 KB per image
|
|
- **YouTube**: ~100-200 KB per thumbnail
|
|
- **Podcast**: ~3-4 MB per episode thumbnail (high quality artwork)
|
|
|
|
For 1000 items per source:
|
|
- Instagram: ~100 MB (assuming 1 image per post)
|
|
- YouTube: ~200 MB
|
|
- Podcast: ~4 GB (if all episodes have artwork)
|
|
|
|
## Usage
|
|
|
|
### Test Image Downloads
|
|
```bash
|
|
python test_image_downloads.py
|
|
```
|
|
|
|
### Production Run with Images
|
|
```bash
|
|
python run_production_with_images.py
|
|
```
|
|
|
|
### Check Downloaded Images
|
|
```bash
|
|
# Count images per source
|
|
find data/media -name "*.jpg" -o -name "*.png" | wc -l
|
|
|
|
# Check disk usage
|
|
du -sh data/media/*
|
|
```
|
|
|
|
## Configuration
|
|
|
|
No additional configuration needed. The system uses existing environment variables:
|
|
- Instagram credentials for authenticated image access
|
|
- YouTube API key (thumbnails are public)
|
|
- Podcast RSS URL (thumbnails in feed metadata)
|
|
|
|
## Future Enhancements
|
|
|
|
Potential improvements:
|
|
1. Image optimization/compression to reduce storage
|
|
2. Configurable image quality settings
|
|
3. Option to download video files (currently excluded)
|
|
4. Thumbnail generation for videos without thumbnails
|
|
5. Image deduplication for repeated content
|
|
|
|
## Troubleshooting
|
|
|
|
### Images Not Downloading
|
|
- Check network connectivity
|
|
- Verify source credentials (Instagram)
|
|
- Check disk space
|
|
- Review logs for HTTP errors
|
|
|
|
### Rate Limiting
|
|
- Instagram may block rapid downloads
|
|
- Use aggressive delays in scraper
|
|
- Consider batching downloads
|
|
|
|
### Storage Issues
|
|
- Monitor disk usage
|
|
- Consider external storage for media
|
|
- Implement rotation/archiving strategy |