Documentation Updates: - Updated project specification with hkia naming and paths - Modified all markdown documentation files (12 files updated) - Changed service names from hvac-content-* to hkia-content-* - Updated NAS paths from /mnt/nas/hvacknowitall to /mnt/nas/hkia - Replaced all instances of "HVAC Know It All" with "HKIA" Files Updated: - README.md - Updated service names and commands - CLAUDE.md - Updated environment variables and paths - DEPLOY.md - Updated deployment instructions - docs/project_specification.md - Updated naming convention specs - docs/status.md - Updated project status with new naming - docs/final_status.md - Updated completion status - docs/deployment_strategy.md - Updated deployment paths - docs/DEPLOYMENT_CHECKLIST.md - Updated checklist items - docs/PRODUCTION_TODO.md - Updated production tasks - BACKLOG_STATUS.md - Updated backlog references - UPDATED_CAPTURE_STATUS.md - Updated capture status - FINAL_TALLY_REPORT.md - Updated tally report Notes: - Repository name remains hvacknowitall-content (unchanged) - Project directory remains hvac-kia-content (unchanged) - All user-facing outputs now use clean "hkia" naming 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
244 lines
No EOL
6.8 KiB
Markdown
244 lines
No EOL
6.8 KiB
Markdown
# HKIA Content Aggregation System
|
|
|
|
A containerized Python application that aggregates content from multiple HKIA sources, converts them to markdown format, and syncs to a NAS.
|
|
|
|
## Features
|
|
|
|
- **Multi-source content aggregation** from YouTube, Instagram, TikTok, MailChimp, WordPress, and Podcast RSS
|
|
- **Comprehensive image downloading** for all visual content (Instagram posts, YouTube thumbnails, Podcast artwork)
|
|
- **Cumulative markdown management** - Single source-of-truth files that grow with backlog and incremental updates
|
|
- **API integrations** for YouTube Data API v3 and MailChimp API
|
|
- **Intelligent content merging** with caption/transcript updates and metric tracking
|
|
- **Automated NAS synchronization** to `/mnt/nas/hkia/` for both markdown and media files
|
|
- **State management** for incremental updates
|
|
- **Parallel processing** for multiple sources
|
|
- **Atlantic timezone** (America/Halifax) timestamps
|
|
|
|
## Cumulative Markdown System
|
|
|
|
### Overview
|
|
The system maintains a single markdown file per source that combines:
|
|
- Initial backlog content (historical data)
|
|
- Daily incremental updates (new content)
|
|
- Content updates (new captions, updated metrics)
|
|
|
|
### How It Works
|
|
|
|
1. **Initial Backlog**: First run creates base file with all historical content
|
|
2. **Daily Incremental**: Subsequent runs merge new content into existing file
|
|
3. **Smart Merging**: Updates existing entries when better data is available (captions, transcripts, metrics)
|
|
4. **Archival**: Previous versions archived with timestamps for history
|
|
|
|
### File Naming Convention
|
|
```
|
|
<brandName>_<source>_<dateTime>.md
|
|
Example: hkia_YouTube_2025-08-19T143045.md
|
|
```
|
|
|
|
## Quick Start
|
|
|
|
### Installation
|
|
|
|
```bash
|
|
# Install UV package manager
|
|
pip install uv
|
|
|
|
# Install dependencies
|
|
uv pip install -r requirements.txt
|
|
```
|
|
|
|
### Configuration
|
|
|
|
Create `.env` file with credentials:
|
|
```env
|
|
# YouTube
|
|
YOUTUBE_API_KEY=your_api_key
|
|
|
|
# MailChimp
|
|
MAILCHIMP_API_KEY=your_api_key
|
|
MAILCHIMP_SERVER_PREFIX=us10
|
|
|
|
# Instagram
|
|
INSTAGRAM_USERNAME=username
|
|
INSTAGRAM_PASSWORD=password
|
|
|
|
# WordPress
|
|
WORDPRESS_USERNAME=username
|
|
WORDPRESS_API_KEY=api_key
|
|
```
|
|
|
|
### Running
|
|
|
|
```bash
|
|
# Run all scrapers (parallel)
|
|
uv run python run_all_scrapers.py
|
|
|
|
# Run single source
|
|
uv run python -m src.youtube_api_scraper_v2
|
|
|
|
# Test cumulative mode
|
|
uv run python test_cumulative_mode.py
|
|
|
|
# Consolidate existing files
|
|
uv run python consolidate_current_files.py
|
|
```
|
|
|
|
## Architecture
|
|
|
|
### Core Components
|
|
|
|
- **BaseScraper**: Abstract base class for all scrapers
|
|
- **BaseScraperCumulative**: Enhanced base with cumulative support
|
|
- **CumulativeMarkdownManager**: Handles intelligent file merging
|
|
- **ContentOrchestrator**: Manages parallel scraper execution
|
|
|
|
### Data Flow
|
|
|
|
```
|
|
1. Scraper fetches content (checks state for incremental)
|
|
2. CumulativeMarkdownManager loads existing file
|
|
3. Merges new content (adds new, updates existing)
|
|
4. Archives previous version
|
|
5. Saves updated file with current timestamp
|
|
6. Updates state for next run
|
|
```
|
|
|
|
### Directory Structure
|
|
|
|
```
|
|
data/
|
|
├── markdown_current/ # Current single-source-of-truth files
|
|
├── markdown_archives/ # Historical versions by source
|
|
│ ├── YouTube/
|
|
│ ├── Instagram/
|
|
│ └── ...
|
|
├── media/ # Downloaded media files
|
|
│ ├── Instagram/ # Instagram images and video thumbnails
|
|
│ ├── YouTube/ # YouTube video thumbnails
|
|
│ ├── Podcast/ # Podcast episode artwork
|
|
│ └── ...
|
|
└── .state/ # State files for incremental updates
|
|
|
|
logs/ # Log files by source
|
|
src/ # Source code
|
|
tests/ # Test files
|
|
```
|
|
|
|
## API Quota Management
|
|
|
|
### YouTube Data API v3
|
|
- **Daily Limit**: 10,000 units
|
|
- **Usage Strategy**: 95% daily quota for captions
|
|
- **Costs**:
|
|
- videos.list: 1 unit
|
|
- captions.list: 50 units
|
|
- channels.list: 1 unit
|
|
|
|
### Rate Limiting
|
|
- Instagram: 200 posts/hour
|
|
- YouTube: Respects API quotas
|
|
- General: Exponential backoff with retry
|
|
|
|
## Production Deployment
|
|
|
|
### Systemd Services
|
|
|
|
Services are configured in `/etc/systemd/system/`:
|
|
- `hkia-content-images-8am.service` - Morning run with image downloads
|
|
- `hkia-content-images-12pm.service` - Noon run with image downloads
|
|
- `hkia-content-images-8am.timer` - Morning schedule (8 AM Atlantic)
|
|
- `hkia-content-images-12pm.timer` - Noon schedule (12 PM Atlantic)
|
|
|
|
### Manual Deployment
|
|
|
|
```bash
|
|
# Start services
|
|
sudo systemctl start hkia-content-8am.timer
|
|
sudo systemctl start hkia-content-12pm.timer
|
|
|
|
# Enable on boot
|
|
sudo systemctl enable hkia-content-8am.timer
|
|
sudo systemctl enable hkia-content-12pm.timer
|
|
|
|
# Check status
|
|
sudo systemctl status hkia-content-*.timer
|
|
```
|
|
|
|
## Monitoring
|
|
|
|
```bash
|
|
# View logs
|
|
journalctl -u hkia-content-8am -f
|
|
|
|
# Check file growth
|
|
ls -lh data/markdown_current/
|
|
|
|
# View statistics
|
|
uv run python -c "from src.cumulative_markdown_manager import CumulativeMarkdownManager; ..."
|
|
```
|
|
|
|
## Testing
|
|
|
|
```bash
|
|
# Run all tests
|
|
uv run pytest
|
|
|
|
# Test specific scraper
|
|
uv run pytest tests/test_youtube_scraper.py
|
|
|
|
# Test cumulative mode
|
|
uv run python test_cumulative_mode.py
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
### Common Issues
|
|
|
|
1. **Instagram Rate Limiting**: Scraper implements humanized delays (18-22 seconds between requests)
|
|
2. **YouTube Quota Exceeded**: Wait until next day, quota resets at midnight Pacific
|
|
3. **NAS Permission Errors**: Warnings are normal, files still sync successfully
|
|
4. **Missing Captions**: Use YouTube Data API instead of youtube-transcript-api
|
|
|
|
### Debug Commands
|
|
|
|
```bash
|
|
# Check scraper state
|
|
cat data/.state/*_state.json
|
|
|
|
# View recent logs
|
|
tail -f logs/YouTube/youtube_*.log
|
|
|
|
# Test single source
|
|
uv run python -m src.youtube_api_scraper_v2 --test
|
|
```
|
|
|
|
## Recent Updates (2025-08-19)
|
|
|
|
### Comprehensive Image Downloading
|
|
- Implemented full image download capability for all content sources
|
|
- Instagram: Downloads all post images, carousel images, and video thumbnails
|
|
- YouTube: Automatically fetches highest quality video thumbnails
|
|
- Podcasts: Downloads episode artwork and thumbnails
|
|
- Consistent naming: `{source}_{item_id}_{type}.{ext}`
|
|
- Media organized in `data/media/{source}/` directories
|
|
|
|
### File Naming Standardization
|
|
- Migrated to project specification compliant naming
|
|
- Format: `<brandName>_<source>_<dateTime>.md`
|
|
- Example: `hkia_instagram_2025-08-19T100511.md`
|
|
- Archived legacy file structures to `markdown_archives/legacy_structure/`
|
|
|
|
### Instagram Backlog Expansion
|
|
- Completed initial 1000 posts capture with images
|
|
- Currently capturing posts 1001-2000 with rate limiting
|
|
- Cumulative markdown updates every 100 posts
|
|
- Full image download for all historical content
|
|
|
|
### Production Automation
|
|
- Deployed systemd services for twice-daily runs (8 AM, 12 PM Atlantic)
|
|
- Automated NAS synchronization for markdown and media files
|
|
- Rate-limited scraping with humanized delays (10-20 seconds per Instagram post)
|
|
|
|
## License
|
|
|
|
Private repository - All rights reserved |