Major improvements: - Add CumulativeMarkdownManager for intelligent content merging - Implement YouTube Data API v3 integration with caption support - Add MailChimp API integration with content cleaning - Create single source-of-truth files that grow with updates - Smart merging: updates existing entries with better data - Properly combines backlog + incremental daily updates Features: - 179/444 YouTube videos now have captions (40.3%) - MailChimp content cleaned of headers/footers - All sources consolidated to single files - Archive management with timestamped versions - Test suite and documentation included 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
212 lines
No EOL
5.3 KiB
Markdown
212 lines
No EOL
5.3 KiB
Markdown
# HVAC Know It All Content Aggregation System
|
|
|
|
A containerized Python application that aggregates content from multiple HVAC Know It All sources, converts them to markdown format, and syncs to a NAS.
|
|
|
|
## Features
|
|
|
|
- **Multi-source content aggregation** from YouTube, Instagram, TikTok, MailChimp, WordPress, and Podcast RSS
|
|
- **Cumulative markdown management** - Single source-of-truth files that grow with backlog and incremental updates
|
|
- **API integrations** for YouTube Data API v3 and MailChimp API
|
|
- **Intelligent content merging** with caption/transcript updates and metric tracking
|
|
- **Automated NAS synchronization** to `/mnt/nas/hvacknowitall/`
|
|
- **State management** for incremental updates
|
|
- **Parallel processing** for multiple sources
|
|
- **Atlantic timezone** (America/Halifax) timestamps
|
|
|
|
## Cumulative Markdown System
|
|
|
|
### Overview
|
|
The system maintains a single markdown file per source that combines:
|
|
- Initial backlog content (historical data)
|
|
- Daily incremental updates (new content)
|
|
- Content updates (new captions, updated metrics)
|
|
|
|
### How It Works
|
|
|
|
1. **Initial Backlog**: First run creates base file with all historical content
|
|
2. **Daily Incremental**: Subsequent runs merge new content into existing file
|
|
3. **Smart Merging**: Updates existing entries when better data is available (captions, transcripts, metrics)
|
|
4. **Archival**: Previous versions archived with timestamps for history
|
|
|
|
### File Naming Convention
|
|
```
|
|
<brandName>_<source>_<dateTime>.md
|
|
Example: hvacnkowitall_YouTube_2025-08-19T143045.md
|
|
```
|
|
|
|
## Quick Start
|
|
|
|
### Installation
|
|
|
|
```bash
|
|
# Install UV package manager
|
|
pip install uv
|
|
|
|
# Install dependencies
|
|
uv pip install -r requirements.txt
|
|
```
|
|
|
|
### Configuration
|
|
|
|
Create `.env` file with credentials:
|
|
```env
|
|
# YouTube
|
|
YOUTUBE_API_KEY=your_api_key
|
|
|
|
# MailChimp
|
|
MAILCHIMP_API_KEY=your_api_key
|
|
MAILCHIMP_SERVER_PREFIX=us10
|
|
|
|
# Instagram
|
|
INSTAGRAM_USERNAME=username
|
|
INSTAGRAM_PASSWORD=password
|
|
|
|
# WordPress
|
|
WORDPRESS_USERNAME=username
|
|
WORDPRESS_API_KEY=api_key
|
|
```
|
|
|
|
### Running
|
|
|
|
```bash
|
|
# Run all scrapers (parallel)
|
|
uv run python run_all_scrapers.py
|
|
|
|
# Run single source
|
|
uv run python -m src.youtube_api_scraper_v2
|
|
|
|
# Test cumulative mode
|
|
uv run python test_cumulative_mode.py
|
|
|
|
# Consolidate existing files
|
|
uv run python consolidate_current_files.py
|
|
```
|
|
|
|
## Architecture
|
|
|
|
### Core Components
|
|
|
|
- **BaseScraper**: Abstract base class for all scrapers
|
|
- **BaseScraperCumulative**: Enhanced base with cumulative support
|
|
- **CumulativeMarkdownManager**: Handles intelligent file merging
|
|
- **ContentOrchestrator**: Manages parallel scraper execution
|
|
|
|
### Data Flow
|
|
|
|
```
|
|
1. Scraper fetches content (checks state for incremental)
|
|
2. CumulativeMarkdownManager loads existing file
|
|
3. Merges new content (adds new, updates existing)
|
|
4. Archives previous version
|
|
5. Saves updated file with current timestamp
|
|
6. Updates state for next run
|
|
```
|
|
|
|
### Directory Structure
|
|
|
|
```
|
|
data/
|
|
├── markdown_current/ # Current single-source-of-truth files
|
|
├── markdown_archives/ # Historical versions by source
|
|
│ ├── YouTube/
|
|
│ ├── Instagram/
|
|
│ └── ...
|
|
├── media/ # Downloaded media files
|
|
└── .state/ # State files for incremental updates
|
|
|
|
logs/ # Log files by source
|
|
src/ # Source code
|
|
tests/ # Test files
|
|
```
|
|
|
|
## API Quota Management
|
|
|
|
### YouTube Data API v3
|
|
- **Daily Limit**: 10,000 units
|
|
- **Usage Strategy**: 95% daily quota for captions
|
|
- **Costs**:
|
|
- videos.list: 1 unit
|
|
- captions.list: 50 units
|
|
- channels.list: 1 unit
|
|
|
|
### Rate Limiting
|
|
- Instagram: 200 posts/hour
|
|
- YouTube: Respects API quotas
|
|
- General: Exponential backoff with retry
|
|
|
|
## Production Deployment
|
|
|
|
### Systemd Services
|
|
|
|
Services are configured in `/etc/systemd/system/`:
|
|
- `hvac-content-8am.service` - Morning run
|
|
- `hvac-content-12pm.service` - Noon run
|
|
- `hvac-content-8am.timer` - Morning schedule
|
|
- `hvac-content-12pm.timer` - Noon schedule
|
|
|
|
### Manual Deployment
|
|
|
|
```bash
|
|
# Start services
|
|
sudo systemctl start hvac-content-8am.timer
|
|
sudo systemctl start hvac-content-12pm.timer
|
|
|
|
# Enable on boot
|
|
sudo systemctl enable hvac-content-8am.timer
|
|
sudo systemctl enable hvac-content-12pm.timer
|
|
|
|
# Check status
|
|
sudo systemctl status hvac-content-*.timer
|
|
```
|
|
|
|
## Monitoring
|
|
|
|
```bash
|
|
# View logs
|
|
journalctl -u hvac-content-8am -f
|
|
|
|
# Check file growth
|
|
ls -lh data/markdown_current/
|
|
|
|
# View statistics
|
|
uv run python -c "from src.cumulative_markdown_manager import CumulativeMarkdownManager; ..."
|
|
```
|
|
|
|
## Testing
|
|
|
|
```bash
|
|
# Run all tests
|
|
uv run pytest
|
|
|
|
# Test specific scraper
|
|
uv run pytest tests/test_youtube_scraper.py
|
|
|
|
# Test cumulative mode
|
|
uv run python test_cumulative_mode.py
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
### Common Issues
|
|
|
|
1. **Instagram Rate Limiting**: Scraper implements humanized delays (18-22 seconds between requests)
|
|
2. **YouTube Quota Exceeded**: Wait until next day, quota resets at midnight Pacific
|
|
3. **NAS Permission Errors**: Warnings are normal, files still sync successfully
|
|
4. **Missing Captions**: Use YouTube Data API instead of youtube-transcript-api
|
|
|
|
### Debug Commands
|
|
|
|
```bash
|
|
# Check scraper state
|
|
cat data/.state/*_state.json
|
|
|
|
# View recent logs
|
|
tail -f logs/YouTube/youtube_*.log
|
|
|
|
# Test single source
|
|
uv run python -m src.youtube_api_scraper_v2 --test
|
|
```
|
|
|
|
## License
|
|
|
|
Private repository - All rights reserved |