ben/hvac-kia-content

Fork 0

HKIA Content Aggregation System - Complete content scraping and markdown generation for 5 sources (WordPress, MailChimp RSS, Podcast RSS, YouTube, Instagram)

Find a file

Ben Reed 71ab1c2407 feat: Disable TikTok scraper and deploy production systemd services MAJOR CHANGES: - TikTok scraper disabled in orchestrator (GUI dependency issues) - Created new hkia-scraper systemd services replacing hvac-content-* - Added comprehensive installation script: install-hkia-services.sh - Updated documentation to reflect 5 active sources (WordPress, MailChimp, Podcast, YouTube, Instagram) PRODUCTION DEPLOYMENT: - Services installed and active: hkia-scraper.timer, hkia-scraper-nas.timer - Schedule: 8:00 AM & 12:00 PM ADT scraping + 30min NAS sync - All sources now run in parallel (no TikTok GUI blocking) - Automated twice-daily content aggregation with image downloads TECHNICAL: - Orchestrator simplified: removed TikTok special handling - Service files: proper naming convention (hkia-scraper vs hvac-content) - Documentation: marked TikTok as disabled, updated deployment status 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>		2025-08-21 10:40:48 -03:00
config	Fix critical production issues and improve spec compliance	2025-08-18 20:07:55 -03:00
data_api_test	refactor: Update naming convention from hvacknowitall to hkia	2025-08-19 13:35:23 -03:00
data_production_backlog	refactor: Update naming convention from hvacknowitall to hkia	2025-08-19 13:35:23 -03:00
data_quick_test	Fix HTML/XML contamination in WordPress markdown extraction	2025-08-18 23:11:08 -03:00
deploy	refactor: Update naming convention from hvacknowitall to hkia	2025-08-19 13:35:23 -03:00
docs	docs: Update all documentation to use hkia naming convention	2025-08-19 13:40:27 -03:00
monitoring	Add comprehensive monitoring and alerting system	2025-08-18 21:35:28 -03:00
src	feat: Disable TikTok scraper and deploy production systemd services	2025-08-21 10:40:48 -03:00
systemd	feat: Disable TikTok scraper and deploy production systemd services	2025-08-21 10:40:48 -03:00
test_data	refactor: Update naming convention from hvacknowitall to hkia	2025-08-19 13:35:23 -03:00
test_fix	refactor: Update naming convention from hvacknowitall to hkia	2025-08-19 13:35:23 -03:00
tests	refactor: Update naming convention from hvacknowitall to hkia	2025-08-19 13:35:23 -03:00
.env.production	refactor: Update naming convention from hvacknowitall to hkia	2025-08-19 13:35:23 -03:00
.gitignore	Initial commit: Project foundation with base scraper and tests	2025-08-18 12:15:17 -03:00
.python-version	Initial commit: Project foundation with base scraper and tests	2025-08-18 12:15:17 -03:00
automated_backlog_capture.py	Optimize Instagram scraper and increase capture targets to 1000	2025-08-18 22:59:11 -03:00
BACKLOG_STATUS.md	docs: Update all documentation to use hkia naming convention	2025-08-19 13:40:27 -03:00
capture_tiktok_backlog.py	Fix critical production issues and improve spec compliance	2025-08-18 20:07:55 -03:00
claude.md	Fix critical production issues and improve spec compliance	2025-08-18 20:07:55 -03:00
CLAUDE.md	feat: Disable TikTok scraper and deploy production systemd services	2025-08-21 10:40:48 -03:00
clean_markdown.py	Fix HTML/XML contamination in WordPress markdown extraction	2025-08-18 23:11:08 -03:00
consolidate_current_files.py	Implement cumulative markdown system and API integrations	2025-08-19 10:53:40 -03:00
continue_youtube_captions.py	Implement cumulative markdown system and API integrations	2025-08-19 10:53:40 -03:00
create_instagram_incremental.py	refactor: Update naming convention from hvacknowitall to hkia	2025-08-19 13:35:23 -03:00
debug_content.py	refactor: Update naming convention from hvacknowitall to hkia	2025-08-19 13:35:23 -03:00
debug_wordpress.py	Fix critical production issues and improve spec compliance	2025-08-18 20:07:55 -03:00
debug_wordpress_raw.py	Fix critical production issues and improve spec compliance	2025-08-18 20:07:55 -03:00
debug_youtube_detailed.py	Fix critical production issues and improve spec compliance	2025-08-18 20:07:55 -03:00
debug_youtube_videos.py	Fix critical production issues and improve spec compliance	2025-08-18 20:07:55 -03:00
DEPLOY.md	docs: Update all documentation to use hkia naming convention	2025-08-19 13:40:27 -03:00
deploy_production.sh	refactor: Update naming convention from hvacknowitall to hkia	2025-08-19 13:35:23 -03:00
detailed_monitor.py	Fix critical production issues and improve spec compliance	2025-08-18 20:07:55 -03:00
fetch_more_youtube.py	refactor: Update naming convention from hvacknowitall to hkia	2025-08-19 13:35:23 -03:00
fetch_youtube_100_with_transcripts.py	refactor: Update naming convention from hvacknowitall to hkia	2025-08-19 13:35:23 -03:00
fetch_youtube_with_transcripts.py	refactor: Update naming convention from hvacknowitall to hkia	2025-08-19 13:35:23 -03:00
FINAL_TALLY_REPORT.md	docs: Update all documentation to use hkia naming convention	2025-08-19 13:40:27 -03:00
final_verification.py	refactor: Update naming convention from hvacknowitall to hkia	2025-08-19 13:35:23 -03:00
install-hkia-services.sh	feat: Disable TikTok scraper and deploy production systemd services	2025-08-21 10:40:48 -03:00
install.sh	Fix critical production issues and improve spec compliance	2025-08-18 20:07:55 -03:00
install_production.sh	Fix critical production issues and improve spec compliance	2025-08-18 20:07:55 -03:00
main.py	Initial commit: Project foundation with base scraper and tests	2025-08-18 12:15:17 -03:00
monitor_backlog.py	Fix critical production issues and improve spec compliance	2025-08-18 20:07:55 -03:00
monitor_backlog_progress.sh	Optimize Instagram scraper and increase capture targets to 1000	2025-08-18 22:59:11 -03:00
production_backlog_capture.py	refactor: Update naming convention from hvacknowitall to hkia	2025-08-19 13:35:23 -03:00
pyproject.toml	refactor: Update naming convention from hvacknowitall to hkia	2025-08-19 13:35:23 -03:00
quick_backlog_test.py	Fix HTML/XML contamination in WordPress markdown extraction	2025-08-18 23:11:08 -03:00
README.md	docs: Update all documentation to use hkia naming convention	2025-08-19 13:40:27 -03:00
requirements.txt	Implement retry logic, connection pooling, and production hardening	2025-08-18 20:16:02 -03:00
requirements_new.txt	Fix critical production issues and improve spec compliance	2025-08-18 20:07:55 -03:00
resume_instagram_capture.py	Optimize Instagram scraper and increase capture targets to 1000	2025-08-18 22:59:11 -03:00
run_api_production_v2.py	Implement cumulative markdown system and API integrations	2025-08-19 10:53:40 -03:00
run_api_scrapers_production.py	refactor: Update naming convention from hvacknowitall to hkia	2025-08-19 13:35:23 -03:00
run_instagram_next_1000.py	refactor: Update naming convention from hvacknowitall to hkia	2025-08-19 13:35:23 -03:00
run_production.py	refactor: Update naming convention from hvacknowitall to hkia	2025-08-19 13:35:23 -03:00
run_production_cumulative.py	refactor: Update naming convention from hvacknowitall to hkia	2025-08-19 13:35:23 -03:00
run_production_with_images.py	refactor: Update naming convention from hvacknowitall to hkia	2025-08-19 13:35:23 -03:00
status.md	Fix critical production issues and improve spec compliance	2025-08-18 20:07:55 -03:00
test_api_scrapers_full.py	refactor: Update naming convention from hvacknowitall to hkia	2025-08-19 13:35:23 -03:00
test_cumulative_fix.py	fix: Add missing update_cumulative_file method to CumulativeMarkdownManager	2025-08-19 15:02:36 -03:00
test_cumulative_mode.py	Implement cumulative markdown system and API integrations	2025-08-19 10:53:40 -03:00
test_image_downloads.py	refactor: Update naming convention from hvacknowitall to hkia	2025-08-19 13:35:23 -03:00
test_instagram_debug.py	Fix critical production issues and improve spec compliance	2025-08-18 20:07:55 -03:00
test_instagram_fix.py	Fix critical production issues and improve spec compliance	2025-08-18 20:07:55 -03:00
test_mailchimp_api.py	refactor: Update naming convention from hvacknowitall to hkia	2025-08-19 13:35:23 -03:00
test_markitdown_fix.py	Fix critical production issues and improve spec compliance	2025-08-18 20:07:55 -03:00
test_new_auth.py	refactor: Update naming convention from hvacknowitall to hkia	2025-08-19 13:35:23 -03:00
test_production_deployment.py	Add comprehensive production documentation and testing	2025-08-18 20:20:52 -03:00
test_real_data.py	feat: Enhance TikTok scraper with caption fetching and improved video discovery	2025-08-18 18:59:46 -03:00
test_slow_delays.py	refactor: Update naming convention from hvacknowitall to hkia	2025-08-19 13:35:23 -03:00
test_sources_simple.py	Fix critical production issues and improve spec compliance	2025-08-18 20:07:55 -03:00
test_tiktok_advanced.py	Fix critical production issues and improve spec compliance	2025-08-18 20:07:55 -03:00
test_tiktok_scrapling.py	Fix critical production issues and improve spec compliance	2025-08-18 20:07:55 -03:00
test_wordpress_clean.py	Fix HTML/XML contamination in WordPress markdown extraction	2025-08-18 23:11:08 -03:00
test_youtube_api.py	refactor: Update naming convention from hvacknowitall to hkia	2025-08-19 13:35:23 -03:00
test_youtube_auth.py	refactor: Update naming convention from hvacknowitall to hkia	2025-08-19 13:35:23 -03:00
test_youtube_scraper_enhanced.py	refactor: Update naming convention from hvacknowitall to hkia	2025-08-19 13:35:23 -03:00
test_youtube_transcript.py	refactor: Update naming convention from hvacknowitall to hkia	2025-08-19 13:35:23 -03:00
test_youtube_transcripts.py	refactor: Update naming convention from hvacknowitall to hkia	2025-08-19 13:35:23 -03:00
update_docs_to_hkia.py	docs: Update all documentation to use hkia naming convention	2025-08-19 13:40:27 -03:00
update_to_hkia_naming.py	refactor: Update naming convention from hvacknowitall to hkia	2025-08-19 13:35:23 -03:00
UPDATED_CAPTURE_STATUS.md	docs: Update all documentation to use hkia naming convention	2025-08-19 13:40:27 -03:00
uv.lock	refactor: Update naming convention from hvacknowitall to hkia	2025-08-19 13:35:23 -03:00
validate_production.sh	Optimize Instagram scraper and increase capture targets to 1000	2025-08-18 22:59:11 -03:00
verify_processing.py	refactor: Update naming convention from hvacknowitall to hkia	2025-08-19 13:35:23 -03:00
youtube_auth.py	refactor: Update naming convention from hvacknowitall to hkia	2025-08-19 13:35:23 -03:00
youtube_backlog_all_with_transcripts.py	refactor: Update naming convention from hvacknowitall to hkia	2025-08-19 13:35:23 -03:00
youtube_backlog_with_transcripts_slow.py	refactor: Update naming convention from hvacknowitall to hkia	2025-08-19 13:35:23 -03:00
youtube_browser_cookies.py	refactor: Update naming convention from hvacknowitall to hkia	2025-08-19 13:35:23 -03:00
youtube_slow_backlog_with_transcripts.py	refactor: Update naming convention from hvacknowitall to hkia	2025-08-19 13:35:23 -03:00

README.md

HKIA Content Aggregation System

A containerized Python application that aggregates content from multiple HKIA sources, converts them to markdown format, and syncs to a NAS.

Features

Multi-source content aggregation from YouTube, Instagram, TikTok, MailChimp, WordPress, and Podcast RSS
Comprehensive image downloading for all visual content (Instagram posts, YouTube thumbnails, Podcast artwork)
Cumulative markdown management - Single source-of-truth files that grow with backlog and incremental updates
API integrations for YouTube Data API v3 and MailChimp API
Intelligent content merging with caption/transcript updates and metric tracking
Automated NAS synchronization to /mnt/nas/hkia/ for both markdown and media files
State management for incremental updates
Parallel processing for multiple sources
Atlantic timezone (America/Halifax) timestamps

Cumulative Markdown System

Overview

The system maintains a single markdown file per source that combines:

Initial backlog content (historical data)
Daily incremental updates (new content)
Content updates (new captions, updated metrics)

How It Works

Initial Backlog: First run creates base file with all historical content
Daily Incremental: Subsequent runs merge new content into existing file
Smart Merging: Updates existing entries when better data is available (captions, transcripts, metrics)
Archival: Previous versions archived with timestamps for history

File Naming Convention

<brandName>_<source>_<dateTime>.md
Example: hkia_YouTube_2025-08-19T143045.md

Quick Start

Installation

# Install UV package manager
pip install uv

# Install dependencies
uv pip install -r requirements.txt

Configuration

Create .env file with credentials:

# YouTube
YOUTUBE_API_KEY=your_api_key

# MailChimp
MAILCHIMP_API_KEY=your_api_key
MAILCHIMP_SERVER_PREFIX=us10

# Instagram
INSTAGRAM_USERNAME=username
INSTAGRAM_PASSWORD=password

# WordPress
WORDPRESS_USERNAME=username
WORDPRESS_API_KEY=api_key

Running

# Run all scrapers (parallel)
uv run python run_all_scrapers.py

# Run single source
uv run python -m src.youtube_api_scraper_v2

# Test cumulative mode
uv run python test_cumulative_mode.py

# Consolidate existing files
uv run python consolidate_current_files.py

Architecture

Core Components

BaseScraper: Abstract base class for all scrapers
BaseScraperCumulative: Enhanced base with cumulative support
CumulativeMarkdownManager: Handles intelligent file merging
ContentOrchestrator: Manages parallel scraper execution

Data Flow

1. Scraper fetches content (checks state for incremental)
2. CumulativeMarkdownManager loads existing file
3. Merges new content (adds new, updates existing)
4. Archives previous version
5. Saves updated file with current timestamp
6. Updates state for next run

Directory Structure

data/
├── markdown_current/       # Current single-source-of-truth files
├── markdown_archives/      # Historical versions by source
│   ├── YouTube/
│   ├── Instagram/
│   └── ...
├── media/                  # Downloaded media files
│   ├── Instagram/         # Instagram images and video thumbnails
│   ├── YouTube/           # YouTube video thumbnails
│   ├── Podcast/           # Podcast episode artwork
│   └── ...
└── .state/                # State files for incremental updates

logs/                      # Log files by source
src/                       # Source code
tests/                     # Test files

API Quota Management

YouTube Data API v3

Daily Limit: 10,000 units
Usage Strategy: 95% daily quota for captions
Costs:
- videos.list: 1 unit
- captions.list: 50 units
- channels.list: 1 unit

Rate Limiting

Instagram: 200 posts/hour
YouTube: Respects API quotas
General: Exponential backoff with retry

Production Deployment

Systemd Services

Services are configured in /etc/systemd/system/:

hkia-content-images-8am.service - Morning run with image downloads
hkia-content-images-12pm.service - Noon run with image downloads
hkia-content-images-8am.timer - Morning schedule (8 AM Atlantic)
hkia-content-images-12pm.timer - Noon schedule (12 PM Atlantic)

Manual Deployment

# Start services
sudo systemctl start hkia-content-8am.timer
sudo systemctl start hkia-content-12pm.timer

# Enable on boot
sudo systemctl enable hkia-content-8am.timer
sudo systemctl enable hkia-content-12pm.timer

# Check status
sudo systemctl status hkia-content-*.timer

Monitoring

# View logs
journalctl -u hkia-content-8am -f

# Check file growth
ls -lh data/markdown_current/

# View statistics
uv run python -c "from src.cumulative_markdown_manager import CumulativeMarkdownManager; ..."

Testing

# Run all tests
uv run pytest

# Test specific scraper
uv run pytest tests/test_youtube_scraper.py

# Test cumulative mode
uv run python test_cumulative_mode.py

Troubleshooting

Common Issues

Instagram Rate Limiting: Scraper implements humanized delays (18-22 seconds between requests)
YouTube Quota Exceeded: Wait until next day, quota resets at midnight Pacific
NAS Permission Errors: Warnings are normal, files still sync successfully
Missing Captions: Use YouTube Data API instead of youtube-transcript-api

Debug Commands

# Check scraper state
cat data/.state/*_state.json

# View recent logs
tail -f logs/YouTube/youtube_*.log

# Test single source
uv run python -m src.youtube_api_scraper_v2 --test

Recent Updates (2025-08-19)

Comprehensive Image Downloading

Implemented full image download capability for all content sources
Instagram: Downloads all post images, carousel images, and video thumbnails
YouTube: Automatically fetches highest quality video thumbnails
Podcasts: Downloads episode artwork and thumbnails
Consistent naming: {source}_{item_id}_{type}.{ext}
Media organized in data/media/{source}/ directories

File Naming Standardization

Migrated to project specification compliant naming
Format: <brandName>_<source>_<dateTime>.md
Example: hkia_instagram_2025-08-19T100511.md
Archived legacy file structures to markdown_archives/legacy_structure/

Instagram Backlog Expansion

Completed initial 1000 posts capture with images
Currently capturing posts 1001-2000 with rate limiting
Cumulative markdown updates every 100 posts
Full image download for all historical content

Production Automation

Deployed systemd services for twice-daily runs (8 AM, 12 PM Atlantic)
Automated NAS synchronization for markdown and media files
Rate-limited scraping with humanized delays (10-20 seconds per Instagram post)