HKIA Content Aggregation System - Complete content scraping and markdown generation for 5 sources (WordPress, MailChimp RSS, Podcast RSS, YouTube, Instagram)
Find a file
Ben Reed 8ceb858026 Implement cumulative markdown system and API integrations
Major improvements:
- Add CumulativeMarkdownManager for intelligent content merging
- Implement YouTube Data API v3 integration with caption support
- Add MailChimp API integration with content cleaning
- Create single source-of-truth files that grow with updates
- Smart merging: updates existing entries with better data
- Properly combines backlog + incremental daily updates

Features:
- 179/444 YouTube videos now have captions (40.3%)
- MailChimp content cleaned of headers/footers
- All sources consolidated to single files
- Archive management with timestamped versions
- Test suite and documentation included

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-08-19 10:53:40 -03:00
config Fix critical production issues and improve spec compliance 2025-08-18 20:07:55 -03:00
data_production_backlog Fix HTML/XML contamination in WordPress markdown extraction 2025-08-18 23:11:08 -03:00
data_quick_test Fix HTML/XML contamination in WordPress markdown extraction 2025-08-18 23:11:08 -03:00
docs Implement cumulative markdown system and API integrations 2025-08-19 10:53:40 -03:00
monitoring Add comprehensive monitoring and alerting system 2025-08-18 21:35:28 -03:00
src Implement cumulative markdown system and API integrations 2025-08-19 10:53:40 -03:00
systemd Add comprehensive monitoring and alerting system 2025-08-18 21:35:28 -03:00
test_data Fix HTML/XML contamination in WordPress markdown extraction 2025-08-18 23:11:08 -03:00
tests Add comprehensive test infrastructure 2025-08-18 21:16:14 -03:00
.env.production Optimize Instagram scraper and increase capture targets to 1000 2025-08-18 22:59:11 -03:00
.gitignore Initial commit: Project foundation with base scraper and tests 2025-08-18 12:15:17 -03:00
.python-version Initial commit: Project foundation with base scraper and tests 2025-08-18 12:15:17 -03:00
automated_backlog_capture.py Optimize Instagram scraper and increase capture targets to 1000 2025-08-18 22:59:11 -03:00
BACKLOG_STATUS.md Fix HTML/XML contamination in WordPress markdown extraction 2025-08-18 23:11:08 -03:00
capture_tiktok_backlog.py Fix critical production issues and improve spec compliance 2025-08-18 20:07:55 -03:00
claude.md Fix critical production issues and improve spec compliance 2025-08-18 20:07:55 -03:00
CLAUDE.md Fix critical production issues and improve spec compliance 2025-08-18 20:07:55 -03:00
clean_markdown.py Fix HTML/XML contamination in WordPress markdown extraction 2025-08-18 23:11:08 -03:00
consolidate_current_files.py Implement cumulative markdown system and API integrations 2025-08-19 10:53:40 -03:00
continue_youtube_captions.py Implement cumulative markdown system and API integrations 2025-08-19 10:53:40 -03:00
debug_wordpress.py Fix critical production issues and improve spec compliance 2025-08-18 20:07:55 -03:00
debug_wordpress_raw.py Fix critical production issues and improve spec compliance 2025-08-18 20:07:55 -03:00
debug_youtube_detailed.py Fix critical production issues and improve spec compliance 2025-08-18 20:07:55 -03:00
debug_youtube_videos.py Fix critical production issues and improve spec compliance 2025-08-18 20:07:55 -03:00
deploy_production.sh Optimize Instagram scraper and increase capture targets to 1000 2025-08-18 22:59:11 -03:00
detailed_monitor.py Fix critical production issues and improve spec compliance 2025-08-18 20:07:55 -03:00
FINAL_TALLY_REPORT.md Fix HTML/XML contamination in WordPress markdown extraction 2025-08-18 23:11:08 -03:00
install.sh Fix critical production issues and improve spec compliance 2025-08-18 20:07:55 -03:00
install_production.sh Fix critical production issues and improve spec compliance 2025-08-18 20:07:55 -03:00
main.py Initial commit: Project foundation with base scraper and tests 2025-08-18 12:15:17 -03:00
monitor_backlog.py Fix critical production issues and improve spec compliance 2025-08-18 20:07:55 -03:00
monitor_backlog_progress.sh Optimize Instagram scraper and increase capture targets to 1000 2025-08-18 22:59:11 -03:00
production_backlog_capture.py Optimize Instagram scraper and increase capture targets to 1000 2025-08-18 22:59:11 -03:00
pyproject.toml Add final dependencies for monitoring and testing 2025-08-18 21:49:43 -03:00
quick_backlog_test.py Fix HTML/XML contamination in WordPress markdown extraction 2025-08-18 23:11:08 -03:00
README.md Implement cumulative markdown system and API integrations 2025-08-19 10:53:40 -03:00
requirements.txt Implement retry logic, connection pooling, and production hardening 2025-08-18 20:16:02 -03:00
requirements_new.txt Fix critical production issues and improve spec compliance 2025-08-18 20:07:55 -03:00
resume_instagram_capture.py Optimize Instagram scraper and increase capture targets to 1000 2025-08-18 22:59:11 -03:00
run_api_production_v2.py Implement cumulative markdown system and API integrations 2025-08-19 10:53:40 -03:00
run_production.py Implement retry logic, connection pooling, and production hardening 2025-08-18 20:16:02 -03:00
status.md Fix critical production issues and improve spec compliance 2025-08-18 20:07:55 -03:00
test_cumulative_mode.py Implement cumulative markdown system and API integrations 2025-08-19 10:53:40 -03:00
test_instagram_debug.py Fix critical production issues and improve spec compliance 2025-08-18 20:07:55 -03:00
test_instagram_fix.py Fix critical production issues and improve spec compliance 2025-08-18 20:07:55 -03:00
test_markitdown_fix.py Fix critical production issues and improve spec compliance 2025-08-18 20:07:55 -03:00
test_production_deployment.py Add comprehensive production documentation and testing 2025-08-18 20:20:52 -03:00
test_real_data.py feat: Enhance TikTok scraper with caption fetching and improved video discovery 2025-08-18 18:59:46 -03:00
test_sources_simple.py Fix critical production issues and improve spec compliance 2025-08-18 20:07:55 -03:00
test_tiktok_advanced.py Fix critical production issues and improve spec compliance 2025-08-18 20:07:55 -03:00
test_tiktok_scrapling.py Fix critical production issues and improve spec compliance 2025-08-18 20:07:55 -03:00
test_wordpress_clean.py Fix HTML/XML contamination in WordPress markdown extraction 2025-08-18 23:11:08 -03:00
UPDATED_CAPTURE_STATUS.md Optimize Instagram scraper and increase capture targets to 1000 2025-08-18 22:59:11 -03:00
uv.lock Add final dependencies for monitoring and testing 2025-08-18 21:49:43 -03:00
validate_production.sh Optimize Instagram scraper and increase capture targets to 1000 2025-08-18 22:59:11 -03:00

HVAC Know It All Content Aggregation System

A containerized Python application that aggregates content from multiple HVAC Know It All sources, converts them to markdown format, and syncs to a NAS.

Features

  • Multi-source content aggregation from YouTube, Instagram, TikTok, MailChimp, WordPress, and Podcast RSS
  • Cumulative markdown management - Single source-of-truth files that grow with backlog and incremental updates
  • API integrations for YouTube Data API v3 and MailChimp API
  • Intelligent content merging with caption/transcript updates and metric tracking
  • Automated NAS synchronization to /mnt/nas/hvacknowitall/
  • State management for incremental updates
  • Parallel processing for multiple sources
  • Atlantic timezone (America/Halifax) timestamps

Cumulative Markdown System

Overview

The system maintains a single markdown file per source that combines:

  • Initial backlog content (historical data)
  • Daily incremental updates (new content)
  • Content updates (new captions, updated metrics)

How It Works

  1. Initial Backlog: First run creates base file with all historical content
  2. Daily Incremental: Subsequent runs merge new content into existing file
  3. Smart Merging: Updates existing entries when better data is available (captions, transcripts, metrics)
  4. Archival: Previous versions archived with timestamps for history

File Naming Convention

<brandName>_<source>_<dateTime>.md
Example: hvacnkowitall_YouTube_2025-08-19T143045.md

Quick Start

Installation

# Install UV package manager
pip install uv

# Install dependencies
uv pip install -r requirements.txt

Configuration

Create .env file with credentials:

# YouTube
YOUTUBE_API_KEY=your_api_key

# MailChimp
MAILCHIMP_API_KEY=your_api_key
MAILCHIMP_SERVER_PREFIX=us10

# Instagram
INSTAGRAM_USERNAME=username
INSTAGRAM_PASSWORD=password

# WordPress
WORDPRESS_USERNAME=username
WORDPRESS_API_KEY=api_key

Running

# Run all scrapers (parallel)
uv run python run_all_scrapers.py

# Run single source
uv run python -m src.youtube_api_scraper_v2

# Test cumulative mode
uv run python test_cumulative_mode.py

# Consolidate existing files
uv run python consolidate_current_files.py

Architecture

Core Components

  • BaseScraper: Abstract base class for all scrapers
  • BaseScraperCumulative: Enhanced base with cumulative support
  • CumulativeMarkdownManager: Handles intelligent file merging
  • ContentOrchestrator: Manages parallel scraper execution

Data Flow

1. Scraper fetches content (checks state for incremental)
2. CumulativeMarkdownManager loads existing file
3. Merges new content (adds new, updates existing)
4. Archives previous version
5. Saves updated file with current timestamp
6. Updates state for next run

Directory Structure

data/
├── markdown_current/       # Current single-source-of-truth files
├── markdown_archives/      # Historical versions by source
│   ├── YouTube/
│   ├── Instagram/
│   └── ...
├── media/                  # Downloaded media files
└── .state/                # State files for incremental updates

logs/                      # Log files by source
src/                       # Source code
tests/                     # Test files

API Quota Management

YouTube Data API v3

  • Daily Limit: 10,000 units
  • Usage Strategy: 95% daily quota for captions
  • Costs:
    • videos.list: 1 unit
    • captions.list: 50 units
    • channels.list: 1 unit

Rate Limiting

  • Instagram: 200 posts/hour
  • YouTube: Respects API quotas
  • General: Exponential backoff with retry

Production Deployment

Systemd Services

Services are configured in /etc/systemd/system/:

  • hvac-content-8am.service - Morning run
  • hvac-content-12pm.service - Noon run
  • hvac-content-8am.timer - Morning schedule
  • hvac-content-12pm.timer - Noon schedule

Manual Deployment

# Start services
sudo systemctl start hvac-content-8am.timer
sudo systemctl start hvac-content-12pm.timer

# Enable on boot
sudo systemctl enable hvac-content-8am.timer
sudo systemctl enable hvac-content-12pm.timer

# Check status
sudo systemctl status hvac-content-*.timer

Monitoring

# View logs
journalctl -u hvac-content-8am -f

# Check file growth
ls -lh data/markdown_current/

# View statistics
uv run python -c "from src.cumulative_markdown_manager import CumulativeMarkdownManager; ..."

Testing

# Run all tests
uv run pytest

# Test specific scraper
uv run pytest tests/test_youtube_scraper.py

# Test cumulative mode
uv run python test_cumulative_mode.py

Troubleshooting

Common Issues

  1. Instagram Rate Limiting: Scraper implements humanized delays (18-22 seconds between requests)
  2. YouTube Quota Exceeded: Wait until next day, quota resets at midnight Pacific
  3. NAS Permission Errors: Warnings are normal, files still sync successfully
  4. Missing Captions: Use YouTube Data API instead of youtube-transcript-api

Debug Commands

# Check scraper state
cat data/.state/*_state.json

# View recent logs
tail -f logs/YouTube/youtube_*.log

# Test single source
uv run python -m src.youtube_api_scraper_v2 --test

License

Private repository - All rights reserved