Ben Reed 8ceb858026 Implement cumulative markdown system and API integrations

Major improvements:
- Add CumulativeMarkdownManager for intelligent content merging
- Implement YouTube Data API v3 integration with caption support
- Add MailChimp API integration with content cleaning
- Create single source-of-truth files that grow with updates
- Smart merging: updates existing entries with better data
- Properly combines backlog + incremental daily updates

Features:
- 179/444 YouTube videos now have captions (40.3%)
- MailChimp content cleaned of headers/footers
- All sources consolidated to single files
- Archive management with timestamped versions
- Test suite and documentation included

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-08-19 10:53:40 -03:00

4.8 KiB

Raw Blame History

Cumulative Markdown System Documentation

Overview

The cumulative markdown system maintains a single, continuously growing markdown file per content source that intelligently combines backlog data with incremental daily updates.

Problem It Solves

Previously, each scraper run created entirely new files:

Backlog runs created large initial files
Incremental updates created small separate files
No merging of content between files
Multiple files per source made it hard to find the "current" state

Solution Architecture

CumulativeMarkdownManager

Core class that handles:

Loading existing markdown files
Parsing content into sections by unique ID
Merging new content with existing sections
Updating sections when better data is available
Archiving previous versions for history
Saving updated single-source-of-truth file

Merge Logic

The system uses intelligent merging based on content quality:

def should_update_section(old_section, new_section):
    # Update if new has captions/transcripts that old doesn't
    if new_has_captions and not old_has_captions:
        return True
    
    # Update if new has significantly more content
    if new_description_length > old_description_length * 1.2:
        return True
    
    # Update if metrics have increased
    if new_views > old_views:
        return True
    
    return False

Usage Patterns

Initial Backlog Capture

# Day 1 - First run captures all historical content
scraper.fetch_content(max_posts=1000)
# Creates: hvacnkowitall_YouTube_20250819T080000.md (444 videos)

Daily Incremental Update

# Day 2 - Fetch only new content since last run
scraper.fetch_content()  # Uses state to get only new items
# Loads existing file, merges new content
# Updates: hvacnkowitall_YouTube_20250819T120000.md (449 videos)

Caption/Transcript Enhancement

# Day 3 - Fetch captions for existing videos
youtube_scraper.fetch_captions(video_ids)
# Loads existing file, updates videos with caption data
# Updates: hvacnkowitall_YouTube_20250819T080000.md (449 videos, 200 with captions)

File Management

Naming Convention

hvacnkowitall_<Source>_<YYYY-MM-DDTHHMMSS>.md

Brand name is always lowercase
Source name is TitleCase
Timestamp in Atlantic timezone

Archive Strategy

Current:
  hvacnkowitall_YouTube_20250819T143045.md (latest)

Archives:
  YouTube/
    hvacnkowitall_YouTube_20250819T080000_archived_20250819_120000.md
    hvacnkowitall_YouTube_20250819T120000_archived_20250819_143045.md

Implementation Details

Section Structure

Each content item is a section with unique ID:

# ID: video_abc123

## Title: Video Title

## Views: 1,234

## Description:
Full description text...

## Caption Status:
Caption text if available...

## Publish Date: 2024-01-15

--------------------------------------------------

Merge Process

Parse both existing and new content into sections
Index by unique ID (video ID, post ID, etc.)
Compare sections with same ID
Update if new version is better
Add new sections not in existing file
Sort by date (newest first) or maintain order
Save combined content with new timestamp

State Management

State files track last processed item for incremental updates:

{
  "last_video_id": "abc123",
  "last_video_date": "2024-01-20",
  "last_sync": "2024-01-20T12:00:00",
  "total_processed": 449
}

Benefits

Single Source of Truth: One file per source with all content
Automatic Updates: Existing entries enhanced with new data
Efficient Storage: No duplicate content across files
Complete History: Archives preserve all versions
Incremental Growth: Files grow naturally over time
Smart Merging: Best version of each entry is preserved

Migration from Separate Files

Use the consolidation script to migrate existing separate files:

# Consolidate all existing files into cumulative format
uv run python consolidate_current_files.py

This will:

Find all files for each source
Parse and merge by content ID
Create single cumulative file
Archive old separate files

Testing

Test the cumulative workflow:

uv run python test_cumulative_mode.py

This demonstrates:

Initial backlog capture (5 items)
First incremental update (+2 items = 7 total)
Second incremental with updates (1 updated, +1 new = 8 total)
Proper archival of previous versions

Future Enhancements

Potential improvements:

Conflict resolution strategies (user choice on updates)
Differential backups (only store changes)
Compression of archived versions
Metrics tracking across versions
Automatic cleanup of old archives
API endpoint to query cumulative statistics

4.8 KiB Raw Blame History