hvac-kia-content/docs/cumulative_markdown.md
Ben Reed 8ceb858026 Implement cumulative markdown system and API integrations
Major improvements:
- Add CumulativeMarkdownManager for intelligent content merging
- Implement YouTube Data API v3 integration with caption support
- Add MailChimp API integration with content cleaning
- Create single source-of-truth files that grow with updates
- Smart merging: updates existing entries with better data
- Properly combines backlog + incremental daily updates

Features:
- 179/444 YouTube videos now have captions (40.3%)
- MailChimp content cleaned of headers/footers
- All sources consolidated to single files
- Archive management with timestamped versions
- Test suite and documentation included

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-08-19 10:53:40 -03:00

4.8 KiB

Cumulative Markdown System Documentation

Overview

The cumulative markdown system maintains a single, continuously growing markdown file per content source that intelligently combines backlog data with incremental daily updates.

Problem It Solves

Previously, each scraper run created entirely new files:

  • Backlog runs created large initial files
  • Incremental updates created small separate files
  • No merging of content between files
  • Multiple files per source made it hard to find the "current" state

Solution Architecture

CumulativeMarkdownManager

Core class that handles:

  1. Loading existing markdown files
  2. Parsing content into sections by unique ID
  3. Merging new content with existing sections
  4. Updating sections when better data is available
  5. Archiving previous versions for history
  6. Saving updated single-source-of-truth file

Merge Logic

The system uses intelligent merging based on content quality:

def should_update_section(old_section, new_section):
    # Update if new has captions/transcripts that old doesn't
    if new_has_captions and not old_has_captions:
        return True
    
    # Update if new has significantly more content
    if new_description_length > old_description_length * 1.2:
        return True
    
    # Update if metrics have increased
    if new_views > old_views:
        return True
    
    return False

Usage Patterns

Initial Backlog Capture

# Day 1 - First run captures all historical content
scraper.fetch_content(max_posts=1000)
# Creates: hvacnkowitall_YouTube_20250819T080000.md (444 videos)

Daily Incremental Update

# Day 2 - Fetch only new content since last run
scraper.fetch_content()  # Uses state to get only new items
# Loads existing file, merges new content
# Updates: hvacnkowitall_YouTube_20250819T120000.md (449 videos)

Caption/Transcript Enhancement

# Day 3 - Fetch captions for existing videos
youtube_scraper.fetch_captions(video_ids)
# Loads existing file, updates videos with caption data
# Updates: hvacnkowitall_YouTube_20250819T080000.md (449 videos, 200 with captions)

File Management

Naming Convention

hvacnkowitall_<Source>_<YYYY-MM-DDTHHMMSS>.md
  • Brand name is always lowercase
  • Source name is TitleCase
  • Timestamp in Atlantic timezone

Archive Strategy

Current:
  hvacnkowitall_YouTube_20250819T143045.md (latest)

Archives:
  YouTube/
    hvacnkowitall_YouTube_20250819T080000_archived_20250819_120000.md
    hvacnkowitall_YouTube_20250819T120000_archived_20250819_143045.md

Implementation Details

Section Structure

Each content item is a section with unique ID:

# ID: video_abc123

## Title: Video Title

## Views: 1,234

## Description:
Full description text...

## Caption Status:
Caption text if available...

## Publish Date: 2024-01-15

--------------------------------------------------

Merge Process

  1. Parse both existing and new content into sections
  2. Index by unique ID (video ID, post ID, etc.)
  3. Compare sections with same ID
  4. Update if new version is better
  5. Add new sections not in existing file
  6. Sort by date (newest first) or maintain order
  7. Save combined content with new timestamp

State Management

State files track last processed item for incremental updates:

{
  "last_video_id": "abc123",
  "last_video_date": "2024-01-20",
  "last_sync": "2024-01-20T12:00:00",
  "total_processed": 449
}

Benefits

  1. Single Source of Truth: One file per source with all content
  2. Automatic Updates: Existing entries enhanced with new data
  3. Efficient Storage: No duplicate content across files
  4. Complete History: Archives preserve all versions
  5. Incremental Growth: Files grow naturally over time
  6. Smart Merging: Best version of each entry is preserved

Migration from Separate Files

Use the consolidation script to migrate existing separate files:

# Consolidate all existing files into cumulative format
uv run python consolidate_current_files.py

This will:

  1. Find all files for each source
  2. Parse and merge by content ID
  3. Create single cumulative file
  4. Archive old separate files

Testing

Test the cumulative workflow:

uv run python test_cumulative_mode.py

This demonstrates:

  • Initial backlog capture (5 items)
  • First incremental update (+2 items = 7 total)
  • Second incremental with updates (1 updated, +1 new = 8 total)
  • Proper archival of previous versions

Future Enhancements

Potential improvements:

  1. Conflict resolution strategies (user choice on updates)
  2. Differential backups (only store changes)
  3. Compression of archived versions
  4. Metrics tracking across versions
  5. Automatic cleanup of old archives
  6. API endpoint to query cumulative statistics