hvac-kia-content/docs/cumulative_markdown.md

# Cumulative Markdown System Documentation

## Overview

The cumulative markdown system maintains a single, continuously growing markdown file per content source that intelligently combines backlog data with incremental daily updates.

## Problem It Solves

Previously, each scraper run created entirely new files:
- Backlog runs created large initial files
- Incremental updates created small separate files
- No merging of content between files
- Multiple files per source made it hard to find the "current" state

## Solution Architecture

### CumulativeMarkdownManager

Core class that handles:
1. **Loading** existing markdown files
2. **Parsing** content into sections by unique ID
3. **Merging** new content with existing sections
4. **Updating** sections when better data is available
5. **Archiving** previous versions for history
6. **Saving** updated single-source-of-truth file

### Merge Logic

The system uses intelligent merging based on content quality:

```python
def should_update_section(old_section, new_section):
    # Update if new has captions/transcripts that old doesn't
    if new_has_captions and not old_has_captions:
        return True

    # Update if new has significantly more content
    if new_description_length > old_description_length * 1.2:
        return True

    # Update if metrics have increased
    if new_views > old_views:
        return True

    return False
```

## Usage Patterns

### Initial Backlog Capture

```python
# Day 1 - First run captures all historical content
scraper.fetch_content(max_posts=1000)
# Creates: hvacnkowitall_YouTube_20250819T080000.md (444 videos)
```

### Daily Incremental Update

```python
# Day 2 - Fetch only new content since last run
scraper.fetch_content()  # Uses state to get only new items
# Loads existing file, merges new content
# Updates: hvacnkowitall_YouTube_20250819T120000.md (449 videos)
```

### Caption/Transcript Enhancement

```python
# Day 3 - Fetch captions for existing videos
youtube_scraper.fetch_captions(video_ids)
# Loads existing file, updates videos with caption data
# Updates: hvacnkowitall_YouTube_20250819T080000.md (449 videos, 200 with captions)
```

## File Management

### Naming Convention
```
hvacnkowitall_<Source>_<YYYY-MM-DDTHHMMSS>.md
```
- Brand name is always lowercase
- Source name is TitleCase
- Timestamp in Atlantic timezone

### Archive Strategy
```
Current:
  hvacnkowitall_YouTube_20250819T143045.md (latest)

Archives:
  YouTube/
    hvacnkowitall_YouTube_20250819T080000_archived_20250819_120000.md
    hvacnkowitall_YouTube_20250819T120000_archived_20250819_143045.md
```

## Implementation Details

### Section Structure

Each content item is a section with unique ID:
```markdown
# ID: video_abc123

## Title: Video Title

## Views: 1,234

## Description:
Full description text...

## Caption Status:
Caption text if available...

## Publish Date: 2024-01-15

--------------------------------------------------
```

### Merge Process

1. **Parse** both existing and new content into sections
2. **Index** by unique ID (video ID, post ID, etc.)
3. **Compare** sections with same ID
4. **Update** if new version is better
5. **Add** new sections not in existing file
6. **Sort** by date (newest first) or maintain order
7. **Save** combined content with new timestamp

### State Management

State files track last processed item for incremental updates:
```json
{
  "last_video_id": "abc123",
  "last_video_date": "2024-01-20",
  "last_sync": "2024-01-20T12:00:00",
  "total_processed": 449
}
```

## Benefits

1. **Single Source of Truth**: One file per source with all content
2. **Automatic Updates**: Existing entries enhanced with new data
3. **Efficient Storage**: No duplicate content across files
4. **Complete History**: Archives preserve all versions
5. **Incremental Growth**: Files grow naturally over time
6. **Smart Merging**: Best version of each entry is preserved

## Migration from Separate Files

Use the consolidation script to migrate existing separate files:

```bash
# Consolidate all existing files into cumulative format
uv run python consolidate_current_files.py
```

This will:
1. Find all files for each source
2. Parse and merge by content ID
3. Create single cumulative file
4. Archive old separate files

## Testing

Test the cumulative workflow:

```bash
uv run python test_cumulative_mode.py
```

This demonstrates:
- Initial backlog capture (5 items)
- First incremental update (+2 items = 7 total)
- Second incremental with updates (1 updated, +1 new = 8 total)
- Proper archival of previous versions

## Future Enhancements

Potential improvements:
1. Conflict resolution strategies (user choice on updates)
2. Differential backups (only store changes)
3. Compression of archived versions
4. Metrics tracking across versions
5. Automatic cleanup of old archives
6. API endpoint to query cumulative statistics