hvac-kia-content/docs/cumulative_markdown.md
Ben Reed 8ceb858026 Implement cumulative markdown system and API integrations
Major improvements:
- Add CumulativeMarkdownManager for intelligent content merging
- Implement YouTube Data API v3 integration with caption support
- Add MailChimp API integration with content cleaning
- Create single source-of-truth files that grow with updates
- Smart merging: updates existing entries with better data
- Properly combines backlog + incremental daily updates

Features:
- 179/444 YouTube videos now have captions (40.3%)
- MailChimp content cleaned of headers/footers
- All sources consolidated to single files
- Archive management with timestamped versions
- Test suite and documentation included

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-08-19 10:53:40 -03:00

188 lines
No EOL
4.8 KiB
Markdown

# Cumulative Markdown System Documentation
## Overview
The cumulative markdown system maintains a single, continuously growing markdown file per content source that intelligently combines backlog data with incremental daily updates.
## Problem It Solves
Previously, each scraper run created entirely new files:
- Backlog runs created large initial files
- Incremental updates created small separate files
- No merging of content between files
- Multiple files per source made it hard to find the "current" state
## Solution Architecture
### CumulativeMarkdownManager
Core class that handles:
1. **Loading** existing markdown files
2. **Parsing** content into sections by unique ID
3. **Merging** new content with existing sections
4. **Updating** sections when better data is available
5. **Archiving** previous versions for history
6. **Saving** updated single-source-of-truth file
### Merge Logic
The system uses intelligent merging based on content quality:
```python
def should_update_section(old_section, new_section):
# Update if new has captions/transcripts that old doesn't
if new_has_captions and not old_has_captions:
return True
# Update if new has significantly more content
if new_description_length > old_description_length * 1.2:
return True
# Update if metrics have increased
if new_views > old_views:
return True
return False
```
## Usage Patterns
### Initial Backlog Capture
```python
# Day 1 - First run captures all historical content
scraper.fetch_content(max_posts=1000)
# Creates: hvacnkowitall_YouTube_20250819T080000.md (444 videos)
```
### Daily Incremental Update
```python
# Day 2 - Fetch only new content since last run
scraper.fetch_content() # Uses state to get only new items
# Loads existing file, merges new content
# Updates: hvacnkowitall_YouTube_20250819T120000.md (449 videos)
```
### Caption/Transcript Enhancement
```python
# Day 3 - Fetch captions for existing videos
youtube_scraper.fetch_captions(video_ids)
# Loads existing file, updates videos with caption data
# Updates: hvacnkowitall_YouTube_20250819T080000.md (449 videos, 200 with captions)
```
## File Management
### Naming Convention
```
hvacnkowitall_<Source>_<YYYY-MM-DDTHHMMSS>.md
```
- Brand name is always lowercase
- Source name is TitleCase
- Timestamp in Atlantic timezone
### Archive Strategy
```
Current:
hvacnkowitall_YouTube_20250819T143045.md (latest)
Archives:
YouTube/
hvacnkowitall_YouTube_20250819T080000_archived_20250819_120000.md
hvacnkowitall_YouTube_20250819T120000_archived_20250819_143045.md
```
## Implementation Details
### Section Structure
Each content item is a section with unique ID:
```markdown
# ID: video_abc123
## Title: Video Title
## Views: 1,234
## Description:
Full description text...
## Caption Status:
Caption text if available...
## Publish Date: 2024-01-15
--------------------------------------------------
```
### Merge Process
1. **Parse** both existing and new content into sections
2. **Index** by unique ID (video ID, post ID, etc.)
3. **Compare** sections with same ID
4. **Update** if new version is better
5. **Add** new sections not in existing file
6. **Sort** by date (newest first) or maintain order
7. **Save** combined content with new timestamp
### State Management
State files track last processed item for incremental updates:
```json
{
"last_video_id": "abc123",
"last_video_date": "2024-01-20",
"last_sync": "2024-01-20T12:00:00",
"total_processed": 449
}
```
## Benefits
1. **Single Source of Truth**: One file per source with all content
2. **Automatic Updates**: Existing entries enhanced with new data
3. **Efficient Storage**: No duplicate content across files
4. **Complete History**: Archives preserve all versions
5. **Incremental Growth**: Files grow naturally over time
6. **Smart Merging**: Best version of each entry is preserved
## Migration from Separate Files
Use the consolidation script to migrate existing separate files:
```bash
# Consolidate all existing files into cumulative format
uv run python consolidate_current_files.py
```
This will:
1. Find all files for each source
2. Parse and merge by content ID
3. Create single cumulative file
4. Archive old separate files
## Testing
Test the cumulative workflow:
```bash
uv run python test_cumulative_mode.py
```
This demonstrates:
- Initial backlog capture (5 items)
- First incremental update (+2 items = 7 total)
- Second incremental with updates (1 updated, +1 new = 8 total)
- Proper archival of previous versions
## Future Enhancements
Potential improvements:
1. Conflict resolution strategies (user choice on updates)
2. Differential backups (only store changes)
3. Compression of archived versions
4. Metrics tracking across versions
5. Automatic cleanup of old archives
6. API endpoint to query cumulative statistics