Major improvements: - Add CumulativeMarkdownManager for intelligent content merging - Implement YouTube Data API v3 integration with caption support - Add MailChimp API integration with content cleaning - Create single source-of-truth files that grow with updates - Smart merging: updates existing entries with better data - Properly combines backlog + incremental daily updates Features: - 179/444 YouTube videos now have captions (40.3%) - MailChimp content cleaned of headers/footers - All sources consolidated to single files - Archive management with timestamped versions - Test suite and documentation included 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
4.8 KiB
4.8 KiB
Cumulative Markdown System Documentation
Overview
The cumulative markdown system maintains a single, continuously growing markdown file per content source that intelligently combines backlog data with incremental daily updates.
Problem It Solves
Previously, each scraper run created entirely new files:
- Backlog runs created large initial files
- Incremental updates created small separate files
- No merging of content between files
- Multiple files per source made it hard to find the "current" state
Solution Architecture
CumulativeMarkdownManager
Core class that handles:
- Loading existing markdown files
- Parsing content into sections by unique ID
- Merging new content with existing sections
- Updating sections when better data is available
- Archiving previous versions for history
- Saving updated single-source-of-truth file
Merge Logic
The system uses intelligent merging based on content quality:
def should_update_section(old_section, new_section):
# Update if new has captions/transcripts that old doesn't
if new_has_captions and not old_has_captions:
return True
# Update if new has significantly more content
if new_description_length > old_description_length * 1.2:
return True
# Update if metrics have increased
if new_views > old_views:
return True
return False
Usage Patterns
Initial Backlog Capture
# Day 1 - First run captures all historical content
scraper.fetch_content(max_posts=1000)
# Creates: hvacnkowitall_YouTube_20250819T080000.md (444 videos)
Daily Incremental Update
# Day 2 - Fetch only new content since last run
scraper.fetch_content() # Uses state to get only new items
# Loads existing file, merges new content
# Updates: hvacnkowitall_YouTube_20250819T120000.md (449 videos)
Caption/Transcript Enhancement
# Day 3 - Fetch captions for existing videos
youtube_scraper.fetch_captions(video_ids)
# Loads existing file, updates videos with caption data
# Updates: hvacnkowitall_YouTube_20250819T080000.md (449 videos, 200 with captions)
File Management
Naming Convention
hvacnkowitall_<Source>_<YYYY-MM-DDTHHMMSS>.md
- Brand name is always lowercase
- Source name is TitleCase
- Timestamp in Atlantic timezone
Archive Strategy
Current:
hvacnkowitall_YouTube_20250819T143045.md (latest)
Archives:
YouTube/
hvacnkowitall_YouTube_20250819T080000_archived_20250819_120000.md
hvacnkowitall_YouTube_20250819T120000_archived_20250819_143045.md
Implementation Details
Section Structure
Each content item is a section with unique ID:
# ID: video_abc123
## Title: Video Title
## Views: 1,234
## Description:
Full description text...
## Caption Status:
Caption text if available...
## Publish Date: 2024-01-15
--------------------------------------------------
Merge Process
- Parse both existing and new content into sections
- Index by unique ID (video ID, post ID, etc.)
- Compare sections with same ID
- Update if new version is better
- Add new sections not in existing file
- Sort by date (newest first) or maintain order
- Save combined content with new timestamp
State Management
State files track last processed item for incremental updates:
{
"last_video_id": "abc123",
"last_video_date": "2024-01-20",
"last_sync": "2024-01-20T12:00:00",
"total_processed": 449
}
Benefits
- Single Source of Truth: One file per source with all content
- Automatic Updates: Existing entries enhanced with new data
- Efficient Storage: No duplicate content across files
- Complete History: Archives preserve all versions
- Incremental Growth: Files grow naturally over time
- Smart Merging: Best version of each entry is preserved
Migration from Separate Files
Use the consolidation script to migrate existing separate files:
# Consolidate all existing files into cumulative format
uv run python consolidate_current_files.py
This will:
- Find all files for each source
- Parse and merge by content ID
- Create single cumulative file
- Archive old separate files
Testing
Test the cumulative workflow:
uv run python test_cumulative_mode.py
This demonstrates:
- Initial backlog capture (5 items)
- First incremental update (+2 items = 7 total)
- Second incremental with updates (1 updated, +1 new = 8 total)
- Proper archival of previous versions
Future Enhancements
Potential improvements:
- Conflict resolution strategies (user choice on updates)
- Differential backups (only store changes)
- Compression of archived versions
- Metrics tracking across versions
- Automatic cleanup of old archives
- API endpoint to query cumulative statistics