Major improvements: - Add CumulativeMarkdownManager for intelligent content merging - Implement YouTube Data API v3 integration with caption support - Add MailChimp API integration with content cleaning - Create single source-of-truth files that grow with updates - Smart merging: updates existing entries with better data - Properly combines backlog + incremental daily updates Features: - 179/444 YouTube videos now have captions (40.3%) - MailChimp content cleaned of headers/footers - All sources consolidated to single files - Archive management with timestamped versions - Test suite and documentation included 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
188 lines
No EOL
4.8 KiB
Markdown
188 lines
No EOL
4.8 KiB
Markdown
# Cumulative Markdown System Documentation
|
|
|
|
## Overview
|
|
|
|
The cumulative markdown system maintains a single, continuously growing markdown file per content source that intelligently combines backlog data with incremental daily updates.
|
|
|
|
## Problem It Solves
|
|
|
|
Previously, each scraper run created entirely new files:
|
|
- Backlog runs created large initial files
|
|
- Incremental updates created small separate files
|
|
- No merging of content between files
|
|
- Multiple files per source made it hard to find the "current" state
|
|
|
|
## Solution Architecture
|
|
|
|
### CumulativeMarkdownManager
|
|
|
|
Core class that handles:
|
|
1. **Loading** existing markdown files
|
|
2. **Parsing** content into sections by unique ID
|
|
3. **Merging** new content with existing sections
|
|
4. **Updating** sections when better data is available
|
|
5. **Archiving** previous versions for history
|
|
6. **Saving** updated single-source-of-truth file
|
|
|
|
### Merge Logic
|
|
|
|
The system uses intelligent merging based on content quality:
|
|
|
|
```python
|
|
def should_update_section(old_section, new_section):
|
|
# Update if new has captions/transcripts that old doesn't
|
|
if new_has_captions and not old_has_captions:
|
|
return True
|
|
|
|
# Update if new has significantly more content
|
|
if new_description_length > old_description_length * 1.2:
|
|
return True
|
|
|
|
# Update if metrics have increased
|
|
if new_views > old_views:
|
|
return True
|
|
|
|
return False
|
|
```
|
|
|
|
## Usage Patterns
|
|
|
|
### Initial Backlog Capture
|
|
|
|
```python
|
|
# Day 1 - First run captures all historical content
|
|
scraper.fetch_content(max_posts=1000)
|
|
# Creates: hvacnkowitall_YouTube_20250819T080000.md (444 videos)
|
|
```
|
|
|
|
### Daily Incremental Update
|
|
|
|
```python
|
|
# Day 2 - Fetch only new content since last run
|
|
scraper.fetch_content() # Uses state to get only new items
|
|
# Loads existing file, merges new content
|
|
# Updates: hvacnkowitall_YouTube_20250819T120000.md (449 videos)
|
|
```
|
|
|
|
### Caption/Transcript Enhancement
|
|
|
|
```python
|
|
# Day 3 - Fetch captions for existing videos
|
|
youtube_scraper.fetch_captions(video_ids)
|
|
# Loads existing file, updates videos with caption data
|
|
# Updates: hvacnkowitall_YouTube_20250819T080000.md (449 videos, 200 with captions)
|
|
```
|
|
|
|
## File Management
|
|
|
|
### Naming Convention
|
|
```
|
|
hvacnkowitall_<Source>_<YYYY-MM-DDTHHMMSS>.md
|
|
```
|
|
- Brand name is always lowercase
|
|
- Source name is TitleCase
|
|
- Timestamp in Atlantic timezone
|
|
|
|
### Archive Strategy
|
|
```
|
|
Current:
|
|
hvacnkowitall_YouTube_20250819T143045.md (latest)
|
|
|
|
Archives:
|
|
YouTube/
|
|
hvacnkowitall_YouTube_20250819T080000_archived_20250819_120000.md
|
|
hvacnkowitall_YouTube_20250819T120000_archived_20250819_143045.md
|
|
```
|
|
|
|
## Implementation Details
|
|
|
|
### Section Structure
|
|
|
|
Each content item is a section with unique ID:
|
|
```markdown
|
|
# ID: video_abc123
|
|
|
|
## Title: Video Title
|
|
|
|
## Views: 1,234
|
|
|
|
## Description:
|
|
Full description text...
|
|
|
|
## Caption Status:
|
|
Caption text if available...
|
|
|
|
## Publish Date: 2024-01-15
|
|
|
|
--------------------------------------------------
|
|
```
|
|
|
|
### Merge Process
|
|
|
|
1. **Parse** both existing and new content into sections
|
|
2. **Index** by unique ID (video ID, post ID, etc.)
|
|
3. **Compare** sections with same ID
|
|
4. **Update** if new version is better
|
|
5. **Add** new sections not in existing file
|
|
6. **Sort** by date (newest first) or maintain order
|
|
7. **Save** combined content with new timestamp
|
|
|
|
### State Management
|
|
|
|
State files track last processed item for incremental updates:
|
|
```json
|
|
{
|
|
"last_video_id": "abc123",
|
|
"last_video_date": "2024-01-20",
|
|
"last_sync": "2024-01-20T12:00:00",
|
|
"total_processed": 449
|
|
}
|
|
```
|
|
|
|
## Benefits
|
|
|
|
1. **Single Source of Truth**: One file per source with all content
|
|
2. **Automatic Updates**: Existing entries enhanced with new data
|
|
3. **Efficient Storage**: No duplicate content across files
|
|
4. **Complete History**: Archives preserve all versions
|
|
5. **Incremental Growth**: Files grow naturally over time
|
|
6. **Smart Merging**: Best version of each entry is preserved
|
|
|
|
## Migration from Separate Files
|
|
|
|
Use the consolidation script to migrate existing separate files:
|
|
|
|
```bash
|
|
# Consolidate all existing files into cumulative format
|
|
uv run python consolidate_current_files.py
|
|
```
|
|
|
|
This will:
|
|
1. Find all files for each source
|
|
2. Parse and merge by content ID
|
|
3. Create single cumulative file
|
|
4. Archive old separate files
|
|
|
|
## Testing
|
|
|
|
Test the cumulative workflow:
|
|
|
|
```bash
|
|
uv run python test_cumulative_mode.py
|
|
```
|
|
|
|
This demonstrates:
|
|
- Initial backlog capture (5 items)
|
|
- First incremental update (+2 items = 7 total)
|
|
- Second incremental with updates (1 updated, +1 new = 8 total)
|
|
- Proper archival of previous versions
|
|
|
|
## Future Enhancements
|
|
|
|
Potential improvements:
|
|
1. Conflict resolution strategies (user choice on updates)
|
|
2. Differential backups (only store changes)
|
|
3. Compression of archived versions
|
|
4. Metrics tracking across versions
|
|
5. Automatic cleanup of old archives
|
|
6. API endpoint to query cumulative statistics |