Implement cumulative markdown system and API integrations

Major improvements: - Add CumulativeMarkdownManager for intelligent content merging - Implement YouTube Data API v3 integration with caption support - Add MailChimp API integration with content cleaning - Create single source-of-truth files that grow with updates - Smart merging: updates existing entries with better data - Properly combines backlog + incremental daily updates Features: - 179/444 YouTube videos now have captions (40.3%) - MailChimp content cleaned of headers/footers - All sources consolidated to single files - Archive management with timestamped versions - Test suite and documentation included 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-08-19 10:53:40 -03:00 · 2025-08-19 10:53:40 -03:00 · 8ceb858026
commit 8ceb858026
parent 8a0b8b4d3f
11 changed files with 2696 additions and 9 deletions
--- a/README.md
+++ b/README.md
@ -0,0 +1,212 @@
 # HVAC Know It All Content Aggregation System
 A containerized Python application that aggregates content from multiple HVAC Know It All sources, converts them to markdown format, and syncs to a NAS.
 ## Features
 - **Multi-source content aggregation** from YouTube, Instagram, TikTok, MailChimp, WordPress, and Podcast RSS
 - **Cumulative markdown management** - Single source-of-truth files that grow with backlog and incremental updates
 - **API integrations** for YouTube Data API v3 and MailChimp API
 - **Intelligent content merging** with caption/transcript updates and metric tracking
 - **Automated NAS synchronization** to `/mnt/nas/hvacknowitall/`
 - **State management** for incremental updates
 - **Parallel processing** for multiple sources
 - **Atlantic timezone** (America/Halifax) timestamps
 ## Cumulative Markdown System
 ### Overview
 The system maintains a single markdown file per source that combines:
 - Initial backlog content (historical data)
 - Daily incremental updates (new content)
 - Content updates (new captions, updated metrics)
 ### How It Works
 1. **Initial Backlog**: First run creates base file with all historical content
 2. **Daily Incremental**: Subsequent runs merge new content into existing file
 3. **Smart Merging**: Updates existing entries when better data is available (captions, transcripts, metrics)
 4. **Archival**: Previous versions archived with timestamps for history
 ### File Naming Convention
 ```
 <brandName>_<source>_<dateTime>.md
 Example: hvacnkowitall_YouTube_2025-08-19T143045.md
 ```
 ## Quick Start
 ### Installation
 ```bash
 # Install UV package manager
 pip install uv
 # Install dependencies
 uv pip install -r requirements.txt
 ```
 ### Configuration
 Create `.env` file with credentials:
 ```env
 # YouTube
 YOUTUBE_API_KEY=your_api_key
 # MailChimp
 MAILCHIMP_API_KEY=your_api_key
 MAILCHIMP_SERVER_PREFIX=us10
 # Instagram
 INSTAGRAM_USERNAME=username
 INSTAGRAM_PASSWORD=password
 # WordPress
 WORDPRESS_USERNAME=username
 WORDPRESS_API_KEY=api_key
 ```
 ### Running
 ```bash
 # Run all scrapers (parallel)
 uv run python run_all_scrapers.py
 # Run single source
 uv run python -m src.youtube_api_scraper_v2
 # Test cumulative mode
 uv run python test_cumulative_mode.py
 # Consolidate existing files
 uv run python consolidate_current_files.py
 ```
 ## Architecture
 ### Core Components
 - **BaseScraper**: Abstract base class for all scrapers
 - **BaseScraperCumulative**: Enhanced base with cumulative support
 - **CumulativeMarkdownManager**: Handles intelligent file merging
 - **ContentOrchestrator**: Manages parallel scraper execution
 ### Data Flow
 ```
 1. Scraper fetches content (checks state for incremental)
 2. CumulativeMarkdownManager loads existing file
 3. Merges new content (adds new, updates existing)
 4. Archives previous version
 5. Saves updated file with current timestamp
 6. Updates state for next run
 ```
 ### Directory Structure
 ```
 data/
 ├── markdown_current/       # Current single-source-of-truth files
 ├── markdown_archives/      # Historical versions by source
 │   ├── YouTube/
 │   ├── Instagram/
 │   └── ...
 ├── media/                  # Downloaded media files
 └── .state/                # State files for incremental updates
 logs/                      # Log files by source
 src/                       # Source code
 tests/                     # Test files
 ```
 ## API Quota Management
 ### YouTube Data API v3
 - **Daily Limit**: 10,000 units
 - **Usage Strategy**: 95% daily quota for captions
 - **Costs**: 
  - videos.list: 1 unit
  - captions.list: 50 units
  - channels.list: 1 unit
 ### Rate Limiting
 - Instagram: 200 posts/hour
 - YouTube: Respects API quotas
 - General: Exponential backoff with retry
 ## Production Deployment
 ### Systemd Services
 Services are configured in `/etc/systemd/system/`:
 - `hvac-content-8am.service` - Morning run
 - `hvac-content-12pm.service` - Noon run
 - `hvac-content-8am.timer` - Morning schedule
 - `hvac-content-12pm.timer` - Noon schedule
 ### Manual Deployment
 ```bash
 # Start services
 sudo systemctl start hvac-content-8am.timer
 sudo systemctl start hvac-content-12pm.timer
 # Enable on boot
 sudo systemctl enable hvac-content-8am.timer
 sudo systemctl enable hvac-content-12pm.timer
 # Check status
 sudo systemctl status hvac-content-*.timer
 ```
 ## Monitoring
 ```bash
 # View logs
 journalctl -u hvac-content-8am -f
 # Check file growth
 ls -lh data/markdown_current/
 # View statistics
 uv run python -c "from src.cumulative_markdown_manager import CumulativeMarkdownManager; ..."
 ```
 ## Testing
 ```bash
 # Run all tests
 uv run pytest
 # Test specific scraper
 uv run pytest tests/test_youtube_scraper.py
 # Test cumulative mode
 uv run python test_cumulative_mode.py
 ```
 ## Troubleshooting
 ### Common Issues
 1. **Instagram Rate Limiting**: Scraper implements humanized delays (18-22 seconds between requests)
 2. **YouTube Quota Exceeded**: Wait until next day, quota resets at midnight Pacific
 3. **NAS Permission Errors**: Warnings are normal, files still sync successfully
 4. **Missing Captions**: Use YouTube Data API instead of youtube-transcript-api
 ### Debug Commands
 ```bash
 # Check scraper state
 cat data/.state/*_state.json
 # View recent logs
 tail -f logs/YouTube/youtube_*.log
 # Test single source
 uv run python -m src.youtube_api_scraper_v2 --test
 ```
 ## License
 Private repository - All rights reserved
--- a/consolidate_current_files.py
+++ b/consolidate_current_files.py
@ -0,0 +1,226 @@
 #!/usr/bin/env python3
 """
 Consolidate multiple markdown files per source into single current files
 Combines backlog data and incremental updates into one source of truth
 Follows project specification naming: hvacnkowitall_<source>_<dateTime>.md
 """
 import sys
 from pathlib import Path
 sys.path.insert(0, str(Path(__file__).parent))
 from datetime import datetime
 import pytz
 import re
 from typing import Dict, List, Set
 import logging
 # Set up logging
 logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
    handlers=[
        logging.FileHandler('logs/consolidation.log'),
        logging.StreamHandler()
    ]
 )
 logger = logging.getLogger('consolidator')
 def get_atlantic_timestamp() -> str:
    """Get current timestamp in Atlantic timezone."""
    tz = pytz.timezone('America/Halifax')
    return datetime.now(tz).strftime('%Y-%m-%dT%H%M%S')
 def parse_markdown_sections(content: str) -> List[Dict]:
    """Parse markdown content into sections by ID."""
    sections = []
    # Split by ID headers
    parts = content.split('# ID: ')
    for part in parts[1:]:  # Skip first empty part
        if not part.strip():
            continue
        lines = part.strip().split('\n')
        section_id = lines[0].strip()
        # Get the full section content
        section_content = f"# ID: {section_id}\n" + '\n'.join(lines[1:])
        sections.append({
            'id': section_id,
            'content': section_content
        })
    return sections
 def consolidate_source_files(source_name: str) -> bool:
    """Consolidate all files for a specific source into one current file."""
    logger.info(f"Consolidating {source_name} files...")
    current_dir = Path('data/markdown_current')
    archives_dir = Path('data/markdown_archives')
    # Find all files for this source
    pattern = f"hvacnkowitall_{source_name}_*.md"
    current_files = list(current_dir.glob(pattern))
    # Also check for files with different naming (like captions files)
    alt_patterns = [
        f"*{source_name}*.md",
        f"hvacnkowitall_{source_name.lower()}_*.md"
    ]
    for alt_pattern in alt_patterns:
        current_files.extend(current_dir.glob(alt_pattern))
    # Remove duplicates
    current_files = list(set(current_files))
    if not current_files:
        logger.warning(f"No files found for source: {source_name}")
        return False
    logger.info(f"Found {len(current_files)} files for {source_name}: {[f.name for f in current_files]}")
    # Track unique sections by ID
    sections_by_id: Dict[str, Dict] = {}
    all_sections = []
    # Process each file
    for file_path in current_files:
        logger.info(f"Processing {file_path.name}...")
        try:
            content = file_path.read_text(encoding='utf-8')
            sections = parse_markdown_sections(content)
            logger.info(f"  Found {len(sections)} sections")
            # Add sections, preferring newer data
            for section in sections:
                section_id = section['id']
                # If we haven't seen this ID, add it
                if section_id not in sections_by_id:
                    sections_by_id[section_id] = section
                    all_sections.append(section)
                else:
                    # Check if this version has more content (like captions)
                    old_content = sections_by_id[section_id]['content']
                    new_content = section['content']
                    # Prefer content with captions/more detail
                    if ('Caption Status:' in new_content and 'Caption Status:' not in old_content) or \
                       len(new_content) > len(old_content):
                        logger.info(f"  Updating section {section_id} with more detailed content")
                        # Update in place
                        for i, existing in enumerate(all_sections):
                            if existing['id'] == section_id:
                                all_sections[i] = section
                                sections_by_id[section_id] = section
                                break
        except Exception as e:
            logger.error(f"Error processing {file_path}: {e}")
            continue
    if not all_sections:
        logger.warning(f"No sections found for {source_name}")
        return False
    # Create consolidated content
    consolidated_content = []
    # Sort sections by ID for consistency
    all_sections.sort(key=lambda x: x['id'])
    for section in all_sections:
        consolidated_content.append(section['content'])
        consolidated_content.append("")  # Add separator
    # Generate new filename following project specification
    timestamp = get_atlantic_timestamp()
    new_filename = f"hvacnkowitall_{source_name}_{timestamp}.md"
    new_file_path = current_dir / new_filename
    # Save consolidated file
    final_content = '\n'.join(consolidated_content)
    new_file_path.write_text(final_content, encoding='utf-8')
    logger.info(f"Created consolidated file: {new_filename}")
    logger.info(f"  Total sections: {len(all_sections)}")
    logger.info(f"  File size: {len(final_content):,} characters")
    # Archive old files
    archive_source_dir = archives_dir / source_name
    archive_source_dir.mkdir(parents=True, exist_ok=True)
    archived_count = 0
    for old_file in current_files:
        if old_file.name != new_filename:  # Don't archive the new file
            try:
                archive_path = archive_source_dir / old_file.name
                old_file.rename(archive_path)
                archived_count += 1
                logger.info(f"  Archived: {old_file.name}")
            except Exception as e:
                logger.error(f"Error archiving {old_file.name}: {e}")
    logger.info(f"Archived {archived_count} old files for {source_name}")
    # Create copy in archives as well
    archive_current_path = archive_source_dir / new_filename
    archive_current_path.write_text(final_content, encoding='utf-8')
    return True
 def main():
    """Main consolidation function."""
    logger.info("=" * 60)
    logger.info("CONSOLIDATING CURRENT MARKDOWN FILES")
    logger.info("=" * 60)
    # Create directories if needed
    Path('data/markdown_current').mkdir(parents=True, exist_ok=True)
    Path('data/markdown_archives').mkdir(parents=True, exist_ok=True)
    Path('logs').mkdir(parents=True, exist_ok=True)
    # Define sources to consolidate
    sources = ['YouTube', 'MailChimp', 'Instagram', 'TikTok', 'Podcast']
    consolidated = []
    failed = []
    for source in sources:
        logger.info(f"\n{'-' * 40}")
        try:
            if consolidate_source_files(source):
                consolidated.append(source)
            else:
                failed.append(source)
        except Exception as e:
            logger.error(f"Failed to consolidate {source}: {e}")
            failed.append(source)
    logger.info(f"\n{'=' * 60}")
    logger.info("CONSOLIDATION SUMMARY")
    logger.info(f"{'=' * 60}")
    logger.info(f"Successfully consolidated: {consolidated}")
    logger.info(f"Failed/No data: {failed}")
    # List final current files
    current_files = list(Path('data/markdown_current').glob('*.md'))
    logger.info(f"\nFinal current files:")
    for file in sorted(current_files):
        size = file.stat().st_size
        logger.info(f"  {file.name} ({size:,} bytes)")
 if __name__ == "__main__":
    main()
--- a/continue_youtube_captions.py
+++ b/continue_youtube_captions.py
@ -0,0 +1,229 @@
 #!/usr/bin/env python3
 """
 Continue YouTube caption fetching using remaining quota
 Fetches captions for videos 50-188 (next 139 videos by view count)
 Uses up to 95% of daily quota (9,500 units)
 """
 import sys
 from pathlib import Path
 sys.path.insert(0, str(Path(__file__).parent))
 from src.youtube_api_scraper_v2 import YouTubeAPIScraper
 from src.base_scraper import ScraperConfig
 from datetime import datetime
 import pytz
 import time
 import json
 import logging
 # Set up logging
 logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
    handlers=[
        logging.FileHandler('logs/youtube_caption_continue.log'),
        logging.StreamHandler()
    ]
 )
 logger = logging.getLogger('youtube_captions')
 def load_existing_videos():
    """Load existing video data from the latest markdown file."""
    latest_file = Path('data/markdown_current/hvacnkowitall_YouTube_2025-08-19T100336.md')
    if not latest_file.exists():
        logger.error(f"Latest YouTube file not found: {latest_file}")
        return []
    # Parse the markdown to extract video data
    content = latest_file.read_text(encoding='utf-8')
    videos = []
    # Simple parsing - split by video sections
    sections = content.split('# ID: ')
    for section in sections[1:]:  # Skip first empty section
        lines = section.strip().split('\n')
        if not lines:
            continue
        video_id = lines[0].strip()
        video_data = {'id': video_id}
        # Parse basic info
        for line in lines:
            if line.startswith('## Title: '):
                video_data['title'] = line.replace('## Title: ', '')
            elif line.startswith('## Views: '):
                views_str = line.replace('## Views: ', '').replace(',', '')
                video_data['view_count'] = int(views_str) if views_str.isdigit() else 0
            elif line.startswith('## Caption Status:'):
                video_data['has_caption_info'] = True
        videos.append(video_data)
    logger.info(f"Loaded {len(videos)} videos from existing file")
    return videos
 def continue_caption_fetching():
    """Continue fetching captions from where we left off."""
    logger.info("=" * 60)
    logger.info("CONTINUING YOUTUBE CAPTION FETCHING")
    logger.info("=" * 60)
    # Load existing video data
    videos = load_existing_videos()
    if not videos:
        logger.error("No existing videos found to continue from")
        return False
    # Sort by view count (descending)
    videos.sort(key=lambda x: x.get('view_count', 0), reverse=True)
    # Count how many already have captions
    with_captions = sum(1 for v in videos if v.get('has_caption_info'))
    without_captions = [v for v in videos if not v.get('has_caption_info')]
    logger.info(f"Current status:")
    logger.info(f"  Total videos: {len(videos)}")
    logger.info(f"  Already have captions: {with_captions}")
    logger.info(f"  Need captions: {len(without_captions)}")
    # Calculate quota
    quota_used_so_far = 2519  # From previous run
    daily_limit = 10000
    target_usage = int(daily_limit * 0.95)  # 95% = 9,500 units
    available_quota = target_usage - quota_used_so_far
    logger.info(f"Quota analysis:")
    logger.info(f"  Daily limit: {daily_limit:,} units")
    logger.info(f"  Already used: {quota_used_so_far:,} units")
    logger.info(f"  Target (95%): {target_usage:,} units") 
    logger.info(f"  Available: {available_quota:,} units")
    # Calculate how many more videos we can caption
    max_additional_captions = available_quota // 50  # 50 units per video
    videos_to_caption = without_captions[:max_additional_captions]
    logger.info(f"Caption plan:")
    logger.info(f"  Videos to caption now: {len(videos_to_caption)}")
    logger.info(f"  Estimated quota cost: {len(videos_to_caption) * 50:,} units")
    logger.info(f"  Will use: {quota_used_so_far + (len(videos_to_caption) * 50):,} units total")
    if not videos_to_caption:
        logger.info("No additional videos to caption within quota limits")
        return True
    # Set up scraper
    config = ScraperConfig(
        source_name='YouTube',
        brand_name='hvacnkowitall',
        data_dir=Path('data/markdown_current'),
        logs_dir=Path('logs/YouTube'),
        timezone='America/Halifax'
    )
    scraper = YouTubeAPIScraper(config)
    scraper.quota_used = quota_used_so_far  # Set initial quota usage
    logger.info(f"Starting caption fetching for {len(videos_to_caption)} videos...")
    start_time = time.time()
    captions_found = 0
    for i, video in enumerate(videos_to_caption, 1):
        video_id = video['id']
        title = video.get('title', 'Unknown')[:50]
        logger.info(f"[{i}/{len(videos_to_caption)}] Fetching caption for: {title}...")
        # Fetch caption info
        caption_info = scraper._fetch_caption_text(video_id)
        if caption_info:
            video['caption_text'] = caption_info
            captions_found += 1
            logger.info(f"  ✅ Caption found")
        else:
            logger.info(f"  ❌ No caption available")
        # Add delay to be respectful
        time.sleep(0.5)
        # Check if we're approaching quota limit
        if scraper.quota_used >= target_usage:
            logger.warning(f"Reached 95% quota limit at video {i}")
            break
    elapsed = time.time() - start_time
    logger.info(f"Caption fetching complete!")
    logger.info(f"  Duration: {elapsed:.1f} seconds")
    logger.info(f"  Captions found: {captions_found}")
    logger.info(f"  Quota used: {scraper.quota_used:,}/{daily_limit:,} units")
    logger.info(f"  Quota percentage: {(scraper.quota_used/daily_limit)*100:.1f}%")
    # Update the video data with new caption info
    video_lookup = {v['id']: v for v in videos}
    for video in videos_to_caption:
        if video['id'] in video_lookup and video.get('caption_text'):
            video_lookup[video['id']]['caption_text'] = video['caption_text']
    # Save updated data
    timestamp = datetime.now(pytz.timezone('America/Halifax')).strftime('%Y-%m-%dT%H%M%S')
    updated_filename = f"hvacnkowitall_YouTube_{timestamp}_captions.md"
    # Generate updated markdown (simplified version)
    markdown_sections = []
    for video in videos:
        section = []
        section.append(f"# ID: {video['id']}")
        section.append("")
        section.append(f"## Title: {video.get('title', 'Unknown')}")
        section.append("")
        section.append(f"## Views: {video.get('view_count', 0):,}")
        section.append("")
        # Caption status
        if video.get('caption_text'):
            section.append("## Caption Status:")
            section.append(video['caption_text'])
            section.append("")
        elif video.get('has_caption_info'):
            section.append("## Caption Status:")
            section.append("[Captions available - ]")
            section.append("")
        section.append("-" * 50)
        section.append("")
        markdown_sections.append('\n'.join(section))
    # Save updated file
    output_file = Path(f'data/markdown_current/{updated_filename}')
    output_file.write_text('\n'.join(markdown_sections), encoding='utf-8')
    logger.info(f"Updated file saved: {output_file}")
    # Calculate remaining work
    total_with_captions = with_captions + captions_found
    remaining_videos = len(videos) - total_with_captions
    logger.info(f"Progress summary:")
    logger.info(f"  Total videos: {len(videos)}")
    logger.info(f"  Captioned: {total_with_captions}")
    logger.info(f"  Remaining: {remaining_videos}")
    logger.info(f"  Progress: {(total_with_captions/len(videos))*100:.1f}%")
    if remaining_videos > 0:
        days_needed = (remaining_videos // 190) + (1 if remaining_videos % 190 else 0)
        logger.info(f"  Estimated days to complete: {days_needed}")
    return True
 if __name__ == "__main__":
    success = continue_caption_fetching()
    sys.exit(0 if success else 1)
--- a/docs/cumulative_markdown.md
+++ b/docs/cumulative_markdown.md
@ -0,0 +1,188 @@
 # Cumulative Markdown System Documentation
 ## Overview
 The cumulative markdown system maintains a single, continuously growing markdown file per content source that intelligently combines backlog data with incremental daily updates.
 ## Problem It Solves
 Previously, each scraper run created entirely new files:
 - Backlog runs created large initial files
 - Incremental updates created small separate files  
 - No merging of content between files
 - Multiple files per source made it hard to find the "current" state
 ## Solution Architecture
 ### CumulativeMarkdownManager
 Core class that handles:
 1. **Loading** existing markdown files
 2. **Parsing** content into sections by unique ID
 3. **Merging** new content with existing sections
 4. **Updating** sections when better data is available
 5. **Archiving** previous versions for history
 6. **Saving** updated single-source-of-truth file
 ### Merge Logic
 The system uses intelligent merging based on content quality:
 ```python
 def should_update_section(old_section, new_section):
    # Update if new has captions/transcripts that old doesn't
    if new_has_captions and not old_has_captions:
        return True
    # Update if new has significantly more content
    if new_description_length > old_description_length * 1.2:
        return True
    # Update if metrics have increased
    if new_views > old_views:
        return True
    return False
 ```
 ## Usage Patterns
 ### Initial Backlog Capture
 ```python
 # Day 1 - First run captures all historical content
 scraper.fetch_content(max_posts=1000)
 # Creates: hvacnkowitall_YouTube_20250819T080000.md (444 videos)
 ```
 ### Daily Incremental Update
 ```python
 # Day 2 - Fetch only new content since last run
 scraper.fetch_content()  # Uses state to get only new items
 # Loads existing file, merges new content
 # Updates: hvacnkowitall_YouTube_20250819T120000.md (449 videos)
 ```
 ### Caption/Transcript Enhancement
 ```python
 # Day 3 - Fetch captions for existing videos
 youtube_scraper.fetch_captions(video_ids)
 # Loads existing file, updates videos with caption data
 # Updates: hvacnkowitall_YouTube_20250819T080000.md (449 videos, 200 with captions)
 ```
 ## File Management
 ### Naming Convention
 ```
 hvacnkowitall_<Source>_<YYYY-MM-DDTHHMMSS>.md
 ```
 - Brand name is always lowercase
 - Source name is TitleCase
 - Timestamp in Atlantic timezone
 ### Archive Strategy
 ```
 Current:
  hvacnkowitall_YouTube_20250819T143045.md (latest)
 Archives:
  YouTube/
    hvacnkowitall_YouTube_20250819T080000_archived_20250819_120000.md
    hvacnkowitall_YouTube_20250819T120000_archived_20250819_143045.md
 ```
 ## Implementation Details
 ### Section Structure
 Each content item is a section with unique ID:
 ```markdown
 # ID: video_abc123
 ## Title: Video Title
 ## Views: 1,234
 ## Description:
 Full description text...
 ## Caption Status:
 Caption text if available...
 ## Publish Date: 2024-01-15
 --------------------------------------------------
 ```
 ### Merge Process
 1. **Parse** both existing and new content into sections
 2. **Index** by unique ID (video ID, post ID, etc.)
 3. **Compare** sections with same ID
 4. **Update** if new version is better
 5. **Add** new sections not in existing file
 6. **Sort** by date (newest first) or maintain order
 7. **Save** combined content with new timestamp
 ### State Management
 State files track last processed item for incremental updates:
 ```json
 {
  "last_video_id": "abc123",
  "last_video_date": "2024-01-20",
  "last_sync": "2024-01-20T12:00:00",
  "total_processed": 449
 }
 ```
 ## Benefits
 1. **Single Source of Truth**: One file per source with all content
 2. **Automatic Updates**: Existing entries enhanced with new data
 3. **Efficient Storage**: No duplicate content across files
 4. **Complete History**: Archives preserve all versions
 5. **Incremental Growth**: Files grow naturally over time
 6. **Smart Merging**: Best version of each entry is preserved
 ## Migration from Separate Files
 Use the consolidation script to migrate existing separate files:
 ```bash
 # Consolidate all existing files into cumulative format
 uv run python consolidate_current_files.py
 ```
 This will:
 1. Find all files for each source
 2. Parse and merge by content ID
 3. Create single cumulative file
 4. Archive old separate files
 ## Testing
 Test the cumulative workflow:
 ```bash
 uv run python test_cumulative_mode.py
 ```
 This demonstrates:
 - Initial backlog capture (5 items)
 - First incremental update (+2 items = 7 total)
 - Second incremental with updates (1 updated, +1 new = 8 total)
 - Proper archival of previous versions
 ## Future Enhancements
 Potential improvements:
 1. Conflict resolution strategies (user choice on updates)
 2. Differential backups (only store changes)
 3. Compression of archived versions
 4. Metrics tracking across versions
 5. Automatic cleanup of old archives
 6. API endpoint to query cumulative statistics
--- a/docs/status.md
+++ b/docs/status.md
@ -1,11 +1,11 @@
 # HVAC Know It All Content Aggregation - Project Status
-## Current Status: 🟢 PRODUCTION DEPLOYED
+## Current Status: 🟢 PRODUCTION READY
 **Project Completion: 100%**
 **All 6 Sources: ✅ Working**
-**Deployment: 🚀 In Production**
+**Deployment: 🚀 Production Ready**
-**Last Updated: 2025-08-18 23:15 ADT**
+**Last Updated: 2025-08-19 10:50 ADT**
 ---
@ -13,18 +13,34 @@
 | Source | Status | Last Tested | Items Fetched | Notes |
 |--------|--------|-------------|---------------|-------|
-| WordPress Blog | ✅ Working | 2025-08-18 | 139 posts | HTML cleaning implemented, clean markdown output |
+| YouTube | ✅ API Working | 2025-08-19 | 444 videos | API integration, 179/444 with captions (40.3%) |
-| MailChimp RSS | ⚠️ SSL Error | 2025-08-18 | 0 entries | Provider SSL issue, not a code problem |
+| MailChimp | ✅ API Working | 2025-08-19 | 22 campaigns | API integration, cleaned content |
-| Podcast RSS | ✅ Working | 2025-08-18 | 428 episodes | Full backlog captured successfully |
+| TikTok | ✅ Working | 2025-08-19 | 35 videos | All available videos captured |
-| YouTube | ✅ Working | 2025-08-18 | 200 videos | Channel scraping with metadata |
+| Podcast RSS | ✅ Working | 2025-08-19 | 428 episodes | Full backlog captured |
-| Instagram | 🔄 Processing | 2025-08-18 | 45/1000 posts | Rate: 200/hr, ETA: 3:54 AM |
+| WordPress Blog | ✅ Working | 2025-08-18 | 139 posts | HTML cleaning implemented |
-| TikTok | ⏳ Queued | 2025-08-18 | 0/1000 videos | Starts after Instagram completes |
+| Instagram | 🔄 Processing | 2025-08-19 | ~555/1000 posts | Long-running backlog capture |
 ---
 ## Latest Updates (2025-08-19)
 ### 🆕 Cumulative Markdown System
 - **Single Source of Truth**: One continuously growing file per source
 - **Intelligent Merging**: Updates existing entries with new data (captions, metrics)
 - **Backlog + Incremental**: Properly combines historical and daily updates
 - **Smart Updates**: Prefers content with captions/transcripts over without
 - **Archive Management**: Previous versions timestamped in archives
 ### 🆕 API Integrations
 - **YouTube Data API v3**: Replaced yt-dlp with official API
 - **MailChimp API**: Replaced RSS feed with API integration
 - **Caption Support**: YouTube captions via Data API (50 units/video)
 - **Content Cleaning**: MailChimp headers/footers removed
 ## Technical Implementation
 ### ✅ Core Features Complete
 - **Cumulative Markdown**: Single growing file per source with intelligent merging
 - **Incremental Updates**: All scrapers support state-based incremental fetching
 - **Archive Management**: Previous files automatically archived with timestamps
 - **Markdown Conversion**: All content properly converted to markdown format
--- a/run_api_production_v2.py
+++ b/run_api_production_v2.py
@ -0,0 +1,304 @@
 #!/usr/bin/env python3
 """
 Production script for API-based content scraping - Version 2
 Follows project specification file/folder naming conventions
 Captures YouTube videos with captions and MailChimp campaigns with cleaned content
 """
 import sys
 from pathlib import Path
 sys.path.insert(0, str(Path(__file__).parent))
 from src.youtube_api_scraper_v2 import YouTubeAPIScraper
 from src.mailchimp_api_scraper_v2 import MailChimpAPIScraper
 from src.base_scraper import ScraperConfig
 from datetime import datetime
 import pytz
 import time
 import logging
 import subprocess
 # Set up logging
 logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
    handlers=[
        logging.FileHandler('logs/api_production_v2.log'),
        logging.StreamHandler()
    ]
 )
 logger = logging.getLogger('api_production_v2')
 def get_atlantic_timestamp() -> str:
    """Get current timestamp in Atlantic timezone for file naming."""
    tz = pytz.timezone('America/Halifax')
    return datetime.now(tz).strftime('%Y-%m-%dT%H%M%S')
 def run_youtube_api_production():
    """Run YouTube API scraper for production backlog with captions."""
    logger.info("=" * 60)
    logger.info("YOUTUBE API SCRAPER - PRODUCTION V2")
    logger.info("=" * 60)
    timestamp = get_atlantic_timestamp()
    # Follow project specification directory structure
    config = ScraperConfig(
        source_name='YouTube',  # Capitalized per spec
        brand_name='hvacnkowitall',
        data_dir=Path('data/markdown_current'),
        logs_dir=Path('logs/YouTube'),
        timezone='America/Halifax'
    )
    try:
        scraper = YouTubeAPIScraper(config)
        logger.info("Starting YouTube API fetch with captions for all videos...")
        start = time.time()
        # Fetch all videos WITH captions for top 50 (use more quota)
        videos = scraper.fetch_content(fetch_captions=True)
        elapsed = time.time() - start
        logger.info(f"Fetched {len(videos)} videos in {elapsed:.1f} seconds")
        if videos:
            # Statistics
            total_views = sum(v.get('view_count', 0) for v in videos)
            total_likes = sum(v.get('like_count', 0) for v in videos)
            with_captions = sum(1 for v in videos if v.get('caption_text'))
            logger.info(f"Statistics:")
            logger.info(f"  Total videos: {len(videos)}")
            logger.info(f"  Total views: {total_views:,}")
            logger.info(f"  Total likes: {total_likes:,}")
            logger.info(f"  Videos with captions: {with_captions}")
            logger.info(f"  Quota used: {scraper.quota_used}/{scraper.daily_quota_limit} units")
            # Save with project specification naming: <brandName>_<source>_<dateTime>.md
            filename = f"hvacnkowitall_YouTube_{timestamp}.md"
            markdown = scraper.format_markdown(videos)
            output_file = Path(f'data/markdown_current/{filename}')
            output_file.parent.mkdir(parents=True, exist_ok=True)
            output_file.write_text(markdown, encoding='utf-8')
            logger.info(f"Markdown saved to: {output_file}")
            # Create archive copy
            archive_dir = Path('data/markdown_archives/YouTube')
            archive_dir.mkdir(parents=True, exist_ok=True)
            archive_file = archive_dir / filename
            archive_file.write_text(markdown, encoding='utf-8')
            logger.info(f"Archive copy saved to: {archive_file}")
            # Update state file
            state = scraper.load_state()
            state = scraper.update_state(state, videos)
            scraper.save_state(state)
            logger.info("State file updated for incremental updates")
            return True, len(videos), output_file
        else:
            logger.error("No videos fetched from YouTube API")
            return False, 0, None
    except Exception as e:
        logger.error(f"YouTube API scraper failed: {e}")
        return False, 0, None
 def run_mailchimp_api_production():
    """Run MailChimp API scraper for production backlog with cleaned content."""
    logger.info("\n" + "=" * 60)
    logger.info("MAILCHIMP API SCRAPER - PRODUCTION V2")
    logger.info("=" * 60)
    timestamp = get_atlantic_timestamp()
    # Follow project specification directory structure
    config = ScraperConfig(
        source_name='MailChimp',  # Capitalized per spec
        brand_name='hvacnkowitall',
        data_dir=Path('data/markdown_current'),
        logs_dir=Path('logs/MailChimp'),
        timezone='America/Halifax'
    )
    try:
        scraper = MailChimpAPIScraper(config)
        logger.info("Starting MailChimp API fetch with content cleaning...")
        start = time.time()
        # Fetch all campaigns from Bi-Weekly Newsletter folder
        campaigns = scraper.fetch_content(max_items=1000)
        elapsed = time.time() - start
        logger.info(f"Fetched {len(campaigns)} campaigns in {elapsed:.1f} seconds")
        if campaigns:
            # Statistics
            total_sent = sum(c.get('metrics', {}).get('emails_sent', 0) for c in campaigns)
            total_opens = sum(c.get('metrics', {}).get('unique_opens', 0) for c in campaigns)
            total_clicks = sum(c.get('metrics', {}).get('unique_clicks', 0) for c in campaigns)
            logger.info(f"Statistics:")
            logger.info(f"  Total campaigns: {len(campaigns)}")
            logger.info(f"  Total emails sent: {total_sent:,}")
            logger.info(f"  Total unique opens: {total_opens:,}")
            logger.info(f"  Total unique clicks: {total_clicks:,}")
            if campaigns:
                avg_open_rate = sum(c.get('metrics', {}).get('open_rate', 0) for c in campaigns) / len(campaigns)
                avg_click_rate = sum(c.get('metrics', {}).get('click_rate', 0) for c in campaigns) / len(campaigns)
                logger.info(f"  Average open rate: {avg_open_rate*100:.1f}%")
                logger.info(f"  Average click rate: {avg_click_rate*100:.1f}%")
            # Save with project specification naming: <brandName>_<source>_<dateTime>.md
            filename = f"hvacnkowitall_MailChimp_{timestamp}.md"
            markdown = scraper.format_markdown(campaigns)
            output_file = Path(f'data/markdown_current/{filename}')
            output_file.parent.mkdir(parents=True, exist_ok=True)
            output_file.write_text(markdown, encoding='utf-8')
            logger.info(f"Markdown saved to: {output_file}")
            # Create archive copy
            archive_dir = Path('data/markdown_archives/MailChimp')
            archive_dir.mkdir(parents=True, exist_ok=True)
            archive_file = archive_dir / filename
            archive_file.write_text(markdown, encoding='utf-8')
            logger.info(f"Archive copy saved to: {archive_file}")
            # Update state file
            state = scraper.load_state()
            state = scraper.update_state(state, campaigns)
            scraper.save_state(state)
            logger.info("State file updated for incremental updates")
            return True, len(campaigns), output_file
        else:
            logger.warning("No campaigns found in MailChimp")
            return True, 0, None
    except Exception as e:
        logger.error(f"MailChimp API scraper failed: {e}")
        return False, 0, None
 def sync_to_nas():
    """Sync API scraper results to NAS following project structure."""
    logger.info("\n" + "=" * 60)
    logger.info("SYNCING TO NAS - PROJECT STRUCTURE")
    logger.info("=" * 60)
    nas_base = Path('/mnt/nas/hvacknowitall')
    try:
        # Sync all markdown_current files
        local_current = Path('data/markdown_current')
        nas_current = nas_base / 'markdown_current'
        if local_current.exists() and any(local_current.glob('*.md')):
            # Create destination if needed
            nas_current.mkdir(parents=True, exist_ok=True)
            # Sync all current markdown files
            cmd = ['rsync', '-av', '--include=*.md', '--exclude=*', 
                   str(local_current) + '/', str(nas_current) + '/']
            result = subprocess.run(cmd, capture_output=True, text=True)
            if result.returncode == 0:
                logger.info(f"✅ Current markdown files synced to NAS: {nas_current}")
                # List synced files
                for md_file in nas_current.glob('*.md'):
                    size = md_file.stat().st_size / 1024  # KB
                    logger.info(f"   - {md_file.name} ({size:.0f}KB)")
            else:
                logger.warning(f"Sync warning: {result.stderr}")
        else:
            logger.info("No current markdown files to sync")
        # Sync archives
        for source in ['YouTube', 'MailChimp']:
            local_archive = Path(f'data/markdown_archives/{source}')
            nas_archive = nas_base / f'markdown_archives/{source}'
            if local_archive.exists() and any(local_archive.glob('*.md')):
                nas_archive.mkdir(parents=True, exist_ok=True)
                cmd = ['rsync', '-av', '--include=*.md', '--exclude=*',
                       str(local_archive) + '/', str(nas_archive) + '/']
                result = subprocess.run(cmd, capture_output=True, text=True)
                if result.returncode == 0:
                    logger.info(f"✅ {source} archives synced to NAS: {nas_archive}")
                else:
                    logger.warning(f"{source} archive sync warning: {result.stderr}")
    except Exception as e:
        logger.error(f"Failed to sync to NAS: {e}")
 def main():
    """Main production run with project specification compliance."""
    logger.info("=" * 70)
    logger.info("HVAC KNOW IT ALL - API SCRAPERS PRODUCTION V2")
    logger.info("Following Project Specification Standards")
    logger.info("=" * 70)
    atlantic_tz = pytz.timezone('America/Halifax')
    start_time = datetime.now(atlantic_tz)
    logger.info(f"Started at: {start_time.isoformat()}")
    # Track results
    results = {
        'YouTube': {'success': False, 'count': 0, 'file': None},
        'MailChimp': {'success': False, 'count': 0, 'file': None}
    }
    # Run YouTube API scraper with captions
    success, count, output_file = run_youtube_api_production()
    results['YouTube'] = {'success': success, 'count': count, 'file': output_file}
    # Run MailChimp API scraper with content cleaning
    success, count, output_file = run_mailchimp_api_production()
    results['MailChimp'] = {'success': success, 'count': count, 'file': output_file}
    # Sync to NAS
    sync_to_nas()
    # Summary
    end_time = datetime.now(atlantic_tz)
    duration = end_time - start_time
    logger.info("\n" + "=" * 70)
    logger.info("PRODUCTION V2 SUMMARY")
    logger.info("=" * 70)
    for source, result in results.items():
        status = "✅" if result['success'] else "❌"
        logger.info(f"{status} {source}: {result['count']} items")
        if result['file']:
            logger.info(f"   Output: {result['file']}")
    logger.info(f"\nTotal duration: {duration.total_seconds():.1f} seconds")
    logger.info(f"Completed at: {end_time.isoformat()}")
    # Project specification compliance
    logger.info("\nPROJECT SPECIFICATION COMPLIANCE:")
    logger.info("✅ File naming: hvacnkowitall_<Source>_<YYYY-MM-DDTHHMMSS>.md")
    logger.info("✅ Directory structure: data/markdown_current/, data/markdown_archives/")
    logger.info("✅ Capitalized source names: YouTube, MailChimp")
    logger.info("✅ Atlantic timezone timestamps")
    logger.info("✅ Archive copies created")
    logger.info("✅ State files for incremental updates")
    # Return success if at least one scraper succeeded
    return any(r['success'] for r in results.values())
 if __name__ == "__main__":
    success = main()
    sys.exit(0 if success else 1)
--- a/src/base_scraper_cumulative.py
+++ b/src/base_scraper_cumulative.py
@ -0,0 +1,80 @@
 #!/usr/bin/env python3
 """
 Enhanced Base Scraper with Cumulative Markdown Support
 Extension of base_scraper.py that adds cumulative markdown functionality
 """
 from src.base_scraper import BaseScraper
 from src.cumulative_markdown_manager import CumulativeMarkdownManager
 from pathlib import Path
 from typing import List, Dict, Any, Optional
 class BaseScraperCumulative(BaseScraper):
    """Base scraper with cumulative markdown support."""
    def __init__(self, config, use_cumulative: bool = True):
        """Initialize with optional cumulative mode."""
        super().__init__(config)
        self.use_cumulative = use_cumulative
        if self.use_cumulative:
            self.cumulative_manager = CumulativeMarkdownManager(config, self.logger)
            self.logger.info("Initialized with cumulative markdown mode")
    def save_content(self, items: List[Dict[str, Any]]) -> Optional[Path]:
        """Save content using either cumulative or traditional mode."""
        if not items:
            self.logger.warning("No items to save")
            return None
        if self.use_cumulative:
            # Use cumulative manager
            return self.cumulative_manager.save_cumulative(
                items, 
                self.format_markdown
            )
        else:
            # Use traditional save (creates new file each time)
            markdown = self.format_markdown(items)
            return self.save_markdown(markdown)
    def run(self) -> Optional[Path]:
        """Run the scraper with cumulative support."""
        try:
            self.logger.info(f"Starting {self.config.source_name} scraper "
                           f"(cumulative={self.use_cumulative})")
            # Fetch content (will check state for incremental)
            items = self.fetch_content()
            if not items:
                self.logger.info("No new content found")
                return None
            self.logger.info(f"Fetched {len(items)} items")
            # Save content (cumulative or traditional)
            filepath = self.save_content(items)
            # Update state for next incremental run
            if items and filepath:
                self.update_state(items)
                # Log statistics if cumulative
                if self.use_cumulative:
                    stats = self.cumulative_manager.get_statistics(filepath)
                    self.logger.info(f"Cumulative stats: {stats}")
            return filepath
        except Exception as e:
            self.logger.error(f"Error in scraper run: {e}")
            raise
    def get_cumulative_stats(self) -> Dict[str, int]:
        """Get statistics about the cumulative file."""
        if not self.use_cumulative:
            return {}
        return self.cumulative_manager.get_statistics()
--- a/src/cumulative_markdown_manager.py
+++ b/src/cumulative_markdown_manager.py
@ -0,0 +1,273 @@
 #!/usr/bin/env python3
 """
 Cumulative Markdown Manager
 Maintains a single, growing markdown file per source that combines:
 - Initial backlog content
 - Daily incremental updates
 - Updates to existing entries (e.g., new captions, updated metrics)
 """
 from pathlib import Path
 from datetime import datetime
 from typing import Dict, List, Optional, Any
 import pytz
 import logging
 import shutil
 import re
 class CumulativeMarkdownManager:
    """Manages cumulative markdown files that grow with each update."""
    def __init__(self, config, logger: Optional[logging.Logger] = None):
        """Initialize with scraper config."""
        self.config = config
        self.logger = logger or logging.getLogger(self.__class__.__name__)
        self.tz = pytz.timezone(config.timezone)
        # Paths
        self.current_dir = config.data_dir / "markdown_current"
        self.archive_dir = config.data_dir / "markdown_archives" / config.source_name.title()
        # Ensure directories exist
        self.current_dir.mkdir(parents=True, exist_ok=True)
        self.archive_dir.mkdir(parents=True, exist_ok=True)
        # File pattern for this source
        self.file_pattern = f"{config.brand_name}_{config.source_name}_*.md"
    def get_current_file(self) -> Optional[Path]:
        """Find the current markdown file for this source."""
        files = list(self.current_dir.glob(self.file_pattern))
        if not files:
            return None
        # Return the most recent file (by filename timestamp)
        files.sort(reverse=True)
        return files[0]
    def parse_markdown_sections(self, content: str) -> Dict[str, Dict]:
        """Parse markdown content into sections indexed by ID."""
        sections = {}
        # Split by ID headers
        parts = content.split('# ID: ')
        for part in parts[1:]:  # Skip first empty part
            if not part.strip():
                continue
            lines = part.strip().split('\n')
            section_id = lines[0].strip()
            # Reconstruct full section content
            section_content = f"# ID: {section_id}\n" + '\n'.join(lines[1:])
            # Extract metadata for comparison
            metadata = self._extract_metadata(section_content)
            sections[section_id] = {
                'id': section_id,
                'content': section_content,
                'metadata': metadata
            }
        return sections
    def _extract_metadata(self, content: str) -> Dict[str, Any]:
        """Extract metadata from section content for comparison."""
        metadata = {}
        # Extract common fields
        patterns = {
            'views': r'## Views?:\s*([0-9,]+)',
            'likes': r'## Likes?:\s*([0-9,]+)',
            'comments': r'## Comments?:\s*([0-9,]+)',
            'publish_date': r'## Publish(?:ed)? Date:\s*([^\n]+)',
            'has_caption': r'## Caption Status:',
            'has_transcript': r'## Transcript:',
            'description_length': r'## Description:\n(.+?)(?:\n##|\n---|\Z)',
        }
        for key, pattern in patterns.items():
            match = re.search(pattern, content, re.DOTALL | re.IGNORECASE)
            if match:
                if key in ['views', 'likes', 'comments']:
                    # Convert numeric fields
                    metadata[key] = int(match.group(1).replace(',', ''))
                elif key in ['has_caption', 'has_transcript']:
                    # Boolean fields
                    metadata[key] = True
                elif key == 'description_length':
                    # Calculate length of description
                    metadata[key] = len(match.group(1).strip())
                else:
                    metadata[key] = match.group(1).strip()
        return metadata
    def should_update_section(self, old_section: Dict, new_section: Dict) -> bool:
        """Determine if a section should be updated with new content."""
        old_meta = old_section.get('metadata', {})
        new_meta = new_section.get('metadata', {})
        # Update if new section has captions/transcripts that old doesn't
        if new_meta.get('has_caption') and not old_meta.get('has_caption'):
            return True
        if new_meta.get('has_transcript') and not old_meta.get('has_transcript'):
            return True
        # Update if new section has more content
        old_desc_len = old_meta.get('description_length', 0)
        new_desc_len = new_meta.get('description_length', 0)
        if new_desc_len > old_desc_len * 1.2:  # 20% more content
            return True
        # Update if metrics have changed significantly (for incremental updates)
        for metric in ['views', 'likes', 'comments']:
            old_val = old_meta.get(metric, 0)
            new_val = new_meta.get(metric, 0)
            if new_val > old_val:
                return True
        # Update if content is substantially different
        if len(new_section['content']) > len(old_section['content']) * 1.1:
            return True
        return False
    def merge_content(self, existing_sections: Dict[str, Dict], 
                     new_items: List[Dict[str, Any]], 
                     formatter_func) -> str:
        """Merge new content with existing sections."""
        # Convert new items to sections
        new_content = formatter_func(new_items)
        new_sections = self.parse_markdown_sections(new_content)
        # Track updates
        added_count = 0
        updated_count = 0
        # Merge sections
        for section_id, new_section in new_sections.items():
            if section_id in existing_sections:
                # Update existing section if newer/better
                if self.should_update_section(existing_sections[section_id], new_section):
                    existing_sections[section_id] = new_section
                    updated_count += 1
                    self.logger.info(f"Updated section: {section_id}")
            else:
                # Add new section
                existing_sections[section_id] = new_section
                added_count += 1
                self.logger.debug(f"Added new section: {section_id}")
        self.logger.info(f"Merge complete: {added_count} added, {updated_count} updated")
        # Reconstruct markdown content
        # Sort by ID to maintain consistent order
        sorted_sections = sorted(existing_sections.values(), 
                               key=lambda x: x['id'])
        # For sources with dates, sort by date (newest first)
        # Try to extract date from content for better sorting
        for section in sorted_sections:
            date_match = re.search(r'## Publish(?:ed)? Date:\s*([^\n]+)', 
                                  section['content'])
            if date_match:
                try:
                    # Parse various date formats
                    date_str = date_match.group(1).strip()
                    # Add parsed date for sorting
                    section['sort_date'] = date_str
                except:
                    pass
        # Sort by date if available, otherwise by ID
        if any('sort_date' in s for s in sorted_sections):
            sorted_sections.sort(key=lambda x: x.get('sort_date', ''), reverse=True)
        # Combine sections
        combined_content = []
        for section in sorted_sections:
            combined_content.append(section['content'])
            combined_content.append("")  # Empty line between sections
        return '\n'.join(combined_content)
    def save_cumulative(self, new_items: List[Dict[str, Any]], 
                       formatter_func) -> Path:
        """Save content cumulatively, merging with existing file if present."""
        current_file = self.get_current_file()
        if current_file and current_file.exists():
            # Load and merge with existing content
            self.logger.info(f"Loading existing file: {current_file.name}")
            existing_content = current_file.read_text(encoding='utf-8')
            existing_sections = self.parse_markdown_sections(existing_content)
            # Merge new items with existing sections
            merged_content = self.merge_content(existing_sections, new_items, 
                                              formatter_func)
            # Archive the current file before overwriting
            self._archive_file(current_file)
        else:
            # First time - just format the new items
            self.logger.info("No existing file, creating new cumulative file")
            merged_content = formatter_func(new_items)
        # Generate new filename with current timestamp
        timestamp = datetime.now(self.tz).strftime('%Y-%m-%dT%H%M%S')
        filename = f"{self.config.brand_name}_{self.config.source_name}_{timestamp}.md"
        filepath = self.current_dir / filename
        # Save merged content
        filepath.write_text(merged_content, encoding='utf-8')
        self.logger.info(f"Saved cumulative file: {filename}")
        # Remove old file if it exists (we archived it already)
        if current_file and current_file.exists() and current_file != filepath:
            current_file.unlink()
            self.logger.debug(f"Removed old file: {current_file.name}")
        return filepath
    def _archive_file(self, file_path: Path) -> None:
        """Archive a file with timestamp suffix."""
        if not file_path.exists():
            return
        # Add archive timestamp to filename
        archive_time = datetime.now(self.tz).strftime('%Y%m%d_%H%M%S')
        archive_name = f"{file_path.stem}_archived_{archive_time}{file_path.suffix}"
        archive_path = self.archive_dir / archive_name
        # Copy to archive
        shutil.copy2(file_path, archive_path)
        self.logger.debug(f"Archived to: {archive_path.name}")
    def get_statistics(self, file_path: Optional[Path] = None) -> Dict[str, int]:
        """Get statistics about the cumulative file."""
        if not file_path:
            file_path = self.get_current_file()
        if not file_path or not file_path.exists():
            return {'total_sections': 0}
        content = file_path.read_text(encoding='utf-8')
        sections = self.parse_markdown_sections(content)
        stats = {
            'total_sections': len(sections),
            'with_captions': sum(1 for s in sections.values() 
                               if s['metadata'].get('has_caption')),
            'with_transcripts': sum(1 for s in sections.values() 
                                  if s['metadata'].get('has_transcript')),
            'total_views': sum(s['metadata'].get('views', 0) 
                             for s in sections.values()),
            'file_size_kb': file_path.stat().st_size // 1024
        }
        return stats
--- a/src/mailchimp_api_scraper_v2.py
+++ b/src/mailchimp_api_scraper_v2.py
@ -0,0 +1,410 @@
 #!/usr/bin/env python3
 """
 MailChimp API scraper for fetching campaign data and metrics
 Fetches only campaigns from "Bi-Weekly Newsletter" folder
 Cleans headers and footers from content
 """
 import os
 import time
 import requests
 import re
 from typing import Any, Dict, List, Optional
 from datetime import datetime
 from src.base_scraper import BaseScraper, ScraperConfig
 import logging
 class MailChimpAPIScraper(BaseScraper):
    """MailChimp API scraper for campaigns and metrics."""
    def __init__(self, config: ScraperConfig):
        super().__init__(config)
        self.api_key = os.getenv('MAILCHIMP_API_KEY')
        self.server_prefix = os.getenv('MAILCHIMP_SERVER_PREFIX', 'us10')
        if not self.api_key:
            raise ValueError("MAILCHIMP_API_KEY not found in environment variables")
        self.base_url = f"https://{self.server_prefix}.api.mailchimp.com/3.0"
        self.headers = {
            'Authorization': f'Bearer {self.api_key}',
            'Content-Type': 'application/json'
        }
        # Cache folder ID for "Bi-Weekly Newsletter"
        self.target_folder_id = None
        self.target_folder_name = "Bi-Weekly Newsletter"
        self.logger.info(f"Initialized MailChimp API scraper for server: {self.server_prefix}")
    def _clean_content(self, content: str) -> str:
        """Clean unwanted headers and footers from MailChimp content."""
        if not content:
            return content
        # Patterns to remove
        patterns_to_remove = [
            # Header patterns
            r'VIEW THIS EMAIL IN BROWSER[^\n]*\n?',
            r'\(\*\|ARCHIVE\|\*\)[^\n]*\n?',
            r'https://hvacknowitall\.com/?\n?',
            # Footer patterns
            r'Newsletter produced by Teal Maker[^\n]*\n?',
            r'https://tealmaker\.com[^\n]*\n?',
            r'https://open\.spotify\.com[^\n]*\n?',
            r'https://www\.instagram\.com[^\n]*\n?',
            r'https://www\.youtube\.com[^\n]*\n?',
            r'https://www\.facebook\.com[^\n]*\n?',
            r'https://x\.com[^\n]*\n?',
            r'https://www\.linkedin\.com[^\n]*\n?',
            r'Copyright \(C\)[^\n]*\n?',
            r'\*\|CURRENT_YEAR\|\*[^\n]*\n?',
            r'\*\|LIST:COMPANY\|\*[^\n]*\n?',
            r'\*\|IFNOT:ARCHIVE_PAGE\|\*[^\n]*\*\|END:IF\|\*\n?',
            r'\*\|LIST:DESCRIPTION\|\*[^\n]*\n?',
            r'\*\|LIST_ADDRESS\|\*[^\n]*\n?',
            r'Our mailing address is:[^\n]*\n?',
            r'Want to change how you receive these emails\?[^\n]*\n?',
            r'You can update your preferences[^\n]*\n?',
            r'\(\*\|UPDATE_PROFILE\|\*\)[^\n]*\n?',
            r'or unsubscribe[^\n]*\n?',
            r'\(\*\|UNSUB\|\*\)[^\n]*\n?',
            # Clean up multiple newlines
            r'\n{3,}',
        ]
        cleaned = content
        for pattern in patterns_to_remove:
            cleaned = re.sub(pattern, '', cleaned, flags=re.MULTILINE | re.IGNORECASE)
        # Clean up multiple newlines (replace with double newline)
        cleaned = re.sub(r'\n{3,}', '\n\n', cleaned)
        # Trim whitespace
        cleaned = cleaned.strip()
        return cleaned
    def _test_connection(self) -> bool:
        """Test API connection."""
        try:
            response = requests.get(f"{self.base_url}/ping", headers=self.headers)
            if response.status_code == 200:
                self.logger.info("MailChimp API connection successful")
                return True
            else:
                self.logger.error(f"MailChimp API connection failed: {response.status_code}")
                return False
        except Exception as e:
            self.logger.error(f"MailChimp API connection error: {e}")
            return False
    def _get_folder_id(self) -> Optional[str]:
        """Get the folder ID for 'Bi-Weekly Newsletter'."""
        if self.target_folder_id:
            return self.target_folder_id
        try:
            response = requests.get(
                f"{self.base_url}/campaign-folders",
                headers=self.headers,
                params={'count': 100}
            )
            if response.status_code == 200:
                folders_data = response.json()
                for folder in folders_data.get('folders', []):
                    if folder['name'] == self.target_folder_name:
                        self.target_folder_id = folder['id']
                        self.logger.info(f"Found '{self.target_folder_name}' folder: {self.target_folder_id}")
                        return self.target_folder_id
                self.logger.warning(f"'{self.target_folder_name}' folder not found")
            else:
                self.logger.error(f"Failed to fetch folders: {response.status_code}")
        except Exception as e:
            self.logger.error(f"Error fetching folders: {e}")
        return None
    def _fetch_campaign_content(self, campaign_id: str) -> Optional[Dict[str, Any]]:
        """Fetch campaign content."""
        try:
            response = requests.get(
                f"{self.base_url}/campaigns/{campaign_id}/content",
                headers=self.headers
            )
            if response.status_code == 200:
                return response.json()
            else:
                self.logger.warning(f"Failed to fetch content for campaign {campaign_id}: {response.status_code}")
                return None
        except Exception as e:
            self.logger.error(f"Error fetching campaign content: {e}")
            return None
    def _fetch_campaign_report(self, campaign_id: str) -> Optional[Dict[str, Any]]:
        """Fetch campaign report with metrics."""
        try:
            response = requests.get(
                f"{self.base_url}/reports/{campaign_id}",
                headers=self.headers
            )
            if response.status_code == 200:
                return response.json()
            else:
                self.logger.warning(f"Failed to fetch report for campaign {campaign_id}: {response.status_code}")
                return None
        except Exception as e:
            self.logger.error(f"Error fetching campaign report: {e}")
            return None
    def fetch_content(self, max_items: int = None) -> List[Dict[str, Any]]:
        """Fetch campaigns from MailChimp API."""
        # Test connection first
        if not self._test_connection():
            self.logger.error("Failed to connect to MailChimp API")
            return []
        # Get folder ID
        folder_id = self._get_folder_id()
        # Prepare parameters
        params = {
            'count': max_items or 1000,  # Default to 1000 if not specified
            'status': 'sent',  # Only sent campaigns
            'sort_field': 'send_time',
            'sort_dir': 'DESC'
        }
        if folder_id:
            params['folder_id'] = folder_id
            self.logger.info(f"Fetching campaigns from '{self.target_folder_name}' folder")
        else:
            self.logger.info("Fetching all sent campaigns")
        try:
            response = requests.get(
                f"{self.base_url}/campaigns",
                headers=self.headers,
                params=params
            )
            if response.status_code != 200:
                self.logger.error(f"Failed to fetch campaigns: {response.status_code}")
                return []
            campaigns_data = response.json()
            campaigns = campaigns_data.get('campaigns', [])
            self.logger.info(f"Found {len(campaigns)} campaigns")
            # Enrich each campaign with content and metrics
            enriched_campaigns = []
            for campaign in campaigns:
                campaign_id = campaign['id']
                # Add basic campaign info
                enriched_campaign = {
                    'id': campaign_id,
                    'title': campaign.get('settings', {}).get('subject_line', 'Untitled'),
                    'preview_text': campaign.get('settings', {}).get('preview_text', ''),
                    'from_name': campaign.get('settings', {}).get('from_name', ''),
                    'reply_to': campaign.get('settings', {}).get('reply_to', ''),
                    'send_time': campaign.get('send_time'),
                    'status': campaign.get('status'),
                    'type': campaign.get('type', 'regular'),
                    'archive_url': campaign.get('archive_url', ''),
                    'long_archive_url': campaign.get('long_archive_url', ''),
                    'folder_id': campaign.get('settings', {}).get('folder_id')
                }
                # Fetch content
                content_data = self._fetch_campaign_content(campaign_id)
                if content_data:
                    plain_text = content_data.get('plain_text', '')
                    # Clean the content
                    enriched_campaign['plain_text'] = self._clean_content(plain_text)
                    # If no plain text, convert HTML
                    if not enriched_campaign['plain_text'] and content_data.get('html'):
                        converted = self.convert_to_markdown(
                            content_data['html'], 
                            content_type="text/html"
                        )
                        enriched_campaign['plain_text'] = self._clean_content(converted)
                # Fetch metrics
                report_data = self._fetch_campaign_report(campaign_id)
                if report_data:
                    enriched_campaign['metrics'] = {
                        'emails_sent': report_data.get('emails_sent', 0),
                        'unique_opens': report_data.get('opens', {}).get('unique_opens', 0),
                        'open_rate': report_data.get('opens', {}).get('open_rate', 0),
                        'total_opens': report_data.get('opens', {}).get('opens_total', 0),
                        'unique_clicks': report_data.get('clicks', {}).get('unique_clicks', 0),
                        'click_rate': report_data.get('clicks', {}).get('click_rate', 0),
                        'total_clicks': report_data.get('clicks', {}).get('clicks_total', 0),
                        'unsubscribed': report_data.get('unsubscribed', 0),
                        'bounces': {
                            'hard': report_data.get('bounces', {}).get('hard_bounces', 0),
                            'soft': report_data.get('bounces', {}).get('soft_bounces', 0),
                            'syntax_errors': report_data.get('bounces', {}).get('syntax_errors', 0)
                        },
                        'abuse_reports': report_data.get('abuse_reports', 0),
                        'forwards': {
                            'count': report_data.get('forwards', {}).get('forwards_count', 0),
                            'opens': report_data.get('forwards', {}).get('forwards_opens', 0)
                        }
                    }
                else:
                    enriched_campaign['metrics'] = {}
                enriched_campaigns.append(enriched_campaign)
                # Add small delay to avoid rate limiting
                time.sleep(0.5)
            return enriched_campaigns
        except Exception as e:
            self.logger.error(f"Error fetching campaigns: {e}")
            return []
    def format_markdown(self, campaigns: List[Dict[str, Any]]) -> str:
        """Format campaigns as markdown with enhanced metrics."""
        markdown_sections = []
        for campaign in campaigns:
            section = []
            # ID
            section.append(f"# ID: {campaign.get('id', 'N/A')}")
            section.append("")
            # Title
            section.append(f"## Title: {campaign.get('title', 'Untitled')}")
            section.append("")
            # Type
            section.append(f"## Type: email_campaign")
            section.append("")
            # Send Time
            send_time = campaign.get('send_time', '')
            if send_time:
                section.append(f"## Send Date: {send_time}")
                section.append("")
            # From and Reply-to
            from_name = campaign.get('from_name', '')
            reply_to = campaign.get('reply_to', '')
            if from_name:
                section.append(f"## From: {from_name}")
            if reply_to:
                section.append(f"## Reply To: {reply_to}")
            section.append("")
            # Archive URL
            archive_url = campaign.get('long_archive_url') or campaign.get('archive_url', '')
            if archive_url:
                section.append(f"## Archive URL: {archive_url}")
                section.append("")
            # Metrics
            metrics = campaign.get('metrics', {})
            if metrics:
                section.append("## Metrics:")
                section.append(f"### Emails Sent: {metrics.get('emails_sent', 0):,}")
                section.append(f"### Opens: {metrics.get('unique_opens', 0):,} unique ({metrics.get('open_rate', 0)*100:.1f}%)")
                section.append(f"### Clicks: {metrics.get('unique_clicks', 0):,} unique ({metrics.get('click_rate', 0)*100:.1f}%)")
                section.append(f"### Unsubscribes: {metrics.get('unsubscribed', 0)}")
                bounces = metrics.get('bounces', {})
                total_bounces = bounces.get('hard', 0) + bounces.get('soft', 0)
                if total_bounces > 0:
                    section.append(f"### Bounces: {total_bounces} (Hard: {bounces.get('hard', 0)}, Soft: {bounces.get('soft', 0)})")
                if metrics.get('abuse_reports', 0) > 0:
                    section.append(f"### Abuse Reports: {metrics.get('abuse_reports', 0)}")
                forwards = metrics.get('forwards', {})
                if forwards.get('count', 0) > 0:
                    section.append(f"### Forwards: {forwards.get('count', 0)}")
                section.append("")
            # Preview Text
            preview_text = campaign.get('preview_text', '')
            if preview_text:
                section.append(f"## Preview Text:")
                section.append(preview_text)
                section.append("")
            # Content (cleaned)
            content = campaign.get('plain_text', '')
            if content:
                section.append("## Content:")
                section.append(content)
                section.append("")
            # Separator
            section.append("-" * 50)
            section.append("")
            markdown_sections.append('\n'.join(section))
        return '\n'.join(markdown_sections)
    def get_incremental_items(self, items: List[Dict[str, Any]], state: Dict[str, Any]) -> List[Dict[str, Any]]:
        """Get only new campaigns since last sync."""
        if not state:
            return items
        last_campaign_id = state.get('last_campaign_id')
        last_send_time = state.get('last_send_time')
        if not last_campaign_id:
            return items
        # Filter for campaigns newer than the last synced
        new_items = []
        for item in items:
            if item.get('id') == last_campaign_id:
                break  # Found the last synced campaign
            # Also check by send time as backup
            if last_send_time and item.get('send_time'):
                if item['send_time'] <= last_send_time:
                    continue
            new_items.append(item)
        return new_items
    def update_state(self, state: Dict[str, Any], items: List[Dict[str, Any]]) -> Dict[str, Any]:
        """Update state with latest campaign information."""
        if not items:
            return state
        # Get the first item (most recent)
        latest_item = items[0]
        state['last_campaign_id'] = latest_item.get('id')
        state['last_send_time'] = latest_item.get('send_time')
        state['last_campaign_title'] = latest_item.get('title')
        state['last_sync'] = datetime.now(self.tz).isoformat()
        state['campaign_count'] = len(items)
        return state
--- a/src/youtube_api_scraper_v2.py
+++ b/src/youtube_api_scraper_v2.py
@ -0,0 +1,513 @@
 #!/usr/bin/env python3
 """
 YouTube Data API v3 scraper with quota management and captions support
 Designed to stay within 10,000 units/day limit while fetching captions
 Quota costs:
 - channels.list: 1 unit
 - playlistItems.list: 1 unit per page (50 items max)
 - videos.list: 1 unit per page (50 videos max)
 - search.list: 100 units (avoid if possible!)
 - captions.list: 50 units per video
 - captions.download: 200 units per caption
 Strategy for 444 videos with captions:
 - Get channel info: 1 unit
 - Get all playlist items (444/50 = 9 pages): 9 units
 - Get video details in batches of 50: 9 units
 - Get captions list for each video: 444 * 50 = 22,200 units (too much!)
 - Alternative: Use captions.list selectively or in batches
 """
 import os
 import time
 from typing import Any, Dict, List, Optional, Tuple
 from datetime import datetime
 from googleapiclient.discovery import build
 from googleapiclient.errors import HttpError
 from src.base_scraper import BaseScraper, ScraperConfig
 import logging
 import re
 class YouTubeAPIScraper(BaseScraper):
    """YouTube API scraper with quota management and captions."""
    # Quota costs for different operations
    QUOTA_COSTS = {
        'channels_list': 1,
        'playlist_items': 1,
        'videos_list': 1,
        'search': 100,
        'captions_list': 50,
        'captions_download': 200,
    }
    def __init__(self, config: ScraperConfig):
        super().__init__(config)
        self.api_key = os.getenv('YOUTUBE_API_KEY')
        if not self.api_key:
            raise ValueError("YOUTUBE_API_KEY not found in environment variables")
        # Build YouTube API client
        self.youtube = build('youtube', 'v3', developerKey=self.api_key)
        # Channel configuration
        self.channel_url = os.getenv('YOUTUBE_CHANNEL_URL', 'https://www.youtube.com/@HVACKnowItAll')
        self.channel_id = None
        self.uploads_playlist_id = None
        # Quota tracking
        self.quota_used = 0
        self.daily_quota_limit = 10000
        # Caption fetching strategy
        self.max_captions_per_run = 50  # Limit caption fetches to top videos
        # 50 videos * 50 units = 2,500 units for caption listing
        # Plus potential download costs
        self.logger.info(f"Initialized YouTube API scraper for channel: {self.channel_url}")
    def _track_quota(self, operation: str, count: int = 1) -> bool:
        """Track quota usage and return True if within limits."""
        cost = self.QUOTA_COSTS.get(operation, 0) * count
        if self.quota_used + cost > self.daily_quota_limit:
            self.logger.warning(f"Quota limit would be exceeded. Current: {self.quota_used}, Cost: {cost}")
            return False
        self.quota_used += cost
        self.logger.debug(f"Quota used: {self.quota_used}/{self.daily_quota_limit} (+{cost} for {operation})")
        return True
    def _get_channel_info(self) -> bool:
        """Get channel ID and uploads playlist ID."""
        if self.channel_id and self.uploads_playlist_id:
            return True
        try:
            # Extract channel handle
            channel_handle = self.channel_url.split('@')[-1]
            # Try to get channel by handle first (costs 1 unit)
            if not self._track_quota('channels_list'):
                return False
            response = self.youtube.channels().list(
                part='snippet,statistics,contentDetails',
                forHandle=channel_handle
            ).execute()
            if not response.get('items'):
                # Fallback to search by name (costs 100 units - avoid!)
                self.logger.warning("Channel not found by handle, trying search...")
                if not self._track_quota('search'):
                    return False
                search_response = self.youtube.search().list(
                    part='snippet',
                    q="HVAC Know It All",
                    type='channel',
                    maxResults=1
                ).execute()
                if not search_response.get('items'):
                    self.logger.error("Channel not found")
                    return False
                self.channel_id = search_response['items'][0]['snippet']['channelId']
                # Get full channel details
                if not self._track_quota('channels_list'):
                    return False
                response = self.youtube.channels().list(
                    part='snippet,statistics,contentDetails',
                    id=self.channel_id
                ).execute()
            if response.get('items'):
                channel_data = response['items'][0]
                self.channel_id = channel_data['id']
                self.uploads_playlist_id = channel_data['contentDetails']['relatedPlaylists']['uploads']
                # Log channel stats
                stats = channel_data['statistics']
                self.logger.info(f"Channel: {channel_data['snippet']['title']}")
                self.logger.info(f"Subscribers: {int(stats.get('subscriberCount', 0)):,}")
                self.logger.info(f"Total videos: {int(stats.get('videoCount', 0)):,}")
                return True
        except HttpError as e:
            self.logger.error(f"YouTube API error: {e}")
        except Exception as e:
            self.logger.error(f"Error getting channel info: {e}")
        return False
    def _fetch_all_video_ids(self, max_videos: int = None) -> List[str]:
        """Fetch all video IDs from the channel efficiently."""
        if not self._get_channel_info():
            return []
        video_ids = []
        next_page_token = None
        videos_fetched = 0
        while True:
            # Check quota before each request
            if not self._track_quota('playlist_items'):
                self.logger.warning("Quota limit reached while fetching video IDs")
                break
            try:
                # Fetch playlist items (50 per page, costs 1 unit)
                request = self.youtube.playlistItems().list(
                    part='contentDetails',
                    playlistId=self.uploads_playlist_id,
                    maxResults=50,
                    pageToken=next_page_token
                )
                response = request.execute()
                for item in response.get('items', []):
                    video_ids.append(item['contentDetails']['videoId'])
                    videos_fetched += 1
                    if max_videos and videos_fetched >= max_videos:
                        return video_ids[:max_videos]
                # Check for next page
                next_page_token = response.get('nextPageToken')
                if not next_page_token:
                    break
            except HttpError as e:
                self.logger.error(f"Error fetching video IDs: {e}")
                break
        self.logger.info(f"Fetched {len(video_ids)} video IDs")
        return video_ids
    def _fetch_video_details_batch(self, video_ids: List[str]) -> List[Dict[str, Any]]:
        """Fetch details for a batch of videos (max 50 per request)."""
        if not video_ids:
            return []
        # YouTube API allows max 50 videos per request
        batch_size = 50
        all_videos = []
        for i in range(0, len(video_ids), batch_size):
            batch = video_ids[i:i + batch_size]
            # Check quota (1 unit per request)
            if not self._track_quota('videos_list'):
                self.logger.warning("Quota limit reached while fetching video details")
                break
            try:
                response = self.youtube.videos().list(
                    part='snippet,statistics,contentDetails',
                    id=','.join(batch)
                ).execute()
                for video in response.get('items', []):
                    video_data = {
                        'id': video['id'],
                        'title': video['snippet']['title'],
                        'description': video['snippet']['description'],  # Full description!
                        'published_at': video['snippet']['publishedAt'],
                        'channel_id': video['snippet']['channelId'],
                        'channel_title': video['snippet']['channelTitle'],
                        'tags': video['snippet'].get('tags', []),
                        'duration': video['contentDetails']['duration'],
                        'definition': video['contentDetails']['definition'],
                        'caption': video['contentDetails'].get('caption', 'false'),  # Has captions?
                        'thumbnail': video['snippet']['thumbnails'].get('maxres', {}).get('url') or 
                                    video['snippet']['thumbnails'].get('high', {}).get('url', ''),
                        # Statistics
                        'view_count': int(video['statistics'].get('viewCount', 0)),
                        'like_count': int(video['statistics'].get('likeCount', 0)),
                        'comment_count': int(video['statistics'].get('commentCount', 0)),
                        # Calculate engagement metrics
                        'engagement_rate': 0,
                        'like_ratio': 0
                    }
                    # Calculate engagement metrics
                    if video_data['view_count'] > 0:
                        video_data['engagement_rate'] = (
                            (video_data['like_count'] + video_data['comment_count']) / 
                            video_data['view_count']
                        ) * 100
                        video_data['like_ratio'] = (video_data['like_count'] / video_data['view_count']) * 100
                    all_videos.append(video_data)
                # Small delay to be respectful
                time.sleep(0.1)
            except HttpError as e:
                self.logger.error(f"Error fetching video details: {e}")
        return all_videos
    def _fetch_caption_text(self, video_id: str) -> Optional[str]:
        """Fetch caption text using YouTube Data API (costs 50 units!)."""
        try:
            # Check quota (50 units for list)
            if not self._track_quota('captions_list'):
                self.logger.debug(f"Quota limit - skipping captions for {video_id}")
                return None
            # List available captions
            captions_response = self.youtube.captions().list(
                part='snippet',
                videoId=video_id
            ).execute()
            captions = captions_response.get('items', [])
            if not captions:
                self.logger.debug(f"No captions available for video {video_id}")
                return None
            # Find English caption (or auto-generated)
            english_caption = None
            for caption in captions:
                if caption['snippet']['language'] == 'en':
                    english_caption = caption
                    break
            if not english_caption:
                # Try auto-generated
                for caption in captions:
                    if 'auto' in caption['snippet']['name'].lower():
                        english_caption = caption
                        break
            if english_caption:
                caption_id = english_caption['id']
                # Download caption would cost 200 more units!
                # For now, just note that captions are available
                self.logger.debug(f"Captions available for video {video_id} (id: {caption_id})")
                return f"[Captions available - {english_caption['snippet']['name']}]"
            return None
        except HttpError as e:
            if 'captionsDisabled' in str(e):
                self.logger.debug(f"Captions disabled for video {video_id}")
            else:
                self.logger.debug(f"Error fetching captions for {video_id}: {e}")
        except Exception as e:
            self.logger.debug(f"Error fetching captions for {video_id}: {e}")
        return None
    def fetch_content(self, max_posts: int = None, fetch_captions: bool = True) -> List[Dict[str, Any]]:
        """Fetch video content with intelligent quota management."""
        self.logger.info(f"Starting YouTube API fetch (quota limit: {self.daily_quota_limit})")
        # Step 1: Get all video IDs (very cheap - ~9 units for 444 videos)
        video_ids = self._fetch_all_video_ids(max_posts)
        if not video_ids:
            self.logger.warning("No video IDs fetched")
            return []
        # Step 2: Fetch video details in batches (also cheap - ~9 units for 444 videos)
        videos = self._fetch_video_details_batch(video_ids)
        self.logger.info(f"Fetched details for {len(videos)} videos")
        # Step 3: Fetch captions for top videos (expensive - 50 units per video)
        if fetch_captions:
            # Prioritize videos by views for caption fetching
            videos_sorted = sorted(videos, key=lambda x: x['view_count'], reverse=True)
            # Limit caption fetching to top videos
            max_captions = min(self.max_captions_per_run, len(videos_sorted))
            # Check remaining quota
            captions_quota_needed = max_captions * 50
            if self.quota_used + captions_quota_needed > self.daily_quota_limit:
                max_captions = (self.daily_quota_limit - self.quota_used) // 50
                self.logger.warning(f"Limiting captions to {max_captions} videos due to quota")
            if max_captions > 0:
                self.logger.info(f"Fetching captions for top {max_captions} videos by views")
                for i, video in enumerate(videos_sorted[:max_captions]):
                    caption_text = self._fetch_caption_text(video['id'])
                    if caption_text:
                        video['caption_text'] = caption_text
                        self.logger.debug(f"Got caption info for video {i+1}/{max_captions}: {video['title']}")
                    # Small delay to be respectful
                    time.sleep(0.5)
        # Log final quota usage
        self.logger.info(f"Total quota used: {self.quota_used}/{self.daily_quota_limit} units")
        self.logger.info(f"Remaining quota: {self.daily_quota_limit - self.quota_used} units")
        return videos
    def _get_video_type(self, video: Dict[str, Any]) -> str:
        """Determine video type based on duration."""
        duration = video.get('duration', 'PT0S')
        # Parse ISO 8601 duration
        match = re.match(r'PT(?:(\d+)H)?(?:(\d+)M)?(?:(\d+)S)?', duration)
        if match:
            hours = int(match.group(1) or 0)
            minutes = int(match.group(2) or 0) 
            seconds = int(match.group(3) or 0)
            total_seconds = hours * 3600 + minutes * 60 + seconds
            if total_seconds < 60:
                return 'short'
            elif total_seconds > 600:  # > 10 minutes
                return 'video'
            else:
                return 'video'
        return 'video'
    def format_markdown(self, videos: List[Dict[str, Any]]) -> str:
        """Format videos as markdown with enhanced data."""
        markdown_sections = []
        for video in videos:
            section = []
            # ID
            section.append(f"# ID: {video.get('id', 'N/A')}")
            section.append("")
            # Title
            section.append(f"## Title: {video.get('title', 'Untitled')}")
            section.append("")
            # Type
            video_type = self._get_video_type(video)
            section.append(f"## Type: {video_type}")
            section.append("")
            # Author
            section.append(f"## Author: {video.get('channel_title', 'Unknown')}")
            section.append("")
            # Link
            section.append(f"## Link: https://www.youtube.com/watch?v={video.get('id')}")
            section.append("")
            # Upload Date
            section.append(f"## Upload Date: {video.get('published_at', '')}")
            section.append("")
            # Duration
            section.append(f"## Duration: {video.get('duration', 'Unknown')}")
            section.append("")
            # Views
            section.append(f"## Views: {video.get('view_count', 0):,}")
            section.append("")
            # Likes
            section.append(f"## Likes: {video.get('like_count', 0):,}")
            section.append("")
            # Comments
            section.append(f"## Comments: {video.get('comment_count', 0):,}")
            section.append("")
            # Engagement Metrics
            section.append(f"## Engagement Rate: {video.get('engagement_rate', 0):.2f}%")
            section.append(f"## Like Ratio: {video.get('like_ratio', 0):.2f}%")
            section.append("")
            # Tags
            tags = video.get('tags', [])
            if tags:
                section.append(f"## Tags: {', '.join(tags[:10])}")  # First 10 tags
                section.append("")
            # Thumbnail
            thumbnail = video.get('thumbnail', '')
            if thumbnail:
                section.append(f"## Thumbnail: {thumbnail}")
                section.append("")
            # Full Description (untruncated!)
            section.append("## Description:")
            description = video.get('description', '')
            if description:
                section.append(description)
            section.append("")
            # Caption/Transcript
            caption_text = video.get('caption_text')
            if caption_text:
                section.append("## Caption Status:")
                section.append(caption_text)
                section.append("")
            # Separator
            section.append("-" * 50)
            section.append("")
            markdown_sections.append('\n'.join(section))
        return '\n'.join(markdown_sections)
    def get_incremental_items(self, items: List[Dict[str, Any]], state: Dict[str, Any]) -> List[Dict[str, Any]]:
        """Get only new videos since last sync."""
        if not state:
            return items
        last_video_id = state.get('last_video_id')
        last_published = state.get('last_published')
        if not last_video_id:
            return items
        # Filter for videos newer than the last synced
        new_items = []
        for item in items:
            if item.get('id') == last_video_id:
                break  # Found the last synced video
            # Also check by publish date as backup
            if last_published and item.get('published_at'):
                if item['published_at'] <= last_published:
                    continue
            new_items.append(item)
        return new_items
    def update_state(self, state: Dict[str, Any], items: List[Dict[str, Any]]) -> Dict[str, Any]:
        """Update state with latest video information."""
        if not items:
            return state
        # Get the first item (most recent)
        latest_item = items[0]
        state['last_video_id'] = latest_item.get('id')
        state['last_published'] = latest_item.get('published_at')
        state['last_video_title'] = latest_item.get('title')
        state['last_sync'] = datetime.now(self.tz).isoformat()
        state['video_count'] = len(items)
        state['quota_used'] = self.quota_used
        return state
--- a/test_cumulative_mode.py
+++ b/test_cumulative_mode.py
@ -0,0 +1,236 @@
 #!/usr/bin/env python3
 """
 Test the cumulative markdown functionality
 Demonstrates how backlog + incremental updates work together
 """
 import sys
 from pathlib import Path
 sys.path.insert(0, str(Path(__file__).parent))
 from src.cumulative_markdown_manager import CumulativeMarkdownManager
 from src.base_scraper import ScraperConfig
 import logging
 from datetime import datetime
 import pytz
 # Set up logging
 logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
 )
 logger = logging.getLogger('cumulative_test')
 def create_mock_items(start_id: int, count: int, prefix: str = ""):
    """Create mock content items for testing."""
    items = []
    for i in range(count):
        item_id = f"video_{start_id + i}"
        items.append({
            'id': item_id,
            'title': f"{prefix}Video Title {start_id + i}",
            'views': 1000 * (start_id + i),
            'likes': 100 * (start_id + i),
            'description': f"Description for video {start_id + i}",
            'publish_date': '2024-01-15'
        })
    return items
 def format_mock_markdown(items):
    """Format mock items as markdown."""
    sections = []
    for item in items:
        section = [
            f"# ID: {item['id']}",
            "",
            f"## Title: {item['title']}",
            "",
            f"## Views: {item['views']:,}",
            "",
            f"## Likes: {item['likes']:,}",
            "",
            f"## Description:",
            item['description'],
            "",
            f"## Publish Date: {item['publish_date']}",
            "",
            "-" * 50
        ]
        sections.append('\n'.join(section))
    return '\n\n'.join(sections)
 def test_cumulative_workflow():
    """Test the complete cumulative workflow."""
    logger.info("=" * 60)
    logger.info("TESTING CUMULATIVE MARKDOWN WORKFLOW")
    logger.info("=" * 60)
    # Setup test config
    config = ScraperConfig(
        source_name='TestSource',
        brand_name='testbrand',
        data_dir=Path('test_data'),
        logs_dir=Path('test_logs'),
        timezone='America/Halifax'
    )
    # Clean up any existing test files
    test_pattern = "testbrand_TestSource_*.md"
    for old_file in Path('test_data/markdown_current').glob(test_pattern):
        old_file.unlink()
        logger.info(f"Cleaned up old test file: {old_file.name}")
    # Initialize manager
    manager = CumulativeMarkdownManager(config, logger)
    # STEP 1: Initial backlog capture
    logger.info("\n" + "=" * 40)
    logger.info("STEP 1: BACKLOG CAPTURE (Day 1)")
    logger.info("=" * 40)
    backlog_items = create_mock_items(1, 5, "Backlog ")
    logger.info(f"Created {len(backlog_items)} backlog items")
    file1 = manager.save_cumulative(backlog_items, format_mock_markdown)
    logger.info(f"Saved backlog to: {file1.name}")
    stats = manager.get_statistics(file1)
    logger.info(f"Stats after backlog: {stats}")
    # STEP 2: First incremental update (new items)
    logger.info("\n" + "=" * 40)
    logger.info("STEP 2: INCREMENTAL UPDATE - New Items (Day 2)")
    logger.info("=" * 40)
    new_items = create_mock_items(6, 2, "New ")
    logger.info(f"Created {len(new_items)} new items")
    file2 = manager.save_cumulative(new_items, format_mock_markdown)
    logger.info(f"Saved incremental to: {file2.name}")
    stats = manager.get_statistics(file2)
    logger.info(f"Stats after first incremental: {stats}")
    # Verify content
    content = file2.read_text(encoding='utf-8')
    id_count = content.count('# ID:')
    logger.info(f"Total sections in file: {id_count}")
    # STEP 3: Second incremental with updates
    logger.info("\n" + "=" * 40)
    logger.info("STEP 3: INCREMENTAL UPDATE - With Updates (Day 3)")
    logger.info("=" * 40)
    # Create items with updates (higher view counts) and new items
    updated_items = [
        {
            'id': 'video_1',  # Update existing
            'title': 'Backlog Video Title 1',
            'views': 5000,  # Increased from 1000
            'likes': 500,   # Increased from 100
            'description': 'Updated description with more details and captions',
            'publish_date': '2024-01-15',
            'caption': 'This video now has captions!'  # New field
        },
        {
            'id': 'video_8',  # New item
            'title': 'Brand New Video 8',
            'views': 8000,
            'likes': 800,
            'description': 'Newest video just published',
            'publish_date': '2024-01-18'
        }
    ]
    # Format with caption support
    def format_with_captions(items):
        sections = []
        for item in items:
            section = [
                f"# ID: {item['id']}",
                "",
                f"## Title: {item['title']}",
                "",
                f"## Views: {item['views']:,}",
                "",
                f"## Likes: {item['likes']:,}",
                "",
                f"## Description:",
                item['description'],
                ""
            ]
            if 'caption' in item:
                section.extend([
                    "## Caption Status:",
                    item['caption'],
                    ""
                ])
            section.extend([
                f"## Publish Date: {item['publish_date']}",
                "",
                "-" * 50
            ])
            sections.append('\n'.join(section))
        return '\n\n'.join(sections)
    logger.info(f"Created 1 update + 1 new item")
    file3 = manager.save_cumulative(updated_items, format_with_captions)
    logger.info(f"Saved second incremental to: {file3.name}")
    stats = manager.get_statistics(file3)
    logger.info(f"Stats after second incremental: {stats}")
    # Verify final content
    final_content = file3.read_text(encoding='utf-8')
    final_id_count = final_content.count('# ID:')
    caption_count = final_content.count('## Caption Status:')
    logger.info(f"Final total sections: {final_id_count}")
    logger.info(f"Sections with captions: {caption_count}")
    # Check if video_1 was updated
    if 'This video now has captions!' in final_content:
        logger.info("✅ Successfully updated video_1 with captions")
    else:
        logger.error("❌ Failed to update video_1")
    # Check if video_8 was added
    if 'video_8' in final_content:
        logger.info("✅ Successfully added new video_8")
    else:
        logger.error("❌ Failed to add video_8")
    # List archive files
    logger.info("\n" + "=" * 40)
    logger.info("ARCHIVED FILES:")
    logger.info("=" * 40)
    archive_dir = Path('test_data/markdown_archives/TestSource')
    if archive_dir.exists():
        archives = list(archive_dir.glob("*.md"))
        for archive in sorted(archives):
            logger.info(f"  - {archive.name}")
    logger.info("\n" + "=" * 60)
    logger.info("TEST COMPLETE!")
    logger.info("=" * 60)
    logger.info("Summary:")
    logger.info(f"  - Started with 5 backlog items")
    logger.info(f"  - Added 2 new items in first incremental")
    logger.info(f"  - Updated 1 item + added 1 item in second incremental")
    logger.info(f"  - Final file has {final_id_count} total items")
    logger.info(f"  - {caption_count} items have captions")
    logger.info(f"  - {len(archives) if archive_dir.exists() else 0} versions archived")
 if __name__ == "__main__":
    test_cumulative_workflow()