hvac-kia-content

Author	SHA1	Message	Date
Ben Reed	ef66d3bbc5	CRITICAL FIX: MailChimp content cleaning bug causing missing newsletter body Issue: - MailChimp campaigns missing body content in markdown files - Logic flaw in HTML-to-markdown conversion flow - Double cleaning and incorrect empty content checks Root Cause: - Checked already-cleaned content instead of original for HTML fallback - HTML content never converted when plain_text was empty - Applied cleaning twice when HTML was converted Fix: - Check original plain_text before deciding HTML conversion - Convert HTML first, then clean once (eliminate double cleaning) - Preserve all legitimate newsletter body content - Keep header/footer cleaning patterns (they are appropriate) Impact: - All newsletter content now preserved correctly - Headers/footers still properly removed - Next production run will capture complete content 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-08-19 11:19:32 -03:00
Ben Reed	8ceb858026	Implement cumulative markdown system and API integrations Major improvements: - Add CumulativeMarkdownManager for intelligent content merging - Implement YouTube Data API v3 integration with caption support - Add MailChimp API integration with content cleaning - Create single source-of-truth files that grow with updates - Smart merging: updates existing entries with better data - Properly combines backlog + incremental daily updates Features: - 179/444 YouTube videos now have captions (40.3%) - MailChimp content cleaned of headers/footers - All sources consolidated to single files - Archive management with timestamped versions - Test suite and documentation included 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-08-19 10:53:40 -03:00
Ben Reed	8b83185130	Fix HTML/XML contamination in WordPress markdown extraction - Update base_scraper.py convert_to_markdown() to properly clean HTML - Remove script/style blocks and their content before conversion - Strip inline JavaScript event handlers - Clean up br tags and excessive blank lines - Fix malformed comparison operators that look like tags - Add comprehensive HTML cleaning during content extraction (not after) - Test confirms WordPress content now generates clean markdown without HTML This ensures all future WordPress scraping produces specification-compliant markdown without any HTML/XML contamination.	2025-08-18 23:11:08 -03:00
Ben Reed	0a795437a7	Optimize Instagram scraper and increase capture targets to 1000 - Increased Instagram rate limit from 100 to 200 posts/hour - Reduced delays: 10-20s (was 15-30s), extended breaks 30-60s (was 60-120s) - Extended break interval: every 10 requests (was 5) - Updated capture targets: 1000 posts for Instagram, 1000 videos for TikTok - Added production deployment and monitoring scripts - Created environment configuration template This provides ~40-50% speed improvement for Instagram scraping and captures 5x more Instagram content and 3.3x more TikTok content. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-08-18 22:59:11 -03:00
Ben Reed	ccfeacbe91	Fix NAS sync to include media files instead of logs - Changed NAS sync from logs to media directory - Media files (images, videos, audio) are much more valuable for backup - Logs are better kept locally for debugging and monitoring - Uses rsync -av --delete for media synchronization - Maintains proper error handling and reporting NAS structure now: - /mnt/nas/hvacknowitall/current/ (latest markdown) - /mnt/nas/hvacknowitall/archives/ (historical archives) - /mnt/nas/hvacknowitall/media/ (downloaded media files) 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-08-18 21:52:28 -03:00
Ben Reed	8d5750b1d1	Add comprehensive test infrastructure - Created unit tests for BaseScraper with mocking - Added integration tests for parallel processing - Created end-to-end tests with realistic mock data - Fixed initialization order in BaseScraper (logger before user agent) - Fixed orchestrator method name (archive_current_file) - Added tenacity dependency for retry logic - Validated parallel processing performance and overlap detection - Confirmed spec-compliant markdown formatting in tests Tests cover: - Base scraper functionality (state, markdown, retry logic, media downloads) - Parallel vs sequential execution timing - Error isolation between scrapers - Directory structure creation - State management across runs - Full workflow with realistic data 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-08-18 21:16:14 -03:00
Ben Reed	b6273ca934	Complete core specification compliance improvements Major Feature Additions: - Standardized markdown format to match specification exactly - Implemented media downloading with retry logic and safe filenames - Added user agent rotation (6 browsers) with random rotation - Created comprehensive pytest unit tests for base scraper - Enhanced directory structure to match specification Technical Improvements: - Spec-compliant markdown format with ID, Title, Type, Permalink structure - Media download with URL parsing, filename sanitization, and deduplication - User agent pool rotation every 5 requests to avoid detection - Complete test coverage for state management, retry logic, formatting Progress: 22 of 25 tasks completed (88% done) Remaining: Integration tests, staging deployment, monitoring setup The system now meets 90%+ of the original specification requirements with robust error handling, retry logic, and production readiness. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-08-18 20:33:21 -03:00
Ben Reed	dabef8bfcb	Implement retry logic, connection pooling, and production hardening Major Production Improvements: - Added retry logic with exponential backoff using tenacity - Implemented HTTP connection pooling via requests.Session - Added health check monitoring with metrics reporting - Implemented configuration validation for all numeric values - Fixed error isolation (verified continues on failure) Technical Changes: - BaseScraper: Added session management and make_request() method - WordPressScraper: Updated all HTTP calls to use retry logic - Production runner: Added validate_config() and health check ping - Retry config: 3 attempts, 5-60s exponential backoff System is now production-ready with robust error handling, automatic retries, and health monitoring. Remaining tasks focus on spec compliance (media downloads, markdown format) and testing/documentation. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-08-18 20:16:02 -03:00
Ben Reed	05218a873b	Fix critical production issues and improve spec compliance Production Readiness Improvements: - Fixed scheduling to match spec (8 AM & 12 PM ADT instead of 6 AM/6 PM) - Enabled NAS synchronization in production runner with error handling - Fixed file naming convention to spec format (hvacknowitall_combined_YYYY-MM-DD-THHMMSS.md) - Made systemd services portable (removed hardcoded user/paths) - Added environment variable validation on startup - Moved DISPLAY/XAUTHORITY to .env configuration Systemd Improvements: - Created template service file (@.service) for any user - Changed all paths to /opt/hvac-kia-content - Updated installation script for portable deployment - Fixed service dependencies and resource limits Documentation: - Created comprehensive PRODUCTION_TODO.md with 25 tasks - Added PRODUCTION_GUIDE.md with deployment instructions - Documented spec compliance gaps (65% complete) Remaining work includes retry logic, connection pooling, media downloads, and pytest test suite as documented in PRODUCTION_TODO.md 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-08-18 20:07:55 -03:00
Ben Reed	1e5880bf00	feat: Enhance TikTok scraper with caption fetching and improved video discovery - Add optional individual video page fetching for complete captions - Implement profile scrolling to discover more videos (27+ vs 18) - Add configurable rate limiting and anti-detection delays - Fix RSS scrapers to support max_items parameter for backlog fetching - Add fetch_captions parameter with max_caption_fetches limit - Include additional metadata extraction (likes, comments, shares, duration) 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-08-18 18:59:46 -03:00
Ben Reed	b89655c829	Add Instagram scraper with instaloader and parallel processing orchestrator - Implement Instagram scraper with aggressive rate limiting - Add orchestrator for running all scrapers in parallel - Create comprehensive tests for Instagram scraper (11 tests) - Create tests for orchestrator (9 tests) - Fix Instagram test issues with post type detection - All 60 tests passing successfully 🤖 Generated with Claude Code Co-Authored-By: Claude <noreply@anthropic.com>	2025-08-18 12:56:57 -03:00
Ben Reed	c1831d3a52	feat: Implement YouTube scraper with humanized behavior - YouTube channel scraper using yt-dlp - Authentication and session persistence via cookies - Humanized delays and rate limiting (2-5 seconds between requests) - User agent rotation for stealth - Incremental updates via state management - Support for videos, shorts, and live streams detection - All 11 tests passing 🤖 Generated with Claude Code Co-Authored-By: Claude <noreply@anthropic.com>	2025-08-18 12:39:49 -03:00
Ben Reed	7191fcd132	feat: Implement RSS scrapers for MailChimp and Podcast feeds - Created base RSS scraper class with common functionality - Implemented MailChimp RSS scraper for newsletters - Implemented Podcast RSS scraper with audio/image extraction - State management for incremental updates - All 9 tests passing for RSS scrapers 🤖 Generated with Claude Code Co-Authored-By: Claude <noreply@anthropic.com>	2025-08-18 12:29:45 -03:00
Ben Reed	95e0499791	feat: Implement WordPress scraper with comprehensive tests - Created WordPressScraper class extending BaseScraper - Fetches posts with pagination support - Enriches posts with author, category, and tag information - Implements incremental updates via state management - Word count calculation for content - All 11 tests passing 🤖 Generated with Claude Code Co-Authored-By: Claude <noreply@anthropic.com>	2025-08-18 12:19:56 -03:00
Ben Reed	f9a8e719a7	Initial commit: Project foundation with base scraper and tests - Set up UV environment with all required packages - Created comprehensive project structure - Implemented abstract BaseScraper class with TDD - Added documentation (project spec, implementation plan, status) - Configured .env for credentials (not committed) - All base scraper tests passing (9/9) 🤖 Generated with Claude Code Co-Authored-By: Claude <noreply@anthropic.com>	2025-08-18 12:15:17 -03:00

15 commits