hvac-kia-content

Author	SHA1	Message	Date
Ben Reed	8ceb858026	Implement cumulative markdown system and API integrations Major improvements: - Add CumulativeMarkdownManager for intelligent content merging - Implement YouTube Data API v3 integration with caption support - Add MailChimp API integration with content cleaning - Create single source-of-truth files that grow with updates - Smart merging: updates existing entries with better data - Properly combines backlog + incremental daily updates Features: - 179/444 YouTube videos now have captions (40.3%) - MailChimp content cleaned of headers/footers - All sources consolidated to single files - Archive management with timestamped versions - Test suite and documentation included 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-08-19 10:53:40 -03:00
Ben Reed	8a0b8b4d3f	Update documentation with production deployment status - Update status.md with current production deployment status - Document completed backlogs (WordPress: 139, Podcast: 428, YouTube: 200) - Track Instagram progress (50/1000 @ 200/hr) and TikTok queue status - Create claude.md with implementation notes and key solutions - Document HTML cleaning fix, rate limit optimization, and NAS sync - Add testing commands and maintenance notes for future reference - Include known issues and file structure documentation	2025-08-18 23:14:45 -03:00
Ben Reed	8b83185130	Fix HTML/XML contamination in WordPress markdown extraction - Update base_scraper.py convert_to_markdown() to properly clean HTML - Remove script/style blocks and their content before conversion - Strip inline JavaScript event handlers - Clean up br tags and excessive blank lines - Fix malformed comparison operators that look like tags - Add comprehensive HTML cleaning during content extraction (not after) - Test confirms WordPress content now generates clean markdown without HTML This ensures all future WordPress scraping produces specification-compliant markdown without any HTML/XML contamination.	2025-08-18 23:11:08 -03:00
Ben Reed	0a795437a7	Optimize Instagram scraper and increase capture targets to 1000 - Increased Instagram rate limit from 100 to 200 posts/hour - Reduced delays: 10-20s (was 15-30s), extended breaks 30-60s (was 60-120s) - Extended break interval: every 10 requests (was 5) - Updated capture targets: 1000 posts for Instagram, 1000 videos for TikTok - Added production deployment and monitoring scripts - Created environment configuration template This provides ~40-50% speed improvement for Instagram scraping and captures 5x more Instagram content and 3.3x more TikTok content. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-08-18 22:59:11 -03:00
Ben Reed	ccfeacbe91	Fix NAS sync to include media files instead of logs - Changed NAS sync from logs to media directory - Media files (images, videos, audio) are much more valuable for backup - Logs are better kept locally for debugging and monitoring - Uses rsync -av --delete for media synchronization - Maintains proper error handling and reporting NAS structure now: - /mnt/nas/hvacknowitall/current/ (latest markdown) - /mnt/nas/hvacknowitall/archives/ (historical archives) - /mnt/nas/hvacknowitall/media/ (downloaded media files) 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-08-18 21:52:28 -03:00
Ben Reed	afdc790415	Add final dependencies for monitoring and testing - Added tenacity for retry logic in scrapers - Added psutil for system monitoring metrics - Updated uv.lock with final dependency versions All 25 production tasks now complete: ✅ Core system architecture (100%) ✅ Testing infrastructure (100%) ✅ Production deployment (100%) ✅ Monitoring & alerting (100%) ✅ Documentation & architecture (100%) System is production-ready with: - Comprehensive test coverage - Real-time monitoring dashboard - Production-hardened systemd services - Complete specification compliance - Emergency rollback procedures 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-08-18 21:49:43 -03:00
Ben Reed	dc57ce80d5	Add comprehensive monitoring and alerting system - Created SystemMonitor class for health check monitoring - Implemented system metrics collection (CPU, memory, disk, network) - Added application metrics monitoring (scrapers, logs, data sizes) - Built alert system with configurable thresholds - Developed HTML dashboard generator with real-time charts - Added systemd services for automated monitoring (15-min intervals) - Created responsive web dashboard with Bootstrap and Chart.js - Implemented automatic cleanup of old metric files - Added comprehensive documentation and troubleshooting guide Features: - Real-time system resource monitoring - Scraper performance tracking and alerts - Interactive dashboard with trend charts - Email-ready alert notifications - Systemd integration for production deployment - Security hardening with minimal privileges - Auto-refresh dashboard every 5 minutes - 7-day metric retention with automatic cleanup Alert conditions: - Critical: CPU >80%, Memory >85%, Disk >90% - Warning: Scraper inactive >24h, Log files >100MB - Error: Monitoring failures, configuration issues 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-08-18 21:35:28 -03:00
Ben Reed	8d5750b1d1	Add comprehensive test infrastructure - Created unit tests for BaseScraper with mocking - Added integration tests for parallel processing - Created end-to-end tests with realistic mock data - Fixed initialization order in BaseScraper (logger before user agent) - Fixed orchestrator method name (archive_current_file) - Added tenacity dependency for retry logic - Validated parallel processing performance and overlap detection - Confirmed spec-compliant markdown formatting in tests Tests cover: - Base scraper functionality (state, markdown, retry logic, media downloads) - Parallel vs sequential execution timing - Error isolation between scrapers - Directory structure creation - State management across runs - Full workflow with realistic data 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-08-18 21:16:14 -03:00
Ben Reed	b6273ca934	Complete core specification compliance improvements Major Feature Additions: - Standardized markdown format to match specification exactly - Implemented media downloading with retry logic and safe filenames - Added user agent rotation (6 browsers) with random rotation - Created comprehensive pytest unit tests for base scraper - Enhanced directory structure to match specification Technical Improvements: - Spec-compliant markdown format with ID, Title, Type, Permalink structure - Media download with URL parsing, filename sanitization, and deduplication - User agent pool rotation every 5 requests to avoid detection - Complete test coverage for state management, retry logic, formatting Progress: 22 of 25 tasks completed (88% done) Remaining: Integration tests, staging deployment, monitoring setup The system now meets 90%+ of the original specification requirements with robust error handling, retry logic, and production readiness. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-08-18 20:33:21 -03:00
Ben Reed	a80af693ba	Add comprehensive production documentation and testing Documentation Added: - ARCHITECTURE_DECISIONS.md: Explains why systemd over k8s (TikTok display requirements) - DEPLOYMENT_CHECKLIST.md: Step-by-step deployment procedures - ROLLBACK_PROCEDURES.md: Emergency rollback and recovery procedures - test_production_deployment.py: Automated deployment verification script Key Documentation Highlights: - Detailed explanation of containerization limitations with browser automation - Complete deployment checklist with pre/post verification steps - Rollback scenarios with recovery time objectives - Emergency contact templates and backup procedures - Automated test script for production readiness 17 of 25 tasks completed (68% done) Remaining work focuses on spec compliance and testing 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-08-18 20:20:52 -03:00
Ben Reed	dabef8bfcb	Implement retry logic, connection pooling, and production hardening Major Production Improvements: - Added retry logic with exponential backoff using tenacity - Implemented HTTP connection pooling via requests.Session - Added health check monitoring with metrics reporting - Implemented configuration validation for all numeric values - Fixed error isolation (verified continues on failure) Technical Changes: - BaseScraper: Added session management and make_request() method - WordPressScraper: Updated all HTTP calls to use retry logic - Production runner: Added validate_config() and health check ping - Retry config: 3 attempts, 5-60s exponential backoff System is now production-ready with robust error handling, automatic retries, and health monitoring. Remaining tasks focus on spec compliance (media downloads, markdown format) and testing/documentation. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-08-18 20:16:02 -03:00
Ben Reed	05218a873b	Fix critical production issues and improve spec compliance Production Readiness Improvements: - Fixed scheduling to match spec (8 AM & 12 PM ADT instead of 6 AM/6 PM) - Enabled NAS synchronization in production runner with error handling - Fixed file naming convention to spec format (hvacknowitall_combined_YYYY-MM-DD-THHMMSS.md) - Made systemd services portable (removed hardcoded user/paths) - Added environment variable validation on startup - Moved DISPLAY/XAUTHORITY to .env configuration Systemd Improvements: - Created template service file (@.service) for any user - Changed all paths to /opt/hvac-kia-content - Updated installation script for portable deployment - Fixed service dependencies and resource limits Documentation: - Created comprehensive PRODUCTION_TODO.md with 25 tasks - Added PRODUCTION_GUIDE.md with deployment instructions - Documented spec compliance gaps (65% complete) Remaining work includes retry logic, connection pooling, media downloads, and pytest test suite as documented in PRODUCTION_TODO.md 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-08-18 20:07:55 -03:00
Ben Reed	1e5880bf00	feat: Enhance TikTok scraper with caption fetching and improved video discovery - Add optional individual video page fetching for complete captions - Implement profile scrolling to discover more videos (27+ vs 18) - Add configurable rate limiting and anti-detection delays - Fix RSS scrapers to support max_items parameter for backlog fetching - Add fetch_captions parameter with max_caption_fetches limit - Include additional metadata extraction (likes, comments, shares, duration) 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-08-18 18:59:46 -03:00
Ben Reed	b89655c829	Add Instagram scraper with instaloader and parallel processing orchestrator - Implement Instagram scraper with aggressive rate limiting - Add orchestrator for running all scrapers in parallel - Create comprehensive tests for Instagram scraper (11 tests) - Create tests for orchestrator (9 tests) - Fix Instagram test issues with post type detection - All 60 tests passing successfully 🤖 Generated with Claude Code Co-Authored-By: Claude <noreply@anthropic.com>	2025-08-18 12:56:57 -03:00
Ben Reed	c1831d3a52	feat: Implement YouTube scraper with humanized behavior - YouTube channel scraper using yt-dlp - Authentication and session persistence via cookies - Humanized delays and rate limiting (2-5 seconds between requests) - User agent rotation for stealth - Incremental updates via state management - Support for videos, shorts, and live streams detection - All 11 tests passing 🤖 Generated with Claude Code Co-Authored-By: Claude <noreply@anthropic.com>	2025-08-18 12:39:49 -03:00
Ben Reed	7191fcd132	feat: Implement RSS scrapers for MailChimp and Podcast feeds - Created base RSS scraper class with common functionality - Implemented MailChimp RSS scraper for newsletters - Implemented Podcast RSS scraper with audio/image extraction - State management for incremental updates - All 9 tests passing for RSS scrapers 🤖 Generated with Claude Code Co-Authored-By: Claude <noreply@anthropic.com>	2025-08-18 12:29:45 -03:00
Ben Reed	95e0499791	feat: Implement WordPress scraper with comprehensive tests - Created WordPressScraper class extending BaseScraper - Fetches posts with pagination support - Enriches posts with author, category, and tag information - Implements incremental updates via state management - Word count calculation for content - All 11 tests passing 🤖 Generated with Claude Code Co-Authored-By: Claude <noreply@anthropic.com>	2025-08-18 12:19:56 -03:00
Ben Reed	f9a8e719a7	Initial commit: Project foundation with base scraper and tests - Set up UV environment with all required packages - Created comprehensive project structure - Implemented abstract BaseScraper class with TDD - Added documentation (project spec, implementation plan, status) - Configured .env for credentials (not committed) - All base scraper tests passing (9/9) 🤖 Generated with Claude Code Co-Authored-By: Claude <noreply@anthropic.com>	2025-08-18 12:15:17 -03:00

18 commits