hvac-kia-content

Author	SHA1	Message	Date
Ben Reed	34fd853874	feat: Add HVACRSchool scraper and fix all source connectivity - Add new HVACRSchool scraper for technical articles (6th source) - Fix WordPress API connectivity (corrected URL to hvacknowitall.com) - Fix MailChimp RSS processing after environment consolidation - Implement YouTube hybrid scraper (API + yt-dlp) with PO token support - Disable YouTube transcripts due to platform restrictions (Aug 2025) - Update orchestrator to use all 6 active sources - Consolidate environment variables into single .env file - Full system sync completed with all sources updating successfully - Update documentation with current system status and capabilities 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-08-27 18:11:00 -03:00
Ben Reed	ccdb9366db	docs: Update GitHub references to point to local Forgejo server - Updated repository URLs in PRODUCTION_GUIDE.md - Updated project specification repository reference - Updated rollback and deployment documentation - All references now point to git.tealmaker.com/ben/hvac-kia-content.git	2025-08-27 16:07:07 -03:00
Ben Reed	4bdb3de6e8	fix: Correct systemd timer schedule to use local ADT times - Changed OnCalendar from UTC (11:00, 15:00) to local times (08:00, 12:00) - Fixed timezone confusion that caused missed morning runs - Services now run at proper 8:00 AM and 12:00 PM Atlantic time - Manual test confirms YouTube and other scrapers working correctly 🤖 Generated with [Claude Code](https://claude.ai/code)	2025-08-22 09:49:45 -03:00
Ben Reed	71ab1c2407	feat: Disable TikTok scraper and deploy production systemd services MAJOR CHANGES: - TikTok scraper disabled in orchestrator (GUI dependency issues) - Created new hkia-scraper systemd services replacing hvac-content-* - Added comprehensive installation script: install-hkia-services.sh - Updated documentation to reflect 5 active sources (WordPress, MailChimp, Podcast, YouTube, Instagram) PRODUCTION DEPLOYMENT: - Services installed and active: hkia-scraper.timer, hkia-scraper-nas.timer - Schedule: 8:00 AM & 12:00 PM ADT scraping + 30min NAS sync - All sources now run in parallel (no TikTok GUI blocking) - Automated twice-daily content aggregation with image downloads TECHNICAL: - Orchestrator simplified: removed TikTok special handling - Service files: proper naming convention (hkia-scraper vs hvac-content) - Documentation: marked TikTok as disabled, updated deployment status 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-08-21 10:40:48 -03:00
Ben Reed	299eb35910	fix: Add missing update_cumulative_file method to CumulativeMarkdownManager The method was being called by multiple scripts but didn't exist, causing Instagram capture to fail at post 1200. Added a compatibility method that uses a basic formatter to handle any source type with standard fields like ID, title, views, likes, images, etc. Tested successfully with test script.	2025-08-19 15:02:36 -03:00
Ben Reed	7e5377e7b1	docs: Update all documentation to use hkia naming convention Documentation Updates: - Updated project specification with hkia naming and paths - Modified all markdown documentation files (12 files updated) - Changed service names from hvac-content-* to hkia-content-* - Updated NAS paths from /mnt/nas/hvacknowitall to /mnt/nas/hkia - Replaced all instances of "HVAC Know It All" with "HKIA" Files Updated: - README.md - Updated service names and commands - CLAUDE.md - Updated environment variables and paths - DEPLOY.md - Updated deployment instructions - docs/project_specification.md - Updated naming convention specs - docs/status.md - Updated project status with new naming - docs/final_status.md - Updated completion status - docs/deployment_strategy.md - Updated deployment paths - docs/DEPLOYMENT_CHECKLIST.md - Updated checklist items - docs/PRODUCTION_TODO.md - Updated production tasks - BACKLOG_STATUS.md - Updated backlog references - UPDATED_CAPTURE_STATUS.md - Updated capture status - FINAL_TALLY_REPORT.md - Updated tally report Notes: - Repository name remains hvacknowitall-content (unchanged) - Project directory remains hvac-kia-content (unchanged) - All user-facing outputs now use clean "hkia" naming 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-08-19 13:40:27 -03:00
Ben Reed	daab901e35	refactor: Update naming convention from hvacknowitall to hkia Major Changes: - Updated all code references from hvacknowitall/hvacnkowitall to hkia - Renamed all existing markdown files to use hkia_ prefix - Updated configuration files, scrapers, and production scripts - Modified systemd service descriptions to use HKIA - Changed NAS sync path to /mnt/nas/hkia Files Updated: - 20+ source files updated with new naming convention - 34 markdown files renamed to hkia_* format - All ScraperConfig brand_name parameters now use 'hkia' - Documentation updated to reflect new naming Rationale: - Shorter, cleaner filenames - Consistent branding across all outputs - Easier to type and reference - Maintains same functionality with improved naming Next Steps: - Deploy updated services to production - Update any external references to old naming - Monitor scrapers to ensure proper operation 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-08-19 13:35:23 -03:00
Ben Reed	6b7a65e8f6	feat: Add cumulative markdown service configuration	2025-08-19 13:24:40 -03:00
Ben Reed	2edc359b5e	feat: Implement comprehensive image downloading and cumulative markdown system Major Updates: - Added image downloading for Instagram, YouTube, and Podcast scrapers - Implemented cumulative markdown system for maintaining single source-of-truth files - Deployed production services with automatic NAS sync for images - Standardized file naming conventions per project specification New Features: - Instagram: Downloads all post images, carousel images, and video thumbnails - YouTube: Downloads video thumbnails (highest quality available) - Podcast: Downloads episode artwork/thumbnails - Consistent image naming: {source}_{item_id}_{type}.{ext} - Cumulative markdown updates to prevent file proliferation - Automatic media sync to NAS at /mnt/nas/hvacknowitall/media/ Production Deployment: - New systemd services: hvac-content-images-8am and hvac-content-images-12pm - Runs twice daily at 8 AM and 12 PM Atlantic time - Comprehensive rsync for both markdown and media files File Structure Compliance: - Renamed Instagram backlog to spec-compliant format - Archived legacy directory structures - Ensured all new files follow <brandName>_<source>_<dateTime>.md format Testing: - Successfully captured Instagram posts 1-1000 with images - Launched next batch (posts 1001-2000) currently in progress - Verified thumbnail downloads for YouTube and Podcast content 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-08-19 12:54:21 -03:00
Ben Reed	ef66d3bbc5	CRITICAL FIX: MailChimp content cleaning bug causing missing newsletter body Issue: - MailChimp campaigns missing body content in markdown files - Logic flaw in HTML-to-markdown conversion flow - Double cleaning and incorrect empty content checks Root Cause: - Checked already-cleaned content instead of original for HTML fallback - HTML content never converted when plain_text was empty - Applied cleaning twice when HTML was converted Fix: - Check original plain_text before deciding HTML conversion - Convert HTML first, then clean once (eliminate double cleaning) - Preserve all legitimate newsletter body content - Keep header/footer cleaning patterns (they are appropriate) Impact: - All newsletter content now preserved correctly - Headers/footers still properly removed - Next production run will capture complete content 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-08-19 11:19:32 -03:00
Ben Reed	2090da57f5	Add systemd deployment configuration - Create systemd service and timer files for 8am and 12pm runs - Add automated installation script - Include deployment documentation with troubleshooting - Configure for production with proper paths and environment Ready for production deployment with: sudo ./deploy/install.sh 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-08-19 10:56:32 -03:00
Ben Reed	8ceb858026	Implement cumulative markdown system and API integrations Major improvements: - Add CumulativeMarkdownManager for intelligent content merging - Implement YouTube Data API v3 integration with caption support - Add MailChimp API integration with content cleaning - Create single source-of-truth files that grow with updates - Smart merging: updates existing entries with better data - Properly combines backlog + incremental daily updates Features: - 179/444 YouTube videos now have captions (40.3%) - MailChimp content cleaned of headers/footers - All sources consolidated to single files - Archive management with timestamped versions - Test suite and documentation included 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-08-19 10:53:40 -03:00
Ben Reed	8a0b8b4d3f	Update documentation with production deployment status - Update status.md with current production deployment status - Document completed backlogs (WordPress: 139, Podcast: 428, YouTube: 200) - Track Instagram progress (50/1000 @ 200/hr) and TikTok queue status - Create claude.md with implementation notes and key solutions - Document HTML cleaning fix, rate limit optimization, and NAS sync - Add testing commands and maintenance notes for future reference - Include known issues and file structure documentation	2025-08-18 23:14:45 -03:00
Ben Reed	8b83185130	Fix HTML/XML contamination in WordPress markdown extraction - Update base_scraper.py convert_to_markdown() to properly clean HTML - Remove script/style blocks and their content before conversion - Strip inline JavaScript event handlers - Clean up br tags and excessive blank lines - Fix malformed comparison operators that look like tags - Add comprehensive HTML cleaning during content extraction (not after) - Test confirms WordPress content now generates clean markdown without HTML This ensures all future WordPress scraping produces specification-compliant markdown without any HTML/XML contamination.	2025-08-18 23:11:08 -03:00
Ben Reed	0a795437a7	Optimize Instagram scraper and increase capture targets to 1000 - Increased Instagram rate limit from 100 to 200 posts/hour - Reduced delays: 10-20s (was 15-30s), extended breaks 30-60s (was 60-120s) - Extended break interval: every 10 requests (was 5) - Updated capture targets: 1000 posts for Instagram, 1000 videos for TikTok - Added production deployment and monitoring scripts - Created environment configuration template This provides ~40-50% speed improvement for Instagram scraping and captures 5x more Instagram content and 3.3x more TikTok content. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-08-18 22:59:11 -03:00
Ben Reed	ccfeacbe91	Fix NAS sync to include media files instead of logs - Changed NAS sync from logs to media directory - Media files (images, videos, audio) are much more valuable for backup - Logs are better kept locally for debugging and monitoring - Uses rsync -av --delete for media synchronization - Maintains proper error handling and reporting NAS structure now: - /mnt/nas/hvacknowitall/current/ (latest markdown) - /mnt/nas/hvacknowitall/archives/ (historical archives) - /mnt/nas/hvacknowitall/media/ (downloaded media files) 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-08-18 21:52:28 -03:00
Ben Reed	afdc790415	Add final dependencies for monitoring and testing - Added tenacity for retry logic in scrapers - Added psutil for system monitoring metrics - Updated uv.lock with final dependency versions All 25 production tasks now complete: ✅ Core system architecture (100%) ✅ Testing infrastructure (100%) ✅ Production deployment (100%) ✅ Monitoring & alerting (100%) ✅ Documentation & architecture (100%) System is production-ready with: - Comprehensive test coverage - Real-time monitoring dashboard - Production-hardened systemd services - Complete specification compliance - Emergency rollback procedures 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-08-18 21:49:43 -03:00
Ben Reed	dc57ce80d5	Add comprehensive monitoring and alerting system - Created SystemMonitor class for health check monitoring - Implemented system metrics collection (CPU, memory, disk, network) - Added application metrics monitoring (scrapers, logs, data sizes) - Built alert system with configurable thresholds - Developed HTML dashboard generator with real-time charts - Added systemd services for automated monitoring (15-min intervals) - Created responsive web dashboard with Bootstrap and Chart.js - Implemented automatic cleanup of old metric files - Added comprehensive documentation and troubleshooting guide Features: - Real-time system resource monitoring - Scraper performance tracking and alerts - Interactive dashboard with trend charts - Email-ready alert notifications - Systemd integration for production deployment - Security hardening with minimal privileges - Auto-refresh dashboard every 5 minutes - 7-day metric retention with automatic cleanup Alert conditions: - Critical: CPU >80%, Memory >85%, Disk >90% - Warning: Scraper inactive >24h, Log files >100MB - Error: Monitoring failures, configuration issues 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-08-18 21:35:28 -03:00
Ben Reed	8d5750b1d1	Add comprehensive test infrastructure - Created unit tests for BaseScraper with mocking - Added integration tests for parallel processing - Created end-to-end tests with realistic mock data - Fixed initialization order in BaseScraper (logger before user agent) - Fixed orchestrator method name (archive_current_file) - Added tenacity dependency for retry logic - Validated parallel processing performance and overlap detection - Confirmed spec-compliant markdown formatting in tests Tests cover: - Base scraper functionality (state, markdown, retry logic, media downloads) - Parallel vs sequential execution timing - Error isolation between scrapers - Directory structure creation - State management across runs - Full workflow with realistic data 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-08-18 21:16:14 -03:00
Ben Reed	b6273ca934	Complete core specification compliance improvements Major Feature Additions: - Standardized markdown format to match specification exactly - Implemented media downloading with retry logic and safe filenames - Added user agent rotation (6 browsers) with random rotation - Created comprehensive pytest unit tests for base scraper - Enhanced directory structure to match specification Technical Improvements: - Spec-compliant markdown format with ID, Title, Type, Permalink structure - Media download with URL parsing, filename sanitization, and deduplication - User agent pool rotation every 5 requests to avoid detection - Complete test coverage for state management, retry logic, formatting Progress: 22 of 25 tasks completed (88% done) Remaining: Integration tests, staging deployment, monitoring setup The system now meets 90%+ of the original specification requirements with robust error handling, retry logic, and production readiness. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-08-18 20:33:21 -03:00
Ben Reed	a80af693ba	Add comprehensive production documentation and testing Documentation Added: - ARCHITECTURE_DECISIONS.md: Explains why systemd over k8s (TikTok display requirements) - DEPLOYMENT_CHECKLIST.md: Step-by-step deployment procedures - ROLLBACK_PROCEDURES.md: Emergency rollback and recovery procedures - test_production_deployment.py: Automated deployment verification script Key Documentation Highlights: - Detailed explanation of containerization limitations with browser automation - Complete deployment checklist with pre/post verification steps - Rollback scenarios with recovery time objectives - Emergency contact templates and backup procedures - Automated test script for production readiness 17 of 25 tasks completed (68% done) Remaining work focuses on spec compliance and testing 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-08-18 20:20:52 -03:00
Ben Reed	dabef8bfcb	Implement retry logic, connection pooling, and production hardening Major Production Improvements: - Added retry logic with exponential backoff using tenacity - Implemented HTTP connection pooling via requests.Session - Added health check monitoring with metrics reporting - Implemented configuration validation for all numeric values - Fixed error isolation (verified continues on failure) Technical Changes: - BaseScraper: Added session management and make_request() method - WordPressScraper: Updated all HTTP calls to use retry logic - Production runner: Added validate_config() and health check ping - Retry config: 3 attempts, 5-60s exponential backoff System is now production-ready with robust error handling, automatic retries, and health monitoring. Remaining tasks focus on spec compliance (media downloads, markdown format) and testing/documentation. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-08-18 20:16:02 -03:00
Ben Reed	05218a873b	Fix critical production issues and improve spec compliance Production Readiness Improvements: - Fixed scheduling to match spec (8 AM & 12 PM ADT instead of 6 AM/6 PM) - Enabled NAS synchronization in production runner with error handling - Fixed file naming convention to spec format (hvacknowitall_combined_YYYY-MM-DD-THHMMSS.md) - Made systemd services portable (removed hardcoded user/paths) - Added environment variable validation on startup - Moved DISPLAY/XAUTHORITY to .env configuration Systemd Improvements: - Created template service file (@.service) for any user - Changed all paths to /opt/hvac-kia-content - Updated installation script for portable deployment - Fixed service dependencies and resource limits Documentation: - Created comprehensive PRODUCTION_TODO.md with 25 tasks - Added PRODUCTION_GUIDE.md with deployment instructions - Documented spec compliance gaps (65% complete) Remaining work includes retry logic, connection pooling, media downloads, and pytest test suite as documented in PRODUCTION_TODO.md 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-08-18 20:07:55 -03:00
Ben Reed	1e5880bf00	feat: Enhance TikTok scraper with caption fetching and improved video discovery - Add optional individual video page fetching for complete captions - Implement profile scrolling to discover more videos (27+ vs 18) - Add configurable rate limiting and anti-detection delays - Fix RSS scrapers to support max_items parameter for backlog fetching - Add fetch_captions parameter with max_caption_fetches limit - Include additional metadata extraction (likes, comments, shares, duration) 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-08-18 18:59:46 -03:00
Ben Reed	b89655c829	Add Instagram scraper with instaloader and parallel processing orchestrator - Implement Instagram scraper with aggressive rate limiting - Add orchestrator for running all scrapers in parallel - Create comprehensive tests for Instagram scraper (11 tests) - Create tests for orchestrator (9 tests) - Fix Instagram test issues with post type detection - All 60 tests passing successfully 🤖 Generated with Claude Code Co-Authored-By: Claude <noreply@anthropic.com>	2025-08-18 12:56:57 -03:00
Ben Reed	c1831d3a52	feat: Implement YouTube scraper with humanized behavior - YouTube channel scraper using yt-dlp - Authentication and session persistence via cookies - Humanized delays and rate limiting (2-5 seconds between requests) - User agent rotation for stealth - Incremental updates via state management - Support for videos, shorts, and live streams detection - All 11 tests passing 🤖 Generated with Claude Code Co-Authored-By: Claude <noreply@anthropic.com>	2025-08-18 12:39:49 -03:00
Ben Reed	7191fcd132	feat: Implement RSS scrapers for MailChimp and Podcast feeds - Created base RSS scraper class with common functionality - Implemented MailChimp RSS scraper for newsletters - Implemented Podcast RSS scraper with audio/image extraction - State management for incremental updates - All 9 tests passing for RSS scrapers 🤖 Generated with Claude Code Co-Authored-By: Claude <noreply@anthropic.com>	2025-08-18 12:29:45 -03:00
Ben Reed	95e0499791	feat: Implement WordPress scraper with comprehensive tests - Created WordPressScraper class extending BaseScraper - Fetches posts with pagination support - Enriches posts with author, category, and tag information - Implements incremental updates via state management - Word count calculation for content - All 11 tests passing 🤖 Generated with Claude Code Co-Authored-By: Claude <noreply@anthropic.com>	2025-08-18 12:19:56 -03:00
Ben Reed	f9a8e719a7	Initial commit: Project foundation with base scraper and tests - Set up UV environment with all required packages - Created comprehensive project structure - Implemented abstract BaseScraper class with TDD - Added documentation (project spec, implementation plan, status) - Configured .env for credentials (not committed) - All base scraper tests passing (9/9) 🤖 Generated with Claude Code Co-Authored-By: Claude <noreply@anthropic.com>	2025-08-18 12:15:17 -03:00

29 commits