Commit graph

11 commits

Author SHA1 Message Date
Ben Reed
ade81beea2 feat: Complete Phase 1 content analysis with engagement parsing fixes
Major enhancements to HKIA content analysis system:

CRITICAL FIXES:
• Fix engagement data parsing from markdown (Views/Likes/Comments now extracted correctly)
• YouTube: 18.75% engagement rate working (16 views, 2 likes, 1 comment)
• Instagram: 7.37% average engagement rate across 20 posts
• High performer detection operational (1 YouTube + 20 Instagram above thresholds)

CONTENT ANALYSIS SYSTEM:
• Add Claude Haiku analyzer for HVAC content classification
• Add engagement analyzer with source-specific algorithms
• Add keyword extractor with 100+ HVAC-specific terms
• Add intelligence aggregator for daily JSON reports
• Add comprehensive unit test suite (73 tests, 90% coverage target)

ARCHITECTURE:
• Extend BaseScraper with optional AI analysis capabilities
• Add content analysis orchestrator with CLI interface
• Add competitive intelligence module structure
• Maintain backward compatibility with existing scrapers

INTELLIGENCE FEATURES:
• Daily intelligence reports with strategic insights
• Trending keyword analysis (813 refrigeration, 701 service mentions)
• Content opportunity identification
• Multi-source engagement benchmarking
• HVAC-specific topic and product categorization

PRODUCTION READY:
• Claude Haiku API integration validated ($15-25/month estimated)
• Graceful degradation when API unavailable
• Comprehensive logging and error handling
• State management for analytics tracking

Ready for Phase 2: Competitive Intelligence Infrastructure

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-08-28 16:40:19 -03:00
Ben Reed
34fd853874 feat: Add HVACRSchool scraper and fix all source connectivity
- Add new HVACRSchool scraper for technical articles (6th source)
- Fix WordPress API connectivity (corrected URL to hvacknowitall.com)
- Fix MailChimp RSS processing after environment consolidation
- Implement YouTube hybrid scraper (API + yt-dlp) with PO token support
- Disable YouTube transcripts due to platform restrictions (Aug 2025)
- Update orchestrator to use all 6 active sources
- Consolidate environment variables into single .env file
- Full system sync completed with all sources updating successfully
- Update documentation with current system status and capabilities

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-08-27 18:11:00 -03:00
Ben Reed
daab901e35 refactor: Update naming convention from hvacknowitall to hkia
Major Changes:
- Updated all code references from hvacknowitall/hvacnkowitall to hkia
- Renamed all existing markdown files to use hkia_ prefix
- Updated configuration files, scrapers, and production scripts
- Modified systemd service descriptions to use HKIA
- Changed NAS sync path to /mnt/nas/hkia

Files Updated:
- 20+ source files updated with new naming convention
- 34 markdown files renamed to hkia_* format
- All ScraperConfig brand_name parameters now use 'hkia'
- Documentation updated to reflect new naming

Rationale:
- Shorter, cleaner filenames
- Consistent branding across all outputs
- Easier to type and reference
- Maintains same functionality with improved naming

Next Steps:
- Deploy updated services to production
- Update any external references to old naming
- Monitor scrapers to ensure proper operation

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-08-19 13:35:23 -03:00
Ben Reed
8d5750b1d1 Add comprehensive test infrastructure
- Created unit tests for BaseScraper with mocking
- Added integration tests for parallel processing
- Created end-to-end tests with realistic mock data
- Fixed initialization order in BaseScraper (logger before user agent)
- Fixed orchestrator method name (archive_current_file)
- Added tenacity dependency for retry logic
- Validated parallel processing performance and overlap detection
- Confirmed spec-compliant markdown formatting in tests

Tests cover:
- Base scraper functionality (state, markdown, retry logic, media downloads)
- Parallel vs sequential execution timing
- Error isolation between scrapers
- Directory structure creation
- State management across runs
- Full workflow with realistic data

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-08-18 21:16:14 -03:00
Ben Reed
b6273ca934 Complete core specification compliance improvements
Major Feature Additions:
- Standardized markdown format to match specification exactly
- Implemented media downloading with retry logic and safe filenames
- Added user agent rotation (6 browsers) with random rotation
- Created comprehensive pytest unit tests for base scraper
- Enhanced directory structure to match specification

Technical Improvements:
- Spec-compliant markdown format with ID, Title, Type, Permalink structure
- Media download with URL parsing, filename sanitization, and deduplication
- User agent pool rotation every 5 requests to avoid detection
- Complete test coverage for state management, retry logic, formatting

Progress: 22 of 25 tasks completed (88% done)
Remaining: Integration tests, staging deployment, monitoring setup

The system now meets 90%+ of the original specification requirements
with robust error handling, retry logic, and production readiness.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-08-18 20:33:21 -03:00
Ben Reed
05218a873b Fix critical production issues and improve spec compliance
Production Readiness Improvements:
- Fixed scheduling to match spec (8 AM & 12 PM ADT instead of 6 AM/6 PM)
- Enabled NAS synchronization in production runner with error handling
- Fixed file naming convention to spec format (hvacknowitall_combined_YYYY-MM-DD-THHMMSS.md)
- Made systemd services portable (removed hardcoded user/paths)
- Added environment variable validation on startup
- Moved DISPLAY/XAUTHORITY to .env configuration

Systemd Improvements:
- Created template service file (@.service) for any user
- Changed all paths to /opt/hvac-kia-content
- Updated installation script for portable deployment
- Fixed service dependencies and resource limits

Documentation:
- Created comprehensive PRODUCTION_TODO.md with 25 tasks
- Added PRODUCTION_GUIDE.md with deployment instructions
- Documented spec compliance gaps (65% complete)

Remaining work includes retry logic, connection pooling, media downloads,
and pytest test suite as documented in PRODUCTION_TODO.md

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-08-18 20:07:55 -03:00
Ben Reed
b89655c829 Add Instagram scraper with instaloader and parallel processing orchestrator
- Implement Instagram scraper with aggressive rate limiting
- Add orchestrator for running all scrapers in parallel
- Create comprehensive tests for Instagram scraper (11 tests)
- Create tests for orchestrator (9 tests)
- Fix Instagram test issues with post type detection
- All 60 tests passing successfully

🤖 Generated with Claude Code

Co-Authored-By: Claude <noreply@anthropic.com>
2025-08-18 12:56:57 -03:00
Ben Reed
c1831d3a52 feat: Implement YouTube scraper with humanized behavior
- YouTube channel scraper using yt-dlp
- Authentication and session persistence via cookies
- Humanized delays and rate limiting (2-5 seconds between requests)
- User agent rotation for stealth
- Incremental updates via state management
- Support for videos, shorts, and live streams detection
- All 11 tests passing

🤖 Generated with Claude Code

Co-Authored-By: Claude <noreply@anthropic.com>
2025-08-18 12:39:49 -03:00
Ben Reed
7191fcd132 feat: Implement RSS scrapers for MailChimp and Podcast feeds
- Created base RSS scraper class with common functionality
- Implemented MailChimp RSS scraper for newsletters
- Implemented Podcast RSS scraper with audio/image extraction
- State management for incremental updates
- All 9 tests passing for RSS scrapers

🤖 Generated with Claude Code

Co-Authored-By: Claude <noreply@anthropic.com>
2025-08-18 12:29:45 -03:00
Ben Reed
95e0499791 feat: Implement WordPress scraper with comprehensive tests
- Created WordPressScraper class extending BaseScraper
- Fetches posts with pagination support
- Enriches posts with author, category, and tag information
- Implements incremental updates via state management
- Word count calculation for content
- All 11 tests passing

🤖 Generated with Claude Code

Co-Authored-By: Claude <noreply@anthropic.com>
2025-08-18 12:19:56 -03:00
Ben Reed
f9a8e719a7 Initial commit: Project foundation with base scraper and tests
- Set up UV environment with all required packages
- Created comprehensive project structure
- Implemented abstract BaseScraper class with TDD
- Added documentation (project spec, implementation plan, status)
- Configured .env for credentials (not committed)
- All base scraper tests passing (9/9)

🤖 Generated with Claude Code

Co-Authored-By: Claude <noreply@anthropic.com>
2025-08-18 12:15:17 -03:00