hvac-kia-content

Author	SHA1	Message	Date
Ben Reed	8b83185130	Fix HTML/XML contamination in WordPress markdown extraction - Update base_scraper.py convert_to_markdown() to properly clean HTML - Remove script/style blocks and their content before conversion - Strip inline JavaScript event handlers - Clean up br tags and excessive blank lines - Fix malformed comparison operators that look like tags - Add comprehensive HTML cleaning during content extraction (not after) - Test confirms WordPress content now generates clean markdown without HTML This ensures all future WordPress scraping produces specification-compliant markdown without any HTML/XML contamination.	2025-08-18 23:11:08 -03:00
Ben Reed	8d5750b1d1	Add comprehensive test infrastructure - Created unit tests for BaseScraper with mocking - Added integration tests for parallel processing - Created end-to-end tests with realistic mock data - Fixed initialization order in BaseScraper (logger before user agent) - Fixed orchestrator method name (archive_current_file) - Added tenacity dependency for retry logic - Validated parallel processing performance and overlap detection - Confirmed spec-compliant markdown formatting in tests Tests cover: - Base scraper functionality (state, markdown, retry logic, media downloads) - Parallel vs sequential execution timing - Error isolation between scrapers - Directory structure creation - State management across runs - Full workflow with realistic data 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-08-18 21:16:14 -03:00
Ben Reed	b6273ca934	Complete core specification compliance improvements Major Feature Additions: - Standardized markdown format to match specification exactly - Implemented media downloading with retry logic and safe filenames - Added user agent rotation (6 browsers) with random rotation - Created comprehensive pytest unit tests for base scraper - Enhanced directory structure to match specification Technical Improvements: - Spec-compliant markdown format with ID, Title, Type, Permalink structure - Media download with URL parsing, filename sanitization, and deduplication - User agent pool rotation every 5 requests to avoid detection - Complete test coverage for state management, retry logic, formatting Progress: 22 of 25 tasks completed (88% done) Remaining: Integration tests, staging deployment, monitoring setup The system now meets 90%+ of the original specification requirements with robust error handling, retry logic, and production readiness. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-08-18 20:33:21 -03:00
Ben Reed	dabef8bfcb	Implement retry logic, connection pooling, and production hardening Major Production Improvements: - Added retry logic with exponential backoff using tenacity - Implemented HTTP connection pooling via requests.Session - Added health check monitoring with metrics reporting - Implemented configuration validation for all numeric values - Fixed error isolation (verified continues on failure) Technical Changes: - BaseScraper: Added session management and make_request() method - WordPressScraper: Updated all HTTP calls to use retry logic - Production runner: Added validate_config() and health check ping - Retry config: 3 attempts, 5-60s exponential backoff System is now production-ready with robust error handling, automatic retries, and health monitoring. Remaining tasks focus on spec compliance (media downloads, markdown format) and testing/documentation. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-08-18 20:16:02 -03:00
Ben Reed	05218a873b	Fix critical production issues and improve spec compliance Production Readiness Improvements: - Fixed scheduling to match spec (8 AM & 12 PM ADT instead of 6 AM/6 PM) - Enabled NAS synchronization in production runner with error handling - Fixed file naming convention to spec format (hvacknowitall_combined_YYYY-MM-DD-THHMMSS.md) - Made systemd services portable (removed hardcoded user/paths) - Added environment variable validation on startup - Moved DISPLAY/XAUTHORITY to .env configuration Systemd Improvements: - Created template service file (@.service) for any user - Changed all paths to /opt/hvac-kia-content - Updated installation script for portable deployment - Fixed service dependencies and resource limits Documentation: - Created comprehensive PRODUCTION_TODO.md with 25 tasks - Added PRODUCTION_GUIDE.md with deployment instructions - Documented spec compliance gaps (65% complete) Remaining work includes retry logic, connection pooling, media downloads, and pytest test suite as documented in PRODUCTION_TODO.md 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-08-18 20:07:55 -03:00
Ben Reed	f9a8e719a7	Initial commit: Project foundation with base scraper and tests - Set up UV environment with all required packages - Created comprehensive project structure - Implemented abstract BaseScraper class with TDD - Added documentation (project spec, implementation plan, status) - Configured .env for credentials (not committed) - All base scraper tests passing (9/9) 🤖 Generated with Claude Code Co-Authored-By: Claude <noreply@anthropic.com>	2025-08-18 12:15:17 -03:00

6 commits