Documentation Updates: - Updated project specification with hkia naming and paths - Modified all markdown documentation files (12 files updated) - Changed service names from hvac-content-* to hkia-content-* - Updated NAS paths from /mnt/nas/hvacknowitall to /mnt/nas/hkia - Replaced all instances of "HVAC Know It All" with "HKIA" Files Updated: - README.md - Updated service names and commands - CLAUDE.md - Updated environment variables and paths - DEPLOY.md - Updated deployment instructions - docs/project_specification.md - Updated naming convention specs - docs/status.md - Updated project status with new naming - docs/final_status.md - Updated completion status - docs/deployment_strategy.md - Updated deployment paths - docs/DEPLOYMENT_CHECKLIST.md - Updated checklist items - docs/PRODUCTION_TODO.md - Updated production tasks - BACKLOG_STATUS.md - Updated backlog references - UPDATED_CAPTURE_STATUS.md - Updated capture status - FINAL_TALLY_REPORT.md - Updated tally report Notes: - Repository name remains hvacknowitall-content (unchanged) - Project directory remains hvac-kia-content (unchanged) - All user-facing outputs now use clean "hkia" naming 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
6.1 KiB
6.1 KiB
HKIA Content Aggregation System - Project Specification
Overview
A containerized Python application that aggregates content from multiple HKIA sources, converts them to markdown format, and syncs to a NAS. The system runs on a Kubernetes cluster on the control plane node.
Content Sources
1. YouTube Channel
- Fields: ID, type (video/short/live), link, author, description, likes, comments, views, shares
- Authentication: Credentials stored in .env (YOUTUBE_USERNAME, YOUTUBE_PASSWORD)
- Tool: yt-dlp
- Special Requirements: Humanized behavior, rate limiting
2. MailChimp RSS
- Fields: ID, title, link, publish date, content
- URL: https://hkia.com/feed/
- Tool: feedparser
3. Podcast RSS
- Fields: ID, audio link, author, title, subtitle, pubDate, duration, description, image, episode link
- URL: https://hkia.com/podcast/feed/
- Tool: feedparser
4. WordPress Blog Posts
- Fields: ID, title, author, publish date, word count, tags, categories
- API: REST API at https://hkia.com/
- Credentials: Stored in .env (WORDPRESS_USERNAME, WORDPRESS_API_KEY)
5. Instagram
- Fields: ID, type (post/story/reel/highlights), publish date, link, author, description, likes, comments, views, shares
- Authentication: Credentials stored in .env (INSTAGRAM_USERNAME, INSTAGRAM_PASSWORD)
- Tool: instaloader
- Special Requirements: Humanized behavior, aggressive rate limiting
System Requirements
Scheduling
- Run twice daily: 8:00 AM ADT and 12:00 PM ADT
- Use Atlantic timezone (America/Halifax)
Data Processing
- Check for new content (incremental updates preferred)
- Spawn parallel processes for each source
- Convert all content to markdown using MarkItDown
- Download associated media files
- Archive previous markdown files
- Rsync to NAS at /mnt/nas/hkia/
File Naming Convention
<brandName>_<source>_<dateTime in Atlantic Timezone>.md
Example: hkia_blog_2024-15-01-T143045.md
Directory Structure
.env
data/
├── markdown_current/ # Current markdown files
├── markdown_archives/ # Archived markdown files by source
│ ├── WordPress/
│ ├── Instagram/
│ ├── YouTube/
│ ├── Podcast/
│ └── MailChimp/
├── media/ # Downloaded media files by source
│ ├── WordPress/
│ ├── Instagram/
│ ├── YouTube/
│ ├── Podcast/
│ └── MailChimp/
└── .state/ # State files for incremental updates
docs/ # Documentation
logs/ # Log files by source with rotation
├── WordPress/
├── Instagram/
├── YouTube/
├── Podcast/
└── MailChimp/
src/ # Source code
tests/ # Test files
k8s/ # Kubernetes manifests
Markdown File Format
# ID: [unique_identifier]
## Title: [content_title]
## Type: [content_type]
## Permalink: [url]
## Description:
[content_description]
## Metadata:
### Comments: [count]
### Likes: [count]
### Tags:
- tag1
- tag2
--------------
# ID: [next_item]
...
Technical Implementation
Development Approach
- Test-Driven Development (TDD)
- Python with UV package manager
- Abstract base class for content sources
- Parallel processing using multiprocessing
- State management with JSON files
- Comprehensive error handling with exponential backoff
Key Python Packages
- requests: API calls
- feedparser: RSS parsing
- yt-dlp: YouTube content
- instaloader: Instagram content
- markitdown: Markdown conversion
- python-dotenv: Environment management
- schedule: Task scheduling
- pytest: Testing framework
- pytz: Timezone handling
Security & Rate Limiting
- Credentials stored in .env file
- Humanized behavior for YouTube/Instagram:
- Random delays between requests (2-10 seconds)
- Exponential backoff on errors
- User-agent rotation
- Session management
Logging
- Separate log files per source
- Rotating file handler (max 10MB, keep 5 backups)
- Log levels: DEBUG, INFO, WARNING, ERROR, CRITICAL
- Structured logging with timestamps
Error Handling
- Graceful degradation if source fails
- Retry logic with exponential backoff
- Maximum 3 retries per source
- Continue with other sources on failure
- Alert logging for critical failures
Containerization & Kubernetes
Docker Requirements
- Multi-stage build for smaller image
- Non-root user execution
- Health checks
- Volume mounts for data persistence
Kubernetes Deployment
- Run on control plane node (node selector)
- CronJob for scheduled execution
- ConfigMap for non-sensitive config
- Secret for credentials
- PersistentVolume for data/logs
- Service account with appropriate permissions
- Resource limits and requests
Persistent Storage
- PVC for /data directory
- PVC for /logs directory
- HostPath or NFS for NAS access
Testing Strategy
Unit Tests
- Test each scraper independently
- Mock external API calls
- Test state management
- Test markdown conversion
- Test error handling
Integration Tests
- Test parallel processing
- Test file archiving
- Test rsync functionality
- Test scheduling
End-to-End Tests
- Full workflow with mock data
- Verify markdown output format
- Verify file naming and placement
Monitoring & Maintenance
Health Checks
- Verify each source accessibility
- Check disk space
- Monitor API rate limits
- Log file rotation status
Metrics to Track
- Content items processed per source
- API call counts
- Error rates
- Processing time per source
- Storage usage
Version Control
- Private GitHub repository: https://github.com/bengizmo/hvacknowitall-content.git
- Commit after major milestones
- Semantic versioning
- Comprehensive commit messages
Documentation Files
README.md: Setup and usage instructionsclaude.md: AI context and implementation notesstatus.md: Current project status and progressdocs/project_specification.md: This filedocs/api_documentation.md: API endpoints and responsesdocs/troubleshooting.md: Common issues and solutions