Documentation Updates: - Updated project specification with hkia naming and paths - Modified all markdown documentation files (12 files updated) - Changed service names from hvac-content-* to hkia-content-* - Updated NAS paths from /mnt/nas/hvacknowitall to /mnt/nas/hkia - Replaced all instances of "HVAC Know It All" with "HKIA" Files Updated: - README.md - Updated service names and commands - CLAUDE.md - Updated environment variables and paths - DEPLOY.md - Updated deployment instructions - docs/project_specification.md - Updated naming convention specs - docs/status.md - Updated project status with new naming - docs/final_status.md - Updated completion status - docs/deployment_strategy.md - Updated deployment paths - docs/DEPLOYMENT_CHECKLIST.md - Updated checklist items - docs/PRODUCTION_TODO.md - Updated production tasks - BACKLOG_STATUS.md - Updated backlog references - UPDATED_CAPTURE_STATUS.md - Updated capture status - FINAL_TALLY_REPORT.md - Updated tally report Notes: - Repository name remains hvacknowitall-content (unchanged) - Project directory remains hvac-kia-content (unchanged) - All user-facing outputs now use clean "hkia" naming 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
223 lines
No EOL
6.1 KiB
Markdown
223 lines
No EOL
6.1 KiB
Markdown
# HKIA Content Aggregation System - Project Specification
|
|
|
|
## Overview
|
|
A containerized Python application that aggregates content from multiple HKIA sources, converts them to markdown format, and syncs to a NAS. The system runs on a Kubernetes cluster on the control plane node.
|
|
|
|
## Content Sources
|
|
|
|
### 1. YouTube Channel
|
|
- **Fields**: ID, type (video/short/live), link, author, description, likes, comments, views, shares
|
|
- **Authentication**: Credentials stored in .env (YOUTUBE_USERNAME, YOUTUBE_PASSWORD)
|
|
- **Tool**: yt-dlp
|
|
- **Special Requirements**: Humanized behavior, rate limiting
|
|
|
|
### 2. MailChimp RSS
|
|
- **Fields**: ID, title, link, publish date, content
|
|
- **URL**: https://hkia.com/feed/
|
|
- **Tool**: feedparser
|
|
|
|
### 3. Podcast RSS
|
|
- **Fields**: ID, audio link, author, title, subtitle, pubDate, duration, description, image, episode link
|
|
- **URL**: https://hkia.com/podcast/feed/
|
|
- **Tool**: feedparser
|
|
|
|
### 4. WordPress Blog Posts
|
|
- **Fields**: ID, title, author, publish date, word count, tags, categories
|
|
- **API**: REST API at https://hkia.com/
|
|
- **Credentials**: Stored in .env (WORDPRESS_USERNAME, WORDPRESS_API_KEY)
|
|
|
|
### 5. Instagram
|
|
- **Fields**: ID, type (post/story/reel/highlights), publish date, link, author, description, likes, comments, views, shares
|
|
- **Authentication**: Credentials stored in .env (INSTAGRAM_USERNAME, INSTAGRAM_PASSWORD)
|
|
- **Tool**: instaloader
|
|
- **Special Requirements**: Humanized behavior, aggressive rate limiting
|
|
|
|
## System Requirements
|
|
|
|
### Scheduling
|
|
- Run twice daily: 8:00 AM ADT and 12:00 PM ADT
|
|
- Use Atlantic timezone (America/Halifax)
|
|
|
|
### Data Processing
|
|
1. Check for new content (incremental updates preferred)
|
|
2. Spawn parallel processes for each source
|
|
3. Convert all content to markdown using MarkItDown
|
|
4. Download associated media files
|
|
5. Archive previous markdown files
|
|
6. Rsync to NAS at /mnt/nas/hkia/
|
|
|
|
### File Naming Convention
|
|
`<brandName>_<source>_<dateTime in Atlantic Timezone>.md`
|
|
Example: `hkia_blog_2024-15-01-T143045.md`
|
|
|
|
### Directory Structure
|
|
```
|
|
.env
|
|
data/
|
|
├── markdown_current/ # Current markdown files
|
|
├── markdown_archives/ # Archived markdown files by source
|
|
│ ├── WordPress/
|
|
│ ├── Instagram/
|
|
│ ├── YouTube/
|
|
│ ├── Podcast/
|
|
│ └── MailChimp/
|
|
├── media/ # Downloaded media files by source
|
|
│ ├── WordPress/
|
|
│ ├── Instagram/
|
|
│ ├── YouTube/
|
|
│ ├── Podcast/
|
|
│ └── MailChimp/
|
|
└── .state/ # State files for incremental updates
|
|
docs/ # Documentation
|
|
logs/ # Log files by source with rotation
|
|
├── WordPress/
|
|
├── Instagram/
|
|
├── YouTube/
|
|
├── Podcast/
|
|
└── MailChimp/
|
|
src/ # Source code
|
|
tests/ # Test files
|
|
k8s/ # Kubernetes manifests
|
|
```
|
|
|
|
### Markdown File Format
|
|
```markdown
|
|
# ID: [unique_identifier]
|
|
|
|
## Title: [content_title]
|
|
|
|
## Type: [content_type]
|
|
|
|
## Permalink: [url]
|
|
|
|
## Description:
|
|
[content_description]
|
|
|
|
## Metadata:
|
|
|
|
### Comments: [count]
|
|
|
|
### Likes: [count]
|
|
|
|
### Tags:
|
|
- tag1
|
|
- tag2
|
|
|
|
--------------
|
|
|
|
# ID: [next_item]
|
|
...
|
|
```
|
|
|
|
## Technical Implementation
|
|
|
|
### Development Approach
|
|
- Test-Driven Development (TDD)
|
|
- Python with UV package manager
|
|
- Abstract base class for content sources
|
|
- Parallel processing using multiprocessing
|
|
- State management with JSON files
|
|
- Comprehensive error handling with exponential backoff
|
|
|
|
### Key Python Packages
|
|
- requests: API calls
|
|
- feedparser: RSS parsing
|
|
- yt-dlp: YouTube content
|
|
- instaloader: Instagram content
|
|
- markitdown: Markdown conversion
|
|
- python-dotenv: Environment management
|
|
- schedule: Task scheduling
|
|
- pytest: Testing framework
|
|
- pytz: Timezone handling
|
|
|
|
### Security & Rate Limiting
|
|
- Credentials stored in .env file
|
|
- Humanized behavior for YouTube/Instagram:
|
|
- Random delays between requests (2-10 seconds)
|
|
- Exponential backoff on errors
|
|
- User-agent rotation
|
|
- Session management
|
|
|
|
### Logging
|
|
- Separate log files per source
|
|
- Rotating file handler (max 10MB, keep 5 backups)
|
|
- Log levels: DEBUG, INFO, WARNING, ERROR, CRITICAL
|
|
- Structured logging with timestamps
|
|
|
|
### Error Handling
|
|
- Graceful degradation if source fails
|
|
- Retry logic with exponential backoff
|
|
- Maximum 3 retries per source
|
|
- Continue with other sources on failure
|
|
- Alert logging for critical failures
|
|
|
|
## Containerization & Kubernetes
|
|
|
|
### Docker Requirements
|
|
- Multi-stage build for smaller image
|
|
- Non-root user execution
|
|
- Health checks
|
|
- Volume mounts for data persistence
|
|
|
|
### Kubernetes Deployment
|
|
- Run on control plane node (node selector)
|
|
- CronJob for scheduled execution
|
|
- ConfigMap for non-sensitive config
|
|
- Secret for credentials
|
|
- PersistentVolume for data/logs
|
|
- Service account with appropriate permissions
|
|
- Resource limits and requests
|
|
|
|
### Persistent Storage
|
|
- PVC for /data directory
|
|
- PVC for /logs directory
|
|
- HostPath or NFS for NAS access
|
|
|
|
## Testing Strategy
|
|
|
|
### Unit Tests
|
|
- Test each scraper independently
|
|
- Mock external API calls
|
|
- Test state management
|
|
- Test markdown conversion
|
|
- Test error handling
|
|
|
|
### Integration Tests
|
|
- Test parallel processing
|
|
- Test file archiving
|
|
- Test rsync functionality
|
|
- Test scheduling
|
|
|
|
### End-to-End Tests
|
|
- Full workflow with mock data
|
|
- Verify markdown output format
|
|
- Verify file naming and placement
|
|
|
|
## Monitoring & Maintenance
|
|
|
|
### Health Checks
|
|
- Verify each source accessibility
|
|
- Check disk space
|
|
- Monitor API rate limits
|
|
- Log file rotation status
|
|
|
|
### Metrics to Track
|
|
- Content items processed per source
|
|
- API call counts
|
|
- Error rates
|
|
- Processing time per source
|
|
- Storage usage
|
|
|
|
## Version Control
|
|
- Private GitHub repository: https://github.com/bengizmo/hvacknowitall-content.git
|
|
- Commit after major milestones
|
|
- Semantic versioning
|
|
- Comprehensive commit messages
|
|
|
|
## Documentation Files
|
|
- `README.md`: Setup and usage instructions
|
|
- `claude.md`: AI context and implementation notes
|
|
- `status.md`: Current project status and progress
|
|
- `docs/project_specification.md`: This file
|
|
- `docs/api_documentation.md`: API endpoints and responses
|
|
- `docs/troubleshooting.md`: Common issues and solutions |