hvac-kia-content/docs/project_specification.md

# HKIA Content Aggregation System - Project Specification

## Overview
A containerized Python application that aggregates content from multiple HKIA sources, converts them to markdown format, and syncs to a NAS. The system runs on a Kubernetes cluster on the control plane node.

## Content Sources

### 1. YouTube Channel
- **Fields**: ID, type (video/short/live), link, author, description, likes, comments, views, shares
- **Authentication**: Credentials stored in .env (YOUTUBE_USERNAME, YOUTUBE_PASSWORD)
- **Tool**: yt-dlp
- **Special Requirements**: Humanized behavior, rate limiting

### 2. MailChimp RSS
- **Fields**: ID, title, link, publish date, content
- **URL**: https://hkia.com/feed/
- **Tool**: feedparser

### 3. Podcast RSS
- **Fields**: ID, audio link, author, title, subtitle, pubDate, duration, description, image, episode link
- **URL**: https://hkia.com/podcast/feed/
- **Tool**: feedparser

### 4. WordPress Blog Posts
- **Fields**: ID, title, author, publish date, word count, tags, categories
- **API**: REST API at https://hkia.com/
- **Credentials**: Stored in .env (WORDPRESS_USERNAME, WORDPRESS_API_KEY)

### 5. Instagram
- **Fields**: ID, type (post/story/reel/highlights), publish date, link, author, description, likes, comments, views, shares
- **Authentication**: Credentials stored in .env (INSTAGRAM_USERNAME, INSTAGRAM_PASSWORD)
- **Tool**: instaloader
- **Special Requirements**: Humanized behavior, aggressive rate limiting

## System Requirements

### Scheduling
- Run twice daily: 8:00 AM ADT and 12:00 PM ADT
- Use Atlantic timezone (America/Halifax)

### Data Processing
1. Check for new content (incremental updates preferred)
2. Spawn parallel processes for each source
3. Convert all content to markdown using MarkItDown
4. Download associated media files
5. Archive previous markdown files
6. Rsync to NAS at /mnt/nas/hkia/

### File Naming Convention
`<brandName>_<source>_<dateTime in Atlantic Timezone>.md`
Example: `hkia_blog_2024-15-01-T143045.md`

### Directory Structure
```
.env
data/
├── markdown_current/       # Current markdown files
├── markdown_archives/      # Archived markdown files by source
│   ├── WordPress/
│   ├── Instagram/
│   ├── YouTube/
│   ├── Podcast/
│   └── MailChimp/
├── media/                  # Downloaded media files by source
│   ├── WordPress/
│   ├── Instagram/
│   ├── YouTube/
│   ├── Podcast/
│   └── MailChimp/
└── .state/                # State files for incremental updates
docs/                      # Documentation
logs/                      # Log files by source with rotation
├── WordPress/
├── Instagram/
├── YouTube/
├── Podcast/
└── MailChimp/
src/                       # Source code
tests/                     # Test files
k8s/                       # Kubernetes manifests
```

### Markdown File Format
```markdown
# ID: [unique_identifier]

## Title: [content_title]

## Type: [content_type]

## Permalink: [url]

## Description:
[content_description]

## Metadata:

### Comments: [count]

### Likes: [count]

### Tags:
- tag1
- tag2

--------------

# ID: [next_item]
...
```

## Technical Implementation

### Development Approach
- Test-Driven Development (TDD)
- Python with UV package manager
- Abstract base class for content sources
- Parallel processing using multiprocessing
- State management with JSON files
- Comprehensive error handling with exponential backoff

### Key Python Packages
- requests: API calls
- feedparser: RSS parsing
- yt-dlp: YouTube content
- instaloader: Instagram content
- markitdown: Markdown conversion
- python-dotenv: Environment management
- schedule: Task scheduling
- pytest: Testing framework
- pytz: Timezone handling

### Security & Rate Limiting
- Credentials stored in .env file
- Humanized behavior for YouTube/Instagram:
  - Random delays between requests (2-10 seconds)
  - Exponential backoff on errors
  - User-agent rotation
  - Session management

### Logging
- Separate log files per source
- Rotating file handler (max 10MB, keep 5 backups)
- Log levels: DEBUG, INFO, WARNING, ERROR, CRITICAL
- Structured logging with timestamps

### Error Handling
- Graceful degradation if source fails
- Retry logic with exponential backoff
- Maximum 3 retries per source
- Continue with other sources on failure
- Alert logging for critical failures

## Containerization & Kubernetes

### Docker Requirements
- Multi-stage build for smaller image
- Non-root user execution
- Health checks
- Volume mounts for data persistence

### Kubernetes Deployment
- Run on control plane node (node selector)
- CronJob for scheduled execution
- ConfigMap for non-sensitive config
- Secret for credentials
- PersistentVolume for data/logs
- Service account with appropriate permissions
- Resource limits and requests

### Persistent Storage
- PVC for /data directory
- PVC for /logs directory
- HostPath or NFS for NAS access

## Testing Strategy

### Unit Tests
- Test each scraper independently
- Mock external API calls
- Test state management
- Test markdown conversion
- Test error handling

### Integration Tests
- Test parallel processing
- Test file archiving
- Test rsync functionality
- Test scheduling

### End-to-End Tests
- Full workflow with mock data
- Verify markdown output format
- Verify file naming and placement

## Monitoring & Maintenance

### Health Checks
- Verify each source accessibility
- Check disk space
- Monitor API rate limits
- Log file rotation status

### Metrics to Track
- Content items processed per source
- API call counts
- Error rates
- Processing time per source
- Storage usage

## Version Control
- Private GitHub repository: https://github.com/bengizmo/hvacknowitall-content.git
- Commit after major milestones
- Semantic versioning
- Comprehensive commit messages

## Documentation Files
- `README.md`: Setup and usage instructions
- `claude.md`: AI context and implementation notes
- `status.md`: Current project status and progress
- `docs/project_specification.md`: This file
- `docs/api_documentation.md`: API endpoints and responses
- `docs/troubleshooting.md`: Common issues and solutions