Ben Reed 7e5377e7b1 docs: Update all documentation to use hkia naming convention

Documentation Updates:
- Updated project specification with hkia naming and paths
- Modified all markdown documentation files (12 files updated)
- Changed service names from hvac-content-* to hkia-content-*
- Updated NAS paths from /mnt/nas/hvacknowitall to /mnt/nas/hkia
- Replaced all instances of "HVAC Know It All" with "HKIA"

Files Updated:
- README.md - Updated service names and commands
- CLAUDE.md - Updated environment variables and paths
- DEPLOY.md - Updated deployment instructions
- docs/project_specification.md - Updated naming convention specs
- docs/status.md - Updated project status with new naming
- docs/final_status.md - Updated completion status
- docs/deployment_strategy.md - Updated deployment paths
- docs/DEPLOYMENT_CHECKLIST.md - Updated checklist items
- docs/PRODUCTION_TODO.md - Updated production tasks
- BACKLOG_STATUS.md - Updated backlog references
- UPDATED_CAPTURE_STATUS.md - Updated capture status
- FINAL_TALLY_REPORT.md - Updated tally report

Notes:
- Repository name remains hvacknowitall-content (unchanged)
- Project directory remains hvac-kia-content (unchanged)
- All user-facing outputs now use clean "hkia" naming

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-08-19 13:40:27 -03:00

6.1 KiB

Raw Blame History

HKIA Content Aggregation System - Project Specification

Overview

A containerized Python application that aggregates content from multiple HKIA sources, converts them to markdown format, and syncs to a NAS. The system runs on a Kubernetes cluster on the control plane node.

Content Sources

1. YouTube Channel

Fields: ID, type (video/short/live), link, author, description, likes, comments, views, shares
Authentication: Credentials stored in .env (YOUTUBE_USERNAME, YOUTUBE_PASSWORD)
Tool: yt-dlp
Special Requirements: Humanized behavior, rate limiting

2. MailChimp RSS

Fields: ID, title, link, publish date, content
URL: https://hkia.com/feed/
Tool: feedparser

3. Podcast RSS

Fields: ID, audio link, author, title, subtitle, pubDate, duration, description, image, episode link
URL: https://hkia.com/podcast/feed/
Tool: feedparser

4. WordPress Blog Posts

Fields: ID, title, author, publish date, word count, tags, categories
API: REST API at https://hkia.com/
Credentials: Stored in .env (WORDPRESS_USERNAME, WORDPRESS_API_KEY)

5. Instagram

Fields: ID, type (post/story/reel/highlights), publish date, link, author, description, likes, comments, views, shares
Authentication: Credentials stored in .env (INSTAGRAM_USERNAME, INSTAGRAM_PASSWORD)
Tool: instaloader
Special Requirements: Humanized behavior, aggressive rate limiting

System Requirements

Scheduling

Run twice daily: 8:00 AM ADT and 12:00 PM ADT
Use Atlantic timezone (America/Halifax)

Data Processing

Check for new content (incremental updates preferred)
Spawn parallel processes for each source
Convert all content to markdown using MarkItDown
Download associated media files
Archive previous markdown files
Rsync to NAS at /mnt/nas/hkia/

File Naming Convention

<brandName>_<source>_<dateTime in Atlantic Timezone>.md Example: hkia_blog_2024-15-01-T143045.md

Directory Structure

.env
data/
├── markdown_current/       # Current markdown files
├── markdown_archives/      # Archived markdown files by source
│   ├── WordPress/
│   ├── Instagram/
│   ├── YouTube/
│   ├── Podcast/
│   └── MailChimp/
├── media/                  # Downloaded media files by source
│   ├── WordPress/
│   ├── Instagram/
│   ├── YouTube/
│   ├── Podcast/
│   └── MailChimp/
└── .state/                # State files for incremental updates
docs/                      # Documentation
logs/                      # Log files by source with rotation
├── WordPress/
├── Instagram/
├── YouTube/
├── Podcast/
└── MailChimp/
src/                       # Source code
tests/                     # Test files
k8s/                       # Kubernetes manifests

Markdown File Format

# ID: [unique_identifier]

## Title: [content_title]

## Type: [content_type]

## Permalink: [url]

## Description:
[content_description]

## Metadata:

### Comments: [count]

### Likes: [count]

### Tags:
- tag1
- tag2

--------------

# ID: [next_item]
...

Technical Implementation

Development Approach

Test-Driven Development (TDD)
Python with UV package manager
Abstract base class for content sources
Parallel processing using multiprocessing
State management with JSON files
Comprehensive error handling with exponential backoff

Key Python Packages

requests: API calls
feedparser: RSS parsing
yt-dlp: YouTube content
instaloader: Instagram content
markitdown: Markdown conversion
python-dotenv: Environment management
schedule: Task scheduling
pytest: Testing framework
pytz: Timezone handling

Security & Rate Limiting

Credentials stored in .env file
Humanized behavior for YouTube/Instagram:
- Random delays between requests (2-10 seconds)
- Exponential backoff on errors
- User-agent rotation
- Session management

Logging

Separate log files per source
Rotating file handler (max 10MB, keep 5 backups)
Log levels: DEBUG, INFO, WARNING, ERROR, CRITICAL
Structured logging with timestamps

Error Handling

Graceful degradation if source fails
Retry logic with exponential backoff
Maximum 3 retries per source
Continue with other sources on failure
Alert logging for critical failures

Containerization & Kubernetes

Docker Requirements

Multi-stage build for smaller image
Non-root user execution
Health checks
Volume mounts for data persistence

Kubernetes Deployment

Run on control plane node (node selector)
CronJob for scheduled execution
ConfigMap for non-sensitive config
Secret for credentials
PersistentVolume for data/logs
Service account with appropriate permissions
Resource limits and requests

Persistent Storage

PVC for /data directory
PVC for /logs directory
HostPath or NFS for NAS access

Testing Strategy

Unit Tests

Test each scraper independently
Mock external API calls
Test state management
Test markdown conversion
Test error handling

Integration Tests

Test parallel processing
Test file archiving
Test rsync functionality
Test scheduling

End-to-End Tests

Full workflow with mock data
Verify markdown output format
Verify file naming and placement

Monitoring & Maintenance

Health Checks

Verify each source accessibility
Check disk space
Monitor API rate limits
Log file rotation status

Metrics to Track

Content items processed per source
API call counts
Error rates
Processing time per source
Storage usage

Version Control

Private GitHub repository: https://github.com/bengizmo/hvacknowitall-content.git
Commit after major milestones
Semantic versioning
Comprehensive commit messages

Documentation Files

README.md: Setup and usage instructions
claude.md: AI context and implementation notes
status.md: Current project status and progress
docs/project_specification.md: This file
docs/api_documentation.md: API endpoints and responses
docs/troubleshooting.md: Common issues and solutions

6.1 KiB Raw Blame History