hvac-kia-content/docs/project_specification.md
Ben Reed ccdb9366db docs: Update GitHub references to point to local Forgejo server
- Updated repository URLs in PRODUCTION_GUIDE.md
- Updated project specification repository reference
- Updated rollback and deployment documentation
- All references now point to git.tealmaker.com/ben/hvac-kia-content.git
2025-08-27 16:07:07 -03:00

6.1 KiB

HKIA Content Aggregation System - Project Specification

Overview

A containerized Python application that aggregates content from multiple HKIA sources, converts them to markdown format, and syncs to a NAS. The system runs on a Kubernetes cluster on the control plane node.

Content Sources

1. YouTube Channel

  • Fields: ID, type (video/short/live), link, author, description, likes, comments, views, shares
  • Authentication: Credentials stored in .env (YOUTUBE_USERNAME, YOUTUBE_PASSWORD)
  • Tool: yt-dlp
  • Special Requirements: Humanized behavior, rate limiting

2. MailChimp RSS

3. Podcast RSS

  • Fields: ID, audio link, author, title, subtitle, pubDate, duration, description, image, episode link
  • URL: https://hkia.com/podcast/feed/
  • Tool: feedparser

4. WordPress Blog Posts

  • Fields: ID, title, author, publish date, word count, tags, categories
  • API: REST API at https://hkia.com/
  • Credentials: Stored in .env (WORDPRESS_USERNAME, WORDPRESS_API_KEY)

5. Instagram

  • Fields: ID, type (post/story/reel/highlights), publish date, link, author, description, likes, comments, views, shares
  • Authentication: Credentials stored in .env (INSTAGRAM_USERNAME, INSTAGRAM_PASSWORD)
  • Tool: instaloader
  • Special Requirements: Humanized behavior, aggressive rate limiting

System Requirements

Scheduling

  • Run twice daily: 8:00 AM ADT and 12:00 PM ADT
  • Use Atlantic timezone (America/Halifax)

Data Processing

  1. Check for new content (incremental updates preferred)
  2. Spawn parallel processes for each source
  3. Convert all content to markdown using MarkItDown
  4. Download associated media files
  5. Archive previous markdown files
  6. Rsync to NAS at /mnt/nas/hkia/

File Naming Convention

<brandName>_<source>_<dateTime in Atlantic Timezone>.md Example: hkia_blog_2024-15-01-T143045.md

Directory Structure

.env
data/
├── markdown_current/       # Current markdown files
├── markdown_archives/      # Archived markdown files by source
│   ├── WordPress/
│   ├── Instagram/
│   ├── YouTube/
│   ├── Podcast/
│   └── MailChimp/
├── media/                  # Downloaded media files by source
│   ├── WordPress/
│   ├── Instagram/
│   ├── YouTube/
│   ├── Podcast/
│   └── MailChimp/
└── .state/                # State files for incremental updates
docs/                      # Documentation
logs/                      # Log files by source with rotation
├── WordPress/
├── Instagram/
├── YouTube/
├── Podcast/
└── MailChimp/
src/                       # Source code
tests/                     # Test files
k8s/                       # Kubernetes manifests

Markdown File Format

# ID: [unique_identifier]

## Title: [content_title]

## Type: [content_type]

## Permalink: [url]

## Description:
[content_description]

## Metadata:

### Comments: [count]

### Likes: [count]

### Tags:
- tag1
- tag2

--------------

# ID: [next_item]
...

Technical Implementation

Development Approach

  • Test-Driven Development (TDD)
  • Python with UV package manager
  • Abstract base class for content sources
  • Parallel processing using multiprocessing
  • State management with JSON files
  • Comprehensive error handling with exponential backoff

Key Python Packages

  • requests: API calls
  • feedparser: RSS parsing
  • yt-dlp: YouTube content
  • instaloader: Instagram content
  • markitdown: Markdown conversion
  • python-dotenv: Environment management
  • schedule: Task scheduling
  • pytest: Testing framework
  • pytz: Timezone handling

Security & Rate Limiting

  • Credentials stored in .env file
  • Humanized behavior for YouTube/Instagram:
    • Random delays between requests (2-10 seconds)
    • Exponential backoff on errors
    • User-agent rotation
    • Session management

Logging

  • Separate log files per source
  • Rotating file handler (max 10MB, keep 5 backups)
  • Log levels: DEBUG, INFO, WARNING, ERROR, CRITICAL
  • Structured logging with timestamps

Error Handling

  • Graceful degradation if source fails
  • Retry logic with exponential backoff
  • Maximum 3 retries per source
  • Continue with other sources on failure
  • Alert logging for critical failures

Containerization & Kubernetes

Docker Requirements

  • Multi-stage build for smaller image
  • Non-root user execution
  • Health checks
  • Volume mounts for data persistence

Kubernetes Deployment

  • Run on control plane node (node selector)
  • CronJob for scheduled execution
  • ConfigMap for non-sensitive config
  • Secret for credentials
  • PersistentVolume for data/logs
  • Service account with appropriate permissions
  • Resource limits and requests

Persistent Storage

  • PVC for /data directory
  • PVC for /logs directory
  • HostPath or NFS for NAS access

Testing Strategy

Unit Tests

  • Test each scraper independently
  • Mock external API calls
  • Test state management
  • Test markdown conversion
  • Test error handling

Integration Tests

  • Test parallel processing
  • Test file archiving
  • Test rsync functionality
  • Test scheduling

End-to-End Tests

  • Full workflow with mock data
  • Verify markdown output format
  • Verify file naming and placement

Monitoring & Maintenance

Health Checks

  • Verify each source accessibility
  • Check disk space
  • Monitor API rate limits
  • Log file rotation status

Metrics to Track

  • Content items processed per source
  • API call counts
  • Error rates
  • Processing time per source
  • Storage usage

Version Control

Documentation Files

  • README.md: Setup and usage instructions
  • claude.md: AI context and implementation notes
  • status.md: Current project status and progress
  • docs/project_specification.md: This file
  • docs/api_documentation.md: API endpoints and responses
  • docs/troubleshooting.md: Common issues and solutions