hvac-kia-content/docs/CONTENT_CLASSIFICATION.md
Ben Reed fc3af8e19f feat: Add AI-powered content classification system
- Implement Claude Haiku integration for content analysis
- Create structured JSON output with summaries and metadata
- Add markdown consolidation with deduplication
- Process 447 YouTube videos and 431 podcast episodes
- Generate clean classified files for Claude Desktop projects
- Include comprehensive documentation and usage examples
- Cost-effective processing at ~.30 for 878 items
- Optimize rate limiting for 80,000 tokens/minute API limit

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-09-03 19:33:32 -03:00

7.4 KiB

HVAC Content Classification System

Overview

The Content Classification System uses Claude Haiku AI to analyze and structure HVAC content from multiple sources into concise JSON files. These files provide structured metadata, summaries, and classifications for use in content creation projects.

Features

Structured Classification

Each content item is analyzed and classified with:

  • URL: Original content location
  • Date Published: Publication date
  • Author: Content creator
  • Word Count: Content length
  • Summary: 1-3 sentence summary of main points
  • Key Learnings: 3-10 bullet point takeaways
  • Content Type: technical/business/educational/marketing/troubleshooting/installation/maintenance
  • Application: Residential/Commercial/Industrial/Automotive/Marine
  • Categories: Technical categories and tags
  • Brands Mentioned: HVAC brands, manufacturers, tools referenced
  • Tools Mentioned: Specific HVAC equipment and software
  • Topics: Technical topics (refrigeration, heat pumps, ductwork, etc.)
  • Meta Information:
    • Difficulty level (beginner/intermediate/advanced)
    • Target audience (homeowner/technician/contractor/engineer)
    • Actionable content flag
    • Troubleshooting focus flag
  • Classification Confidence: AI confidence score

Architecture

Core Components

1. Content Parser (src/content_analysis/content_parser.py)

  • Extracts individual content items from aggregated markdown files
  • Handles all content sources: WordPress, YouTube, Instagram, Podcast, MailChimp
  • Validates content structure and extracts metadata
  • Returns structured ContentItem objects

2. Content Classifier (src/content_analysis/content_classifier.py)

  • Uses Claude Haiku API for cost-effective AI classification
  • Processes content with structured JSON prompts
  • Implements rate limiting and retry logic:
    • 1 second delay between requests
    • Exponential backoff on failures
    • 5 retry attempts per item
  • Returns ClassifiedContent objects with all metadata

3. Markdown Consolidator (consolidate_markdown_sources.py)

  • Deduplicates content across multiple markdown files
  • Keeps most recent version of each content item by ID
  • Consolidates from 53,000+ items to ~3,000 unique items
  • Handles case variations in source names

4. Classification Runner (classify_youtube_podcast_only.py)

  • Focused script for classifying specific sources
  • Sequential processing to avoid rate limit conflicts
  • Progress tracking and error handling
  • Saves results as clean JSON files

Data Flow

1. Raw Markdown Files (multiple versions per source)
   ↓
2. Consolidation & Deduplication
   ↓
3. Consolidated Markdown (5 files: blog, podcast, youtube, instagram, mailchimp)
   ↓
4. Content Parsing & Validation
   ↓
5. Claude Haiku Classification
   ↓
6. Structured JSON Output
   ↓
7. NAS Storage for Distribution

File Structure

Input Files

  • data/consolidated/hkia_blog_consolidated.md - WordPress blog posts
  • data/consolidated/hkia_podcast_consolidated.md - Podcast episodes (431 items)
  • data/consolidated/hkia_youtube_consolidated.md - YouTube videos (447 items)
  • data/consolidated/hkia_instagram_consolidated.md - Instagram posts
  • data/consolidated/hkia_mailchimp_consolidated.md - Newsletter content

Output Files

  • data/clean_classified/blog.json - Classified blog content
  • data/clean_classified/podcast.json - Classified podcast episodes
  • data/clean_classified/youtube.json - Classified YouTube videos
  • data/clean_classified/instagram.json - Classified Instagram posts
  • data/clean_classified/mailchimp.json - Classified newsletter content

NAS Sync

  • Files automatically synced to: /mnt/nas/hkia/clean_classified/

Usage

Full Consolidation and Classification

# Step 1: Consolidate markdown files with deduplication
uv run python consolidate_markdown_sources.py

# Step 2: Classify specific sources (YouTube & Podcast)
export ANTHROPIC_API_KEY="your-api-key"
uv run python classify_youtube_podcast_only.py

# Step 3: Sync to NAS
rsync -av data/clean_classified/ /mnt/nas/hkia/clean_classified/

Classification Only (if consolidated files exist)

# Run focused classification
export ANTHROPIC_API_KEY="your-api-key"
uv run python classify_youtube_podcast_only.py

API Configuration

Claude Haiku Settings

  • Model: claude-3-haiku-20240307
  • Max Tokens: 1000 per request
  • Temperature: 0.1 (low for consistent classification)
  • Rate Limiting:
    • 80,000 output tokens per minute limit
    • ~80 requests per minute maximum
    • 1 second delay between requests

Cost Estimation

  • Input: $0.25 per million tokens
  • Output: $1.25 per million tokens
  • Typical Cost: ~$1.30 for 878 items (447 YouTube + 431 Podcast)

Performance

Processing Times

  • With 1-second rate limiting: ~3 seconds per item
  • YouTube (447 videos): ~22 minutes
  • Podcast (431 episodes): ~22 minutes
  • Total for all sources: ~45 minutes

Success Rates

  • Typical success rate: >99%
  • Automatic retry on JSON parsing errors
  • Exponential backoff on API rate limits

Error Handling

Rate Limiting

  • Base delay: 1 second between requests
  • Exponential backoff: 2x multiplier on retry
  • Maximum retries: 5 attempts per item

JSON Parsing Errors

  • Automatic retry with backoff
  • Fallback JSON extraction from response text
  • Logged errors for debugging

Monitoring

Progress Tracking

  • Console output every 10 items
  • Shows current item ID and number
  • Success/failure counts
  • Estimated time remaining

Log Files

  • Detailed logging with timestamps
  • Error messages and stack traces
  • API response debugging

Integration

Claude Desktop Projects

The classified JSON files are optimized for use in Claude Desktop projects:

  • Massively reduced file sizes (KB instead of MB)
  • Structured data for easy parsing
  • Rich metadata for content filtering
  • Summaries and key learnings for quick reference

Use Cases

  • Content gap analysis
  • Topic research and planning
  • Content repurposing
  • Competitive analysis
  • Training material development
  • SEO optimization

Maintenance

Updating Classifications

  1. Re-run consolidation if new markdown files added
  2. Re-classify specific sources as needed
  3. Sync to NAS for distribution

Adding New Sources

  1. Add source pattern to consolidate_markdown_sources.py
  2. Update content parser if needed
  3. Run consolidation and classification

API Key Management

  • Store in .env file as ANTHROPIC_API_KEY
  • Never commit API keys to repository
  • Use environment variables in production

Troubleshooting

Common Issues

Rate Limit Errors (429)

  • Solution: Increase delay between requests
  • Current setting: 1 second (optimal for 80k tokens/min)

JSON Parsing Errors

  • Usually caused by malformed API responses
  • Automatic retry handles most cases
  • Check logs for persistent failures

Missing Content

  • Verify markdown consolidation captured all files
  • Check case sensitivity in source patterns
  • Ensure NAS sync completed successfully

Future Enhancements

Planned Features

  • Batch processing optimization
  • Parallel classification with rate limit management
  • Incremental updates for new content
  • Custom classification templates per source
  • Advanced deduplication strategies

Potential Improvements

  • Switch to newer Claude models when available
  • Implement caching for unchanged content
  • Add quality scoring metrics
  • Create summary reports and analytics