- Implement Claude Haiku integration for content analysis - Create structured JSON output with summaries and metadata - Add markdown consolidation with deduplication - Process 447 YouTube videos and 431 podcast episodes - Generate clean classified files for Claude Desktop projects - Include comprehensive documentation and usage examples - Cost-effective processing at ~.30 for 878 items - Optimize rate limiting for 80,000 tokens/minute API limit 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
7.4 KiB
7.4 KiB
HVAC Content Classification System
Overview
The Content Classification System uses Claude Haiku AI to analyze and structure HVAC content from multiple sources into concise JSON files. These files provide structured metadata, summaries, and classifications for use in content creation projects.
Features
Structured Classification
Each content item is analyzed and classified with:
- URL: Original content location
- Date Published: Publication date
- Author: Content creator
- Word Count: Content length
- Summary: 1-3 sentence summary of main points
- Key Learnings: 3-10 bullet point takeaways
- Content Type: technical/business/educational/marketing/troubleshooting/installation/maintenance
- Application: Residential/Commercial/Industrial/Automotive/Marine
- Categories: Technical categories and tags
- Brands Mentioned: HVAC brands, manufacturers, tools referenced
- Tools Mentioned: Specific HVAC equipment and software
- Topics: Technical topics (refrigeration, heat pumps, ductwork, etc.)
- Meta Information:
- Difficulty level (beginner/intermediate/advanced)
- Target audience (homeowner/technician/contractor/engineer)
- Actionable content flag
- Troubleshooting focus flag
- Classification Confidence: AI confidence score
Architecture
Core Components
1. Content Parser (src/content_analysis/content_parser.py)
- Extracts individual content items from aggregated markdown files
- Handles all content sources: WordPress, YouTube, Instagram, Podcast, MailChimp
- Validates content structure and extracts metadata
- Returns structured
ContentItemobjects
2. Content Classifier (src/content_analysis/content_classifier.py)
- Uses Claude Haiku API for cost-effective AI classification
- Processes content with structured JSON prompts
- Implements rate limiting and retry logic:
- 1 second delay between requests
- Exponential backoff on failures
- 5 retry attempts per item
- Returns
ClassifiedContentobjects with all metadata
3. Markdown Consolidator (consolidate_markdown_sources.py)
- Deduplicates content across multiple markdown files
- Keeps most recent version of each content item by ID
- Consolidates from 53,000+ items to ~3,000 unique items
- Handles case variations in source names
4. Classification Runner (classify_youtube_podcast_only.py)
- Focused script for classifying specific sources
- Sequential processing to avoid rate limit conflicts
- Progress tracking and error handling
- Saves results as clean JSON files
Data Flow
1. Raw Markdown Files (multiple versions per source)
↓
2. Consolidation & Deduplication
↓
3. Consolidated Markdown (5 files: blog, podcast, youtube, instagram, mailchimp)
↓
4. Content Parsing & Validation
↓
5. Claude Haiku Classification
↓
6. Structured JSON Output
↓
7. NAS Storage for Distribution
File Structure
Input Files
data/consolidated/hkia_blog_consolidated.md- WordPress blog postsdata/consolidated/hkia_podcast_consolidated.md- Podcast episodes (431 items)data/consolidated/hkia_youtube_consolidated.md- YouTube videos (447 items)data/consolidated/hkia_instagram_consolidated.md- Instagram postsdata/consolidated/hkia_mailchimp_consolidated.md- Newsletter content
Output Files
data/clean_classified/blog.json- Classified blog contentdata/clean_classified/podcast.json- Classified podcast episodesdata/clean_classified/youtube.json- Classified YouTube videosdata/clean_classified/instagram.json- Classified Instagram postsdata/clean_classified/mailchimp.json- Classified newsletter content
NAS Sync
- Files automatically synced to:
/mnt/nas/hkia/clean_classified/
Usage
Full Consolidation and Classification
# Step 1: Consolidate markdown files with deduplication
uv run python consolidate_markdown_sources.py
# Step 2: Classify specific sources (YouTube & Podcast)
export ANTHROPIC_API_KEY="your-api-key"
uv run python classify_youtube_podcast_only.py
# Step 3: Sync to NAS
rsync -av data/clean_classified/ /mnt/nas/hkia/clean_classified/
Classification Only (if consolidated files exist)
# Run focused classification
export ANTHROPIC_API_KEY="your-api-key"
uv run python classify_youtube_podcast_only.py
API Configuration
Claude Haiku Settings
- Model: claude-3-haiku-20240307
- Max Tokens: 1000 per request
- Temperature: 0.1 (low for consistent classification)
- Rate Limiting:
- 80,000 output tokens per minute limit
- ~80 requests per minute maximum
- 1 second delay between requests
Cost Estimation
- Input: $0.25 per million tokens
- Output: $1.25 per million tokens
- Typical Cost: ~$1.30 for 878 items (447 YouTube + 431 Podcast)
Performance
Processing Times
- With 1-second rate limiting: ~3 seconds per item
- YouTube (447 videos): ~22 minutes
- Podcast (431 episodes): ~22 minutes
- Total for all sources: ~45 minutes
Success Rates
- Typical success rate: >99%
- Automatic retry on JSON parsing errors
- Exponential backoff on API rate limits
Error Handling
Rate Limiting
- Base delay: 1 second between requests
- Exponential backoff: 2x multiplier on retry
- Maximum retries: 5 attempts per item
JSON Parsing Errors
- Automatic retry with backoff
- Fallback JSON extraction from response text
- Logged errors for debugging
Monitoring
Progress Tracking
- Console output every 10 items
- Shows current item ID and number
- Success/failure counts
- Estimated time remaining
Log Files
- Detailed logging with timestamps
- Error messages and stack traces
- API response debugging
Integration
Claude Desktop Projects
The classified JSON files are optimized for use in Claude Desktop projects:
- Massively reduced file sizes (KB instead of MB)
- Structured data for easy parsing
- Rich metadata for content filtering
- Summaries and key learnings for quick reference
Use Cases
- Content gap analysis
- Topic research and planning
- Content repurposing
- Competitive analysis
- Training material development
- SEO optimization
Maintenance
Updating Classifications
- Re-run consolidation if new markdown files added
- Re-classify specific sources as needed
- Sync to NAS for distribution
Adding New Sources
- Add source pattern to
consolidate_markdown_sources.py - Update content parser if needed
- Run consolidation and classification
API Key Management
- Store in
.envfile asANTHROPIC_API_KEY - Never commit API keys to repository
- Use environment variables in production
Troubleshooting
Common Issues
Rate Limit Errors (429)
- Solution: Increase delay between requests
- Current setting: 1 second (optimal for 80k tokens/min)
JSON Parsing Errors
- Usually caused by malformed API responses
- Automatic retry handles most cases
- Check logs for persistent failures
Missing Content
- Verify markdown consolidation captured all files
- Check case sensitivity in source patterns
- Ensure NAS sync completed successfully
Future Enhancements
Planned Features
- Batch processing optimization
- Parallel classification with rate limit management
- Incremental updates for new content
- Custom classification templates per source
- Advanced deduplication strategies
Potential Improvements
- Switch to newer Claude models when available
- Implement caching for unchanged content
- Add quality scoring metrics
- Create summary reports and analytics