hvac-kia-content/docs/CONTENT_CLASSIFICATION.md
Ben Reed fc3af8e19f feat: Add AI-powered content classification system
- Implement Claude Haiku integration for content analysis
- Create structured JSON output with summaries and metadata
- Add markdown consolidation with deduplication
- Process 447 YouTube videos and 431 podcast episodes
- Generate clean classified files for Claude Desktop projects
- Include comprehensive documentation and usage examples
- Cost-effective processing at ~.30 for 878 items
- Optimize rate limiting for 80,000 tokens/minute API limit

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-09-03 19:33:32 -03:00

238 lines
No EOL
7.4 KiB
Markdown

# HVAC Content Classification System
## Overview
The Content Classification System uses Claude Haiku AI to analyze and structure HVAC content from multiple sources into concise JSON files. These files provide structured metadata, summaries, and classifications for use in content creation projects.
## Features
### Structured Classification
Each content item is analyzed and classified with:
- **URL**: Original content location
- **Date Published**: Publication date
- **Author**: Content creator
- **Word Count**: Content length
- **Summary**: 1-3 sentence summary of main points
- **Key Learnings**: 3-10 bullet point takeaways
- **Content Type**: technical/business/educational/marketing/troubleshooting/installation/maintenance
- **Application**: Residential/Commercial/Industrial/Automotive/Marine
- **Categories**: Technical categories and tags
- **Brands Mentioned**: HVAC brands, manufacturers, tools referenced
- **Tools Mentioned**: Specific HVAC equipment and software
- **Topics**: Technical topics (refrigeration, heat pumps, ductwork, etc.)
- **Meta Information**:
- Difficulty level (beginner/intermediate/advanced)
- Target audience (homeowner/technician/contractor/engineer)
- Actionable content flag
- Troubleshooting focus flag
- **Classification Confidence**: AI confidence score
## Architecture
### Core Components
#### 1. Content Parser (`src/content_analysis/content_parser.py`)
- Extracts individual content items from aggregated markdown files
- Handles all content sources: WordPress, YouTube, Instagram, Podcast, MailChimp
- Validates content structure and extracts metadata
- Returns structured `ContentItem` objects
#### 2. Content Classifier (`src/content_analysis/content_classifier.py`)
- Uses Claude Haiku API for cost-effective AI classification
- Processes content with structured JSON prompts
- Implements rate limiting and retry logic:
- 1 second delay between requests
- Exponential backoff on failures
- 5 retry attempts per item
- Returns `ClassifiedContent` objects with all metadata
#### 3. Markdown Consolidator (`consolidate_markdown_sources.py`)
- Deduplicates content across multiple markdown files
- Keeps most recent version of each content item by ID
- Consolidates from 53,000+ items to ~3,000 unique items
- Handles case variations in source names
#### 4. Classification Runner (`classify_youtube_podcast_only.py`)
- Focused script for classifying specific sources
- Sequential processing to avoid rate limit conflicts
- Progress tracking and error handling
- Saves results as clean JSON files
## Data Flow
```
1. Raw Markdown Files (multiple versions per source)
2. Consolidation & Deduplication
3. Consolidated Markdown (5 files: blog, podcast, youtube, instagram, mailchimp)
4. Content Parsing & Validation
5. Claude Haiku Classification
6. Structured JSON Output
7. NAS Storage for Distribution
```
## File Structure
### Input Files
- `data/consolidated/hkia_blog_consolidated.md` - WordPress blog posts
- `data/consolidated/hkia_podcast_consolidated.md` - Podcast episodes (431 items)
- `data/consolidated/hkia_youtube_consolidated.md` - YouTube videos (447 items)
- `data/consolidated/hkia_instagram_consolidated.md` - Instagram posts
- `data/consolidated/hkia_mailchimp_consolidated.md` - Newsletter content
### Output Files
- `data/clean_classified/blog.json` - Classified blog content
- `data/clean_classified/podcast.json` - Classified podcast episodes
- `data/clean_classified/youtube.json` - Classified YouTube videos
- `data/clean_classified/instagram.json` - Classified Instagram posts
- `data/clean_classified/mailchimp.json` - Classified newsletter content
### NAS Sync
- Files automatically synced to: `/mnt/nas/hkia/clean_classified/`
## Usage
### Full Consolidation and Classification
```bash
# Step 1: Consolidate markdown files with deduplication
uv run python consolidate_markdown_sources.py
# Step 2: Classify specific sources (YouTube & Podcast)
export ANTHROPIC_API_KEY="your-api-key"
uv run python classify_youtube_podcast_only.py
# Step 3: Sync to NAS
rsync -av data/clean_classified/ /mnt/nas/hkia/clean_classified/
```
### Classification Only (if consolidated files exist)
```bash
# Run focused classification
export ANTHROPIC_API_KEY="your-api-key"
uv run python classify_youtube_podcast_only.py
```
## API Configuration
### Claude Haiku Settings
- **Model**: claude-3-haiku-20240307
- **Max Tokens**: 1000 per request
- **Temperature**: 0.1 (low for consistent classification)
- **Rate Limiting**:
- 80,000 output tokens per minute limit
- ~80 requests per minute maximum
- 1 second delay between requests
### Cost Estimation
- **Input**: $0.25 per million tokens
- **Output**: $1.25 per million tokens
- **Typical Cost**: ~$1.30 for 878 items (447 YouTube + 431 Podcast)
## Performance
### Processing Times
- **With 1-second rate limiting**: ~3 seconds per item
- **YouTube (447 videos)**: ~22 minutes
- **Podcast (431 episodes)**: ~22 minutes
- **Total for all sources**: ~45 minutes
### Success Rates
- Typical success rate: >99%
- Automatic retry on JSON parsing errors
- Exponential backoff on API rate limits
## Error Handling
### Rate Limiting
- Base delay: 1 second between requests
- Exponential backoff: 2x multiplier on retry
- Maximum retries: 5 attempts per item
### JSON Parsing Errors
- Automatic retry with backoff
- Fallback JSON extraction from response text
- Logged errors for debugging
## Monitoring
### Progress Tracking
- Console output every 10 items
- Shows current item ID and number
- Success/failure counts
- Estimated time remaining
### Log Files
- Detailed logging with timestamps
- Error messages and stack traces
- API response debugging
## Integration
### Claude Desktop Projects
The classified JSON files are optimized for use in Claude Desktop projects:
- Massively reduced file sizes (KB instead of MB)
- Structured data for easy parsing
- Rich metadata for content filtering
- Summaries and key learnings for quick reference
### Use Cases
- Content gap analysis
- Topic research and planning
- Content repurposing
- Competitive analysis
- Training material development
- SEO optimization
## Maintenance
### Updating Classifications
1. Re-run consolidation if new markdown files added
2. Re-classify specific sources as needed
3. Sync to NAS for distribution
### Adding New Sources
1. Add source pattern to `consolidate_markdown_sources.py`
2. Update content parser if needed
3. Run consolidation and classification
### API Key Management
- Store in `.env` file as `ANTHROPIC_API_KEY`
- Never commit API keys to repository
- Use environment variables in production
## Troubleshooting
### Common Issues
#### Rate Limit Errors (429)
- Solution: Increase delay between requests
- Current setting: 1 second (optimal for 80k tokens/min)
#### JSON Parsing Errors
- Usually caused by malformed API responses
- Automatic retry handles most cases
- Check logs for persistent failures
#### Missing Content
- Verify markdown consolidation captured all files
- Check case sensitivity in source patterns
- Ensure NAS sync completed successfully
## Future Enhancements
### Planned Features
- Batch processing optimization
- Parallel classification with rate limit management
- Incremental updates for new content
- Custom classification templates per source
- Advanced deduplication strategies
### Potential Improvements
- Switch to newer Claude models when available
- Implement caching for unchanged content
- Add quality scoring metrics
- Create summary reports and analytics