# HKIA Content Analysis & Competitive Intelligence Implementation Plan

## Project Overview

Add comprehensive content analysis and competitive intelligence capabilities to the existing HKIA content aggregation system. This will provide daily insights on content performance, trending topics, competitor analysis, and strategic content opportunities.

## Architecture Summary

### Current System Integration
- **Base**: Extend existing `BaseScraper` architecture and `ContentOrchestrator`
- **LLM**: Claude Haiku for cost-effective content classification
- **APIs**: Jina.ai (existing credits), Oxylabs (existing credits), Anthropic API
- **Competitors**: HVACR School (blog), AC Service Tech, Refrigeration Mentor, Love2HVAC, HVAC TV (social)
- **Strategy**: One-time backlog capture + daily incremental + weekly metadata refresh

## Implementation Phases

### Phase 1: Foundation (Week 1-2)
**Goal**: Set up content analysis framework for existing HKIA content

**Tasks**:
1. Create `src/content_analysis/` module structure
2. Implement `ClaudeHaikuAnalyzer` for content classification
3. Extend `BaseScraper` with analysis capabilities
4. Add analysis to existing scrapers (YouTube, Instagram, WordPress, etc.)
5. Create daily intelligence JSON output structure

**Deliverables**:
- Content classification for all existing HKIA sources
- Daily intelligence reports for HKIA content only
- Enhanced metadata in existing markdown files

### Phase 2: Competitor Infrastructure (Week 3-4)  
**Goal**: Build competitor scraping and state management infrastructure

**Tasks**:
1. Create `src/competitive_intelligence/` module structure
2. Implement Oxylabs proxy integration
3. Build competitor scraper base classes
4. Create state management for incremental updates
5. Implement HVACR School blog scraper (backlog + incremental)

**Deliverables**:
- Competitor scraping framework
- HVACR School full backlog capture
- HVACR School daily incremental scraping
- Competitor state management system

### Phase 3: Social Media Competitor Scrapers (Week 5-6)
**Goal**: Implement social media competitor tracking

**Tasks**:
1. Build YouTube competitor scrapers (4 channels)
2. Build Instagram competitor scrapers (3 accounts)  
3. Implement backlog capture commands
4. Create weekly metadata refresh system
5. Add competitor content to intelligence analysis

**Deliverables**:
- Complete competitor social media backlog
- Daily incremental social media scraping
- Weekly engagement metrics updates
- Unified competitor intelligence reports

### Phase 4: Advanced Analytics (Week 7-8)
**Goal**: Add trend detection and strategic insights

**Tasks**:
1. Implement trend detection algorithms
2. Build content gap analysis
3. Create competitive positioning analysis  
4. Add SEO opportunity identification (using Jina.ai)
5. Generate weekly/monthly intelligence summaries

**Deliverables**:
- Advanced trend detection
- Content gap identification
- Strategic content recommendations
- Comprehensive intelligence dashboard data

### Phase 5: Production Deployment (Week 9-10)
**Goal**: Deploy to production with monitoring

**Tasks**:
1. Set up production environment variables
2. Create systemd services and timers
3. Integrate with existing NAS sync
4. Add monitoring and error handling
5. Create operational documentation

**Deliverables**:
- Production-ready deployment
- Automated daily/weekly schedules
- Monitoring and alerting
- Operational runbooks

## Technical Architecture

### Module Structure
```
src/
├── content_analysis/
│   ├── __init__.py
│   ├── claude_analyzer.py          # Haiku-based content classification
│   ├── engagement_analyzer.py      # Metrics and trending analysis
│   ├── keyword_extractor.py        # SEO keyword identification
│   └── intelligence_aggregator.py  # Daily intelligence JSON generation
├── competitive_intelligence/
│   ├── __init__.py
│   ├── backlog_capture/
│   │   ├── __init__.py
│   │   ├── hvacrschool_backlog.py
│   │   ├── youtube_competitor_backlog.py
│   │   └── instagram_competitor_backlog.py
│   ├── incremental_scrapers/
│   │   ├── __init__.py
│   │   ├── hvacrschool_incremental.py
│   │   ├── youtube_competitor_daily.py
│   │   └── instagram_competitor_daily.py
│   ├── metadata_refreshers/
│   │   ├── __init__.py
│   │   ├── youtube_engagement_updater.py
│   │   └── instagram_engagement_updater.py
│   └── analysis/
│       ├── __init__.py
│       ├── competitive_gap_analyzer.py
│       ├── trend_analyzer.py
│       └── strategic_insights.py
└── orchestrators/
    ├── __init__.py
    ├── content_analysis_orchestrator.py
    └── competitive_intelligence_orchestrator.py
```

### Data Structure
```
data/
├── intelligence/
│   ├── daily/
│   │   └── hkia_intelligence_YYYY-MM-DD.json
│   ├── weekly/
│   │   └── hkia_weekly_intelligence_YYYY-MM-DD.json
│   └── monthly/
│       └── hkia_monthly_intelligence_YYYY-MM.json
├── competitor_content/
│   ├── hvacrschool/
│   │   ├── markdown_current/
│   │   ├── markdown_archives/
│   │   └── .state/
│   ├── acservicetech/
│   ├── refrigerationmentor/
│   ├── love2hvac/
│   └── hvactv/
└── .state/
    ├── competitor_hvacrschool_state.json
    ├── competitor_acservicetech_youtube_state.json
    └── ...
```

### Environment Variables
```bash
# Content Analysis
ANTHROPIC_API_KEY=your_claude_key
JINA_AI_API_KEY=your_existing_jina_key

# Competitor Scraping  
OXYLABS_RESIDENTIAL_PROXY_ENDPOINT=your_endpoint
OXYLABS_USERNAME=your_username
OXYLABS_PASSWORD=your_password

# Competitor Targets
COMPETITOR_YOUTUBE_CHANNELS=acservicetech,refrigerationmentor,love2hvac,hvactv
COMPETITOR_INSTAGRAM_ACCOUNTS=acservicetech,love2hvac
COMPETITOR_BLOGS=hvacrschool.com
```

### Production Schedule
```
Daily:
- 8:00 AM: HKIA content scraping (existing)
- 12:00 PM: HKIA content scraping (existing)
- 6:00 PM: Competitor incremental scraping
- 7:00 PM: Daily content analysis & intelligence generation

Weekly:
- Sunday 6:00 AM: Competitor metadata refresh

On-demand:
- Competitor backlog capture commands
- Force refresh commands
```

### systemd Services
```bash
# Daily content analysis
/etc/systemd/system/hkia-content-analysis.service
/etc/systemd/system/hkia-content-analysis.timer

# Daily competitor incremental  
/etc/systemd/system/hkia-competitor-incremental.service
/etc/systemd/system/hkia-competitor-incremental.timer

# Weekly competitor metadata refresh
/etc/systemd/system/hkia-competitor-metadata-refresh.service  
/etc/systemd/system/hkia-competitor-metadata-refresh.timer

# On-demand backlog capture
/etc/systemd/system/hkia-competitor-backlog.service
```

## Cost Estimates

**Monthly Operational Costs:**
- Claude Haiku API: $15-25/month (content classification)
- Jina.ai: $0 (existing credits)
- Oxylabs: $0 (existing credits)
- **Total: $15-25/month**

## Success Metrics

1. **Content Intelligence**: Daily classification of 100% HKIA content
2. **Competitive Coverage**: Track 100% of competitor new content within 24 hours
3. **Strategic Insights**: Generate 3-5 actionable content opportunities daily
4. **Performance**: All analysis completed within 2-hour daily window
5. **Cost Efficiency**: Stay under $30/month operational costs

## Risk Mitigation

1. **Rate Limiting**: Implement exponential backoff and respect competitor ToS
2. **API Costs**: Monitor Claude Haiku usage, implement batching for efficiency  
3. **Proxy Reliability**: Failover logic for Oxylabs proxy issues
4. **Data Storage**: Automated cleanup of old intelligence data
5. **System Load**: Schedule analysis during low-traffic periods

## Commands for Implementation

### Development Setup
```bash
# Add new dependencies
uv add anthropic jina-ai requests-oauthlib

# Create module structure
mkdir -p src/content_analysis src/competitive_intelligence/{backlog_capture,incremental_scrapers,metadata_refreshers,analysis} src/orchestrators

# Test content analysis on existing data
uv run python test_content_analysis.py

# Test competitor scraping
uv run python test_competitor_scraping.py
```

### Backlog Capture (One-time)
```bash
# Capture HVACR School full blog
uv run python -m src.competitive_intelligence.backlog_capture --competitor hvacrschool

# Capture competitor social media backlogs
uv run python -m src.competitive_intelligence.backlog_capture --competitor acservicetech --platforms youtube,instagram

# Force re-capture if needed
uv run python -m src.competitive_intelligence.backlog_capture --force
```

### Production Operations
```bash
# Manual intelligence generation
uv run python -m src.orchestrators.content_analysis_orchestrator

# Manual competitor incremental scraping  
uv run python -m src.orchestrators.competitive_intelligence_orchestrator --mode incremental

# Weekly metadata refresh
uv run python -m src.orchestrators.competitive_intelligence_orchestrator --mode metadata-refresh

# View latest intelligence
cat data/intelligence/daily/hkia_intelligence_$(date +%Y-%m-%d).json | jq
```

## Next Steps

1. **Immediate**: Begin Phase 1 implementation with content analysis framework
2. **Week 1**: Set up Claude Haiku integration and test on existing HKIA content  
3. **Week 2**: Complete content classification for all current sources
4. **Week 3**: Begin competitor infrastructure development
5. **Week 4**: Deploy HVACR School competitor tracking

This plan provides a structured approach to implementing comprehensive content analysis and competitive intelligence while leveraging existing infrastructure and maintaining cost efficiency.