hvac-kia-content/UPDATED_CAPTURE_STATUS.md
Ben Reed 0a795437a7 Optimize Instagram scraper and increase capture targets to 1000
- Increased Instagram rate limit from 100 to 200 posts/hour
- Reduced delays: 10-20s (was 15-30s), extended breaks 30-60s (was 60-120s)
- Extended break interval: every 10 requests (was 5)
- Updated capture targets: 1000 posts for Instagram, 1000 videos for TikTok
- Added production deployment and monitoring scripts
- Created environment configuration template

This provides ~40-50% speed improvement for Instagram scraping and
captures 5x more Instagram content and 3.3x more TikTok content.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-08-18 22:59:11 -03:00

2.6 KiB

HVAC Know It All - Updated Production Backlog Capture

🚀 Updated Configuration

Started: August 18, 2025 @ 10:54 PM ADT

📈 New Rate Limits & Targets

Source Previous Target New Target Rate Limit Estimated Time
Instagram 200 posts 1000 posts 200/hour ~5 hours
TikTok 300 videos 1000 videos Browser-based ~2-3 hours

Instagram Optimization Changes

  • Rate limit: Increased from 100 to 200 posts/hour
  • Delays: Reduced from 15-30s to 10-20 seconds
  • Extended breaks: Every 10 requests (was 5)
  • Break duration: 30-60 seconds (was 60-120s)
  • Speed improvement: ~40-50% faster

🎯 TikTok Enhancements

  • Total videos: 1000 (if available)
  • Videos with captions: 100 (increased from 50)
  • Caption fetching: Individual page visits for detailed content

📊 Already Completed Sources

Source Items Captured File Size Status
WordPress 139 posts 1.5 MB Complete
Podcast 428 episodes 727 KB Complete
YouTube 200 videos 107 KB Complete

🔄 Currently Processing

  • Instagram: Fetching 1000 posts with optimized rate limiting
  • Next: TikTok with 1000 videos target

📁 Output Location

/home/ben/dev/hvac-kia-content/data_production_backlog/markdown_current/
├── hvacknowitall_wordpress_backlog_[timestamp].md
├── hvacknowitall_podcast_backlog_[timestamp].md
├── hvacknowitall_youtube_backlog_[timestamp].md
├── hvacknowitall_instagram_backlog_[timestamp].md (pending)
└── hvacknowitall_tiktok_backlog_[timestamp].md (pending)

📈 Progress Monitoring

To monitor real-time progress:

# Watch Instagram progress
tail -f instagram_1000.log

# Check overall status
./monitor_backlog_progress.sh --live

⏱️ Time Estimates

  • Instagram: ~5 hours for 1000 posts at 200/hour
  • TikTok: ~2-3 hours for 1000 videos (depends on caption fetching)
  • Total remaining: ~7-8 hours

🎯 Final Deliverables

  • ~2,767 total items (767 already + 2000 new)
  • Specification-compliant markdown for all sources
  • Media files downloaded and organized
  • NAS synchronization upon completion

📝 Notes

The increased targets will provide a much more comprehensive historical dataset:

  • Instagram: 5x more content than originally planned
  • TikTok: 3.3x more content than originally planned
  • This will capture a significant portion of the brand's social media history