Compare commits
No commits in common. "4bdb3de6e8eb9d7bc1223dfc6211c5468b132712" and "8a0b8b4d3f1ffd31655baa756d24952ca08ae47f" have entirely different histories.
4bdb3de6e8
...
8a0b8b4d3f
122 changed files with 254 additions and 87784 deletions
|
|
@ -1,10 +1,10 @@
|
|||
# HKIA - Production Environment Variables
|
||||
# HVAC Know It All - Production Environment Variables
|
||||
# Copy to /opt/hvac-kia-content/.env and update with actual values
|
||||
|
||||
# WordPress Configuration
|
||||
WORDPRESS_USERNAME=your_wordpress_username
|
||||
WORDPRESS_API_KEY=your_wordpress_api_key
|
||||
WORDPRESS_BASE_URL=https://hkia.com
|
||||
WORDPRESS_BASE_URL=https://hvacknowitall.com
|
||||
|
||||
# YouTube Configuration
|
||||
YOUTUBE_CHANNEL_URL=https://www.youtube.com/@HVACKnowItAll
|
||||
|
|
@ -15,16 +15,16 @@ INSTAGRAM_USERNAME=your_instagram_username
|
|||
INSTAGRAM_PASSWORD=your_instagram_password
|
||||
|
||||
# TikTok Configuration
|
||||
TIKTOK_TARGET=@hkia
|
||||
TIKTOK_TARGET=@hvacknowitall
|
||||
|
||||
# MailChimp RSS Configuration
|
||||
MAILCHIMP_RSS_URL=https://us10.campaign-archive.com/feed?u=d1a98c3e62003104038942e21&id=2205dbf985
|
||||
|
||||
# Podcast RSS Configuration
|
||||
PODCAST_RSS_URL=https://hkia.com/podcast/feed/
|
||||
PODCAST_RSS_URL=https://hvacknowitall.com/podcast/feed/
|
||||
|
||||
# NAS and Storage Configuration
|
||||
NAS_PATH=/mnt/nas/hkia
|
||||
NAS_PATH=/mnt/nas/hvacknowitall
|
||||
DATA_DIR=/opt/hvac-kia-content/data
|
||||
LOGS_DIR=/opt/hvac-kia-content/logs
|
||||
|
||||
|
|
@ -41,7 +41,7 @@ SMTP_HOST=smtp.gmail.com
|
|||
SMTP_PORT=587
|
||||
SMTP_USERNAME=your_email@gmail.com
|
||||
SMTP_PASSWORD=your_app_password
|
||||
ALERT_EMAIL=alerts@hkia.com
|
||||
ALERT_EMAIL=alerts@hvacknowitall.com
|
||||
|
||||
# Production Settings
|
||||
ENVIRONMENT=production
|
||||
|
|
|
|||
|
|
@ -1,4 +1,4 @@
|
|||
# HKIA - Production Backlog Capture Status
|
||||
# HVAC Know It All - Production Backlog Capture Status
|
||||
|
||||
## 📊 Current Progress Report
|
||||
**Last Updated**: August 18, 2025 @ 10:23 PM ADT
|
||||
|
|
@ -30,9 +30,9 @@ All markdown files are being created in specification-compliant format:
|
|||
|
||||
```
|
||||
/home/ben/dev/hvac-kia-content/data_production_backlog/markdown_current/
|
||||
├── hkia_wordpress_backlog_20250818_221430.md (1.5M)
|
||||
├── hkia_podcast_backlog_20250818_221531.md (727K)
|
||||
└── hkia_youtube_backlog_20250818_221604.md (107K)
|
||||
├── hvacknowitall_wordpress_backlog_20250818_221430.md (1.5M)
|
||||
├── hvacknowitall_podcast_backlog_20250818_221531.md (727K)
|
||||
└── hvacknowitall_youtube_backlog_20250818_221604.md (107K)
|
||||
```
|
||||
|
||||
### ✅ Format Verification
|
||||
|
|
@ -40,7 +40,7 @@ All markdown files are being created in specification-compliant format:
|
|||
- Correct markdown structure with `##` headers
|
||||
- Full content including descriptions and metadata
|
||||
- Item separators (`--------------------------------------------------`)
|
||||
- Timestamped filenames: `hkia_[source]_backlog_[timestamp].md`
|
||||
- Timestamped filenames: `hvacknowitall_[source]_backlog_[timestamp].md`
|
||||
|
||||
## 📊 Statistics
|
||||
|
||||
|
|
|
|||
155
CLAUDE.md
155
CLAUDE.md
|
|
@ -1,49 +1,41 @@
|
|||
# CLAUDE.md
|
||||
|
||||
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
|
||||
|
||||
# HKIA Content Aggregation System
|
||||
# HVAC Know It All Content Aggregation System
|
||||
|
||||
## Project Overview
|
||||
Complete content aggregation system that scrapes 5 sources (WordPress, MailChimp RSS, Podcast RSS, YouTube, Instagram), converts to markdown, and runs twice daily with incremental updates. TikTok scraper disabled due to technical issues.
|
||||
Complete content aggregation system that scrapes 6 sources (WordPress, MailChimp RSS, Podcast RSS, YouTube, Instagram, TikTok), converts to markdown, and runs twice daily with incremental updates.
|
||||
|
||||
## Architecture
|
||||
- **Base Pattern**: Abstract scraper class with common interface
|
||||
- **State Management**: JSON-based incremental update tracking
|
||||
- **Parallel Processing**: All 5 active sources run in parallel
|
||||
- **Output Format**: `hkia_[source]_[timestamp].md`
|
||||
- **Parallel Processing**: 5 sources run in parallel, TikTok separate (GUI requirement)
|
||||
- **Output Format**: `hvacknowitall_[source]_[timestamp].md`
|
||||
- **Archive System**: Previous files archived to timestamped directories
|
||||
- **NAS Sync**: Automated rsync to `/mnt/nas/hkia/`
|
||||
- **NAS Sync**: Automated rsync to `/mnt/nas/hvacknowitall/`
|
||||
|
||||
## Key Implementation Details
|
||||
|
||||
### Instagram Scraper (`src/instagram_scraper.py`)
|
||||
- Uses `instaloader` with session persistence
|
||||
- Aggressive rate limiting: 15-30 second delays, extended breaks every 5 requests
|
||||
- Session file: `instagram_session_hkia1.session`
|
||||
- Authentication: Username `hkia1`, password `I22W5YlbRl7x`
|
||||
- Session file: `instagram_session_hvacknowitall1.session`
|
||||
- Authentication: Username `hvacknowitall1`, password `I22W5YlbRl7x`
|
||||
|
||||
### ~~TikTok Scraper~~ ❌ **DISABLED**
|
||||
- **Status**: Disabled in orchestrator due to technical issues
|
||||
- **Reason**: GUI requirements incompatible with automated deployment
|
||||
- **Code**: Still available in `src/tiktok_scraper_advanced.py` but not active
|
||||
### TikTok Scraper (`src/tiktok_scraper_advanced.py`)
|
||||
- Advanced anti-bot detection using Scrapling + Camofaux
|
||||
- **Requires headed browser with DISPLAY=:0**
|
||||
- Stealth features: geolocation spoofing, OS randomization, WebGL support
|
||||
- Cannot be containerized due to GUI requirements
|
||||
|
||||
### YouTube Scraper (`src/youtube_scraper.py`)
|
||||
- Uses `yt-dlp` with authentication for metadata and transcript extraction
|
||||
- Channel: `@hkia`
|
||||
- **Authentication**: Firefox cookie extraction via `YouTubeAuthHandler`
|
||||
- **Transcript Support**: Can extract transcripts when `fetch_transcripts=True`
|
||||
- ⚠️ **Current Limitation**: YouTube's new PO token requirements (Aug 2025) block transcript extraction
|
||||
- Error: "The following content is not available on this app"
|
||||
- **179 videos identified** with captions available but currently inaccessible
|
||||
- Requires `yt-dlp` updates to handle new YouTube restrictions
|
||||
- Uses `yt-dlp` for metadata extraction
|
||||
- Channel: `@HVACKnowItAll`
|
||||
- Fetches video metadata without downloading videos
|
||||
|
||||
### RSS Scrapers
|
||||
- **MailChimp**: `https://us10.campaign-archive.com/feed?u=d1a98c3e62003104038942e21&id=2205dbf985`
|
||||
- **Podcast**: `https://feeds.libsyn.com/568690/spotify`
|
||||
|
||||
### WordPress Scraper (`src/wordpress_scraper.py`)
|
||||
- Direct API access to `hkia.com`
|
||||
- Direct API access to `hvacknowitall.com`
|
||||
- Fetches blog posts with full content
|
||||
|
||||
## Technical Stack
|
||||
|
|
@ -58,40 +50,38 @@ Complete content aggregation system that scrapes 5 sources (WordPress, MailChimp
|
|||
|
||||
## Deployment Strategy
|
||||
|
||||
### ✅ Production Setup - systemd Services
|
||||
**TikTok disabled** - no longer requires GUI access or containerization restrictions.
|
||||
### ⚠️ IMPORTANT: systemd Services (Not Kubernetes)
|
||||
Originally planned for Kubernetes deployment but **TikTok requires headed browser with DISPLAY=:0**, making containerization impossible.
|
||||
|
||||
### Production Setup
|
||||
```bash
|
||||
# Service files location (✅ INSTALLED)
|
||||
/etc/systemd/system/hkia-scraper.service
|
||||
/etc/systemd/system/hkia-scraper.timer
|
||||
/etc/systemd/system/hkia-scraper-nas.service
|
||||
/etc/systemd/system/hkia-scraper-nas.timer
|
||||
# Service files location
|
||||
/etc/systemd/system/hvac-scraper.service
|
||||
/etc/systemd/system/hvac-scraper.timer
|
||||
/etc/systemd/system/hvac-scraper-nas.service
|
||||
/etc/systemd/system/hvac-scraper-nas.timer
|
||||
|
||||
# Working directory
|
||||
/home/ben/dev/hvac-kia-content/
|
||||
|
||||
# Installation script
|
||||
./install-hkia-services.sh
|
||||
# Installation directory
|
||||
/opt/hvac-kia-content/
|
||||
|
||||
# Environment setup
|
||||
export DISPLAY=:0
|
||||
export XAUTHORITY="/run/user/1000/.mutter-Xwaylandauth.90WDB3"
|
||||
```
|
||||
|
||||
### Schedule (✅ ACTIVE)
|
||||
- **Main Scraping**: 8:00 AM and 12:00 PM Atlantic Daylight Time (5 sources)
|
||||
- **NAS Sync**: 8:30 AM and 12:30 PM (30 minutes after scraping)
|
||||
- **User**: ben (GUI environment available but not required)
|
||||
### Schedule
|
||||
- **Main Scraping**: 8AM and 12PM Atlantic Daylight Time
|
||||
- **NAS Sync**: 30 minutes after each scraping run
|
||||
- **User**: ben (requires GUI access for TikTok)
|
||||
|
||||
## Environment Variables
|
||||
```bash
|
||||
# Required in /opt/hvac-kia-content/.env
|
||||
INSTAGRAM_USERNAME=hkia1
|
||||
INSTAGRAM_USERNAME=hvacknowitall1
|
||||
INSTAGRAM_PASSWORD=I22W5YlbRl7x
|
||||
YOUTUBE_CHANNEL=@hkia
|
||||
TIKTOK_USERNAME=hkia
|
||||
NAS_PATH=/mnt/nas/hkia
|
||||
YOUTUBE_CHANNEL=@HVACKnowItAll
|
||||
TIKTOK_USERNAME=hvacknowitall
|
||||
NAS_PATH=/mnt/nas/hvacknowitall
|
||||
TIMEZONE=America/Halifax
|
||||
DISPLAY=:0
|
||||
XAUTHORITY="/run/user/1000/.mutter-Xwaylandauth.90WDB3"
|
||||
|
|
@ -107,78 +97,37 @@ uv run python test_real_data.py --source [youtube|instagram|tiktok|wordpress|mai
|
|||
# Test backlog processing
|
||||
uv run python test_real_data.py --type backlog --items 50
|
||||
|
||||
# Test cumulative markdown system
|
||||
uv run python test_cumulative_mode.py
|
||||
|
||||
# Full test suite
|
||||
uv run pytest tests/ -v
|
||||
|
||||
# Test with specific GUI environment for TikTok
|
||||
DISPLAY=:0 XAUTHORITY="/run/user/1000/.mutter-Xwaylandauth.90WDB3" uv run python test_real_data.py --source tiktok
|
||||
|
||||
# Test YouTube transcript extraction (currently blocked by YouTube)
|
||||
DISPLAY=:0 XAUTHORITY="/run/user/1000/.mutter-Xwaylandauth.90WDB3" uv run python youtube_backlog_all_with_transcripts.py
|
||||
```
|
||||
|
||||
### Production Operations
|
||||
```bash
|
||||
# Service management (✅ ACTIVE SERVICES)
|
||||
sudo systemctl status hkia-scraper.timer
|
||||
sudo systemctl status hkia-scraper-nas.timer
|
||||
sudo journalctl -f -u hkia-scraper.service
|
||||
sudo journalctl -f -u hkia-scraper-nas.service
|
||||
# Run orchestrator manually
|
||||
uv run python -m src.orchestrator
|
||||
|
||||
# Manual runs (for testing)
|
||||
uv run python run_production_with_images.py
|
||||
# Run specific sources
|
||||
uv run python -m src.orchestrator --sources youtube instagram
|
||||
|
||||
# NAS sync only
|
||||
uv run python -m src.orchestrator --nas-only
|
||||
|
||||
# Legacy commands (still work)
|
||||
uv run python -m src.orchestrator
|
||||
uv run python run_production_cumulative.py
|
||||
# Check service status
|
||||
sudo systemctl status hvac-scraper.service
|
||||
sudo journalctl -f -u hvac-scraper.service
|
||||
```
|
||||
|
||||
## Critical Notes
|
||||
|
||||
1. **✅ TikTok Scraper**: DISABLED - No longer blocks deployment or requires GUI access
|
||||
1. **TikTok GUI Requirement**: Must run on desktop environment with DISPLAY=:0
|
||||
2. **Instagram Rate Limiting**: 100 requests/hour with exponential backoff
|
||||
3. **YouTube Transcript Limitations**: As of August 2025, YouTube blocks transcript extraction
|
||||
- PO token requirements prevent `yt-dlp` access to subtitle/caption data
|
||||
- 179 videos identified with captions but currently inaccessible
|
||||
- Authentication system works but content restricted at platform level
|
||||
4. **State Files**: Located in `data/markdown_current/.state/` directory for incremental updates
|
||||
5. **Archive Management**: Previous files automatically moved to timestamped archives
|
||||
6. **Error Recovery**: All scrapers handle rate limits and network failures gracefully
|
||||
7. **✅ Production Services**: Fully automated with systemd timers running twice daily
|
||||
3. **State Files**: Located in `state/` directory for incremental updates
|
||||
4. **Archive Management**: Previous files automatically moved to timestamped archives
|
||||
5. **Error Recovery**: All scrapers handle rate limits and network failures gracefully
|
||||
|
||||
## YouTube Transcript Investigation (August 2025)
|
||||
|
||||
**Objective**: Extract transcripts for 179 YouTube videos identified as having captions available.
|
||||
|
||||
**Investigation Findings**:
|
||||
- ✅ **179 videos identified** with captions from existing YouTube data
|
||||
- ✅ **Existing authentication system** (`YouTubeAuthHandler` + Firefox cookies) working
|
||||
- ✅ **Transcript extraction code** properly implemented in `YouTubeScraper`
|
||||
- ❌ **Platform restrictions** blocking all video access as of August 2025
|
||||
|
||||
**Technical Attempts**:
|
||||
1. **YouTube Data API v3**: Requires OAuth2 for `captions.download` (not just API keys)
|
||||
2. **youtube-transcript-api**: IP blocking after minimal requests
|
||||
3. **yt-dlp with authentication**: All videos blocked with "not available on this app"
|
||||
|
||||
**Current Blocker**:
|
||||
YouTube's new PO token requirements prevent access to video content and transcripts, even with valid authentication. Error: "The following content is not available on this app.. Watch on the latest version of YouTube."
|
||||
|
||||
**Resolution**: Requires upstream `yt-dlp` updates to handle new YouTube platform restrictions.
|
||||
|
||||
## Project Status: ✅ COMPLETE & DEPLOYED
|
||||
- **5 active sources** working and tested (TikTok disabled)
|
||||
- **✅ Production deployment**: systemd services installed and running
|
||||
- **✅ Automated scheduling**: 8 AM & 12 PM ADT with NAS sync
|
||||
- **✅ Comprehensive testing**: 68+ tests passing
|
||||
- **✅ Real-world data validation**: All sources producing content
|
||||
- **✅ Full backlog processing**: Verified for all active sources
|
||||
- **✅ Cumulative markdown system**: Operational
|
||||
- **✅ Image downloading system**: 686 images synced daily
|
||||
- **✅ NAS synchronization**: Automated twice-daily sync
|
||||
- **YouTube transcript extraction**: Blocked by platform restrictions (not code issues)
|
||||
## Project Status: ✅ COMPLETE
|
||||
- All 6 sources working and tested
|
||||
- Production deployment ready via systemd
|
||||
- Comprehensive testing completed (68+ tests passing)
|
||||
- Real-world data validation completed
|
||||
- Full backlog processing capability verified
|
||||
124
DEPLOY.md
124
DEPLOY.md
|
|
@ -1,124 +0,0 @@
|
|||
# Deployment Instructions
|
||||
|
||||
## Prerequisites
|
||||
|
||||
Ensure the following are completed:
|
||||
1. Python environment is set up with `uv`
|
||||
2. All dependencies installed: `uv pip install -r requirements.txt`
|
||||
3. `.env` file configured with API credentials
|
||||
4. Test run successful: `uv run python run_api_production_v2.py`
|
||||
|
||||
## Deploy to Production
|
||||
|
||||
### Option 1: Automated Installation (Recommended)
|
||||
|
||||
```bash
|
||||
cd /home/ben/dev/hvac-kia-content/deploy
|
||||
sudo ./install.sh
|
||||
```
|
||||
|
||||
This will:
|
||||
- Copy systemd service files to `/etc/systemd/system/`
|
||||
- Enable and start the timers
|
||||
- Show service status
|
||||
|
||||
### Option 2: Manual Installation
|
||||
|
||||
```bash
|
||||
# Copy service files
|
||||
sudo cp deploy/*.service /etc/systemd/system/
|
||||
sudo cp deploy/*.timer /etc/systemd/system/
|
||||
|
||||
# Reload systemd
|
||||
sudo systemctl daemon-reload
|
||||
|
||||
# Enable timers (start on boot)
|
||||
sudo systemctl enable hkia-content-8am.timer
|
||||
sudo systemctl enable hkia-content-12pm.timer
|
||||
|
||||
# Start timers immediately
|
||||
sudo systemctl start hkia-content-8am.timer
|
||||
sudo systemctl start hkia-content-12pm.timer
|
||||
```
|
||||
|
||||
## Verify Deployment
|
||||
|
||||
Check timer status:
|
||||
```bash
|
||||
systemctl list-timers | grep hvac
|
||||
```
|
||||
|
||||
Expected output:
|
||||
```
|
||||
NEXT LEFT LAST PASSED UNIT ACTIVATES
|
||||
Mon 2025-08-20 08:00:00 ADT 21h left n/a n/a hkia-content-8am.timer hkia-content-8am.service
|
||||
Mon 2025-08-19 12:00:00 ADT 1h 9min left n/a n/a hkia-content-12pm.timer hkia-content-12pm.service
|
||||
```
|
||||
|
||||
## Monitor Services
|
||||
|
||||
View logs in real-time:
|
||||
```bash
|
||||
# Morning run logs
|
||||
journalctl -u hkia-content-8am -f
|
||||
|
||||
# Noon run logs
|
||||
journalctl -u hkia-content-12pm -f
|
||||
|
||||
# All logs
|
||||
journalctl -u hkia-content-* -f
|
||||
```
|
||||
|
||||
## Manual Testing
|
||||
|
||||
Run the service manually:
|
||||
```bash
|
||||
# Test morning run
|
||||
sudo systemctl start hkia-content-8am.service
|
||||
|
||||
# Check status
|
||||
sudo systemctl status hkia-content-8am.service
|
||||
```
|
||||
|
||||
## Stop/Disable Services
|
||||
|
||||
If needed:
|
||||
```bash
|
||||
# Stop timers
|
||||
sudo systemctl stop hkia-content-8am.timer
|
||||
sudo systemctl stop hkia-content-12pm.timer
|
||||
|
||||
# Disable from starting on boot
|
||||
sudo systemctl disable hkia-content-8am.timer
|
||||
sudo systemctl disable hkia-content-12pm.timer
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Service Fails to Start
|
||||
1. Check logs: `journalctl -u hkia-content-8am -n 50`
|
||||
2. Verify paths in service files
|
||||
3. Check Python environment: `source .venv/bin/activate && python --version`
|
||||
4. Test manual run: `cd /home/ben/dev/hvac-kia-content && uv run python run_api_production_v2.py`
|
||||
|
||||
### Permission Issues
|
||||
- Ensure user `ben` has read/write access to data directories
|
||||
- Check NAS mount permissions: `ls -la /mnt/nas/hkia/`
|
||||
|
||||
### Timer Not Triggering
|
||||
- Check timer status: `systemctl status hkia-content-8am.timer`
|
||||
- Verify system time: `timedatectl`
|
||||
- Check timer schedule: `systemctl cat hkia-content-8am.timer`
|
||||
|
||||
## Schedule
|
||||
|
||||
The system runs automatically at:
|
||||
- **8:00 AM ADT** - Morning content aggregation
|
||||
- **12:00 PM ADT** - Noon content aggregation
|
||||
|
||||
Both runs will:
|
||||
1. Fetch new content from all sources
|
||||
2. Merge with existing cumulative files
|
||||
3. Update metrics and add captions where available
|
||||
4. Archive previous versions
|
||||
5. Sync to NAS at `/mnt/nas/hkia/`
|
||||
|
|
@ -1,4 +1,4 @@
|
|||
# HKIA - Production Backlog Capture Tally Report
|
||||
# HVAC Know It All - Production Backlog Capture Tally Report
|
||||
**Generated**: August 18, 2025 @ 11:00 PM ADT
|
||||
|
||||
## ✅ Markdown Creation Verification
|
||||
|
|
@ -7,9 +7,9 @@ All completed sources have been successfully saved to specification-compliant ma
|
|||
|
||||
| Source | Status | Markdown File | Items | File Size | Verification |
|
||||
|--------|--------|---------------|-------|-----------|--------------|
|
||||
| **WordPress** | ✅ Complete | hkia_wordpress_backlog_20250818_221430.md | 139 posts | 1.5 MB | ✅ Verified |
|
||||
| **Podcast** | ✅ Complete | hkia_podcast_backlog_20250818_221531.md | 428 episodes | 727 KB | ✅ Verified |
|
||||
| **YouTube** | ✅ Complete | hkia_youtube_backlog_20250818_221604.md | 200 videos | 107 KB | ✅ Verified |
|
||||
| **WordPress** | ✅ Complete | hvacknowitall_wordpress_backlog_20250818_221430.md | 139 posts | 1.5 MB | ✅ Verified |
|
||||
| **Podcast** | ✅ Complete | hvacknowitall_podcast_backlog_20250818_221531.md | 428 episodes | 727 KB | ✅ Verified |
|
||||
| **YouTube** | ✅ Complete | hvacknowitall_youtube_backlog_20250818_221604.md | 200 videos | 107 KB | ✅ Verified |
|
||||
| **MailChimp** | ⚠️ SSL Error | N/A | 0 | N/A | Known Issue |
|
||||
| **Instagram** | 🔄 In Progress | Pending completion | 15/1000 | TBD | Processing |
|
||||
| **TikTok** | ⏳ Queued | Pending | 0/1000 | TBD | Waiting |
|
||||
|
|
|
|||
244
README.md
244
README.md
|
|
@ -1,244 +0,0 @@
|
|||
# HKIA Content Aggregation System
|
||||
|
||||
A containerized Python application that aggregates content from multiple HKIA sources, converts them to markdown format, and syncs to a NAS.
|
||||
|
||||
## Features
|
||||
|
||||
- **Multi-source content aggregation** from YouTube, Instagram, TikTok, MailChimp, WordPress, and Podcast RSS
|
||||
- **Comprehensive image downloading** for all visual content (Instagram posts, YouTube thumbnails, Podcast artwork)
|
||||
- **Cumulative markdown management** - Single source-of-truth files that grow with backlog and incremental updates
|
||||
- **API integrations** for YouTube Data API v3 and MailChimp API
|
||||
- **Intelligent content merging** with caption/transcript updates and metric tracking
|
||||
- **Automated NAS synchronization** to `/mnt/nas/hkia/` for both markdown and media files
|
||||
- **State management** for incremental updates
|
||||
- **Parallel processing** for multiple sources
|
||||
- **Atlantic timezone** (America/Halifax) timestamps
|
||||
|
||||
## Cumulative Markdown System
|
||||
|
||||
### Overview
|
||||
The system maintains a single markdown file per source that combines:
|
||||
- Initial backlog content (historical data)
|
||||
- Daily incremental updates (new content)
|
||||
- Content updates (new captions, updated metrics)
|
||||
|
||||
### How It Works
|
||||
|
||||
1. **Initial Backlog**: First run creates base file with all historical content
|
||||
2. **Daily Incremental**: Subsequent runs merge new content into existing file
|
||||
3. **Smart Merging**: Updates existing entries when better data is available (captions, transcripts, metrics)
|
||||
4. **Archival**: Previous versions archived with timestamps for history
|
||||
|
||||
### File Naming Convention
|
||||
```
|
||||
<brandName>_<source>_<dateTime>.md
|
||||
Example: hkia_YouTube_2025-08-19T143045.md
|
||||
```
|
||||
|
||||
## Quick Start
|
||||
|
||||
### Installation
|
||||
|
||||
```bash
|
||||
# Install UV package manager
|
||||
pip install uv
|
||||
|
||||
# Install dependencies
|
||||
uv pip install -r requirements.txt
|
||||
```
|
||||
|
||||
### Configuration
|
||||
|
||||
Create `.env` file with credentials:
|
||||
```env
|
||||
# YouTube
|
||||
YOUTUBE_API_KEY=your_api_key
|
||||
|
||||
# MailChimp
|
||||
MAILCHIMP_API_KEY=your_api_key
|
||||
MAILCHIMP_SERVER_PREFIX=us10
|
||||
|
||||
# Instagram
|
||||
INSTAGRAM_USERNAME=username
|
||||
INSTAGRAM_PASSWORD=password
|
||||
|
||||
# WordPress
|
||||
WORDPRESS_USERNAME=username
|
||||
WORDPRESS_API_KEY=api_key
|
||||
```
|
||||
|
||||
### Running
|
||||
|
||||
```bash
|
||||
# Run all scrapers (parallel)
|
||||
uv run python run_all_scrapers.py
|
||||
|
||||
# Run single source
|
||||
uv run python -m src.youtube_api_scraper_v2
|
||||
|
||||
# Test cumulative mode
|
||||
uv run python test_cumulative_mode.py
|
||||
|
||||
# Consolidate existing files
|
||||
uv run python consolidate_current_files.py
|
||||
```
|
||||
|
||||
## Architecture
|
||||
|
||||
### Core Components
|
||||
|
||||
- **BaseScraper**: Abstract base class for all scrapers
|
||||
- **BaseScraperCumulative**: Enhanced base with cumulative support
|
||||
- **CumulativeMarkdownManager**: Handles intelligent file merging
|
||||
- **ContentOrchestrator**: Manages parallel scraper execution
|
||||
|
||||
### Data Flow
|
||||
|
||||
```
|
||||
1. Scraper fetches content (checks state for incremental)
|
||||
2. CumulativeMarkdownManager loads existing file
|
||||
3. Merges new content (adds new, updates existing)
|
||||
4. Archives previous version
|
||||
5. Saves updated file with current timestamp
|
||||
6. Updates state for next run
|
||||
```
|
||||
|
||||
### Directory Structure
|
||||
|
||||
```
|
||||
data/
|
||||
├── markdown_current/ # Current single-source-of-truth files
|
||||
├── markdown_archives/ # Historical versions by source
|
||||
│ ├── YouTube/
|
||||
│ ├── Instagram/
|
||||
│ └── ...
|
||||
├── media/ # Downloaded media files
|
||||
│ ├── Instagram/ # Instagram images and video thumbnails
|
||||
│ ├── YouTube/ # YouTube video thumbnails
|
||||
│ ├── Podcast/ # Podcast episode artwork
|
||||
│ └── ...
|
||||
└── .state/ # State files for incremental updates
|
||||
|
||||
logs/ # Log files by source
|
||||
src/ # Source code
|
||||
tests/ # Test files
|
||||
```
|
||||
|
||||
## API Quota Management
|
||||
|
||||
### YouTube Data API v3
|
||||
- **Daily Limit**: 10,000 units
|
||||
- **Usage Strategy**: 95% daily quota for captions
|
||||
- **Costs**:
|
||||
- videos.list: 1 unit
|
||||
- captions.list: 50 units
|
||||
- channels.list: 1 unit
|
||||
|
||||
### Rate Limiting
|
||||
- Instagram: 200 posts/hour
|
||||
- YouTube: Respects API quotas
|
||||
- General: Exponential backoff with retry
|
||||
|
||||
## Production Deployment
|
||||
|
||||
### Systemd Services
|
||||
|
||||
Services are configured in `/etc/systemd/system/`:
|
||||
- `hkia-content-images-8am.service` - Morning run with image downloads
|
||||
- `hkia-content-images-12pm.service` - Noon run with image downloads
|
||||
- `hkia-content-images-8am.timer` - Morning schedule (8 AM Atlantic)
|
||||
- `hkia-content-images-12pm.timer` - Noon schedule (12 PM Atlantic)
|
||||
|
||||
### Manual Deployment
|
||||
|
||||
```bash
|
||||
# Start services
|
||||
sudo systemctl start hkia-content-8am.timer
|
||||
sudo systemctl start hkia-content-12pm.timer
|
||||
|
||||
# Enable on boot
|
||||
sudo systemctl enable hkia-content-8am.timer
|
||||
sudo systemctl enable hkia-content-12pm.timer
|
||||
|
||||
# Check status
|
||||
sudo systemctl status hkia-content-*.timer
|
||||
```
|
||||
|
||||
## Monitoring
|
||||
|
||||
```bash
|
||||
# View logs
|
||||
journalctl -u hkia-content-8am -f
|
||||
|
||||
# Check file growth
|
||||
ls -lh data/markdown_current/
|
||||
|
||||
# View statistics
|
||||
uv run python -c "from src.cumulative_markdown_manager import CumulativeMarkdownManager; ..."
|
||||
```
|
||||
|
||||
## Testing
|
||||
|
||||
```bash
|
||||
# Run all tests
|
||||
uv run pytest
|
||||
|
||||
# Test specific scraper
|
||||
uv run pytest tests/test_youtube_scraper.py
|
||||
|
||||
# Test cumulative mode
|
||||
uv run python test_cumulative_mode.py
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Common Issues
|
||||
|
||||
1. **Instagram Rate Limiting**: Scraper implements humanized delays (18-22 seconds between requests)
|
||||
2. **YouTube Quota Exceeded**: Wait until next day, quota resets at midnight Pacific
|
||||
3. **NAS Permission Errors**: Warnings are normal, files still sync successfully
|
||||
4. **Missing Captions**: Use YouTube Data API instead of youtube-transcript-api
|
||||
|
||||
### Debug Commands
|
||||
|
||||
```bash
|
||||
# Check scraper state
|
||||
cat data/.state/*_state.json
|
||||
|
||||
# View recent logs
|
||||
tail -f logs/YouTube/youtube_*.log
|
||||
|
||||
# Test single source
|
||||
uv run python -m src.youtube_api_scraper_v2 --test
|
||||
```
|
||||
|
||||
## Recent Updates (2025-08-19)
|
||||
|
||||
### Comprehensive Image Downloading
|
||||
- Implemented full image download capability for all content sources
|
||||
- Instagram: Downloads all post images, carousel images, and video thumbnails
|
||||
- YouTube: Automatically fetches highest quality video thumbnails
|
||||
- Podcasts: Downloads episode artwork and thumbnails
|
||||
- Consistent naming: `{source}_{item_id}_{type}.{ext}`
|
||||
- Media organized in `data/media/{source}/` directories
|
||||
|
||||
### File Naming Standardization
|
||||
- Migrated to project specification compliant naming
|
||||
- Format: `<brandName>_<source>_<dateTime>.md`
|
||||
- Example: `hkia_instagram_2025-08-19T100511.md`
|
||||
- Archived legacy file structures to `markdown_archives/legacy_structure/`
|
||||
|
||||
### Instagram Backlog Expansion
|
||||
- Completed initial 1000 posts capture with images
|
||||
- Currently capturing posts 1001-2000 with rate limiting
|
||||
- Cumulative markdown updates every 100 posts
|
||||
- Full image download for all historical content
|
||||
|
||||
### Production Automation
|
||||
- Deployed systemd services for twice-daily runs (8 AM, 12 PM Atlantic)
|
||||
- Automated NAS synchronization for markdown and media files
|
||||
- Rate-limited scraping with humanized delays (10-20 seconds per Instagram post)
|
||||
|
||||
## License
|
||||
|
||||
Private repository - All rights reserved
|
||||
|
|
@ -1,4 +1,4 @@
|
|||
# HKIA - Updated Production Backlog Capture
|
||||
# HVAC Know It All - Updated Production Backlog Capture
|
||||
|
||||
## 🚀 Updated Configuration
|
||||
**Started**: August 18, 2025 @ 10:54 PM ADT
|
||||
|
|
@ -37,11 +37,11 @@
|
|||
## 📁 Output Location
|
||||
```
|
||||
/home/ben/dev/hvac-kia-content/data_production_backlog/markdown_current/
|
||||
├── hkia_wordpress_backlog_[timestamp].md
|
||||
├── hkia_podcast_backlog_[timestamp].md
|
||||
├── hkia_youtube_backlog_[timestamp].md
|
||||
├── hkia_instagram_backlog_[timestamp].md (pending)
|
||||
└── hkia_tiktok_backlog_[timestamp].md (pending)
|
||||
├── hvacknowitall_wordpress_backlog_[timestamp].md
|
||||
├── hvacknowitall_podcast_backlog_[timestamp].md
|
||||
├── hvacknowitall_youtube_backlog_[timestamp].md
|
||||
├── hvacknowitall_instagram_backlog_[timestamp].md (pending)
|
||||
└── hvacknowitall_tiktok_backlog_[timestamp].md (pending)
|
||||
```
|
||||
|
||||
## 📈 Progress Monitoring
|
||||
|
|
|
|||
|
|
@ -1,226 +0,0 @@
|
|||
#!/usr/bin/env python3
|
||||
"""
|
||||
Consolidate multiple markdown files per source into single current files
|
||||
Combines backlog data and incremental updates into one source of truth
|
||||
Follows project specification naming: hvacnkowitall_<source>_<dateTime>.md
|
||||
"""
|
||||
|
||||
import sys
|
||||
from pathlib import Path
|
||||
sys.path.insert(0, str(Path(__file__).parent))
|
||||
|
||||
from datetime import datetime
|
||||
import pytz
|
||||
import re
|
||||
from typing import Dict, List, Set
|
||||
import logging
|
||||
|
||||
# Set up logging
|
||||
logging.basicConfig(
|
||||
level=logging.INFO,
|
||||
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
|
||||
handlers=[
|
||||
logging.FileHandler('logs/consolidation.log'),
|
||||
logging.StreamHandler()
|
||||
]
|
||||
)
|
||||
logger = logging.getLogger('consolidator')
|
||||
|
||||
|
||||
def get_atlantic_timestamp() -> str:
|
||||
"""Get current timestamp in Atlantic timezone."""
|
||||
tz = pytz.timezone('America/Halifax')
|
||||
return datetime.now(tz).strftime('%Y-%m-%dT%H%M%S')
|
||||
|
||||
|
||||
def parse_markdown_sections(content: str) -> List[Dict]:
|
||||
"""Parse markdown content into sections by ID."""
|
||||
sections = []
|
||||
|
||||
# Split by ID headers
|
||||
parts = content.split('# ID: ')
|
||||
|
||||
for part in parts[1:]: # Skip first empty part
|
||||
if not part.strip():
|
||||
continue
|
||||
|
||||
lines = part.strip().split('\n')
|
||||
section_id = lines[0].strip()
|
||||
|
||||
# Get the full section content
|
||||
section_content = f"# ID: {section_id}\n" + '\n'.join(lines[1:])
|
||||
|
||||
sections.append({
|
||||
'id': section_id,
|
||||
'content': section_content
|
||||
})
|
||||
|
||||
return sections
|
||||
|
||||
|
||||
def consolidate_source_files(source_name: str) -> bool:
|
||||
"""Consolidate all files for a specific source into one current file."""
|
||||
logger.info(f"Consolidating {source_name} files...")
|
||||
|
||||
current_dir = Path('data/markdown_current')
|
||||
archives_dir = Path('data/markdown_archives')
|
||||
|
||||
# Find all files for this source
|
||||
pattern = f"hvacnkowitall_{source_name}_*.md"
|
||||
current_files = list(current_dir.glob(pattern))
|
||||
|
||||
# Also check for files with different naming (like captions files)
|
||||
alt_patterns = [
|
||||
f"*{source_name}*.md",
|
||||
f"hvacnkowitall_{source_name.lower()}_*.md"
|
||||
]
|
||||
|
||||
for alt_pattern in alt_patterns:
|
||||
current_files.extend(current_dir.glob(alt_pattern))
|
||||
|
||||
# Remove duplicates
|
||||
current_files = list(set(current_files))
|
||||
|
||||
if not current_files:
|
||||
logger.warning(f"No files found for source: {source_name}")
|
||||
return False
|
||||
|
||||
logger.info(f"Found {len(current_files)} files for {source_name}: {[f.name for f in current_files]}")
|
||||
|
||||
# Track unique sections by ID
|
||||
sections_by_id: Dict[str, Dict] = {}
|
||||
all_sections = []
|
||||
|
||||
# Process each file
|
||||
for file_path in current_files:
|
||||
logger.info(f"Processing {file_path.name}...")
|
||||
|
||||
try:
|
||||
content = file_path.read_text(encoding='utf-8')
|
||||
sections = parse_markdown_sections(content)
|
||||
|
||||
logger.info(f" Found {len(sections)} sections")
|
||||
|
||||
# Add sections, preferring newer data
|
||||
for section in sections:
|
||||
section_id = section['id']
|
||||
|
||||
# If we haven't seen this ID, add it
|
||||
if section_id not in sections_by_id:
|
||||
sections_by_id[section_id] = section
|
||||
all_sections.append(section)
|
||||
else:
|
||||
# Check if this version has more content (like captions)
|
||||
old_content = sections_by_id[section_id]['content']
|
||||
new_content = section['content']
|
||||
|
||||
# Prefer content with captions/more detail
|
||||
if ('Caption Status:' in new_content and 'Caption Status:' not in old_content) or \
|
||||
len(new_content) > len(old_content):
|
||||
logger.info(f" Updating section {section_id} with more detailed content")
|
||||
# Update in place
|
||||
for i, existing in enumerate(all_sections):
|
||||
if existing['id'] == section_id:
|
||||
all_sections[i] = section
|
||||
sections_by_id[section_id] = section
|
||||
break
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error processing {file_path}: {e}")
|
||||
continue
|
||||
|
||||
if not all_sections:
|
||||
logger.warning(f"No sections found for {source_name}")
|
||||
return False
|
||||
|
||||
# Create consolidated content
|
||||
consolidated_content = []
|
||||
|
||||
# Sort sections by ID for consistency
|
||||
all_sections.sort(key=lambda x: x['id'])
|
||||
|
||||
for section in all_sections:
|
||||
consolidated_content.append(section['content'])
|
||||
consolidated_content.append("") # Add separator
|
||||
|
||||
# Generate new filename following project specification
|
||||
timestamp = get_atlantic_timestamp()
|
||||
new_filename = f"hvacnkowitall_{source_name}_{timestamp}.md"
|
||||
new_file_path = current_dir / new_filename
|
||||
|
||||
# Save consolidated file
|
||||
final_content = '\n'.join(consolidated_content)
|
||||
new_file_path.write_text(final_content, encoding='utf-8')
|
||||
|
||||
logger.info(f"Created consolidated file: {new_filename}")
|
||||
logger.info(f" Total sections: {len(all_sections)}")
|
||||
logger.info(f" File size: {len(final_content):,} characters")
|
||||
|
||||
# Archive old files
|
||||
archive_source_dir = archives_dir / source_name
|
||||
archive_source_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
archived_count = 0
|
||||
for old_file in current_files:
|
||||
if old_file.name != new_filename: # Don't archive the new file
|
||||
try:
|
||||
archive_path = archive_source_dir / old_file.name
|
||||
old_file.rename(archive_path)
|
||||
archived_count += 1
|
||||
logger.info(f" Archived: {old_file.name}")
|
||||
except Exception as e:
|
||||
logger.error(f"Error archiving {old_file.name}: {e}")
|
||||
|
||||
logger.info(f"Archived {archived_count} old files for {source_name}")
|
||||
|
||||
# Create copy in archives as well
|
||||
archive_current_path = archive_source_dir / new_filename
|
||||
archive_current_path.write_text(final_content, encoding='utf-8')
|
||||
|
||||
return True
|
||||
|
||||
|
||||
def main():
|
||||
"""Main consolidation function."""
|
||||
logger.info("=" * 60)
|
||||
logger.info("CONSOLIDATING CURRENT MARKDOWN FILES")
|
||||
logger.info("=" * 60)
|
||||
|
||||
# Create directories if needed
|
||||
Path('data/markdown_current').mkdir(parents=True, exist_ok=True)
|
||||
Path('data/markdown_archives').mkdir(parents=True, exist_ok=True)
|
||||
Path('logs').mkdir(parents=True, exist_ok=True)
|
||||
|
||||
# Define sources to consolidate
|
||||
sources = ['YouTube', 'MailChimp', 'Instagram', 'TikTok', 'Podcast']
|
||||
|
||||
consolidated = []
|
||||
failed = []
|
||||
|
||||
for source in sources:
|
||||
logger.info(f"\n{'-' * 40}")
|
||||
try:
|
||||
if consolidate_source_files(source):
|
||||
consolidated.append(source)
|
||||
else:
|
||||
failed.append(source)
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to consolidate {source}: {e}")
|
||||
failed.append(source)
|
||||
|
||||
logger.info(f"\n{'=' * 60}")
|
||||
logger.info("CONSOLIDATION SUMMARY")
|
||||
logger.info(f"{'=' * 60}")
|
||||
logger.info(f"Successfully consolidated: {consolidated}")
|
||||
logger.info(f"Failed/No data: {failed}")
|
||||
|
||||
# List final current files
|
||||
current_files = list(Path('data/markdown_current').glob('*.md'))
|
||||
logger.info(f"\nFinal current files:")
|
||||
for file in sorted(current_files):
|
||||
size = file.stat().st_size
|
||||
logger.info(f" {file.name} ({size:,} bytes)")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
|
|
@ -1,229 +0,0 @@
|
|||
#!/usr/bin/env python3
|
||||
"""
|
||||
Continue YouTube caption fetching using remaining quota
|
||||
Fetches captions for videos 50-188 (next 139 videos by view count)
|
||||
Uses up to 95% of daily quota (9,500 units)
|
||||
"""
|
||||
|
||||
import sys
|
||||
from pathlib import Path
|
||||
sys.path.insert(0, str(Path(__file__).parent))
|
||||
|
||||
from src.youtube_api_scraper_v2 import YouTubeAPIScraper
|
||||
from src.base_scraper import ScraperConfig
|
||||
from datetime import datetime
|
||||
import pytz
|
||||
import time
|
||||
import json
|
||||
import logging
|
||||
|
||||
# Set up logging
|
||||
logging.basicConfig(
|
||||
level=logging.INFO,
|
||||
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
|
||||
handlers=[
|
||||
logging.FileHandler('logs/youtube_caption_continue.log'),
|
||||
logging.StreamHandler()
|
||||
]
|
||||
)
|
||||
logger = logging.getLogger('youtube_captions')
|
||||
|
||||
|
||||
def load_existing_videos():
|
||||
"""Load existing video data from the latest markdown file."""
|
||||
latest_file = Path('data/markdown_current/hvacnkowitall_YouTube_2025-08-19T100336.md')
|
||||
|
||||
if not latest_file.exists():
|
||||
logger.error(f"Latest YouTube file not found: {latest_file}")
|
||||
return []
|
||||
|
||||
# Parse the markdown to extract video data
|
||||
content = latest_file.read_text(encoding='utf-8')
|
||||
videos = []
|
||||
|
||||
# Simple parsing - split by video sections
|
||||
sections = content.split('# ID: ')
|
||||
|
||||
for section in sections[1:]: # Skip first empty section
|
||||
lines = section.strip().split('\n')
|
||||
if not lines:
|
||||
continue
|
||||
|
||||
video_id = lines[0].strip()
|
||||
video_data = {'id': video_id}
|
||||
|
||||
# Parse basic info
|
||||
for line in lines:
|
||||
if line.startswith('## Title: '):
|
||||
video_data['title'] = line.replace('## Title: ', '')
|
||||
elif line.startswith('## Views: '):
|
||||
views_str = line.replace('## Views: ', '').replace(',', '')
|
||||
video_data['view_count'] = int(views_str) if views_str.isdigit() else 0
|
||||
elif line.startswith('## Caption Status:'):
|
||||
video_data['has_caption_info'] = True
|
||||
|
||||
videos.append(video_data)
|
||||
|
||||
logger.info(f"Loaded {len(videos)} videos from existing file")
|
||||
return videos
|
||||
|
||||
|
||||
def continue_caption_fetching():
|
||||
"""Continue fetching captions from where we left off."""
|
||||
logger.info("=" * 60)
|
||||
logger.info("CONTINUING YOUTUBE CAPTION FETCHING")
|
||||
logger.info("=" * 60)
|
||||
|
||||
# Load existing video data
|
||||
videos = load_existing_videos()
|
||||
|
||||
if not videos:
|
||||
logger.error("No existing videos found to continue from")
|
||||
return False
|
||||
|
||||
# Sort by view count (descending)
|
||||
videos.sort(key=lambda x: x.get('view_count', 0), reverse=True)
|
||||
|
||||
# Count how many already have captions
|
||||
with_captions = sum(1 for v in videos if v.get('has_caption_info'))
|
||||
without_captions = [v for v in videos if not v.get('has_caption_info')]
|
||||
|
||||
logger.info(f"Current status:")
|
||||
logger.info(f" Total videos: {len(videos)}")
|
||||
logger.info(f" Already have captions: {with_captions}")
|
||||
logger.info(f" Need captions: {len(without_captions)}")
|
||||
|
||||
# Calculate quota
|
||||
quota_used_so_far = 2519 # From previous run
|
||||
daily_limit = 10000
|
||||
target_usage = int(daily_limit * 0.95) # 95% = 9,500 units
|
||||
available_quota = target_usage - quota_used_so_far
|
||||
|
||||
logger.info(f"Quota analysis:")
|
||||
logger.info(f" Daily limit: {daily_limit:,} units")
|
||||
logger.info(f" Already used: {quota_used_so_far:,} units")
|
||||
logger.info(f" Target (95%): {target_usage:,} units")
|
||||
logger.info(f" Available: {available_quota:,} units")
|
||||
|
||||
# Calculate how many more videos we can caption
|
||||
max_additional_captions = available_quota // 50 # 50 units per video
|
||||
videos_to_caption = without_captions[:max_additional_captions]
|
||||
|
||||
logger.info(f"Caption plan:")
|
||||
logger.info(f" Videos to caption now: {len(videos_to_caption)}")
|
||||
logger.info(f" Estimated quota cost: {len(videos_to_caption) * 50:,} units")
|
||||
logger.info(f" Will use: {quota_used_so_far + (len(videos_to_caption) * 50):,} units total")
|
||||
|
||||
if not videos_to_caption:
|
||||
logger.info("No additional videos to caption within quota limits")
|
||||
return True
|
||||
|
||||
# Set up scraper
|
||||
config = ScraperConfig(
|
||||
source_name='YouTube',
|
||||
brand_name='hvacnkowitall',
|
||||
data_dir=Path('data/markdown_current'),
|
||||
logs_dir=Path('logs/YouTube'),
|
||||
timezone='America/Halifax'
|
||||
)
|
||||
|
||||
scraper = YouTubeAPIScraper(config)
|
||||
scraper.quota_used = quota_used_so_far # Set initial quota usage
|
||||
|
||||
logger.info(f"Starting caption fetching for {len(videos_to_caption)} videos...")
|
||||
start_time = time.time()
|
||||
|
||||
captions_found = 0
|
||||
for i, video in enumerate(videos_to_caption, 1):
|
||||
video_id = video['id']
|
||||
title = video.get('title', 'Unknown')[:50]
|
||||
|
||||
logger.info(f"[{i}/{len(videos_to_caption)}] Fetching caption for: {title}...")
|
||||
|
||||
# Fetch caption info
|
||||
caption_info = scraper._fetch_caption_text(video_id)
|
||||
|
||||
if caption_info:
|
||||
video['caption_text'] = caption_info
|
||||
captions_found += 1
|
||||
logger.info(f" ✅ Caption found")
|
||||
else:
|
||||
logger.info(f" ❌ No caption available")
|
||||
|
||||
# Add delay to be respectful
|
||||
time.sleep(0.5)
|
||||
|
||||
# Check if we're approaching quota limit
|
||||
if scraper.quota_used >= target_usage:
|
||||
logger.warning(f"Reached 95% quota limit at video {i}")
|
||||
break
|
||||
|
||||
elapsed = time.time() - start_time
|
||||
|
||||
logger.info(f"Caption fetching complete!")
|
||||
logger.info(f" Duration: {elapsed:.1f} seconds")
|
||||
logger.info(f" Captions found: {captions_found}")
|
||||
logger.info(f" Quota used: {scraper.quota_used:,}/{daily_limit:,} units")
|
||||
logger.info(f" Quota percentage: {(scraper.quota_used/daily_limit)*100:.1f}%")
|
||||
|
||||
# Update the video data with new caption info
|
||||
video_lookup = {v['id']: v for v in videos}
|
||||
for video in videos_to_caption:
|
||||
if video['id'] in video_lookup and video.get('caption_text'):
|
||||
video_lookup[video['id']]['caption_text'] = video['caption_text']
|
||||
|
||||
# Save updated data
|
||||
timestamp = datetime.now(pytz.timezone('America/Halifax')).strftime('%Y-%m-%dT%H%M%S')
|
||||
updated_filename = f"hvacnkowitall_YouTube_{timestamp}_captions.md"
|
||||
|
||||
# Generate updated markdown (simplified version)
|
||||
markdown_sections = []
|
||||
for video in videos:
|
||||
section = []
|
||||
section.append(f"# ID: {video['id']}")
|
||||
section.append("")
|
||||
section.append(f"## Title: {video.get('title', 'Unknown')}")
|
||||
section.append("")
|
||||
section.append(f"## Views: {video.get('view_count', 0):,}")
|
||||
section.append("")
|
||||
|
||||
# Caption status
|
||||
if video.get('caption_text'):
|
||||
section.append("## Caption Status:")
|
||||
section.append(video['caption_text'])
|
||||
section.append("")
|
||||
elif video.get('has_caption_info'):
|
||||
section.append("## Caption Status:")
|
||||
section.append("[Captions available - ]")
|
||||
section.append("")
|
||||
|
||||
section.append("-" * 50)
|
||||
section.append("")
|
||||
markdown_sections.append('\n'.join(section))
|
||||
|
||||
# Save updated file
|
||||
output_file = Path(f'data/markdown_current/{updated_filename}')
|
||||
output_file.write_text('\n'.join(markdown_sections), encoding='utf-8')
|
||||
|
||||
logger.info(f"Updated file saved: {output_file}")
|
||||
|
||||
# Calculate remaining work
|
||||
total_with_captions = with_captions + captions_found
|
||||
remaining_videos = len(videos) - total_with_captions
|
||||
|
||||
logger.info(f"Progress summary:")
|
||||
logger.info(f" Total videos: {len(videos)}")
|
||||
logger.info(f" Captioned: {total_with_captions}")
|
||||
logger.info(f" Remaining: {remaining_videos}")
|
||||
logger.info(f" Progress: {(total_with_captions/len(videos))*100:.1f}%")
|
||||
|
||||
if remaining_videos > 0:
|
||||
days_needed = (remaining_videos // 190) + (1 if remaining_videos % 190 else 0)
|
||||
logger.info(f" Estimated days to complete: {days_needed}")
|
||||
|
||||
return True
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
success = continue_caption_fetching()
|
||||
sys.exit(0 if success else 1)
|
||||
|
|
@ -1,122 +0,0 @@
|
|||
#!/usr/bin/env python3
|
||||
"""
|
||||
Create incremental Instagram markdown file from running process without losing progress.
|
||||
This script safely generates output from whatever the running Instagram scraper has collected so far.
|
||||
"""
|
||||
|
||||
import os
|
||||
import sys
|
||||
import time
|
||||
from pathlib import Path
|
||||
from datetime import datetime
|
||||
import pytz
|
||||
from dotenv import load_dotenv
|
||||
|
||||
# Add src to path
|
||||
sys.path.insert(0, str(Path(__file__).parent / 'src'))
|
||||
|
||||
from base_scraper import ScraperConfig
|
||||
from instagram_scraper import InstagramScraper
|
||||
|
||||
|
||||
def create_incremental_output():
|
||||
"""Create incremental output without interfering with running process."""
|
||||
|
||||
print("=== INSTAGRAM INCREMENTAL OUTPUT ===")
|
||||
print("Safely creating incremental markdown without stopping running process")
|
||||
print()
|
||||
|
||||
# Load environment
|
||||
load_dotenv()
|
||||
|
||||
# Check if Instagram scraper is running
|
||||
import subprocess
|
||||
result = subprocess.run(
|
||||
["ps", "aux"],
|
||||
capture_output=True,
|
||||
text=True
|
||||
)
|
||||
|
||||
instagram_running = False
|
||||
for line in result.stdout.split('\n'):
|
||||
if 'instagram_scraper' in line.lower() and 'python' in line and 'grep' not in line:
|
||||
instagram_running = True
|
||||
print(f"✓ Found running Instagram scraper: {line.strip()}")
|
||||
break
|
||||
|
||||
if not instagram_running:
|
||||
print("⚠️ No running Instagram scraper detected")
|
||||
print(" This script is designed to work with a running scraper process")
|
||||
return
|
||||
|
||||
# Get Atlantic timezone timestamp
|
||||
tz = pytz.timezone('America/Halifax')
|
||||
now = datetime.now(tz)
|
||||
timestamp = now.strftime('%Y-%m-%dT%H%M%S')
|
||||
|
||||
print(f"Creating incremental output at: {now.strftime('%Y-%m-%d %H:%M:%S %Z')}")
|
||||
print()
|
||||
|
||||
# Setup config - use temporary session to avoid conflicts
|
||||
config = ScraperConfig(
|
||||
source_name='instagram_incremental',
|
||||
brand_name='hvacnkowitall',
|
||||
data_dir=Path('data'),
|
||||
logs_dir=Path('logs'),
|
||||
timezone='America/Halifax'
|
||||
)
|
||||
|
||||
try:
|
||||
# Create a separate scraper instance with different session
|
||||
scraper = InstagramScraper(config)
|
||||
|
||||
# Override session file to avoid conflicts with running process
|
||||
scraper.session_file = scraper.session_file.parent / f'{scraper.username}_incremental.session'
|
||||
|
||||
print("Initializing separate Instagram connection for incremental output...")
|
||||
|
||||
# Try to create incremental output with limited posts to avoid rate limiting conflicts
|
||||
print("Fetching recent posts for incremental output (max 20 to avoid conflicts)...")
|
||||
|
||||
# Fetch a small number of recent posts
|
||||
items = scraper.fetch_content(max_posts=20)
|
||||
|
||||
if items:
|
||||
# Format as markdown
|
||||
markdown_content = scraper.format_markdown(items)
|
||||
|
||||
# Save with incremental naming
|
||||
output_file = Path('data/markdown_current') / f'hvacnkowitall_instagram_incremental_{timestamp}.md'
|
||||
output_file.parent.mkdir(parents=True, exist_ok=True)
|
||||
output_file.write_text(markdown_content, encoding='utf-8')
|
||||
|
||||
print()
|
||||
print("=" * 60)
|
||||
print("INSTAGRAM INCREMENTAL OUTPUT CREATED")
|
||||
print("=" * 60)
|
||||
print(f"Posts captured: {len(items)}")
|
||||
print(f"Output file: {output_file}")
|
||||
print("=" * 60)
|
||||
print()
|
||||
print("NOTE: This is a sample of recent posts.")
|
||||
print("The main backlog process is still running and will create")
|
||||
print("a complete file with all 1000 posts when finished.")
|
||||
|
||||
else:
|
||||
print("❌ No Instagram posts captured for incremental output")
|
||||
print(" This may be due to rate limiting or session conflicts")
|
||||
print(" The main backlog process should continue normally")
|
||||
|
||||
except Exception as e:
|
||||
print(f"❌ Error creating incremental output: {e}")
|
||||
print()
|
||||
print("This is expected if the main Instagram process is using")
|
||||
print("all available API quota. The main process will continue")
|
||||
print("and create the complete output when finished.")
|
||||
print()
|
||||
print("To check progress of the main process:")
|
||||
print(" tail -f logs/instagram.log")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
create_incremental_output()
|
||||
File diff suppressed because it is too large
Load diff
File diff suppressed because it is too large
Load diff
|
|
@ -1,101 +0,0 @@
|
|||
# Netscape HTTP Cookie File
|
||||
# This file is generated by yt-dlp. Do not edit.
|
||||
|
||||
.lastpass.com TRUE / TRUE 1786408717 lp_anonymousid 7995af49-838a-4b73-8d94-b7430e2c329e.v2
|
||||
.lastpass.com TRUE / TRUE 1787056237 lang en_US
|
||||
.lastpass.com TRUE / TRUE 1755580871 ak_bmsc 462AA4AE6A145B526C84F4A66A49960B~000000000000000000000000000000~YAAQxwDeF8PKIaeYAQAALlZYwByftxc5ukLlUAGCGXrxm6zV5KNDctdN2HuUjFLljsUhsX5hTI2Enk9E/uGCZ0eGrfc2Qdlv1soFQlEp5ujcrpJERlQEVTuOGQfjaHBzQqG/kPsbLQHIIJoIvA8gE7C/04exZ0LnAulwkmOQqAvQixUoPpO6ASII09O6r14thdpKlaCMsCfF1O8AG6yGtwq268rthix4L6HkDdQcFF3FVk/pg6jWXO3F6OYRnTnD7z4Hvi6g90N/BzejpvMGhTQbCCXJz1ig+tVg9lxA5A9nq45ZwvkUxZwM8RQLU46+OxgWswnH4bR+nhIlCmWdAC7hpxV0z3+5/JUTBCUQkTp4GZQs+3RA9dGz9sJ+PCJLpyRD1tVx4/ehcdMApkQ=
|
||||
.lastpass.com TRUE / TRUE 1755580876 bm_sv B07AB04C1CF0CD6287547B92994D7119~YAAQxwDeF03LIaeYAQAAUn5YwBxsrazGozgkHVCj2owby39f97b1/hex0fML3VuvAOx0KitBSV7eL4HHonlHaclAs7CoFFjwSNyHjOk0yb33U2G4rjl/MWhvQByl91kMUc24ptY7rWtsoaKRBeveWOXsIXUzoWS/SOx4qLumybL6RLdxfkBoNLGfcXvJLJZ8j4bwBCN2V+mpRSfy0tDHWtxRh/Gcv6TlRAHRf0yxrHViChdkPxNTNLCN8iXcicR/e60=~1
|
||||
.lastpass.com TRUE / TRUE 1787109681 _abck 78CB65894B61AE35DE200E4176B46B22~-1~YAAQxwDeF0zLIaeYAQAAUn5YwA4EDrzJmgJhTC365aMZ7ugfdVHjRQ87RNrRviafhGet7wwcLIF8JYdWecoEj80P3Zwima6w2qi9sHjYi3nBtcV+vZXRy0ybwpHLcHRc6dttxCrlD2FEarNLggeusDY6Gg6cO82uRWIm8xjLDzte0ls8Bmnn8wlaOg2+XCfNaXAmYHmLXfhTrBEiXEvTYUjRNtj7R+kXKDIE9rd0VXnYpM+gqIb3BvftUdCrA8DK5vl/urtaigggV0zb7sSwYikiZB6so9IqekIrIzKbQ3pz0HxR9PCTDhhzx6CC39glmHjS/lGwtrmlhWHU0MsXR3NQUJSLNM447GhtH9PuYZJQ2yTLYDjUYcWhgR33mECusBr9lSWY/h1kFjwKj9lP8BMrfb+puI7PJROneR1uBroNu+cp8wR+U9CKVPsRqiIyR6IUXIMqCvRJCR2ZjJUW6VjKnOi4aHXZyOI/ziOB4BuzwnbnqQuOTMFcg0HTfpwkip+NNoamzdqykDbvVOJb4Wga1SJDTjD6J2qgxnDrEy+WHpcGtRIZ/+O7B9FsSdG3Ga9hXtPAhog5eRNrC4PnpsIZF3d7UETlNi4NKUTxXr00jOaqrV3vQzQd5BUALmAALt9insA53LPdOx0SSfmWK9Xw1eRj~-1~-1~-1
|
||||
pollserver.lastpass.com FALSE / TRUE 1755996002 PHPSESSID q31ed413isulb7oio48bp42u712
|
||||
ogs.google.com FALSE / TRUE 1757464814 OTZ 8209480_68_72_104040_68_446700
|
||||
accounts.google.com FALSE / TRUE 1757464816 OTZ 8209480_68_72_104040_68_446700
|
||||
accounts.google.com FALSE / TRUE 1790133697 __Host-GAPS 1:4a1DbADXrmiwrkKhB9hmZ6pXeH-F6JfxSo-IlrRlzzrsR03oYpfdhiuwfKK5qpwNzkSC0zuVayzRTDoA-uv9O8x1mSy9QQ:KBG9eDx22YWYXmTz
|
||||
accounts.google.com FALSE / TRUE 1790133697 LSID s.CA|s.youtube:g.a0000QhGLPPQzzgZXFpvWoobsAQN8gT8puTlvUnz1hM9I0tjYz70Y44dElY9CS3yOPpIYrOrMAACgYKAQESARYSFQHGX2MiU2OrTar21aQT3EyM3DkwsxoVAUF8yKpQTGz3wsgXHZdiB7ye8VTi0076
|
||||
accounts.google.com FALSE / TRUE 1790133697 __Host-1PLSID s.CA|s.youtube:g.a0000QhGLPPQzzgZXFpvWoobsAQN8gT8puTlvUnz1hM9I0tjYz70pCoJ5-AC8-HZ-atxM4otEwACgYKAVcSARYSFQHGX2Mi0nHXWGmWn2oqsMZNd0oAxBoVAUF8yKrfIkwN9Myxu0Jv_tzrcBux0076
|
||||
accounts.google.com FALSE / TRUE 1790133697 __Host-3PLSID s.CA|s.youtube:g.a0000QhGLPPQzzgZXFpvWoobsAQN8gT8puTlvUnz1hM9I0tjYz70GZCVgTUBk8ObTi1lLtqjDAACgYKAXISARYSFQHGX2MikDM5ymqnmf6mfiUxhTKpIRoVAUF8yKqpaZQixsKWZ-dM6IX9Defp0076
|
||||
accounts.google.com FALSE / TRUE 1790133697 ACCOUNT_CHOOSER AFx_qI6bdD5ej4NqXbmGeXLgDsPC4p-_oVHct5U4CD6V-ZYvA06MYHt7W4gGxOOZVKnKnS1FocEvC1plJjzkHWnK3W7SV1B9BVeTBsJIyv2Nng_0rAbcvDHUEmat6rDd2g7r6cTiIK2-LfbPklIyv1UIRUUYUxRbf4_b9YgQV0c7XFOhU223qxx_Ba5VkPSyvauqnMf9Zkp4ezJi9UpluBb89LFA_yl5TA
|
||||
accounts.google.com FALSE / TRUE 1790133697 SMSV ADHTe-D0vylhvQNbG3d75HEVyXkefEJlRA5u2oKZvMMGtkTOTcovwpJV7WZ6G8A0yFerhGA3zIet28KUaHkL2Pro0QKBMYal9p1Puk-gsaMLx9IShPoiL6lXucaH0aR8roZaiwH4OxsTazdA6ddfVbvs-j2aqvSPP3To1oM26-95NbYXx_WA3uo
|
||||
.google.com TRUE / FALSE 1770424842 SEARCH_SAMESITE CgQI154B
|
||||
.google.com TRUE / TRUE 1770424812 AEC AVh_V2gZSSEc8GGMXOIkIAhmr4RlRooQvJoBoPGM_SLieN8Bedu4SOCqXA
|
||||
.google.com TRUE / TRUE 1787077007 __Secure-1PSIDTS sidts-CjIB5H03P3MMLcAWqR_DvX-PtO5PgrZAIVE9msth5frWlgq9rBmdlsg45uQcZ4Ba3fHFABAA
|
||||
.google.com TRUE / TRUE 1787077007 __Secure-3PSIDTS sidts-CjIB5H03P3MMLcAWqR_DvX-PtO5PgrZAIVE9msth5frWlgq9rBmdlsg45uQcZ4Ba3fHFABAA
|
||||
.google.com TRUE / TRUE 1771331445 NID 525=a0aiEKGd29ts21Tsh0NraPqmh8P82eEMtsrLW85Sftvo6JlhzV9TqDVo-RWnul5CL2tNifBpBeMiaSTweTNqju9-JEvfTn_HIr8PrBov5yPNK80k7OYeyygFh7PaDrEGW6J3bUWCoFtK6El7YSY3DTZZyW5cmdG1B5_dMF3DYGj2jzc3vLFnlEfEQK4_SUa8iqAIo-YD-q3hs-YEVX-hg6SzUUHA0sx-DkYG69Iz4tZwXHI3P0T6SdVPG5fwYvjLTdBkaBNvoPivCg1OA2aXZU7Mmy14Tn1H0cHbxWR-A4RxI5_LkmE2uWktcDn-3C7fMXWRN_GN-0fjghXANVa299Yd-ii5_Ne4iexvNr7oe3CMRTVQk9DMgNs7dNBSjYlwDLJpSJ2huI-8rSDtMDk1gPgYk_Nj8ELrvaVKUQbTjAkly0oFDZDvw9YWSh8blN6dNfIo-yee3Mqxqb5vbySWj8vH3W2m7awRcZ5jYDni_BZdX5ZEy53LzMO2fvgYrEjv2xPQ0yaTu6XQgmNvDUaRacHIbFH-7y6Ht_lRKIF_8524dYCTWR6wZ2g7hsvBlmlo7fM9GOdYPOPkfXMbzzrLdJzsScr5BzHsDBRV6TWgC1MTlG9FFhD1Mv9GToskEKCetLPcD7-7u-fLeo_OhGDlKGKvBKvyaPOYDsjGE2EsYDAYnhmtAm_jIGfuf8cWqa_tElLEy6jCIPWINPQ7wkp16c_WW-GXASBAZ2t82GrlkqMCkUzAjtSCxdZXlWbMxMx05S22d7IvKm7FMPU867NXp2lJ-x31R-2ly6g4Nsfmb0pT3eyXlOVYPs_VX9bkYHUwcxK-K9xBhsA4soIJJmOpX9UDYRqdWyFVO4fKxkrh6thLZMnElA2EbnUhN_72JykxXScjyG4oDswJ9_XTEXQoowTICPPBIXEBa0nCOrfUKdIJgYNVsyjdvH_hz-OYesmbPnEv5H8VaXhnSZcbVVuMhUM_ftN7UiRGPde3L3fuyfkpC-pGI-DeXOMQSaPAY1_mt_crETU
|
||||
.google.com TRUE / FALSE 1790133697 SID g.a0000QhGLBj4KedqRk4XrxKQje3x7ise1rNeLo8EkhDBU19L9Hpk9a9TjV9am48yv0RK3iCNhQACgYKAc0SARYSFQHGX2MiJeNE2HIkzn_49iX78ChKhBoVAUF8yKo3p3fs2tHxkqOxFkGDdHTU0076
|
||||
.google.com TRUE / TRUE 1790133697 __Secure-1PSID g.a0000QhGLBj4KedqRk4XrxKQje3x7ise1rNeLo8EkhDBU19L9HpkO0CWMyWE4HaRJuE3tskapQACgYKAXESARYSFQHGX2MiTAbZKBEoofchgy1ks-EkcBoVAUF8yKqQOkgMAgTnBxQL-IM-JycN0076
|
||||
.google.com TRUE / TRUE 1790133697 __Secure-3PSID g.a0000QhGLBj4KedqRk4XrxKQje3x7ise1rNeLo8EkhDBU19L9HpkZwwWz1fd49vzS89GJCYQVAACgYKAXoSARYSFQHGX2Mi-8cU0UwHLnCSb9W6zdTzbxoVAUF8yKpwmrGoh1Urt98WPMbekjCP0076
|
||||
.google.com TRUE / FALSE 1790133697 HSID AdoyyKyDBJf7xBKFq
|
||||
.google.com TRUE / TRUE 1790133697 SSID A09yvy8kjVqjkIhBT
|
||||
.google.com TRUE / FALSE 1790133697 APISID n_DaYMZo2PQuVj2F/AQvEKcrZGxFMXynXs
|
||||
.google.com TRUE / TRUE 1790133697 SAPISID alNMn9wd6z1SwzvT/AaIKOjwUDoz8fghga
|
||||
.google.com TRUE / TRUE 1790133697 __Secure-1PAPISID alNMn9wd6z1SwzvT/AaIKOjwUDoz8fghga
|
||||
.google.com TRUE / TRUE 1790133697 __Secure-3PAPISID alNMn9wd6z1SwzvT/AaIKOjwUDoz8fghga
|
||||
.google.com TRUE / FALSE 1787109698 SIDCC AKEyXzXNlnGmhnuWIAmfiiDwchuDX8ynutXjIZ_XDJqXx3BY_IVQRB4EHgXwoPoiVjywSoVS_Q
|
||||
.google.com TRUE / TRUE 1787109698 __Secure-1PSIDCC AKEyXzW32q6JpXor-F569XoN3AAniaJFeoCzTv0H-oLz3gtPK0qHjt3SqIKRQJdjvxcIkJbQ
|
||||
.google.com TRUE / TRUE 1787109698 __Secure-3PSIDCC AKEyXzV_IYCMMw7uM400s2bHOEg8GO04enqESX6Qq9fys5SwD9AcCuc7WCZGw_wBkGLJF81w
|
||||
.google.com TRUE /verify TRUE 1771331445 SNID ABablnfcoW10Ir9MN5SbO-BlxkApjD9UG_P68uc5YfkpmcCTITB21LLVeATVjllb6RwnhvhDvrtbu0t7bdXnF9jg79i6OPoh7Q
|
||||
.anthropic.com TRUE / TRUE 1786408856 ph_phc_TXdpocbGVeZVm5VJmAsHTMrCofBQu3e0kN8HGMNGTVW_posthog %7B%22distinct_id%22%3A%2201989692-b189-797a-a748-b9d2479dfb6f%22%2C%22%24sesid%22%3A%5B1754872856169%2C%2201989692-b188-78cf-8d3e-3029f9e5433a%22%2C1754872852872%5D%7D
|
||||
.anthropic.com TRUE / FALSE 1789432901 __ssid fc76574f6762c9814f5cf4045432b39
|
||||
.anthropic.com TRUE / TRUE 1787056847 CH-prefers-color-scheme dark
|
||||
.anthropic.com TRUE / TRUE 1787056849 anthropic-consent-preferences %7B%22analytics%22%3Atrue%2C%22marketing%22%3Atrue%7D
|
||||
.anthropic.com TRUE / TRUE 1787056849 ajs_anonymous_id 70552e7a-dbbe-41e4-9754-c862eefe16d8
|
||||
.anthropic.com TRUE / TRUE 1787056856 lastActiveOrg b75b0db6-c17e-43b0-b3f6-c0c618b3924f
|
||||
.anthropic.com TRUE / TRUE 1756125683 intercom-session-lupk8zyo NnB4MjRrREk2WlYxam55WFg2WVpNUTFkRWdsZzZQbWYrRzBCVXdkWUovV1JnaUwrNmFuU2c1a1dUSjRvNmROMkV5LzdGbWRQUlZiZFIxOEt6U2FZV0E3OGJZam1Na2lGQkZmczMyTFRHZWM9LS1JdVdkci9ETXZMRE5yLytXYi9xN2JnPT0=--756e6f4a69975fc77fd820510ee194e78f22d548
|
||||
.anthropic.com TRUE / TRUE 1778850883 intercom-device-id-lupk8zyo 217abe7e-660a-4789-9ffd-067138b60ad7
|
||||
docs.anthropic.com FALSE / FALSE 1786408853 inkeepUsagePreferences_userId 2z3qr2o1i4o3t1ewj4g6j
|
||||
claude.ai FALSE / TRUE 1789432887 _fbp fb.1.1754872887342.19212444962728464
|
||||
claude.ai FALSE / FALSE 1770424896 g_state {"i_l":0}
|
||||
claude.ai FALSE / TRUE 1780792898 anthropic-device-id 24b0aa8f-9e84-44aa-8d5a-378386a03571
|
||||
.claude.ai TRUE / TRUE 1786408888 CH-prefers-color-scheme light
|
||||
.claude.ai TRUE / TRUE 1786408888 cf_clearance K_Avr.k9lXyYlfP5buJsTimVZlc8X4KkLuEklcxQXzA-1754872888-1.2.1.1-qHvDq4dpIKudM7jhfIUQBm6.i4IMBvl_kXadZD1h75BGYgCDRkMK.CSlna94HOg3ijpl.1sZlpPQwfhDbM7xn.Trekt.9MJrA1rat4LMvhf2CyR_u6P_ID2Gs20HCz1hNn8fLbThZSHmqe9vkqhScGBaGvC86XLPDkHGqGYZ70mGep6T2ml_kWe3Br6MR_llfPNeo8LDNDk0rlWgsLNEaYfmrfExFn3JkXKT7qLA8iI
|
||||
.claude.ai TRUE / FALSE 1789432888 __ssid 73f3e3efafe14323e4eb6f8682c665d
|
||||
.claude.ai TRUE / TRUE 1786408897 lastActiveOrg cc7654cf-09ff-41e7-b623-0d859ab783e3
|
||||
.claude.ai TRUE / TRUE 1757292096 sessionKey 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
|
||||
.claude.ai TRUE / TRUE 1778202898 intercom-device-id-lupk8zyo 65e2f09c-f6d8-4fe2-8cec-f9a73f58336a
|
||||
.claude.ai TRUE / FALSE 1786408898 ajs_user_id d01d4960-bee2-45f3-a228-6dc10137a91e
|
||||
.claude.ai TRUE / FALSE 1786408898 ajs_anonymous_id ecb93856-d8cb-41eb-ae3c-c401857c8ffe
|
||||
.claude.ai TRUE / TRUE 1786408898 anthropic-consent-preferences %7B%22analytics%22%3Afalse%2C%22marketing%22%3Afalse%7D
|
||||
.claude.ai TRUE /fc TRUE 1757464896 ARID kLjYk67/ok33yQWlZLYFpqFqWNz12rqAyy5mdo6ZrBy+sL7pstI3b42uoKS1alz6OovPWBOjmx1wbHkrEvAjcvbyLw47v2ubB5w9MlEcrtvFLpdPBPZRagHdbzg8AhAoJjOUKHC1CPemoqbTbXn1g1mNYXAliuE=**utCOoa+Th7H1kuHH
|
||||
lastpass.com FALSE / TRUE 1787056237 sessonly 0
|
||||
lastpass.com FALSE / TRUE 1756645600 PHPSESSID q31ed413isulb7oio48bp42u712
|
||||
.screamingfrog.co.uk TRUE / FALSE 1790080249 _ga GA1.1.1162743860.1755520249
|
||||
.screamingfrog.co.uk TRUE / FALSE 1790080815 _ga_ED162H365P GS2.1.s1755520249$o1$g0$t1755520815$j60$l0$h0
|
||||
developers.google.com FALSE / FALSE 1771072764 django_language en
|
||||
.developers.google.com TRUE / FALSE 1790080764 _ga GA1.1.1123076598.1755520765
|
||||
.developers.google.com TRUE / FALSE 1790082401 _ga_64EQFFKSHW GS2.1.s1755520764$o1$g1$t1755522401$j60$l0$h0
|
||||
.developers.google.com TRUE / FALSE 1790082401 _ga_272J68FCRF GS2.1.s1755520764$o1$g1$t1755522401$j60$l0$h0
|
||||
.console.anthropic.com TRUE / TRUE 1756125656 sessionKey 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
|
||||
.mozilla.org TRUE / FALSE 1755608799 _gid GA1.2.355179243.1755522400
|
||||
.mozilla.org TRUE / FALSE 1790082399 _ga GA1.1.157627023.1754872679
|
||||
.mozilla.org TRUE / FALSE 1790082399 _ga_B9CY1C9VBC GS2.1.s1755522399$o2$g0$t1755522399$j60$l0$h0
|
||||
console.anthropic.com FALSE / TRUE 1781443010 anthropic-device-id 8f03c23c-3d9f-404c-9f90-d09a37c2dcad
|
||||
.tiktok.com TRUE / TRUE 1771093007 tt_chain_token CwQ2wR8CfOG0FC+BkuzPyw==
|
||||
.tiktok.com TRUE / TRUE 1787077008 ttwid 1%7CvQuucbrpIVNAleLjylqryuAwIP-GvumfRPJFmJepcjQ%7C1755541008%7Caf2a58ac78f5a1f87fd6e8950ee70614ca5c887534a1cab6193416f2fe04664b
|
||||
.tiktok.com TRUE / TRUE 1756405021 msToken 3L5kUsiNayJ-UwvG2qEpAVYz2QMULS6SAr0pbzxU2tOd_7hynoXEpcLXsA-mZz9F69_DQRKmbwW8vzeJMooCt_3ctUnwQlyKR_HIfOrEYPPkUjoH9MQrqKLN2ED3GG78CcoRDXOV7p8=
|
||||
.www.tiktok.com TRUE / TRUE 1781461009 tiktok_webapp_theme_source auto
|
||||
.www.tiktok.com TRUE / TRUE 1781461009 tiktok_webapp_theme dark
|
||||
.www.tiktok.com TRUE / TRUE 1781461010 delay_guest_mode_vid 5
|
||||
www.tiktok.com FALSE / FALSE 1763317021 msToken 3L5kUsiNayJ-UwvG2qEpAVYz2QMULS6SAr0pbzxU2tOd_7hynoXEpcLXsA-mZz9F69_DQRKmbwW8vzeJMooCt_3ctUnwQlyKR_HIfOrEYPPkUjoH9MQrqKLN2ED3GG78CcoRDXOV7p8=
|
||||
.youtube.com TRUE / TRUE 1771125646 __Secure-ROLLOUT_TOKEN CLDT1IrIhZWDFxCtuZO89ZWPAxjD0-C89ZWPAw%3D%3D
|
||||
.youtube.com TRUE / TRUE 1790133697 __Secure-3PSID g.a0000QhGLBj4KedqRk4XrxKQje3x7ise1rNeLo8EkhDBU19L9HpkZwwWz1fd49vzS89GJCYQVAACgYKAXoSARYSFQHGX2Mi-8cU0UwHLnCSb9W6zdTzbxoVAUF8yKpwmrGoh1Urt98WPMbekjCP0076
|
||||
.youtube.com TRUE / TRUE 1790133697 __Secure-3PAPISID alNMn9wd6z1SwzvT/AaIKOjwUDoz8fghga
|
||||
.youtube.com TRUE / TRUE 1771127671 VISITOR_INFO1_LIVE 6THBtqhe0l8
|
||||
.youtube.com TRUE / TRUE 1771127671 VISITOR_PRIVACY_METADATA CgJDQRIEGgAgOw%3D%3D
|
||||
.youtube.com TRUE / TRUE 1776613650 PREF f6=40000000&hl=en&tz=UTC
|
||||
.youtube.com TRUE / TRUE 1787109697 __Secure-1PSIDTS sidts-CjUB5H03P6LZHz8-meWERM1pqje95dyDN68EeWo4naQ9KjgcU0UOZtEltRSTN8NVFEI8XhD8IhAA
|
||||
.youtube.com TRUE / TRUE 1787109697 __Secure-3PSIDTS sidts-CjUB5H03P6LZHz8-meWERM1pqje95dyDN68EeWo4naQ9KjgcU0UOZtEltRSTN8NVFEI8XhD8IhAA
|
||||
.youtube.com TRUE / TRUE 1787109733 __Secure-3PSIDCC AKEyXzXZgJoZXDWa_mmgaCLTSjYYxY6nhvVHKqHCEJSWZyfmjOJ5IMiOX4tliaVvJjeo-0mZhQ
|
||||
.youtube.com TRUE / TRUE 1818647671 __Secure-YT_TVFAS t=487659&s=2
|
||||
.youtube.com TRUE / TRUE 1771127671 DEVICE_INFO ChxOelUwTURFek1UYzJPVFF4TlRNNE5EZzNOZz09EPfqj8UGGOXbj8UG
|
||||
.youtube.com TRUE / TRUE 1755577470 GPS 1
|
||||
.youtube.com TRUE / TRUE 0 YSC 6KpsQNw8n6w
|
||||
.youtube.com TRUE /tv TRUE 1788407671 __Secure-YT_DERP CNmPp7lk
|
||||
.google.ca TRUE / TRUE 1771384897 NID 525=OGuhjgB3NP4xSGoiioAF9nJBSgyhfUvqaBZN4QrY5yNFHfeocb1aE829PIzEEC6Qyo9LVK910s_WiTcrYtqsVpYUjg3H3s_mK_ffyytVDxHNKiKRKYWd4vBEzqeOxEHcdoMBQwY20W9svBCX-cc_YQXl5zpiAepPDVGQcth5rZ7kebYv5jYmH8BEQOQcE7HVyP6PcAI9yds
|
||||
.google.ca TRUE / FALSE 1790133697 SID g.a0000QhGLBj4KedqRk4XrxKQje3x7ise1rNeLo8EkhDBU19L9Hpk9a9TjV9am48yv0RK3iCNhQACgYKAc0SARYSFQHGX2MiJeNE2HIkzn_49iX78ChKhBoVAUF8yKo3p3fs2tHxkqOxFkGDdHTU0076
|
||||
.google.ca TRUE / TRUE 1790133697 __Secure-1PSID g.a0000QhGLBj4KedqRk4XrxKQje3x7ise1rNeLo8EkhDBU19L9HpkO0CWMyWE4HaRJuE3tskapQACgYKAXESARYSFQHGX2MiTAbZKBEoofchgy1ks-EkcBoVAUF8yKqQOkgMAgTnBxQL-IM-JycN0076
|
||||
.google.ca TRUE / TRUE 1790133697 __Secure-3PSID g.a0000QhGLBj4KedqRk4XrxKQje3x7ise1rNeLo8EkhDBU19L9HpkZwwWz1fd49vzS89GJCYQVAACgYKAXoSARYSFQHGX2Mi-8cU0UwHLnCSb9W6zdTzbxoVAUF8yKpwmrGoh1Urt98WPMbekjCP0076
|
||||
.google.ca TRUE / FALSE 1790133697 HSID AiRg2EkM6heMohMPn
|
||||
.google.ca TRUE / TRUE 1790133697 SSID AJP9S08XSagldlZjA
|
||||
.google.ca TRUE / FALSE 1790133697 APISID n_DaYMZo2PQuVj2F/AQvEKcrZGxFMXynXs
|
||||
.google.ca TRUE / TRUE 1790133697 SAPISID alNMn9wd6z1SwzvT/AaIKOjwUDoz8fghga
|
||||
.google.ca TRUE / TRUE 1790133697 __Secure-1PAPISID alNMn9wd6z1SwzvT/AaIKOjwUDoz8fghga
|
||||
.google.ca TRUE / TRUE 1790133697 __Secure-3PAPISID alNMn9wd6z1SwzvT/AaIKOjwUDoz8fghga
|
||||
|
|
@ -1,13 +0,0 @@
|
|||
# Netscape HTTP Cookie File
|
||||
# This file is generated by yt-dlp. Do not edit.
|
||||
|
||||
.youtube.com TRUE / FALSE 0 PREF hl=en&tz=UTC
|
||||
.youtube.com TRUE / TRUE 0 SOCS CAI
|
||||
.youtube.com TRUE / TRUE 0 YSC 7cc8-LrPd_Q
|
||||
.youtube.com TRUE / TRUE 1771125725 VISITOR_INFO1_LIVE za_nyLN37wM
|
||||
.youtube.com TRUE / TRUE 1771125725 VISITOR_PRIVACY_METADATA CgJDQRIEGgAgNQ%3D%3D
|
||||
.youtube.com TRUE / TRUE 1771123579 __Secure-ROLLOUT_TOKEN CM7Wy8jf2ozaPxDbhefL2ZWPAxjni_zi7ZWPAw%3D%3D
|
||||
.youtube.com TRUE / TRUE 1818645725 __Secure-YT_TVFAS t=487657&s=2
|
||||
.youtube.com TRUE / TRUE 1771125725 DEVICE_INFO ChxOelUwTURFeU16YzJNRGMyTkRVNE1UYzVOUT09EN3bj8UGGJzNj8UG
|
||||
.youtube.com TRUE / TRUE 1755575296 GPS 1
|
||||
.youtube.com TRUE /tv TRUE 1788405725 __Secure-YT_DERP CJny7bdk
|
||||
|
|
@ -1,101 +1,10 @@
|
|||
# Netscape HTTP Cookie File
|
||||
# This file is generated by yt-dlp. Do not edit.
|
||||
|
||||
.lastpass.com TRUE / TRUE 1786408717 lp_anonymousid 7995af49-838a-4b73-8d94-b7430e2c329e.v2
|
||||
.lastpass.com TRUE / TRUE 1787056237 lang en_US
|
||||
.lastpass.com TRUE / TRUE 1755580871 ak_bmsc 462AA4AE6A145B526C84F4A66A49960B~000000000000000000000000000000~YAAQxwDeF8PKIaeYAQAALlZYwByftxc5ukLlUAGCGXrxm6zV5KNDctdN2HuUjFLljsUhsX5hTI2Enk9E/uGCZ0eGrfc2Qdlv1soFQlEp5ujcrpJERlQEVTuOGQfjaHBzQqG/kPsbLQHIIJoIvA8gE7C/04exZ0LnAulwkmOQqAvQixUoPpO6ASII09O6r14thdpKlaCMsCfF1O8AG6yGtwq268rthix4L6HkDdQcFF3FVk/pg6jWXO3F6OYRnTnD7z4Hvi6g90N/BzejpvMGhTQbCCXJz1ig+tVg9lxA5A9nq45ZwvkUxZwM8RQLU46+OxgWswnH4bR+nhIlCmWdAC7hpxV0z3+5/JUTBCUQkTp4GZQs+3RA9dGz9sJ+PCJLpyRD1tVx4/ehcdMApkQ=
|
||||
.lastpass.com TRUE / TRUE 1787109681 _abck 78CB65894B61AE35DE200E4176B46B22~-1~YAAQxwDeF0zLIaeYAQAAUn5YwA4EDrzJmgJhTC365aMZ7ugfdVHjRQ87RNrRviafhGet7wwcLIF8JYdWecoEj80P3Zwima6w2qi9sHjYi3nBtcV+vZXRy0ybwpHLcHRc6dttxCrlD2FEarNLggeusDY6Gg6cO82uRWIm8xjLDzte0ls8Bmnn8wlaOg2+XCfNaXAmYHmLXfhTrBEiXEvTYUjRNtj7R+kXKDIE9rd0VXnYpM+gqIb3BvftUdCrA8DK5vl/urtaigggV0zb7sSwYikiZB6so9IqekIrIzKbQ3pz0HxR9PCTDhhzx6CC39glmHjS/lGwtrmlhWHU0MsXR3NQUJSLNM447GhtH9PuYZJQ2yTLYDjUYcWhgR33mECusBr9lSWY/h1kFjwKj9lP8BMrfb+puI7PJROneR1uBroNu+cp8wR+U9CKVPsRqiIyR6IUXIMqCvRJCR2ZjJUW6VjKnOi4aHXZyOI/ziOB4BuzwnbnqQuOTMFcg0HTfpwkip+NNoamzdqykDbvVOJb4Wga1SJDTjD6J2qgxnDrEy+WHpcGtRIZ/+O7B9FsSdG3Ga9hXtPAhog5eRNrC4PnpsIZF3d7UETlNi4NKUTxXr00jOaqrV3vQzQd5BUALmAALt9insA53LPdOx0SSfmWK9Xw1eRj~-1~-1~-1
|
||||
.lastpass.com TRUE / TRUE 1755580876 bm_sv B07AB04C1CF0CD6287547B92994D7119~YAAQxwDeF03LIaeYAQAAUn5YwBxsrazGozgkHVCj2owby39f97b1/hex0fML3VuvAOx0KitBSV7eL4HHonlHaclAs7CoFFjwSNyHjOk0yb33U2G4rjl/MWhvQByl91kMUc24ptY7rWtsoaKRBeveWOXsIXUzoWS/SOx4qLumybL6RLdxfkBoNLGfcXvJLJZ8j4bwBCN2V+mpRSfy0tDHWtxRh/Gcv6TlRAHRf0yxrHViChdkPxNTNLCN8iXcicR/e60=~1
|
||||
pollserver.lastpass.com FALSE / TRUE 1755996002 PHPSESSID q31ed413isulb7oio48bp42u712
|
||||
ogs.google.com FALSE / TRUE 1757464814 OTZ 8209480_68_72_104040_68_446700
|
||||
accounts.google.com FALSE / TRUE 1757464816 OTZ 8209480_68_72_104040_68_446700
|
||||
accounts.google.com FALSE / TRUE 1790133697 __Host-GAPS 1:4a1DbADXrmiwrkKhB9hmZ6pXeH-F6JfxSo-IlrRlzzrsR03oYpfdhiuwfKK5qpwNzkSC0zuVayzRTDoA-uv9O8x1mSy9QQ:KBG9eDx22YWYXmTz
|
||||
accounts.google.com FALSE / TRUE 1790133697 LSID s.CA|s.youtube:g.a0000QhGLPPQzzgZXFpvWoobsAQN8gT8puTlvUnz1hM9I0tjYz70Y44dElY9CS3yOPpIYrOrMAACgYKAQESARYSFQHGX2MiU2OrTar21aQT3EyM3DkwsxoVAUF8yKpQTGz3wsgXHZdiB7ye8VTi0076
|
||||
accounts.google.com FALSE / TRUE 1790133697 __Host-1PLSID s.CA|s.youtube:g.a0000QhGLPPQzzgZXFpvWoobsAQN8gT8puTlvUnz1hM9I0tjYz70pCoJ5-AC8-HZ-atxM4otEwACgYKAVcSARYSFQHGX2Mi0nHXWGmWn2oqsMZNd0oAxBoVAUF8yKrfIkwN9Myxu0Jv_tzrcBux0076
|
||||
accounts.google.com FALSE / TRUE 1790133697 __Host-3PLSID s.CA|s.youtube:g.a0000QhGLPPQzzgZXFpvWoobsAQN8gT8puTlvUnz1hM9I0tjYz70GZCVgTUBk8ObTi1lLtqjDAACgYKAXISARYSFQHGX2MikDM5ymqnmf6mfiUxhTKpIRoVAUF8yKqpaZQixsKWZ-dM6IX9Defp0076
|
||||
accounts.google.com FALSE / TRUE 1790133697 ACCOUNT_CHOOSER AFx_qI6bdD5ej4NqXbmGeXLgDsPC4p-_oVHct5U4CD6V-ZYvA06MYHt7W4gGxOOZVKnKnS1FocEvC1plJjzkHWnK3W7SV1B9BVeTBsJIyv2Nng_0rAbcvDHUEmat6rDd2g7r6cTiIK2-LfbPklIyv1UIRUUYUxRbf4_b9YgQV0c7XFOhU223qxx_Ba5VkPSyvauqnMf9Zkp4ezJi9UpluBb89LFA_yl5TA
|
||||
accounts.google.com FALSE / TRUE 1790133697 SMSV ADHTe-D0vylhvQNbG3d75HEVyXkefEJlRA5u2oKZvMMGtkTOTcovwpJV7WZ6G8A0yFerhGA3zIet28KUaHkL2Pro0QKBMYal9p1Puk-gsaMLx9IShPoiL6lXucaH0aR8roZaiwH4OxsTazdA6ddfVbvs-j2aqvSPP3To1oM26-95NbYXx_WA3uo
|
||||
.google.com TRUE / FALSE 1770424842 SEARCH_SAMESITE CgQI154B
|
||||
.google.com TRUE / TRUE 1770424812 AEC AVh_V2gZSSEc8GGMXOIkIAhmr4RlRooQvJoBoPGM_SLieN8Bedu4SOCqXA
|
||||
.google.com TRUE / TRUE 1787077007 __Secure-1PSIDTS sidts-CjIB5H03P3MMLcAWqR_DvX-PtO5PgrZAIVE9msth5frWlgq9rBmdlsg45uQcZ4Ba3fHFABAA
|
||||
.google.com TRUE / TRUE 1787077007 __Secure-3PSIDTS sidts-CjIB5H03P3MMLcAWqR_DvX-PtO5PgrZAIVE9msth5frWlgq9rBmdlsg45uQcZ4Ba3fHFABAA
|
||||
.google.com TRUE / TRUE 1771331445 NID 525=a0aiEKGd29ts21Tsh0NraPqmh8P82eEMtsrLW85Sftvo6JlhzV9TqDVo-RWnul5CL2tNifBpBeMiaSTweTNqju9-JEvfTn_HIr8PrBov5yPNK80k7OYeyygFh7PaDrEGW6J3bUWCoFtK6El7YSY3DTZZyW5cmdG1B5_dMF3DYGj2jzc3vLFnlEfEQK4_SUa8iqAIo-YD-q3hs-YEVX-hg6SzUUHA0sx-DkYG69Iz4tZwXHI3P0T6SdVPG5fwYvjLTdBkaBNvoPivCg1OA2aXZU7Mmy14Tn1H0cHbxWR-A4RxI5_LkmE2uWktcDn-3C7fMXWRN_GN-0fjghXANVa299Yd-ii5_Ne4iexvNr7oe3CMRTVQk9DMgNs7dNBSjYlwDLJpSJ2huI-8rSDtMDk1gPgYk_Nj8ELrvaVKUQbTjAkly0oFDZDvw9YWSh8blN6dNfIo-yee3Mqxqb5vbySWj8vH3W2m7awRcZ5jYDni_BZdX5ZEy53LzMO2fvgYrEjv2xPQ0yaTu6XQgmNvDUaRacHIbFH-7y6Ht_lRKIF_8524dYCTWR6wZ2g7hsvBlmlo7fM9GOdYPOPkfXMbzzrLdJzsScr5BzHsDBRV6TWgC1MTlG9FFhD1Mv9GToskEKCetLPcD7-7u-fLeo_OhGDlKGKvBKvyaPOYDsjGE2EsYDAYnhmtAm_jIGfuf8cWqa_tElLEy6jCIPWINPQ7wkp16c_WW-GXASBAZ2t82GrlkqMCkUzAjtSCxdZXlWbMxMx05S22d7IvKm7FMPU867NXp2lJ-x31R-2ly6g4Nsfmb0pT3eyXlOVYPs_VX9bkYHUwcxK-K9xBhsA4soIJJmOpX9UDYRqdWyFVO4fKxkrh6thLZMnElA2EbnUhN_72JykxXScjyG4oDswJ9_XTEXQoowTICPPBIXEBa0nCOrfUKdIJgYNVsyjdvH_hz-OYesmbPnEv5H8VaXhnSZcbVVuMhUM_ftN7UiRGPde3L3fuyfkpC-pGI-DeXOMQSaPAY1_mt_crETU
|
||||
.google.com TRUE / FALSE 1790133697 SID g.a0000QhGLBj4KedqRk4XrxKQje3x7ise1rNeLo8EkhDBU19L9Hpk9a9TjV9am48yv0RK3iCNhQACgYKAc0SARYSFQHGX2MiJeNE2HIkzn_49iX78ChKhBoVAUF8yKo3p3fs2tHxkqOxFkGDdHTU0076
|
||||
.google.com TRUE / TRUE 1790133697 __Secure-1PSID g.a0000QhGLBj4KedqRk4XrxKQje3x7ise1rNeLo8EkhDBU19L9HpkO0CWMyWE4HaRJuE3tskapQACgYKAXESARYSFQHGX2MiTAbZKBEoofchgy1ks-EkcBoVAUF8yKqQOkgMAgTnBxQL-IM-JycN0076
|
||||
.google.com TRUE / TRUE 1790133697 __Secure-3PSID g.a0000QhGLBj4KedqRk4XrxKQje3x7ise1rNeLo8EkhDBU19L9HpkZwwWz1fd49vzS89GJCYQVAACgYKAXoSARYSFQHGX2Mi-8cU0UwHLnCSb9W6zdTzbxoVAUF8yKpwmrGoh1Urt98WPMbekjCP0076
|
||||
.google.com TRUE / FALSE 1790133697 HSID AdoyyKyDBJf7xBKFq
|
||||
.google.com TRUE / TRUE 1790133697 SSID A09yvy8kjVqjkIhBT
|
||||
.google.com TRUE / FALSE 1790133697 APISID n_DaYMZo2PQuVj2F/AQvEKcrZGxFMXynXs
|
||||
.google.com TRUE / TRUE 1790133697 SAPISID alNMn9wd6z1SwzvT/AaIKOjwUDoz8fghga
|
||||
.google.com TRUE / TRUE 1790133697 __Secure-1PAPISID alNMn9wd6z1SwzvT/AaIKOjwUDoz8fghga
|
||||
.google.com TRUE / TRUE 1790133697 __Secure-3PAPISID alNMn9wd6z1SwzvT/AaIKOjwUDoz8fghga
|
||||
.google.com TRUE / FALSE 1787109698 SIDCC AKEyXzXNlnGmhnuWIAmfiiDwchuDX8ynutXjIZ_XDJqXx3BY_IVQRB4EHgXwoPoiVjywSoVS_Q
|
||||
.google.com TRUE / TRUE 1787109698 __Secure-1PSIDCC AKEyXzW32q6JpXor-F569XoN3AAniaJFeoCzTv0H-oLz3gtPK0qHjt3SqIKRQJdjvxcIkJbQ
|
||||
.google.com TRUE / TRUE 1787109698 __Secure-3PSIDCC AKEyXzV_IYCMMw7uM400s2bHOEg8GO04enqESX6Qq9fys5SwD9AcCuc7WCZGw_wBkGLJF81w
|
||||
.google.com TRUE /verify TRUE 1771331445 SNID ABablnfcoW10Ir9MN5SbO-BlxkApjD9UG_P68uc5YfkpmcCTITB21LLVeATVjllb6RwnhvhDvrtbu0t7bdXnF9jg79i6OPoh7Q
|
||||
.anthropic.com TRUE / TRUE 1786408856 ph_phc_TXdpocbGVeZVm5VJmAsHTMrCofBQu3e0kN8HGMNGTVW_posthog %7B%22distinct_id%22%3A%2201989692-b189-797a-a748-b9d2479dfb6f%22%2C%22%24sesid%22%3A%5B1754872856169%2C%2201989692-b188-78cf-8d3e-3029f9e5433a%22%2C1754872852872%5D%7D
|
||||
.anthropic.com TRUE / FALSE 1789432901 __ssid fc76574f6762c9814f5cf4045432b39
|
||||
.anthropic.com TRUE / TRUE 1787056847 CH-prefers-color-scheme dark
|
||||
.anthropic.com TRUE / TRUE 1787056849 anthropic-consent-preferences %7B%22analytics%22%3Atrue%2C%22marketing%22%3Atrue%7D
|
||||
.anthropic.com TRUE / TRUE 1787056849 ajs_anonymous_id 70552e7a-dbbe-41e4-9754-c862eefe16d8
|
||||
.anthropic.com TRUE / TRUE 1787056856 lastActiveOrg b75b0db6-c17e-43b0-b3f6-c0c618b3924f
|
||||
.anthropic.com TRUE / TRUE 1756125683 intercom-session-lupk8zyo NnB4MjRrREk2WlYxam55WFg2WVpNUTFkRWdsZzZQbWYrRzBCVXdkWUovV1JnaUwrNmFuU2c1a1dUSjRvNmROMkV5LzdGbWRQUlZiZFIxOEt6U2FZV0E3OGJZam1Na2lGQkZmczMyTFRHZWM9LS1JdVdkci9ETXZMRE5yLytXYi9xN2JnPT0=--756e6f4a69975fc77fd820510ee194e78f22d548
|
||||
.anthropic.com TRUE / TRUE 1778850883 intercom-device-id-lupk8zyo 217abe7e-660a-4789-9ffd-067138b60ad7
|
||||
docs.anthropic.com FALSE / FALSE 1786408853 inkeepUsagePreferences_userId 2z3qr2o1i4o3t1ewj4g6j
|
||||
claude.ai FALSE / TRUE 1789432887 _fbp fb.1.1754872887342.19212444962728464
|
||||
claude.ai FALSE / FALSE 1770424896 g_state {"i_l":0}
|
||||
claude.ai FALSE / TRUE 1780792898 anthropic-device-id 24b0aa8f-9e84-44aa-8d5a-378386a03571
|
||||
.claude.ai TRUE / TRUE 1786408888 CH-prefers-color-scheme light
|
||||
.claude.ai TRUE / TRUE 1786408888 cf_clearance K_Avr.k9lXyYlfP5buJsTimVZlc8X4KkLuEklcxQXzA-1754872888-1.2.1.1-qHvDq4dpIKudM7jhfIUQBm6.i4IMBvl_kXadZD1h75BGYgCDRkMK.CSlna94HOg3ijpl.1sZlpPQwfhDbM7xn.Trekt.9MJrA1rat4LMvhf2CyR_u6P_ID2Gs20HCz1hNn8fLbThZSHmqe9vkqhScGBaGvC86XLPDkHGqGYZ70mGep6T2ml_kWe3Br6MR_llfPNeo8LDNDk0rlWgsLNEaYfmrfExFn3JkXKT7qLA8iI
|
||||
.claude.ai TRUE / FALSE 1789432888 __ssid 73f3e3efafe14323e4eb6f8682c665d
|
||||
.claude.ai TRUE / TRUE 1786408897 lastActiveOrg cc7654cf-09ff-41e7-b623-0d859ab783e3
|
||||
.claude.ai TRUE / TRUE 1757292096 sessionKey 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
|
||||
.claude.ai TRUE / TRUE 1778202898 intercom-device-id-lupk8zyo 65e2f09c-f6d8-4fe2-8cec-f9a73f58336a
|
||||
.claude.ai TRUE / FALSE 1786408898 ajs_user_id d01d4960-bee2-45f3-a228-6dc10137a91e
|
||||
.claude.ai TRUE / FALSE 1786408898 ajs_anonymous_id ecb93856-d8cb-41eb-ae3c-c401857c8ffe
|
||||
.claude.ai TRUE / TRUE 1786408898 anthropic-consent-preferences %7B%22analytics%22%3Afalse%2C%22marketing%22%3Afalse%7D
|
||||
.claude.ai TRUE /fc TRUE 1757464896 ARID kLjYk67/ok33yQWlZLYFpqFqWNz12rqAyy5mdo6ZrBy+sL7pstI3b42uoKS1alz6OovPWBOjmx1wbHkrEvAjcvbyLw47v2ubB5w9MlEcrtvFLpdPBPZRagHdbzg8AhAoJjOUKHC1CPemoqbTbXn1g1mNYXAliuE=**utCOoa+Th7H1kuHH
|
||||
lastpass.com FALSE / TRUE 1787056237 sessonly 0
|
||||
lastpass.com FALSE / TRUE 1756645600 PHPSESSID q31ed413isulb7oio48bp42u712
|
||||
.screamingfrog.co.uk TRUE / FALSE 1790080249 _ga GA1.1.1162743860.1755520249
|
||||
.screamingfrog.co.uk TRUE / FALSE 1790080815 _ga_ED162H365P GS2.1.s1755520249$o1$g0$t1755520815$j60$l0$h0
|
||||
developers.google.com FALSE / FALSE 1771072764 django_language en
|
||||
.developers.google.com TRUE / FALSE 1790080764 _ga GA1.1.1123076598.1755520765
|
||||
.developers.google.com TRUE / FALSE 1790082401 _ga_64EQFFKSHW GS2.1.s1755520764$o1$g1$t1755522401$j60$l0$h0
|
||||
.developers.google.com TRUE / FALSE 1790082401 _ga_272J68FCRF GS2.1.s1755520764$o1$g1$t1755522401$j60$l0$h0
|
||||
.console.anthropic.com TRUE / TRUE 1756125656 sessionKey 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
|
||||
.mozilla.org TRUE / FALSE 1755608799 _gid GA1.2.355179243.1755522400
|
||||
.mozilla.org TRUE / FALSE 1790082399 _ga GA1.1.157627023.1754872679
|
||||
.mozilla.org TRUE / FALSE 1790082399 _ga_B9CY1C9VBC GS2.1.s1755522399$o2$g0$t1755522399$j60$l0$h0
|
||||
console.anthropic.com FALSE / TRUE 1781443010 anthropic-device-id 8f03c23c-3d9f-404c-9f90-d09a37c2dcad
|
||||
.tiktok.com TRUE / TRUE 1771093007 tt_chain_token CwQ2wR8CfOG0FC+BkuzPyw==
|
||||
.tiktok.com TRUE / TRUE 1787077008 ttwid 1%7CvQuucbrpIVNAleLjylqryuAwIP-GvumfRPJFmJepcjQ%7C1755541008%7Caf2a58ac78f5a1f87fd6e8950ee70614ca5c887534a1cab6193416f2fe04664b
|
||||
.tiktok.com TRUE / TRUE 1756405021 msToken 3L5kUsiNayJ-UwvG2qEpAVYz2QMULS6SAr0pbzxU2tOd_7hynoXEpcLXsA-mZz9F69_DQRKmbwW8vzeJMooCt_3ctUnwQlyKR_HIfOrEYPPkUjoH9MQrqKLN2ED3GG78CcoRDXOV7p8=
|
||||
.www.tiktok.com TRUE / TRUE 1781461009 tiktok_webapp_theme_source auto
|
||||
.www.tiktok.com TRUE / TRUE 1781461009 tiktok_webapp_theme dark
|
||||
.www.tiktok.com TRUE / TRUE 1781461010 delay_guest_mode_vid 5
|
||||
www.tiktok.com FALSE / FALSE 1763317021 msToken 3L5kUsiNayJ-UwvG2qEpAVYz2QMULS6SAr0pbzxU2tOd_7hynoXEpcLXsA-mZz9F69_DQRKmbwW8vzeJMooCt_3ctUnwQlyKR_HIfOrEYPPkUjoH9MQrqKLN2ED3GG78CcoRDXOV7p8=
|
||||
.youtube.com TRUE / TRUE 1771125646 __Secure-ROLLOUT_TOKEN CLDT1IrIhZWDFxCtuZO89ZWPAxjD0-C89ZWPAw%3D%3D
|
||||
.youtube.com TRUE / TRUE 1787109697 __Secure-1PSIDTS sidts-CjUB5H03P6LZHz8-meWERM1pqje95dyDN68EeWo4naQ9KjgcU0UOZtEltRSTN8NVFEI8XhD8IhAA
|
||||
.youtube.com TRUE / TRUE 1787109697 __Secure-3PSIDTS sidts-CjUB5H03P6LZHz8-meWERM1pqje95dyDN68EeWo4naQ9KjgcU0UOZtEltRSTN8NVFEI8XhD8IhAA
|
||||
.youtube.com TRUE / TRUE 1790133697 __Secure-3PSID g.a0000QhGLBj4KedqRk4XrxKQje3x7ise1rNeLo8EkhDBU19L9HpkZwwWz1fd49vzS89GJCYQVAACgYKAXoSARYSFQHGX2Mi-8cU0UwHLnCSb9W6zdTzbxoVAUF8yKpwmrGoh1Urt98WPMbekjCP0076
|
||||
.youtube.com TRUE / TRUE 1790133697 __Secure-3PAPISID alNMn9wd6z1SwzvT/AaIKOjwUDoz8fghga
|
||||
.youtube.com TRUE / TRUE 1771130640 VISITOR_INFO1_LIVE 6THBtqhe0l8
|
||||
.youtube.com TRUE / TRUE 1771130640 VISITOR_PRIVACY_METADATA CgJDQRIEGgAgOw%3D%3D
|
||||
.youtube.com TRUE / FALSE 0 PREF f6=40000000&hl=en&tz=UTC
|
||||
.youtube.com TRUE / TRUE 1787110442 __Secure-3PSIDCC AKEyXzUcQYeh1zkf7LcFC1wB3xjB6vmXF6oMo_a9AnSMMBezZ_M4AyjGOSn5lPMDwImX7d3sgg
|
||||
.youtube.com TRUE / TRUE 1818650640 __Secure-YT_TVFAS t=487659&s=2
|
||||
.youtube.com TRUE / TRUE 1771130640 DEVICE_INFO ChxOelUwTURFek1UYzJPVFF4TlRNNE5EZzNOZz09EJCCkMUGGOXbj8UG
|
||||
.youtube.com TRUE / FALSE 0 PREF hl=en&tz=UTC
|
||||
.youtube.com TRUE / TRUE 0 SOCS CAI
|
||||
.youtube.com TRUE / TRUE 1755579805 GPS 1
|
||||
.youtube.com TRUE /tv TRUE 1788410640 __Secure-YT_DERP CNmPp7lk
|
||||
.google.ca TRUE / TRUE 1771384897 NID 525=OGuhjgB3NP4xSGoiioAF9nJBSgyhfUvqaBZN4QrY5yNFHfeocb1aE829PIzEEC6Qyo9LVK910s_WiTcrYtqsVpYUjg3H3s_mK_ffyytVDxHNKiKRKYWd4vBEzqeOxEHcdoMBQwY20W9svBCX-cc_YQXl5zpiAepPDVGQcth5rZ7kebYv5jYmH8BEQOQcE7HVyP6PcAI9yds
|
||||
.google.ca TRUE / FALSE 1790133697 SID g.a0000QhGLBj4KedqRk4XrxKQje3x7ise1rNeLo8EkhDBU19L9Hpk9a9TjV9am48yv0RK3iCNhQACgYKAc0SARYSFQHGX2MiJeNE2HIkzn_49iX78ChKhBoVAUF8yKo3p3fs2tHxkqOxFkGDdHTU0076
|
||||
.google.ca TRUE / TRUE 1790133697 __Secure-1PSID g.a0000QhGLBj4KedqRk4XrxKQje3x7ise1rNeLo8EkhDBU19L9HpkO0CWMyWE4HaRJuE3tskapQACgYKAXESARYSFQHGX2MiTAbZKBEoofchgy1ks-EkcBoVAUF8yKqQOkgMAgTnBxQL-IM-JycN0076
|
||||
.google.ca TRUE / TRUE 1790133697 __Secure-3PSID g.a0000QhGLBj4KedqRk4XrxKQje3x7ise1rNeLo8EkhDBU19L9HpkZwwWz1fd49vzS89GJCYQVAACgYKAXoSARYSFQHGX2Mi-8cU0UwHLnCSb9W6zdTzbxoVAUF8yKpwmrGoh1Urt98WPMbekjCP0076
|
||||
.google.ca TRUE / FALSE 1790133697 HSID AiRg2EkM6heMohMPn
|
||||
.google.ca TRUE / TRUE 1790133697 SSID AJP9S08XSagldlZjA
|
||||
.google.ca TRUE / FALSE 1790133697 APISID n_DaYMZo2PQuVj2F/AQvEKcrZGxFMXynXs
|
||||
.google.ca TRUE / TRUE 1790133697 SAPISID alNMn9wd6z1SwzvT/AaIKOjwUDoz8fghga
|
||||
.google.ca TRUE / TRUE 1790133697 __Secure-1PAPISID alNMn9wd6z1SwzvT/AaIKOjwUDoz8fghga
|
||||
.google.ca TRUE / TRUE 1790133697 __Secure-3PAPISID alNMn9wd6z1SwzvT/AaIKOjwUDoz8fghga
|
||||
.youtube.com TRUE / TRUE 1755567962 GPS 1
|
||||
.youtube.com TRUE / TRUE 0 YSC 7cc8-LrPd_Q
|
||||
.youtube.com TRUE / TRUE 1771118162 VISITOR_INFO1_LIVE za_nyLN37wM
|
||||
.youtube.com TRUE / TRUE 1771118162 VISITOR_PRIVACY_METADATA CgJDQRIEGgAgNQ%3D%3D
|
||||
.youtube.com TRUE / TRUE 1771118162 __Secure-ROLLOUT_TOKEN CM7Wy8jf2ozaPxDbhefL2ZWPAxjbhefL2ZWPAw%3D%3D
|
||||
|
|
|
|||
|
|
@ -1,13 +0,0 @@
|
|||
# Netscape HTTP Cookie File
|
||||
# This file is generated by yt-dlp. Do not edit.
|
||||
|
||||
.youtube.com TRUE / FALSE 0 PREF hl=en&tz=UTC
|
||||
.youtube.com TRUE / TRUE 0 SOCS CAI
|
||||
.youtube.com TRUE / TRUE 1755574691 GPS 1
|
||||
.youtube.com TRUE / TRUE 0 YSC g8_QSnzawNg
|
||||
.youtube.com TRUE / TRUE 1771124892 __Secure-ROLLOUT_TOKEN CKrui7OciK6LRxDLkM_U8pWPAxjDrorV8pWPAw%3D%3D
|
||||
.youtube.com TRUE / TRUE 1771124892 VISITOR_INFO1_LIVE KdsXshgK67Q
|
||||
.youtube.com TRUE / TRUE 1771124892 VISITOR_PRIVACY_METADATA CgJDQRIEGgAgQQ%3D%3D
|
||||
.youtube.com TRUE / TRUE 1818644892 __Secure-YT_TVFAS t=487659&s=2
|
||||
.youtube.com TRUE / TRUE 1771124892 DEVICE_INFO ChxOelUwTURFeU9ERTFOemMwTXpZNE1qTXpOUT09EJzVj8UGGJzVj8UG
|
||||
.youtube.com TRUE /tv TRUE 1788404892 __Secure-YT_DERP CPSU_MFq
|
||||
|
|
@ -1,13 +0,0 @@
|
|||
# Netscape HTTP Cookie File
|
||||
# This file is generated by yt-dlp. Do not edit.
|
||||
|
||||
.youtube.com TRUE / FALSE 0 PREF hl=en&tz=UTC
|
||||
.youtube.com TRUE / TRUE 0 SOCS CAI
|
||||
.youtube.com TRUE / TRUE 1755577534 GPS 1
|
||||
.youtube.com TRUE / TRUE 0 YSC 50hWpo_LZdA
|
||||
.youtube.com TRUE / TRUE 1771127734 __Secure-ROLLOUT_TOKEN CNbHwaqU0bS7hAEQ-6GloP2VjwMY-o22oP2VjwM%3D
|
||||
.youtube.com TRUE / TRUE 1771127738 VISITOR_INFO1_LIVE 7IRfROHo8b8
|
||||
.youtube.com TRUE / TRUE 1771127738 VISITOR_PRIVACY_METADATA CgJDQRIEGgAgRw%3D%3D
|
||||
.youtube.com TRUE / TRUE 1818647738 __Secure-YT_TVFAS t=487659&s=2
|
||||
.youtube.com TRUE / TRUE 1771127738 DEVICE_INFO ChxOelUwTURFME1ETTRNVFF6TnpBNE16QXlOQT09ELrrj8UGGLrrj8UG
|
||||
.youtube.com TRUE /tv TRUE 1788407738 __Secure-YT_DERP CJq0-8Jq
|
||||
|
|
@ -1,7 +0,0 @@
|
|||
{
|
||||
"last_update": "2025-08-19T10:05:11.847635",
|
||||
"last_item_count": 1000,
|
||||
"backlog_captured": true,
|
||||
"backlog_timestamp": "20250819_100511",
|
||||
"last_id": "CzPvL-HLAoI"
|
||||
}
|
||||
|
|
@ -1,7 +0,0 @@
|
|||
{
|
||||
"last_update": "2025-08-19T10:34:23.578337",
|
||||
"last_item_count": 35,
|
||||
"backlog_captured": true,
|
||||
"backlog_timestamp": "20250819_103423",
|
||||
"last_id": "7512609729022070024"
|
||||
}
|
||||
7
data_production_backlog/.state/youtube_state.json
Normal file
7
data_production_backlog/.state/youtube_state.json
Normal file
|
|
@ -0,0 +1,7 @@
|
|||
{
|
||||
"last_update": "2025-08-18T22:16:04.345767",
|
||||
"last_item_count": 200,
|
||||
"backlog_captured": true,
|
||||
"backlog_timestamp": "20250818_221604",
|
||||
"last_id": "Zn4kcNFO1I4"
|
||||
}
|
||||
File diff suppressed because it is too large
Load diff
|
|
@ -1,774 +0,0 @@
|
|||
# ID: 7099516072725908741
|
||||
|
||||
## Type: video
|
||||
|
||||
## Author: @hvacknowitall
|
||||
|
||||
## Publish Date: 2025-08-19T10:05:50.636383-03:00
|
||||
|
||||
## Link: https://www.tiktok.com/@hvacknowitall/video/7099516072725908741
|
||||
|
||||
## Views: 126,400
|
||||
|
||||
## Likes: 3,119
|
||||
|
||||
## Comments: 150
|
||||
|
||||
## Shares: 245
|
||||
|
||||
## Caption:
|
||||
Start planning now for 2023!
|
||||
|
||||
--------------------------------------------------
|
||||
|
||||
# ID: 7189380105762786566
|
||||
|
||||
## Type: video
|
||||
|
||||
## Author: @hvacknowitall
|
||||
|
||||
## Publish Date: 2025-08-19T10:05:50.636530-03:00
|
||||
|
||||
## Link: https://www.tiktok.com/@hvacknowitall/video/7189380105762786566
|
||||
|
||||
## Views: 93,900
|
||||
|
||||
## Likes: 1,807
|
||||
|
||||
## Comments: 46
|
||||
|
||||
## Shares: 450
|
||||
|
||||
## Caption:
|
||||
Finally here... Launch date of the @navac_inc NTB7L. If you're heading down to @ahrexpo you'll get a chance to check it out in action.
|
||||
|
||||
--------------------------------------------------
|
||||
|
||||
# ID: 7124848964452617477
|
||||
|
||||
## Type: video
|
||||
|
||||
## Author: @hvacknowitall
|
||||
|
||||
## Publish Date: 2025-08-19T10:05:50.636641-03:00
|
||||
|
||||
## Link: https://www.tiktok.com/@hvacknowitall/video/7124848964452617477
|
||||
|
||||
## Views: 229,800
|
||||
|
||||
## Likes: 5,960
|
||||
|
||||
## Comments: 50
|
||||
|
||||
## Shares: 274
|
||||
|
||||
## Caption:
|
||||
SkillMill bringing the fire!
|
||||
|
||||
--------------------------------------------------
|
||||
|
||||
# ID: 7540016568957226261
|
||||
|
||||
## Type: video
|
||||
|
||||
## Author: @hvacknowitall
|
||||
|
||||
## Publish Date: 2025-08-19T10:05:50.636789-03:00
|
||||
|
||||
## Link: https://www.tiktok.com/@hvacknowitall/video/7540016568957226261
|
||||
|
||||
## Views: 6,926
|
||||
|
||||
## Likes: 174
|
||||
|
||||
## Comments: 2
|
||||
|
||||
## Shares: 21
|
||||
|
||||
## Caption:
|
||||
This tool is legit... I cleaned this coil last week but it was still running hot. I've had the SHAECO fin straightener from in my possession now for a while and finally had a chance to use it today, it simply attaches to an oscillating tool. They recommended using some soap bubbles then a comb after to straighten them out. BigBlu was what was used. I used the new 860i to perform a before and after on the coil and it dropped approximately 6⁰F.
|
||||
|
||||
--------------------------------------------------
|
||||
|
||||
# ID: 7538196385712115000
|
||||
|
||||
## Type: video
|
||||
|
||||
## Author: @hvacknowitall
|
||||
|
||||
## Publish Date: 2025-08-19T10:05:50.636892-03:00
|
||||
|
||||
## Link: https://www.tiktok.com/@hvacknowitall/video/7538196385712115000
|
||||
|
||||
## Views: 4,523
|
||||
|
||||
## Likes: 132
|
||||
|
||||
## Comments: 3
|
||||
|
||||
## Shares: 2
|
||||
|
||||
## Caption:
|
||||
Some troubleshooting... Sometimes you need a few fuses and use the process of elimination.
|
||||
|
||||
--------------------------------------------------
|
||||
|
||||
# ID: 7538097200132295941
|
||||
|
||||
## Type: video
|
||||
|
||||
## Author: @hvacknowitall
|
||||
|
||||
## Publish Date: 2025-08-19T10:05:50.636988-03:00
|
||||
|
||||
## Link: https://www.tiktok.com/@hvacknowitall/video/7538097200132295941
|
||||
|
||||
## Views: 1,293
|
||||
|
||||
## Likes: 39
|
||||
|
||||
## Comments: 2
|
||||
|
||||
## Shares: 7
|
||||
|
||||
## Caption:
|
||||
3 in 1 Filter Rack... The Midea RAC EVOX G³ filter rack can be utilized as a 4", 2" or 1". I would always suggest a 4" filter, it will capture more particulate and also provide more air flow.
|
||||
|
||||
--------------------------------------------------
|
||||
|
||||
# ID: 7537732064779537720
|
||||
|
||||
## Type: video
|
||||
|
||||
## Author: @hvacknowitall
|
||||
|
||||
## Publish Date: 2025-08-19T10:05:50.637267-03:00
|
||||
|
||||
## Link: https://www.tiktok.com/@hvacknowitall/video/7537732064779537720
|
||||
|
||||
## Views: 22,500
|
||||
|
||||
## Likes: 791
|
||||
|
||||
## Comments: 33
|
||||
|
||||
## Shares: 144
|
||||
|
||||
## Caption:
|
||||
Vacuum Y and Core Tool... This device has a patent pending. It's the @ritchieyellowjacket Vacuum Y with RealTorque Core removal Tool. Its design allows for Schrader valves to be torqued to spec. with a pre-set in the handle. The Y allows for attachment of 3/8" vacuum hoses to double the flow from a single service valve.
|
||||
|
||||
--------------------------------------------------
|
||||
|
||||
# ID: 7535113073150020920
|
||||
|
||||
## Type: video
|
||||
|
||||
## Author: @hvacknowitall
|
||||
|
||||
## Publish Date: 2025-08-19T10:05:50.637368-03:00
|
||||
|
||||
## Link: https://www.tiktok.com/@hvacknowitall/video/7535113073150020920
|
||||
|
||||
## Views: 5,378
|
||||
|
||||
## Likes: 93
|
||||
|
||||
## Comments: 6
|
||||
|
||||
## Shares: 2
|
||||
|
||||
## Caption:
|
||||
Pump replacement... I was invited onto a site by Armstrong Fluid Technology to record a pump re and re. The old single speed pump was removed for a gen 5 Design Envelope pump. Pump manager was also installed to monitor the pump's performance. Pump manager is able to track and record pump data to track energy usage and predict maintenance issues.
|
||||
|
||||
--------------------------------------------------
|
||||
|
||||
# ID: 7534847716896083256
|
||||
|
||||
## Type: video
|
||||
|
||||
## Author: @hvacknowitall
|
||||
|
||||
## Publish Date: 2025-08-19T10:05:50.637460-03:00
|
||||
|
||||
## Link: https://www.tiktok.com/@hvacknowitall/video/7534847716896083256
|
||||
|
||||
## Views: 4,620
|
||||
|
||||
## Caption:
|
||||
(No caption available - fetch individual video for details)
|
||||
|
||||
--------------------------------------------------
|
||||
|
||||
# ID: 7534027218721197318
|
||||
|
||||
## Type: video
|
||||
|
||||
## Author: @hvacknowitall
|
||||
|
||||
## Publish Date: 2025-08-19T10:05:50.637563-03:00
|
||||
|
||||
## Link: https://www.tiktok.com/@hvacknowitall/video/7534027218721197318
|
||||
|
||||
## Views: 3,881
|
||||
|
||||
## Likes: 47
|
||||
|
||||
## Comments: 7
|
||||
|
||||
## Shares: 0
|
||||
|
||||
## Caption:
|
||||
Full Heat Pump Install Vid... To watch the entire video with the heat pump install tips go to our YouTube channel and search for "heat pump install". Or click the link in the story. The Rectorseal bracket used on this install is adjustable and can handle 500 lbs. It is shipped with isolation pads as well.
|
||||
|
||||
--------------------------------------------------
|
||||
|
||||
# ID: 7532664694616755512
|
||||
|
||||
## Type: video
|
||||
|
||||
## Author: @hvacknowitall
|
||||
|
||||
## Publish Date: 2025-08-19T10:05:50.637662-03:00
|
||||
|
||||
## Link: https://www.tiktok.com/@hvacknowitall/video/7532664694616755512
|
||||
|
||||
## Views: 11,200
|
||||
|
||||
## Caption:
|
||||
(No caption available - fetch individual video for details)
|
||||
|
||||
--------------------------------------------------
|
||||
|
||||
# ID: 7530798356034080056
|
||||
|
||||
## Type: video
|
||||
|
||||
## Author: @hvacknowitall
|
||||
|
||||
## Publish Date: 2025-08-19T10:05:50.637906-03:00
|
||||
|
||||
## Link: https://www.tiktok.com/@hvacknowitall/video/7530798356034080056
|
||||
|
||||
## Views: 8,665
|
||||
|
||||
## Likes: 183
|
||||
|
||||
## Comments: 6
|
||||
|
||||
## Shares: 45
|
||||
|
||||
## Caption:
|
||||
SureSwtich over view... Through my testing of this device, it has proven valuable. When I installed mine 5 years ago, I put my contactor in a drawer just in case. It's still there. The Copeland SureSwitch is a solid state contactor with sealed contacts, it provides additional compressor protection from brownouts. My favourite feature of the SureSwitch is that it is designed to prevent pitting and arcing through its control function.
|
||||
|
||||
--------------------------------------------------
|
||||
|
||||
# ID: 7530310420045761797
|
||||
|
||||
## Type: video
|
||||
|
||||
## Author: @hvacknowitall
|
||||
|
||||
## Publish Date: 2025-08-19T10:05:50.638005-03:00
|
||||
|
||||
## Link: https://www.tiktok.com/@hvacknowitall/video/7530310420045761797
|
||||
|
||||
## Views: 7,859
|
||||
|
||||
## Likes: 296
|
||||
|
||||
## Comments: 6
|
||||
|
||||
## Shares: 8
|
||||
|
||||
## Caption:
|
||||
Heat pump TXV... We hooked up with Jamie Kitchen from Danfoss to discuss heat pump TXVs and the TR6 valve. We will have more videos to come on this subject.
|
||||
|
||||
--------------------------------------------------
|
||||
|
||||
# ID: 7529941807065500984
|
||||
|
||||
## Type: video
|
||||
|
||||
## Author: @hvacknowitall
|
||||
|
||||
## Publish Date: 2025-08-19T10:05:50.638330-03:00
|
||||
|
||||
## Link: https://www.tiktok.com/@hvacknowitall/video/7529941807065500984
|
||||
|
||||
## Views: 9,532
|
||||
|
||||
## Likes: 288
|
||||
|
||||
## Comments: 14
|
||||
|
||||
## Shares: 8
|
||||
|
||||
## Caption:
|
||||
Old school will tell you to run it for an hour... But when you truly pay attention, time is not the indicator of a complete evacuation. This 20 ton system was pulled down in 20 minutes by pulling the cores and using 3/4" hoses. This allowed me to use a battery powered vac pump and avoided running cords on a commercial roof. I used the NP6DLM pump and NH35AB 3/4" hoses and NVR2 core removal tool.
|
||||
|
||||
--------------------------------------------------
|
||||
|
||||
# ID: 7528820889589206328
|
||||
|
||||
## Type: video
|
||||
|
||||
## Author: @hvacknowitall
|
||||
|
||||
## Publish Date: 2025-08-19T10:05:50.638444-03:00
|
||||
|
||||
## Link: https://www.tiktok.com/@hvacknowitall/video/7528820889589206328
|
||||
|
||||
## Views: 15,800
|
||||
|
||||
## Likes: 529
|
||||
|
||||
## Comments: 15
|
||||
|
||||
## Shares: 200
|
||||
|
||||
## Caption:
|
||||
6 different builds... The Midea RAC Evox G³ was designed with latches so the filter, coil and air handling portion can be built 6 different ways depending on the application.
|
||||
|
||||
--------------------------------------------------
|
||||
|
||||
# ID: 7527709142165933317
|
||||
|
||||
## Type: video
|
||||
|
||||
## Author: @hvacknowitall
|
||||
|
||||
## Publish Date: 2025-08-19T10:05:50.638748-03:00
|
||||
|
||||
## Link: https://www.tiktok.com/@hvacknowitall/video/7527709142165933317
|
||||
|
||||
## Views: 2,563
|
||||
|
||||
## Likes: 62
|
||||
|
||||
## Comments: 1
|
||||
|
||||
## Shares: 0
|
||||
|
||||
## Caption:
|
||||
Two leak locations... The first leak is on the body of the pressure switch, anything pressurized can leak, remember this. The second leak isn't actually on that coil, that corroded coil is hydronic. The leak is buried in behind the hydronic coil on the reheat coil. What would your recommendation be here moving forward? Using the Sauermann Si-RD3
|
||||
|
||||
--------------------------------------------------
|
||||
|
||||
# ID: 7524443251642813701
|
||||
|
||||
## Type: video
|
||||
|
||||
## Author: @hvacknowitall
|
||||
|
||||
## Publish Date: 2025-08-19T10:05:50.638919-03:00
|
||||
|
||||
## Link: https://www.tiktok.com/@hvacknowitall/video/7524443251642813701
|
||||
|
||||
## Views: 1,998
|
||||
|
||||
## Likes: 62
|
||||
|
||||
## Comments: 3
|
||||
|
||||
## Shares: 0
|
||||
|
||||
## Caption:
|
||||
Thermistor troubleshooting... We're using the ICM Controls UDefrost control to show a little thermistor troubleshooting. The UDefrost is a heat pump defrost control that has a customized set up through the ICM OMNI app. A thermistor is a resistor that changes resistance due to a change in temperature. In the video we are using an NTC (negative temperature coefficient). This means the resistance will drop on a rise in temperature. PTC (positive temperature coefficient) has a rise in resistance with a rise in temperature.
|
||||
|
||||
--------------------------------------------------
|
||||
|
||||
# ID: 7522648911681457464
|
||||
|
||||
## Type: video
|
||||
|
||||
## Author: @hvacknowitall
|
||||
|
||||
## Publish Date: 2025-08-19T10:05:50.639026-03:00
|
||||
|
||||
## Link: https://www.tiktok.com/@hvacknowitall/video/7522648911681457464
|
||||
|
||||
## Views: 10,700
|
||||
|
||||
## Likes: 222
|
||||
|
||||
## Comments: 13
|
||||
|
||||
## Shares: 9
|
||||
|
||||
## Caption:
|
||||
A perfect flare... I spent a day with Joe with Nottawasaga Mechanical and he was on board to give the NEF6LM a go. This was a 2.5 ton Moovair heat pump, which is becoming the heat pump of choice in the area to install. Thanks to for their dedication to excellent tubing tools and to Master for their heat pump product. Always Nylog on the flare seat!
|
||||
|
||||
--------------------------------------------------
|
||||
|
||||
# ID: 7520750214311988485
|
||||
|
||||
## Type: video
|
||||
|
||||
## Author: @hvacknowitall
|
||||
|
||||
## Publish Date: 2025-08-19T10:05:50.639134-03:00
|
||||
|
||||
## Link: https://www.tiktok.com/@hvacknowitall/video/7520750214311988485
|
||||
|
||||
## Views: 159,400
|
||||
|
||||
## Likes: 2,366
|
||||
|
||||
## Comments: 97
|
||||
|
||||
## Shares: 368
|
||||
|
||||
## Caption:
|
||||
Packaged Window Heat Pump... Midea RAC designed this Window Package Heat Pump for high rise buildings in New York City. Word on the street is tenant spaces in some areas will have a max temp they can be at, just like they have a min temp they must maintain. Essentially, some rented spaces will be forced to provide air conditioning if they don't already. I think the atmomized condensate is a cool feature.
|
||||
|
||||
--------------------------------------------------
|
||||
|
||||
# ID: 7520734215592365368
|
||||
|
||||
## Type: video
|
||||
|
||||
## Author: @hvacknowitall
|
||||
|
||||
## Publish Date: 2025-08-19T10:05:50.639390-03:00
|
||||
|
||||
## Link: https://www.tiktok.com/@hvacknowitall/video/7520734215592365368
|
||||
|
||||
## Views: 4,482
|
||||
|
||||
## Likes: 105
|
||||
|
||||
## Comments: 3
|
||||
|
||||
## Shares: 1
|
||||
|
||||
## Caption:
|
||||
Check it out... is running a promotion, check out below for more info... Buy an Oxyset or Precision Torch or Nitrogen Kit from any supply store PLUS either the new Power Torch or 1.9L Oxygen Cylinder Scan the QR code or visit ambrocontrols.com/powerup Fill out the redemption form and upload proof of purchase We’ll ship your FREE Backpack direct to you The new power torch can braze up to 3" pipe diameter and is meant to be paired with the larger oxygen cylinder.
|
||||
|
||||
--------------------------------------------------
|
||||
|
||||
# ID: 7520290054502190342
|
||||
|
||||
## Type: video
|
||||
|
||||
## Author: @hvacknowitall
|
||||
|
||||
## Publish Date: 2025-08-19T10:05:50.639485-03:00
|
||||
|
||||
## Link: https://www.tiktok.com/@hvacknowitall/video/7520290054502190342
|
||||
|
||||
## Views: 5,202
|
||||
|
||||
## Likes: 123
|
||||
|
||||
## Comments: 3
|
||||
|
||||
## Shares: 4
|
||||
|
||||
## Caption:
|
||||
It builds a barrier to moisture... There's a few manufacturers that do this, York also but it's a one piece harness. From time to time, I see the terminal box melted from moisture penetration. What has really helped is silicone grease, it prevents moisture from getting inside the connection. I'm using silicone grease on this Lennox unit. It's dielectric and won't pass current.
|
||||
|
||||
--------------------------------------------------
|
||||
|
||||
# ID: 7519663363446590726
|
||||
|
||||
## Type: video
|
||||
|
||||
## Author: @hvacknowitall
|
||||
|
||||
## Publish Date: 2025-08-19T10:05:50.639573-03:00
|
||||
|
||||
## Link: https://www.tiktok.com/@hvacknowitall/video/7519663363446590726
|
||||
|
||||
## Views: 4,250
|
||||
|
||||
## Likes: 45
|
||||
|
||||
## Comments: 1
|
||||
|
||||
## Shares: 6
|
||||
|
||||
## Caption:
|
||||
Only a few days left to qualify... The ServiceTitan HVAC National Championship Powered by Trane is coming this fall, to qualify for the next round go to hvacnationals.com and take the quiz. US Citizens Only!
|
||||
|
||||
--------------------------------------------------
|
||||
|
||||
# ID: 7519143575838264581
|
||||
|
||||
## Type: video
|
||||
|
||||
## Author: @hvacknowitall
|
||||
|
||||
## Publish Date: 2025-08-19T10:05:50.639663-03:00
|
||||
|
||||
## Link: https://www.tiktok.com/@hvacknowitall/video/7519143575838264581
|
||||
|
||||
## Views: 73,500
|
||||
|
||||
## Likes: 2,335
|
||||
|
||||
## Comments: 20
|
||||
|
||||
## Shares: 371
|
||||
|
||||
## Caption:
|
||||
Reversing valve tutorial part 1... takes us through the operation of a reversing valve. We will have part 2 soon on how the valve switches to cooling mode. Thanks Matt!
|
||||
|
||||
--------------------------------------------------
|
||||
|
||||
# ID: 7518919306252471608
|
||||
|
||||
## Type: video
|
||||
|
||||
## Author: @hvacknowitall
|
||||
|
||||
## Publish Date: 2025-08-19T10:05:50.639753-03:00
|
||||
|
||||
## Link: https://www.tiktok.com/@hvacknowitall/video/7518919306252471608
|
||||
|
||||
## Views: 35,600
|
||||
|
||||
## Caption:
|
||||
(No caption available - fetch individual video for details)
|
||||
|
||||
--------------------------------------------------
|
||||
|
||||
# ID: 7517701341196586245
|
||||
|
||||
## Type: video
|
||||
|
||||
## Author: @hvacknowitall
|
||||
|
||||
## Publish Date: 2025-08-19T10:05:50.640092-03:00
|
||||
|
||||
## Link: https://www.tiktok.com/@hvacknowitall/video/7517701341196586245
|
||||
|
||||
## Views: 4,237
|
||||
|
||||
## Likes: 73
|
||||
|
||||
## Comments: 0
|
||||
|
||||
## Shares: 2
|
||||
|
||||
## Caption:
|
||||
Visual inspection first... Carrier rooftop that needs to be chucked off the roof needs to last for "one more summer" 😂. R22 pretty much all gone. Easy repair to be honest. New piece of pipe, evacuate and charge with an R22 drop in. I'm using the Sauermann Si 3DR on this job. Yes it can detect A2L refrigerants.
|
||||
|
||||
--------------------------------------------------
|
||||
|
||||
# ID: 7516930528050826502
|
||||
|
||||
## Type: video
|
||||
|
||||
## Author: @hvacknowitall
|
||||
|
||||
## Publish Date: 2025-08-19T10:05:50.640203-03:00
|
||||
|
||||
## Link: https://www.tiktok.com/@hvacknowitall/video/7516930528050826502
|
||||
|
||||
## Views: 7,869
|
||||
|
||||
## Likes: 215
|
||||
|
||||
## Comments: 5
|
||||
|
||||
## Shares: 28
|
||||
|
||||
## Caption:
|
||||
CO2 is not something I've worked on but it's definitely interesting to learn about. Ben Reed had the opportunity to speak with Danfoss Climate Solutions down at AHR about their transcritcal CO2 condensing unit that is capable of handling 115⁰F ambient temperature.
|
||||
|
||||
--------------------------------------------------
|
||||
|
||||
# ID: 7516268018662493496
|
||||
|
||||
## Type: video
|
||||
|
||||
## Author: @hvacknowitall
|
||||
|
||||
## Publish Date: 2025-08-19T10:05:50.640314-03:00
|
||||
|
||||
## Link: https://www.tiktok.com/@hvacknowitall/video/7516268018662493496
|
||||
|
||||
## Views: 3,706
|
||||
|
||||
## Likes: 112
|
||||
|
||||
## Comments: 3
|
||||
|
||||
## Shares: 23
|
||||
|
||||
## Caption:
|
||||
Who wants to win??? The HVAC Nationals are being held this fall in Florida. To qualify for this, take the quiz before June 30th. You can find the quiz at hvacnationals.com.
|
||||
|
||||
--------------------------------------------------
|
||||
|
||||
# ID: 7516262642558799109
|
||||
|
||||
## Type: video
|
||||
|
||||
## Author: @hvacknowitall
|
||||
|
||||
## Publish Date: 2025-08-19T10:05:50.640419-03:00
|
||||
|
||||
## Link: https://www.tiktok.com/@hvacknowitall/video/7516262642558799109
|
||||
|
||||
## Views: 2,741
|
||||
|
||||
## Caption:
|
||||
(No caption available - fetch individual video for details)
|
||||
|
||||
--------------------------------------------------
|
||||
|
||||
# ID: 7515566208591088902
|
||||
|
||||
## Type: video
|
||||
|
||||
## Author: @hvacknowitall
|
||||
|
||||
## Publish Date: 2025-08-19T10:05:50.640711-03:00
|
||||
|
||||
## Link: https://www.tiktok.com/@hvacknowitall/video/7515566208591088902
|
||||
|
||||
## Views: 8,737
|
||||
|
||||
## Caption:
|
||||
(No caption available - fetch individual video for details)
|
||||
|
||||
--------------------------------------------------
|
||||
|
||||
# ID: 7515071260376845624
|
||||
|
||||
## Type: video
|
||||
|
||||
## Author: @hvacknowitall
|
||||
|
||||
## Publish Date: 2025-08-19T10:05:50.640821-03:00
|
||||
|
||||
## Link: https://www.tiktok.com/@hvacknowitall/video/7515071260376845624
|
||||
|
||||
## Views: 4,930
|
||||
|
||||
## Likes: 95
|
||||
|
||||
## Comments: 5
|
||||
|
||||
## Shares: 0
|
||||
|
||||
## Caption:
|
||||
On site... I was invited onto a site by to cover the install of a central Moovair heat pump. Joe is choosing to install brackets over a pad or stand due to space and grading restrictions. These units are super quiet. The outdoor unit has flare connections and you know my man is going to use a dab iykyk!
|
||||
|
||||
--------------------------------------------------
|
||||
|
||||
# ID: 7514797712802417928
|
||||
|
||||
## Type: video
|
||||
|
||||
## Author: @hvacknowitall
|
||||
|
||||
## Publish Date: 2025-08-19T10:05:50.640931-03:00
|
||||
|
||||
## Link: https://www.tiktok.com/@hvacknowitall/video/7514797712802417928
|
||||
|
||||
## Views: 10,500
|
||||
|
||||
## Likes: 169
|
||||
|
||||
## Comments: 18
|
||||
|
||||
## Shares: 56
|
||||
|
||||
## Caption:
|
||||
Another brazless connection... This is the Smartlock Fitting 3/8" Swage Coupling. It connects pipe to the swage without pulling out torches. Yes we know, braze4life but sometimes it's good to have options.
|
||||
|
||||
--------------------------------------------------
|
||||
|
||||
# ID: 7514713297292201224
|
||||
|
||||
## Type: video
|
||||
|
||||
## Author: @hvacknowitall
|
||||
|
||||
## Publish Date: 2025-08-19T10:05:50.641044-03:00
|
||||
|
||||
## Link: https://www.tiktok.com/@hvacknowitall/video/7514713297292201224
|
||||
|
||||
## Views: 3,057
|
||||
|
||||
## Likes: 72
|
||||
|
||||
## Comments: 2
|
||||
|
||||
## Shares: 5
|
||||
|
||||
## Caption:
|
||||
Drop down filter... This single deflection cassette from Midea RAC has a remote filter drop down to remove and clean it. It's designed to fit in between a joist space also. This head is currently part of a multi zone system but will soon be compatible with a single zone outdoor unit. Thanks to Ascend Group for the tour of the show room yesterday.
|
||||
|
||||
--------------------------------------------------
|
||||
|
||||
# ID: 7514708767557160200
|
||||
|
||||
## Type: video
|
||||
|
||||
## Author: @hvacknowitall
|
||||
|
||||
## Publish Date: 2025-08-19T10:05:50.641144-03:00
|
||||
|
||||
## Link: https://www.tiktok.com/@hvacknowitall/video/7514708767557160200
|
||||
|
||||
## Views: 1,807
|
||||
|
||||
## Likes: 40
|
||||
|
||||
## Comments: 1
|
||||
|
||||
## Shares: 0
|
||||
|
||||
## Caption:
|
||||
Our mini series with Michael Cyr wraps up with him explaining some contractor benefits when using Senville products. Tech support Parts support
|
||||
|
||||
--------------------------------------------------
|
||||
|
||||
# ID: 7512963405142101266
|
||||
|
||||
## Type: video
|
||||
|
||||
## Author: @hvacknowitall
|
||||
|
||||
## Publish Date: 2025-08-19T10:05:50.641415-03:00
|
||||
|
||||
## Link: https://www.tiktok.com/@hvacknowitall/video/7512963405142101266
|
||||
|
||||
## Views: 16,100
|
||||
|
||||
## Likes: 565
|
||||
|
||||
## Comments: 5
|
||||
|
||||
## Shares: 30
|
||||
|
||||
## Caption:
|
||||
Thermistor troubleshooting... Using the ICM Controls UDefrost board (universal heat pump defrost board). We will look at how to troubleshoot the thermistor by cross referencing a chart that indicates resistance at a given temperature.
|
||||
|
||||
--------------------------------------------------
|
||||
|
||||
# ID: 7512609729022070024
|
||||
|
||||
## Type: video
|
||||
|
||||
## Author: @hvacknowitall
|
||||
|
||||
## Publish Date: 2025-08-19T10:05:50.641525-03:00
|
||||
|
||||
## Link: https://www.tiktok.com/@hvacknowitall/video/7512609729022070024
|
||||
|
||||
## Views: 3,177
|
||||
|
||||
## Likes: 102
|
||||
|
||||
## Comments: 0
|
||||
|
||||
## Shares: 15
|
||||
|
||||
## Caption:
|
||||
Great opportunity for the HVAC elite... You'll need to take the quiz by June 30th to be considered. The link is hvacnationals.com - easy enough to retype or click on it my story. HVAC Nationals are held in Florida and there's 100k in cash prizes up for grabs.
|
||||
|
||||
--------------------------------------------------
|
||||
File diff suppressed because it is too large
Load diff
|
|
@ -1,124 +0,0 @@
|
|||
# ID: TpdYT_itu9U
|
||||
|
||||
## Title: How HVAC Design & Redundancy Protect Cannabis Grow Rooms & Boost Yields with John Zimmerman Part 1
|
||||
|
||||
## Type: video
|
||||
|
||||
## Author: None
|
||||
|
||||
## Link: https://www.youtube.com/watch?v=TpdYT_itu9U
|
||||
|
||||
## Upload Date:
|
||||
|
||||
## Views: 266
|
||||
|
||||
## Likes: 0
|
||||
|
||||
## Comments: 0
|
||||
|
||||
## Duration: 1194.0 seconds
|
||||
|
||||
## Description:
|
||||
In this episode of the HVAC Know It All Podcast, host Gary McCreadie chats with John Zimmerman, Founder & CEO of Harvest Integrated, to kick off a two-part conversation about the unique challenges...
|
||||
|
||||
--------------------------------------------------
|
||||
|
||||
# ID: 1kEjVqBwluU
|
||||
|
||||
## Title: HVAC Rental Trap for Homeowners to Avoid Long-Term Losses and Bad Installs with Scott Pierson Part 2
|
||||
|
||||
## Type: video
|
||||
|
||||
## Author: None
|
||||
|
||||
## Link: https://www.youtube.com/watch?v=1kEjVqBwluU
|
||||
|
||||
## Upload Date:
|
||||
|
||||
## Views: 378
|
||||
|
||||
## Likes: 0
|
||||
|
||||
## Comments: 0
|
||||
|
||||
## Duration: 1015.0 seconds
|
||||
|
||||
## Description:
|
||||
In part 2 of this episode of the HVAC Know It All Podcast, host Gary McCreadie, Director of Player Development and Head Coach at Shelburne Soccer Club, and President of McCreadie HVAC & Refrigerati...
|
||||
|
||||
--------------------------------------------------
|
||||
|
||||
# ID: 3CuCBsWOPA0
|
||||
|
||||
## Title: The Generational Divide in HVAC for Leaders to Retain & Train Young Techs with Scott Pierson Part 1
|
||||
|
||||
## Type: video
|
||||
|
||||
## Author: None
|
||||
|
||||
## Link: https://www.youtube.com/watch?v=3CuCBsWOPA0
|
||||
|
||||
## Upload Date:
|
||||
|
||||
## Views: 1061
|
||||
|
||||
## Likes: 0
|
||||
|
||||
## Comments: 0
|
||||
|
||||
## Duration: 1348.0 seconds
|
||||
|
||||
## Description:
|
||||
In this special episode of the HVAC Know It All Podcast, the usual host, Gary McCreadie, Director of Player Development and Head Coach at Shelburne Soccer Club, and President of McCreadie HVAC...
|
||||
|
||||
--------------------------------------------------
|
||||
|
||||
# ID: _wXqg5EXIzA
|
||||
|
||||
## Title: How Broken Communication and Bad Leadership in the Trades Cause Burnout with Ben Dryer Part 2
|
||||
|
||||
## Type: video
|
||||
|
||||
## Author: None
|
||||
|
||||
## Link: https://www.youtube.com/watch?v=_wXqg5EXIzA
|
||||
|
||||
## Upload Date:
|
||||
|
||||
## Views: 338
|
||||
|
||||
## Likes: 0
|
||||
|
||||
## Comments: 0
|
||||
|
||||
## Duration: 1373.0 seconds
|
||||
|
||||
## Description:
|
||||
In Part 2 of this episode of the HVAC Know It All Podcast, host Gary McCreadie is joined by Benjamin Dryer, a Culture Consultant, Culture Pyramid Implementation, Public Speaker at Align & Elevate...
|
||||
|
||||
--------------------------------------------------
|
||||
|
||||
# ID: 70hcZ1wB7RA
|
||||
|
||||
## Title: How the Man Up Culture in HVAC Fuels Burnout and Blocks Progress for Workers with Ben Dryer Part 1
|
||||
|
||||
## Type: video
|
||||
|
||||
## Author: None
|
||||
|
||||
## Link: https://www.youtube.com/watch?v=70hcZ1wB7RA
|
||||
|
||||
## Upload Date:
|
||||
|
||||
## Views: 987
|
||||
|
||||
## Likes: 0
|
||||
|
||||
## Comments: 0
|
||||
|
||||
## Duration: 1197.0 seconds
|
||||
|
||||
## Description:
|
||||
In this episode of the HVAC Know It All Podcast, host Gary McCreadie speaks with Benjamin Dryer, a Culture Consultant, Culture Pyramid Implementation, Public Speaker at Align & Elevate Consulting,...
|
||||
|
||||
--------------------------------------------------
|
||||
|
|
@ -1,85 +0,0 @@
|
|||
#!/usr/bin/env python3
|
||||
"""
|
||||
Debug MailChimp content structure
|
||||
"""
|
||||
|
||||
import os
|
||||
import requests
|
||||
from dotenv import load_dotenv
|
||||
import json
|
||||
|
||||
load_dotenv()
|
||||
|
||||
def debug_campaign_content():
|
||||
"""Debug MailChimp campaign content structure"""
|
||||
|
||||
api_key = os.getenv('MAILCHIMP_API_KEY')
|
||||
server = os.getenv('MAILCHIMP_SERVER_PREFIX', 'us10')
|
||||
|
||||
if not api_key:
|
||||
print("❌ No MailChimp API key found in .env")
|
||||
return
|
||||
|
||||
base_url = f"https://{server}.api.mailchimp.com/3.0"
|
||||
headers = {
|
||||
'Authorization': f'Bearer {api_key}',
|
||||
'Content-Type': 'application/json'
|
||||
}
|
||||
|
||||
# Get campaigns
|
||||
params = {
|
||||
'count': 5,
|
||||
'status': 'sent',
|
||||
'folder_id': '6a0d1e2621', # Bi-Weekly Newsletter folder
|
||||
'sort_field': 'send_time',
|
||||
'sort_dir': 'DESC'
|
||||
}
|
||||
|
||||
response = requests.get(f"{base_url}/campaigns", headers=headers, params=params)
|
||||
if response.status_code != 200:
|
||||
print(f"Failed to fetch campaigns: {response.status_code}")
|
||||
return
|
||||
|
||||
campaigns = response.json().get('campaigns', [])
|
||||
|
||||
for i, campaign in enumerate(campaigns):
|
||||
campaign_id = campaign['id']
|
||||
subject = campaign.get('settings', {}).get('subject_line', 'N/A')
|
||||
|
||||
print(f"\n{'='*80}")
|
||||
print(f"CAMPAIGN {i+1}: {subject}")
|
||||
print(f"ID: {campaign_id}")
|
||||
print(f"{'='*80}")
|
||||
|
||||
# Get content
|
||||
content_response = requests.get(f"{base_url}/campaigns/{campaign_id}/content", headers=headers)
|
||||
|
||||
if content_response.status_code == 200:
|
||||
content_data = content_response.json()
|
||||
|
||||
plain_text = content_data.get('plain_text', '')
|
||||
html = content_data.get('html', '')
|
||||
|
||||
print(f"PLAIN_TEXT LENGTH: {len(plain_text)}")
|
||||
print(f"HTML LENGTH: {len(html)}")
|
||||
|
||||
if plain_text:
|
||||
print(f"\nPLAIN_TEXT (first 500 chars):")
|
||||
print("-" * 40)
|
||||
print(plain_text[:500])
|
||||
print("-" * 40)
|
||||
else:
|
||||
print("\nNO PLAIN_TEXT CONTENT")
|
||||
|
||||
if html:
|
||||
print(f"\nHTML (first 500 chars):")
|
||||
print("-" * 40)
|
||||
print(html[:500])
|
||||
print("-" * 40)
|
||||
else:
|
||||
print("\nNO HTML CONTENT")
|
||||
else:
|
||||
print(f"Failed to fetch content: {content_response.status_code}")
|
||||
|
||||
if __name__ == "__main__":
|
||||
debug_campaign_content()
|
||||
|
|
@ -1,18 +0,0 @@
|
|||
[Unit]
|
||||
Description=HVAC Content Aggregation - 12 PM Run
|
||||
After=network.target
|
||||
|
||||
[Service]
|
||||
Type=oneshot
|
||||
User=ben
|
||||
Group=ben
|
||||
WorkingDirectory=/home/ben/dev/hvac-kia-content
|
||||
Environment="PATH=/home/ben/.local/bin:/usr/local/bin:/usr/bin:/bin"
|
||||
Environment="DISPLAY=:0"
|
||||
Environment="XAUTHORITY=/run/user/1000/.mutter-Xwaylandauth.90WDB3"
|
||||
ExecStart=/usr/bin/bash -c 'source /home/ben/dev/hvac-kia-content/.venv/bin/activate && python run_api_production_v2.py'
|
||||
StandardOutput=journal
|
||||
StandardError=journal
|
||||
|
||||
[Install]
|
||||
WantedBy=multi-user.target
|
||||
|
|
@ -1,10 +0,0 @@
|
|||
[Unit]
|
||||
Description=HVAC Content Aggregation - 12 PM Timer
|
||||
Requires=hvac-content-12pm.service
|
||||
|
||||
[Timer]
|
||||
OnCalendar=*-*-* 12:00:00
|
||||
Persistent=true
|
||||
|
||||
[Install]
|
||||
WantedBy=timers.target
|
||||
|
|
@ -1,18 +0,0 @@
|
|||
[Unit]
|
||||
Description=HVAC Content Aggregation - 8 AM Run
|
||||
After=network.target
|
||||
|
||||
[Service]
|
||||
Type=oneshot
|
||||
User=ben
|
||||
Group=ben
|
||||
WorkingDirectory=/home/ben/dev/hvac-kia-content
|
||||
Environment="PATH=/home/ben/.local/bin:/usr/local/bin:/usr/bin:/bin"
|
||||
Environment="DISPLAY=:0"
|
||||
Environment="XAUTHORITY=/run/user/1000/.mutter-Xwaylandauth.90WDB3"
|
||||
ExecStart=/usr/bin/bash -c 'source /home/ben/dev/hvac-kia-content/.venv/bin/activate && python run_api_production_v2.py'
|
||||
StandardOutput=journal
|
||||
StandardError=journal
|
||||
|
||||
[Install]
|
||||
WantedBy=multi-user.target
|
||||
|
|
@ -1,10 +0,0 @@
|
|||
[Unit]
|
||||
Description=HVAC Content Aggregation - 8 AM Timer
|
||||
Requires=hvac-content-8am.service
|
||||
|
||||
[Timer]
|
||||
OnCalendar=*-*-* 08:00:00
|
||||
Persistent=true
|
||||
|
||||
[Install]
|
||||
WantedBy=timers.target
|
||||
|
|
@ -1,18 +0,0 @@
|
|||
[Unit]
|
||||
Description=HVAC Content Cumulative with Images - 8 AM Run
|
||||
After=network.target
|
||||
|
||||
[Service]
|
||||
Type=oneshot
|
||||
User=ben
|
||||
Group=ben
|
||||
WorkingDirectory=/home/ben/dev/hvac-kia-content
|
||||
Environment="PATH=/home/ben/.local/bin:/usr/local/bin:/usr/bin:/bin"
|
||||
Environment="DISPLAY=:0"
|
||||
Environment="XAUTHORITY=/run/user/1000/.mutter-Xwaylandauth.90WDB3"
|
||||
ExecStart=/usr/bin/bash -c 'source /home/ben/dev/hvac-kia-content/.venv/bin/activate && python run_production_cumulative.py'
|
||||
StandardOutput=journal
|
||||
StandardError=journal
|
||||
|
||||
[Install]
|
||||
WantedBy=multi-user.target
|
||||
|
|
@ -1,18 +0,0 @@
|
|||
[Unit]
|
||||
Description=HKIA Content Aggregation with Images - 12 PM Run
|
||||
After=network.target
|
||||
|
||||
[Service]
|
||||
Type=oneshot
|
||||
User=ben
|
||||
Group=ben
|
||||
WorkingDirectory=/home/ben/dev/hvac-kia-content
|
||||
Environment="PATH=/home/ben/.local/bin:/usr/local/bin:/usr/bin:/bin"
|
||||
Environment="DISPLAY=:0"
|
||||
Environment="XAUTHORITY=/run/user/1000/.mutter-Xwaylandauth.90WDB3"
|
||||
ExecStart=/usr/bin/bash -c 'source /home/ben/dev/hvac-kia-content/.venv/bin/activate && python run_production_with_images.py'
|
||||
StandardOutput=journal
|
||||
StandardError=journal
|
||||
|
||||
[Install]
|
||||
WantedBy=multi-user.target
|
||||
|
|
@ -1,18 +0,0 @@
|
|||
[Unit]
|
||||
Description=HKIA Content Aggregation with Images - 8 AM Run
|
||||
After=network.target
|
||||
|
||||
[Service]
|
||||
Type=oneshot
|
||||
User=ben
|
||||
Group=ben
|
||||
WorkingDirectory=/home/ben/dev/hvac-kia-content
|
||||
Environment="PATH=/home/ben/.local/bin:/usr/local/bin:/usr/bin:/bin"
|
||||
Environment="DISPLAY=:0"
|
||||
Environment="XAUTHORITY=/run/user/1000/.mutter-Xwaylandauth.90WDB3"
|
||||
ExecStart=/usr/bin/bash -c 'source /home/ben/dev/hvac-kia-content/.venv/bin/activate && python run_production_with_images.py'
|
||||
StandardOutput=journal
|
||||
StandardError=journal
|
||||
|
||||
[Install]
|
||||
WantedBy=multi-user.target
|
||||
|
|
@ -1,36 +0,0 @@
|
|||
#!/bin/bash
|
||||
# Installation script for HVAC Content Aggregation systemd services
|
||||
|
||||
echo "Installing HVAC Content Aggregation systemd services..."
|
||||
|
||||
# Copy service files
|
||||
sudo cp hvac-content-8am.service /etc/systemd/system/
|
||||
sudo cp hvac-content-8am.timer /etc/systemd/system/
|
||||
sudo cp hvac-content-12pm.service /etc/systemd/system/
|
||||
sudo cp hvac-content-12pm.timer /etc/systemd/system/
|
||||
|
||||
# Reload systemd
|
||||
sudo systemctl daemon-reload
|
||||
|
||||
# Enable timers
|
||||
sudo systemctl enable hvac-content-8am.timer
|
||||
sudo systemctl enable hvac-content-12pm.timer
|
||||
|
||||
# Start timers
|
||||
sudo systemctl start hvac-content-8am.timer
|
||||
sudo systemctl start hvac-content-12pm.timer
|
||||
|
||||
# Show status
|
||||
echo ""
|
||||
echo "Service status:"
|
||||
sudo systemctl status hvac-content-8am.timer --no-pager
|
||||
echo ""
|
||||
sudo systemctl status hvac-content-12pm.timer --no-pager
|
||||
|
||||
echo ""
|
||||
echo "Installation complete!"
|
||||
echo ""
|
||||
echo "Useful commands:"
|
||||
echo " View logs: journalctl -u hvac-content-8am -f"
|
||||
echo " Check timer: systemctl list-timers | grep hvac"
|
||||
echo " Manual run: sudo systemctl start hvac-content-8am.service"
|
||||
|
|
@ -1,74 +0,0 @@
|
|||
#!/bin/bash
|
||||
# Update script to enable image downloading in production
|
||||
|
||||
echo "Updating HVAC Content Aggregation to include image downloads..."
|
||||
echo
|
||||
|
||||
# Stop and disable old services
|
||||
echo "Stopping old services..."
|
||||
sudo systemctl stop hvac-content-8am.timer hvac-content-12pm.timer
|
||||
sudo systemctl disable hvac-content-8am.service hvac-content-12pm.service
|
||||
sudo systemctl disable hvac-content-8am.timer hvac-content-12pm.timer
|
||||
|
||||
# Copy new service files
|
||||
echo "Installing new services with image downloads..."
|
||||
sudo cp hvac-content-images-8am.service /etc/systemd/system/
|
||||
sudo cp hvac-content-images-12pm.service /etc/systemd/system/
|
||||
|
||||
# Create new timer files (reuse existing timers with new names)
|
||||
sudo tee /etc/systemd/system/hvac-content-images-8am.timer > /dev/null <<EOF
|
||||
[Unit]
|
||||
Description=Run HVAC Content with Images at 8 AM daily
|
||||
|
||||
[Timer]
|
||||
OnCalendar=*-*-* 08:00:00
|
||||
Persistent=true
|
||||
|
||||
[Install]
|
||||
WantedBy=timers.target
|
||||
EOF
|
||||
|
||||
sudo tee /etc/systemd/system/hvac-content-images-12pm.timer > /dev/null <<EOF
|
||||
[Unit]
|
||||
Description=Run HVAC Content with Images at 12 PM daily
|
||||
|
||||
[Timer]
|
||||
OnCalendar=*-*-* 12:00:00
|
||||
Persistent=true
|
||||
|
||||
[Install]
|
||||
WantedBy=timers.target
|
||||
EOF
|
||||
|
||||
# Reload systemd
|
||||
echo "Reloading systemd..."
|
||||
sudo systemctl daemon-reload
|
||||
|
||||
# Enable new services
|
||||
echo "Enabling new services..."
|
||||
sudo systemctl enable hvac-content-images-8am.timer
|
||||
sudo systemctl enable hvac-content-images-12pm.timer
|
||||
|
||||
# Start timers
|
||||
echo "Starting timers..."
|
||||
sudo systemctl start hvac-content-images-8am.timer
|
||||
sudo systemctl start hvac-content-images-12pm.timer
|
||||
|
||||
# Show status
|
||||
echo
|
||||
echo "Service status:"
|
||||
sudo systemctl status hvac-content-images-8am.timer --no-pager
|
||||
echo
|
||||
sudo systemctl status hvac-content-images-12pm.timer --no-pager
|
||||
echo
|
||||
echo "Next scheduled runs:"
|
||||
sudo systemctl list-timers hvac-content-images-* --no-pager
|
||||
|
||||
echo
|
||||
echo "✅ Update complete! Image downloading is now enabled in production."
|
||||
echo "The scrapers will now download:"
|
||||
echo " - Instagram post images and video thumbnails"
|
||||
echo " - YouTube video thumbnails"
|
||||
echo " - Podcast episode thumbnails"
|
||||
echo
|
||||
echo "Images will be synced to: /mnt/nas/hkia/media/"
|
||||
|
|
@ -1,6 +1,6 @@
|
|||
#!/bin/bash
|
||||
#
|
||||
# HKIA - Production Deployment Script
|
||||
# HVAC Know It All - Production Deployment Script
|
||||
# Sets up systemd services, directories, and configuration
|
||||
#
|
||||
|
||||
|
|
@ -67,7 +67,7 @@ setup_directories() {
|
|||
mkdir -p "$PROD_DIR/venv"
|
||||
|
||||
# Create NAS mount point (if doesn't exist)
|
||||
mkdir -p "/mnt/nas/hkia"
|
||||
mkdir -p "/mnt/nas/hvacknowitall"
|
||||
|
||||
# Copy application files
|
||||
cp -r "$REPO_DIR/src" "$PROD_DIR/"
|
||||
|
|
@ -222,7 +222,7 @@ verify_installation() {
|
|||
|
||||
# Main deployment function
|
||||
main() {
|
||||
print_status "Starting HKIA production deployment..."
|
||||
print_status "Starting HVAC Know It All production deployment..."
|
||||
echo
|
||||
|
||||
check_root
|
||||
|
|
|
|||
|
|
@ -59,7 +59,7 @@
|
|||
- [ ] NAS mount point exists and is accessible
|
||||
- [ ] Write permissions verified:
|
||||
```bash
|
||||
touch /mnt/nas/hkia/test.txt && rm /mnt/nas/hkia/test.txt
|
||||
touch /mnt/nas/hvacknowitall/test.txt && rm /mnt/nas/hvacknowitall/test.txt
|
||||
```
|
||||
- [ ] Sufficient space available on NAS
|
||||
|
||||
|
|
@ -136,15 +136,15 @@
|
|||
### 6. Enable Services
|
||||
- [ ] Enable main timer:
|
||||
```bash
|
||||
sudo systemctl enable hkia-content-aggregator.timer
|
||||
sudo systemctl enable hvac-content-aggregator.timer
|
||||
```
|
||||
- [ ] Start timer:
|
||||
```bash
|
||||
sudo systemctl start hkia-content-aggregator.timer
|
||||
sudo systemctl start hvac-content-aggregator.timer
|
||||
```
|
||||
- [ ] Verify timer is active:
|
||||
```bash
|
||||
systemctl status hkia-content-aggregator.timer
|
||||
systemctl status hvac-content-aggregator.timer
|
||||
```
|
||||
|
||||
### 7. Optional: TikTok Captions
|
||||
|
|
@ -163,7 +163,7 @@
|
|||
```
|
||||
- [ ] No errors in service status:
|
||||
```bash
|
||||
systemctl status hkia-content-aggregator.service
|
||||
systemctl status hvac-content-aggregator.service
|
||||
```
|
||||
- [ ] Log files being created:
|
||||
```bash
|
||||
|
|
@ -173,7 +173,7 @@
|
|||
### First Run Verification
|
||||
- [ ] Manually trigger first run:
|
||||
```bash
|
||||
sudo systemctl start hkia-content-aggregator.service
|
||||
sudo systemctl start hvac-content-aggregator.service
|
||||
```
|
||||
- [ ] Monitor logs in real-time:
|
||||
```bash
|
||||
|
|
@ -241,7 +241,7 @@
|
|||
- [ ] Check systemd timer status
|
||||
- [ ] Review journal logs:
|
||||
```bash
|
||||
journalctl -u hkia-content-aggregator.timer
|
||||
journalctl -u hvac-content-aggregator.timer
|
||||
```
|
||||
|
||||
### If NAS Sync Fails
|
||||
|
|
@ -255,7 +255,7 @@
|
|||
### Quick Rollback
|
||||
1. [ ] Stop services:
|
||||
```bash
|
||||
sudo systemctl stop hkia-content-aggregator.timer
|
||||
sudo systemctl stop hvac-content-aggregator.timer
|
||||
```
|
||||
2. [ ] Restore previous version:
|
||||
```bash
|
||||
|
|
@ -264,7 +264,7 @@
|
|||
```
|
||||
3. [ ] Restart services:
|
||||
```bash
|
||||
sudo systemctl start hkia-content-aggregator.timer
|
||||
sudo systemctl start hvac-content-aggregator.timer
|
||||
```
|
||||
|
||||
### Full Rollback
|
||||
|
|
|
|||
|
|
@ -1,7 +1,7 @@
|
|||
# Production Readiness Todo List
|
||||
|
||||
## Overview
|
||||
This document outlines all tasks required to meet the original specification and prepare the HKIA Content Aggregator for production deployment. Tasks are organized by priority and phase.
|
||||
This document outlines all tasks required to meet the original specification and prepare the HVAC Know It All Content Aggregator for production deployment. Tasks are organized by priority and phase.
|
||||
|
||||
**Note:** Docker/Kubernetes deployment is not feasible due to TikTok scraping requiring display server access. The system uses systemd for service management instead.
|
||||
|
||||
|
|
@ -26,7 +26,7 @@ This document outlines all tasks required to meet the original specification and
|
|||
### File Organization
|
||||
- [ ] Fix file naming convention to match spec format
|
||||
- Change from: `update_20241218_060000.md`
|
||||
- To: `hkia_<source>_2024-12-18-T060000.md`
|
||||
- To: `hvacknowitall_<source>_2024-12-18-T060000.md`
|
||||
|
||||
- [ ] Create proper directory structure
|
||||
```
|
||||
|
|
@ -306,7 +306,7 @@ sed -i 's/18:00:00/12:00:00/g' systemd/*.timer
|
|||
|
||||
# Phase 4: Test Deployment
|
||||
./install_production.sh
|
||||
systemctl status hkia-content-aggregator.timer
|
||||
systemctl status hvac-content-aggregator.timer
|
||||
```
|
||||
|
||||
---
|
||||
|
|
|
|||
|
|
@ -1,188 +0,0 @@
|
|||
# Cumulative Markdown System Documentation
|
||||
|
||||
## Overview
|
||||
|
||||
The cumulative markdown system maintains a single, continuously growing markdown file per content source that intelligently combines backlog data with incremental daily updates.
|
||||
|
||||
## Problem It Solves
|
||||
|
||||
Previously, each scraper run created entirely new files:
|
||||
- Backlog runs created large initial files
|
||||
- Incremental updates created small separate files
|
||||
- No merging of content between files
|
||||
- Multiple files per source made it hard to find the "current" state
|
||||
|
||||
## Solution Architecture
|
||||
|
||||
### CumulativeMarkdownManager
|
||||
|
||||
Core class that handles:
|
||||
1. **Loading** existing markdown files
|
||||
2. **Parsing** content into sections by unique ID
|
||||
3. **Merging** new content with existing sections
|
||||
4. **Updating** sections when better data is available
|
||||
5. **Archiving** previous versions for history
|
||||
6. **Saving** updated single-source-of-truth file
|
||||
|
||||
### Merge Logic
|
||||
|
||||
The system uses intelligent merging based on content quality:
|
||||
|
||||
```python
|
||||
def should_update_section(old_section, new_section):
|
||||
# Update if new has captions/transcripts that old doesn't
|
||||
if new_has_captions and not old_has_captions:
|
||||
return True
|
||||
|
||||
# Update if new has significantly more content
|
||||
if new_description_length > old_description_length * 1.2:
|
||||
return True
|
||||
|
||||
# Update if metrics have increased
|
||||
if new_views > old_views:
|
||||
return True
|
||||
|
||||
return False
|
||||
```
|
||||
|
||||
## Usage Patterns
|
||||
|
||||
### Initial Backlog Capture
|
||||
|
||||
```python
|
||||
# Day 1 - First run captures all historical content
|
||||
scraper.fetch_content(max_posts=1000)
|
||||
# Creates: hvacnkowitall_YouTube_20250819T080000.md (444 videos)
|
||||
```
|
||||
|
||||
### Daily Incremental Update
|
||||
|
||||
```python
|
||||
# Day 2 - Fetch only new content since last run
|
||||
scraper.fetch_content() # Uses state to get only new items
|
||||
# Loads existing file, merges new content
|
||||
# Updates: hvacnkowitall_YouTube_20250819T120000.md (449 videos)
|
||||
```
|
||||
|
||||
### Caption/Transcript Enhancement
|
||||
|
||||
```python
|
||||
# Day 3 - Fetch captions for existing videos
|
||||
youtube_scraper.fetch_captions(video_ids)
|
||||
# Loads existing file, updates videos with caption data
|
||||
# Updates: hvacnkowitall_YouTube_20250819T080000.md (449 videos, 200 with captions)
|
||||
```
|
||||
|
||||
## File Management
|
||||
|
||||
### Naming Convention
|
||||
```
|
||||
hvacnkowitall_<Source>_<YYYY-MM-DDTHHMMSS>.md
|
||||
```
|
||||
- Brand name is always lowercase
|
||||
- Source name is TitleCase
|
||||
- Timestamp in Atlantic timezone
|
||||
|
||||
### Archive Strategy
|
||||
```
|
||||
Current:
|
||||
hvacnkowitall_YouTube_20250819T143045.md (latest)
|
||||
|
||||
Archives:
|
||||
YouTube/
|
||||
hvacnkowitall_YouTube_20250819T080000_archived_20250819_120000.md
|
||||
hvacnkowitall_YouTube_20250819T120000_archived_20250819_143045.md
|
||||
```
|
||||
|
||||
## Implementation Details
|
||||
|
||||
### Section Structure
|
||||
|
||||
Each content item is a section with unique ID:
|
||||
```markdown
|
||||
# ID: video_abc123
|
||||
|
||||
## Title: Video Title
|
||||
|
||||
## Views: 1,234
|
||||
|
||||
## Description:
|
||||
Full description text...
|
||||
|
||||
## Caption Status:
|
||||
Caption text if available...
|
||||
|
||||
## Publish Date: 2024-01-15
|
||||
|
||||
--------------------------------------------------
|
||||
```
|
||||
|
||||
### Merge Process
|
||||
|
||||
1. **Parse** both existing and new content into sections
|
||||
2. **Index** by unique ID (video ID, post ID, etc.)
|
||||
3. **Compare** sections with same ID
|
||||
4. **Update** if new version is better
|
||||
5. **Add** new sections not in existing file
|
||||
6. **Sort** by date (newest first) or maintain order
|
||||
7. **Save** combined content with new timestamp
|
||||
|
||||
### State Management
|
||||
|
||||
State files track last processed item for incremental updates:
|
||||
```json
|
||||
{
|
||||
"last_video_id": "abc123",
|
||||
"last_video_date": "2024-01-20",
|
||||
"last_sync": "2024-01-20T12:00:00",
|
||||
"total_processed": 449
|
||||
}
|
||||
```
|
||||
|
||||
## Benefits
|
||||
|
||||
1. **Single Source of Truth**: One file per source with all content
|
||||
2. **Automatic Updates**: Existing entries enhanced with new data
|
||||
3. **Efficient Storage**: No duplicate content across files
|
||||
4. **Complete History**: Archives preserve all versions
|
||||
5. **Incremental Growth**: Files grow naturally over time
|
||||
6. **Smart Merging**: Best version of each entry is preserved
|
||||
|
||||
## Migration from Separate Files
|
||||
|
||||
Use the consolidation script to migrate existing separate files:
|
||||
|
||||
```bash
|
||||
# Consolidate all existing files into cumulative format
|
||||
uv run python consolidate_current_files.py
|
||||
```
|
||||
|
||||
This will:
|
||||
1. Find all files for each source
|
||||
2. Parse and merge by content ID
|
||||
3. Create single cumulative file
|
||||
4. Archive old separate files
|
||||
|
||||
## Testing
|
||||
|
||||
Test the cumulative workflow:
|
||||
|
||||
```bash
|
||||
uv run python test_cumulative_mode.py
|
||||
```
|
||||
|
||||
This demonstrates:
|
||||
- Initial backlog capture (5 items)
|
||||
- First incremental update (+2 items = 7 total)
|
||||
- Second incremental with updates (1 updated, +1 new = 8 total)
|
||||
- Proper archival of previous versions
|
||||
|
||||
## Future Enhancements
|
||||
|
||||
Potential improvements:
|
||||
1. Conflict resolution strategies (user choice on updates)
|
||||
2. Differential backups (only store changes)
|
||||
3. Compression of archived versions
|
||||
4. Metrics tracking across versions
|
||||
5. Automatic cleanup of old archives
|
||||
6. API endpoint to query cumulative statistics
|
||||
|
|
@ -1,4 +1,4 @@
|
|||
# HKIA - Deployment Strategy
|
||||
# HVAC Know It All - Deployment Strategy
|
||||
|
||||
## Summary
|
||||
|
||||
|
|
@ -76,20 +76,20 @@ After thorough testing and implementation, the content aggregation system has be
|
|||
├── .env # Environment configuration
|
||||
├── requirements.txt # Python dependencies
|
||||
└── systemd/ # Service configuration
|
||||
├── hkia-scraper.service
|
||||
├── hkia-scraper-morning.timer
|
||||
└── hkia-scraper-afternoon.timer
|
||||
├── hvac-scraper.service
|
||||
├── hvac-scraper-morning.timer
|
||||
└── hvac-scraper-afternoon.timer
|
||||
```
|
||||
|
||||
## NAS Integration
|
||||
|
||||
**Sync to**: `/mnt/nas/hkia/`
|
||||
**Sync to**: `/mnt/nas/hvacknowitall/`
|
||||
- Markdown files with timestamped archives
|
||||
- Organized by source and date
|
||||
- Incremental sync to minimize bandwidth
|
||||
|
||||
## Conclusion
|
||||
|
||||
While the original containerized approach is not viable due to TikTok's GUI requirements, the direct deployment approach provides a robust and maintainable solution for the HKIA content aggregation system.
|
||||
While the original containerized approach is not viable due to TikTok's GUI requirements, the direct deployment approach provides a robust and maintainable solution for the HVAC Know It All content aggregation system.
|
||||
|
||||
The system successfully aggregates content from 5 major sources with the option to include TikTok when needed, providing comprehensive coverage of the HKIA brand across digital platforms.
|
||||
The system successfully aggregates content from 5 major sources with the option to include TikTok when needed, providing comprehensive coverage of the HVAC Know It All brand across digital platforms.
|
||||
|
|
@ -1,8 +1,8 @@
|
|||
# HKIA Content Aggregation System - Final Status
|
||||
# HVAC Know It All Content Aggregation System - Final Status
|
||||
|
||||
## 🎉 Project Complete!
|
||||
|
||||
The HKIA content aggregation system has been successfully implemented and tested. All 6 content sources are working, with deployment-ready infrastructure.
|
||||
The HVAC Know It All content aggregation system has been successfully implemented and tested. All 6 content sources are working, with deployment-ready infrastructure.
|
||||
|
||||
## ✅ **All Sources Working (6/6)**
|
||||
|
||||
|
|
@ -20,7 +20,7 @@ The HKIA content aggregation system has been successfully implemented and tested
|
|||
### ✅ Content Aggregation
|
||||
- **Incremental Updates**: Only fetches new content since last run
|
||||
- **State Management**: JSON state files track last sync timestamps
|
||||
- **Markdown Generation**: Standardized format `hkia_{source}_{timestamp}.md`
|
||||
- **Markdown Generation**: Standardized format `hvacknowitall_{source}_{timestamp}.md`
|
||||
- **Archive Management**: Automatic archiving of previous content
|
||||
|
||||
### ✅ Technical Infrastructure
|
||||
|
|
@ -30,7 +30,7 @@ The HKIA content aggregation system has been successfully implemented and tested
|
|||
- **Session Persistence**: Instagram login session reuse
|
||||
|
||||
### ✅ Data Management
|
||||
- **NAS Synchronization**: rsync to `/mnt/nas/hkia/`
|
||||
- **NAS Synchronization**: rsync to `/mnt/nas/hvacknowitall/`
|
||||
- **File Organization**: Current and archived content separation
|
||||
- **Log Management**: Rotating logs with configurable retention
|
||||
|
||||
|
|
@ -87,9 +87,9 @@ Total: 6/6 passed
|
|||
│ ├── tiktok_scraper_advanced.py # TikTok Scrapling
|
||||
│ └── orchestrator.py # Main coordinator
|
||||
├── systemd/ # Service configuration
|
||||
│ ├── hkia-scraper.service
|
||||
│ ├── hkia-scraper-morning.timer
|
||||
│ └── hkia-scraper-afternoon.timer
|
||||
│ ├── hvac-scraper.service
|
||||
│ ├── hvac-scraper-morning.timer
|
||||
│ └── hvac-scraper-afternoon.timer
|
||||
├── test_data/ # Test results
|
||||
│ ├── recent/ # Recent content tests
|
||||
│ └── backlog/ # Backlog tests
|
||||
|
|
@ -115,14 +115,14 @@ sudo ./install.sh
|
|||
### **Manual Commands**
|
||||
```bash
|
||||
# Check service status
|
||||
systemctl status hkia-scraper-morning.timer
|
||||
systemctl status hkia-scraper-afternoon.timer
|
||||
systemctl status hvac-scraper-morning.timer
|
||||
systemctl status hvac-scraper-afternoon.timer
|
||||
|
||||
# Manual execution
|
||||
sudo systemctl start hkia-scraper.service
|
||||
sudo systemctl start hvac-scraper.service
|
||||
|
||||
# View logs
|
||||
journalctl -u hkia-scraper.service -f
|
||||
journalctl -u hvac-scraper.service -f
|
||||
|
||||
# Test individual sources
|
||||
python -m src.orchestrator --sources wordpress instagram
|
||||
|
|
@ -204,7 +204,7 @@ python -m src.orchestrator --sources wordpress instagram
|
|||
|
||||
## 🏆 **Conclusion**
|
||||
|
||||
The HKIA content aggregation system successfully delivers on all requirements:
|
||||
The HVAC Know It All content aggregation system successfully delivers on all requirements:
|
||||
|
||||
- **Complete Coverage**: All 6 major content sources working
|
||||
- **Production Ready**: Robust error handling and deployment infrastructure
|
||||
|
|
@ -212,6 +212,6 @@ The HKIA content aggregation system successfully delivers on all requirements:
|
|||
- **Reliable**: Comprehensive testing and proven real-world performance
|
||||
- **Maintainable**: Clean architecture with extensive documentation
|
||||
|
||||
The system is ready for production deployment and will provide automated, comprehensive content aggregation for the HKIA brand across all digital platforms.
|
||||
The system is ready for production deployment and will provide automated, comprehensive content aggregation for the HVAC Know It All brand across all digital platforms.
|
||||
|
||||
**Project Status: ✅ COMPLETE AND PRODUCTION READY**
|
||||
|
|
@ -1,186 +0,0 @@
|
|||
# Image Download System
|
||||
|
||||
## Overview
|
||||
|
||||
The HKIA content aggregation system now includes comprehensive image downloading capabilities for all supported sources. This system downloads thumbnails and images (but not videos) to provide visual context alongside the markdown content.
|
||||
|
||||
## Supported Image Types
|
||||
|
||||
### Instagram
|
||||
- **Post images**: All images from single posts and carousel posts
|
||||
- **Video thumbnails**: Thumbnail images for video posts (videos themselves are not downloaded)
|
||||
- **Story images**: Images from stories (video stories get thumbnails only)
|
||||
|
||||
### YouTube
|
||||
- **Video thumbnails**: High-resolution thumbnails for each video
|
||||
- **Formats**: Attempts to get maxres > high > medium > default quality
|
||||
|
||||
### Podcasts
|
||||
- **Episode thumbnails**: iTunes artwork and media thumbnails for each episode
|
||||
- **Formats**: PNG/JPEG episode artwork
|
||||
|
||||
## File Naming Convention
|
||||
|
||||
All downloaded images follow a consistent naming pattern:
|
||||
```
|
||||
{source}_{item_id}_{type}_{optional_number}.{ext}
|
||||
```
|
||||
|
||||
Examples:
|
||||
- `instagram_Cm1wgRMr_mj_video_thumb.jpg`
|
||||
- `instagram_CpgiKyqPoX1_image_1.jpg`
|
||||
- `youtube_dQw4w9WgXcQ_thumbnail.jpg`
|
||||
- `podcast_episode123_thumbnail.png`
|
||||
|
||||
## Directory Structure
|
||||
|
||||
```
|
||||
data/
|
||||
├── media/
|
||||
│ ├── Instagram/
|
||||
│ │ ├── instagram_post1_image.jpg
|
||||
│ │ └── instagram_post2_video_thumb.jpg
|
||||
│ ├── YouTube/
|
||||
│ │ ├── youtube_video1_thumbnail.jpg
|
||||
│ │ └── youtube_video2_thumbnail.jpg
|
||||
│ └── Podcast/
|
||||
│ ├── podcast_ep1_thumbnail.png
|
||||
│ └── podcast_ep2_thumbnail.jpg
|
||||
└── markdown_current/
|
||||
├── hkia_instagram_*.md
|
||||
├── hkia_youtube_*.md
|
||||
└── hkia_podcast_*.md
|
||||
```
|
||||
|
||||
## Enhanced Scrapers
|
||||
|
||||
### InstagramScraperWithImages
|
||||
- Extends `InstagramScraper`
|
||||
- Downloads all non-video media
|
||||
- Handles carousel posts with multiple images
|
||||
- Stores local paths in `local_images` field
|
||||
|
||||
### YouTubeAPIScraperWithThumbnails
|
||||
- Extends `YouTubeAPIScraper`
|
||||
- Downloads video thumbnails
|
||||
- Selects highest quality available
|
||||
- Stores local path in `local_thumbnail` field
|
||||
|
||||
### RSSScraperPodcastWithImages
|
||||
- Extends `RSSScraperPodcast`
|
||||
- Downloads episode thumbnails
|
||||
- Extracts from iTunes metadata
|
||||
- Stores local path in `local_thumbnail` field
|
||||
|
||||
## Production Scripts
|
||||
|
||||
### run_production_with_images.py
|
||||
Main production script that:
|
||||
1. Runs all enhanced scrapers
|
||||
2. Downloads images during content fetching
|
||||
3. Updates cumulative markdown files
|
||||
4. Syncs both markdown and images to NAS
|
||||
|
||||
### Test Script
|
||||
`test_image_downloads.py` - Tests image downloading with small batches:
|
||||
- 3 YouTube videos
|
||||
- 3 Instagram posts
|
||||
- 3 Podcast episodes
|
||||
|
||||
## NAS Synchronization
|
||||
|
||||
The rsync function has been enhanced to sync images:
|
||||
|
||||
```python
|
||||
# Sync markdown files
|
||||
rsync -av --include=*.md --exclude=* data/markdown_current/ /mnt/nas/hkia/markdown_current/
|
||||
|
||||
# Sync image files
|
||||
rsync -av --include=*/ --include=*.jpg --include=*.jpeg --include=*.png --include=*.gif --exclude=* data/media/ /mnt/nas/hkia/media/
|
||||
```
|
||||
|
||||
## Markdown Integration
|
||||
|
||||
Downloaded images are referenced in markdown files:
|
||||
|
||||
```markdown
|
||||
## Thumbnail:
|
||||

|
||||
|
||||
## Downloaded Images:
|
||||
- [image1.jpg](media/Instagram/instagram_postId_image_1.jpg)
|
||||
- [image2.jpg](media/Instagram/instagram_postId_image_2.jpg)
|
||||
```
|
||||
|
||||
## Rate Limiting Considerations
|
||||
|
||||
- **Instagram**: Aggressive delays between image downloads (10-20 seconds)
|
||||
- **YouTube**: Minimal delays, respects API quota
|
||||
- **Podcast**: No rate limiting needed for RSS feeds
|
||||
|
||||
## Storage Estimates
|
||||
|
||||
Based on testing:
|
||||
- **Instagram**: ~70-100 KB per image
|
||||
- **YouTube**: ~100-200 KB per thumbnail
|
||||
- **Podcast**: ~3-4 MB per episode thumbnail (high quality artwork)
|
||||
|
||||
For 1000 items per source:
|
||||
- Instagram: ~100 MB (assuming 1 image per post)
|
||||
- YouTube: ~200 MB
|
||||
- Podcast: ~4 GB (if all episodes have artwork)
|
||||
|
||||
## Usage
|
||||
|
||||
### Test Image Downloads
|
||||
```bash
|
||||
python test_image_downloads.py
|
||||
```
|
||||
|
||||
### Production Run with Images
|
||||
```bash
|
||||
python run_production_with_images.py
|
||||
```
|
||||
|
||||
### Check Downloaded Images
|
||||
```bash
|
||||
# Count images per source
|
||||
find data/media -name "*.jpg" -o -name "*.png" | wc -l
|
||||
|
||||
# Check disk usage
|
||||
du -sh data/media/*
|
||||
```
|
||||
|
||||
## Configuration
|
||||
|
||||
No additional configuration needed. The system uses existing environment variables:
|
||||
- Instagram credentials for authenticated image access
|
||||
- YouTube API key (thumbnails are public)
|
||||
- Podcast RSS URL (thumbnails in feed metadata)
|
||||
|
||||
## Future Enhancements
|
||||
|
||||
Potential improvements:
|
||||
1. Image optimization/compression to reduce storage
|
||||
2. Configurable image quality settings
|
||||
3. Option to download video files (currently excluded)
|
||||
4. Thumbnail generation for videos without thumbnails
|
||||
5. Image deduplication for repeated content
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Images Not Downloading
|
||||
- Check network connectivity
|
||||
- Verify source credentials (Instagram)
|
||||
- Check disk space
|
||||
- Review logs for HTTP errors
|
||||
|
||||
### Rate Limiting
|
||||
- Instagram may block rapid downloads
|
||||
- Use aggressive delays in scraper
|
||||
- Consider batching downloads
|
||||
|
||||
### Storage Issues
|
||||
- Monitor disk usage
|
||||
- Consider external storage for media
|
||||
- Implement rotation/archiving strategy
|
||||
|
|
@ -1,7 +1,7 @@
|
|||
# HKIA Content Aggregation System - Project Specification
|
||||
# HVAC Know It All Content Aggregation System - Project Specification
|
||||
|
||||
## Overview
|
||||
A containerized Python application that aggregates content from multiple HKIA sources, converts them to markdown format, and syncs to a NAS. The system runs on a Kubernetes cluster on the control plane node.
|
||||
A containerized Python application that aggregates content from multiple HVAC Know It All sources, converts them to markdown format, and syncs to a NAS. The system runs on a Kubernetes cluster on the control plane node.
|
||||
|
||||
## Content Sources
|
||||
|
||||
|
|
@ -13,17 +13,17 @@ A containerized Python application that aggregates content from multiple HKIA so
|
|||
|
||||
### 2. MailChimp RSS
|
||||
- **Fields**: ID, title, link, publish date, content
|
||||
- **URL**: https://hkia.com/feed/
|
||||
- **URL**: https://hvacknowitall.com/feed/
|
||||
- **Tool**: feedparser
|
||||
|
||||
### 3. Podcast RSS
|
||||
- **Fields**: ID, audio link, author, title, subtitle, pubDate, duration, description, image, episode link
|
||||
- **URL**: https://hkia.com/podcast/feed/
|
||||
- **URL**: https://hvacknowitall.com/podcast/feed/
|
||||
- **Tool**: feedparser
|
||||
|
||||
### 4. WordPress Blog Posts
|
||||
- **Fields**: ID, title, author, publish date, word count, tags, categories
|
||||
- **API**: REST API at https://hkia.com/
|
||||
- **API**: REST API at https://hvacknowitall.com/
|
||||
- **Credentials**: Stored in .env (WORDPRESS_USERNAME, WORDPRESS_API_KEY)
|
||||
|
||||
### 5. Instagram
|
||||
|
|
@ -44,11 +44,11 @@ A containerized Python application that aggregates content from multiple HKIA so
|
|||
3. Convert all content to markdown using MarkItDown
|
||||
4. Download associated media files
|
||||
5. Archive previous markdown files
|
||||
6. Rsync to NAS at /mnt/nas/hkia/
|
||||
6. Rsync to NAS at /mnt/nas/hvacknowitall/
|
||||
|
||||
### File Naming Convention
|
||||
`<brandName>_<source>_<dateTime in Atlantic Timezone>.md`
|
||||
Example: `hkia_blog_2024-15-01-T143045.md`
|
||||
Example: `hvacnkowitall_blog_2024-15-01-T143045.md`
|
||||
|
||||
### Directory Structure
|
||||
```
|
||||
|
|
|
|||
|
|
@ -1,11 +1,11 @@
|
|||
# HKIA Content Aggregation - Project Status
|
||||
# HVAC Know It All Content Aggregation - Project Status
|
||||
|
||||
## Current Status: 🟢 PRODUCTION READY
|
||||
## Current Status: 🟢 PRODUCTION DEPLOYED
|
||||
|
||||
**Project Completion: 100%**
|
||||
**All 6 Sources: ✅ Working**
|
||||
**Deployment: 🚀 Production Ready**
|
||||
**Last Updated: 2025-08-19 10:50 ADT**
|
||||
**Deployment: 🚀 In Production**
|
||||
**Last Updated: 2025-08-18 23:15 ADT**
|
||||
|
||||
---
|
||||
|
||||
|
|
@ -13,34 +13,18 @@
|
|||
|
||||
| Source | Status | Last Tested | Items Fetched | Notes |
|
||||
|--------|--------|-------------|---------------|-------|
|
||||
| YouTube | ✅ API Working | 2025-08-19 | 444 videos | API integration, 179/444 with captions (40.3%) |
|
||||
| MailChimp | ✅ API Working | 2025-08-19 | 22 campaigns | API integration, cleaned content |
|
||||
| TikTok | ✅ Working | 2025-08-19 | 35 videos | All available videos captured |
|
||||
| Podcast RSS | ✅ Working | 2025-08-19 | 428 episodes | Full backlog captured |
|
||||
| WordPress Blog | ✅ Working | 2025-08-18 | 139 posts | HTML cleaning implemented |
|
||||
| Instagram | 🔄 Processing | 2025-08-19 | ~555/1000 posts | Long-running backlog capture |
|
||||
| WordPress Blog | ✅ Working | 2025-08-18 | 139 posts | HTML cleaning implemented, clean markdown output |
|
||||
| MailChimp RSS | ⚠️ SSL Error | 2025-08-18 | 0 entries | Provider SSL issue, not a code problem |
|
||||
| Podcast RSS | ✅ Working | 2025-08-18 | 428 episodes | Full backlog captured successfully |
|
||||
| YouTube | ✅ Working | 2025-08-18 | 200 videos | Channel scraping with metadata |
|
||||
| Instagram | 🔄 Processing | 2025-08-18 | 45/1000 posts | Rate: 200/hr, ETA: 3:54 AM |
|
||||
| TikTok | ⏳ Queued | 2025-08-18 | 0/1000 videos | Starts after Instagram completes |
|
||||
|
||||
---
|
||||
|
||||
## Latest Updates (2025-08-19)
|
||||
|
||||
### 🆕 Cumulative Markdown System
|
||||
- **Single Source of Truth**: One continuously growing file per source
|
||||
- **Intelligent Merging**: Updates existing entries with new data (captions, metrics)
|
||||
- **Backlog + Incremental**: Properly combines historical and daily updates
|
||||
- **Smart Updates**: Prefers content with captions/transcripts over without
|
||||
- **Archive Management**: Previous versions timestamped in archives
|
||||
|
||||
### 🆕 API Integrations
|
||||
- **YouTube Data API v3**: Replaced yt-dlp with official API
|
||||
- **MailChimp API**: Replaced RSS feed with API integration
|
||||
- **Caption Support**: YouTube captions via Data API (50 units/video)
|
||||
- **Content Cleaning**: MailChimp headers/footers removed
|
||||
|
||||
## Technical Implementation
|
||||
|
||||
### ✅ Core Features Complete
|
||||
- **Cumulative Markdown**: Single growing file per source with intelligent merging
|
||||
- **Incremental Updates**: All scrapers support state-based incremental fetching
|
||||
- **Archive Management**: Previous files automatically archived with timestamps
|
||||
- **Markdown Conversion**: All content properly converted to markdown format
|
||||
|
|
@ -69,10 +53,10 @@
|
|||
- **Service Files**: Complete systemd configuration provided
|
||||
|
||||
### Configuration Files
|
||||
- `systemd/hkia-scraper.service` - Main service definition
|
||||
- `systemd/hkia-scraper.timer` - Scheduled execution
|
||||
- `systemd/hkia-scraper-nas.service` - NAS sync service
|
||||
- `systemd/hkia-scraper-nas.timer` - NAS sync schedule
|
||||
- `systemd/hvac-scraper.service` - Main service definition
|
||||
- `systemd/hvac-scraper.timer` - Scheduled execution
|
||||
- `systemd/hvac-scraper-nas.service` - NAS sync service
|
||||
- `systemd/hvac-scraper-nas.timer` - NAS sync schedule
|
||||
|
||||
---
|
||||
|
||||
|
|
@ -110,9 +94,9 @@
|
|||
|
||||
## Next Steps for Production
|
||||
|
||||
1. Install systemd services: `sudo systemctl enable hkia-scraper.timer`
|
||||
1. Install systemd services: `sudo systemctl enable hvac-scraper.timer`
|
||||
2. Configure environment variables in `/opt/hvac-kia-content/.env`
|
||||
3. Set up NAS mount point at `/mnt/nas/hkia/`
|
||||
4. Monitor via systemd logs: `journalctl -f -u hkia-scraper.service`
|
||||
3. Set up NAS mount point at `/mnt/nas/hvacknowitall/`
|
||||
4. Monitor via systemd logs: `journalctl -f -u hvac-scraper.service`
|
||||
|
||||
**Project Status: ✅ READY FOR PRODUCTION DEPLOYMENT**
|
||||
|
|
@ -1,127 +0,0 @@
|
|||
#!/usr/bin/env python3
|
||||
"""
|
||||
Fetch additional YouTube videos to reach 1000 total
|
||||
"""
|
||||
|
||||
import sys
|
||||
from pathlib import Path
|
||||
sys.path.insert(0, str(Path(__file__).parent))
|
||||
|
||||
from src.base_scraper import ScraperConfig
|
||||
from src.youtube_scraper import YouTubeScraper
|
||||
from datetime import datetime
|
||||
import logging
|
||||
import time
|
||||
|
||||
# Set up logging
|
||||
logging.basicConfig(
|
||||
level=logging.INFO,
|
||||
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
|
||||
handlers=[
|
||||
logging.FileHandler('youtube_1000.log'),
|
||||
logging.StreamHandler()
|
||||
]
|
||||
)
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
def main():
|
||||
"""Fetch additional YouTube videos"""
|
||||
logger.info("🎥 Fetching additional YouTube videos to reach 1000 total")
|
||||
logger.info("Already have 200 videos, fetching 800 more...")
|
||||
logger.info("=" * 60)
|
||||
|
||||
# Create config for backlog
|
||||
config = ScraperConfig(
|
||||
source_name="youtube",
|
||||
brand_name="hvacknowitall",
|
||||
data_dir=Path("data_production_backlog"),
|
||||
logs_dir=Path("logs_production_backlog"),
|
||||
timezone="America/Halifax"
|
||||
)
|
||||
|
||||
# Initialize scraper
|
||||
scraper = YouTubeScraper(config)
|
||||
|
||||
# Clear state to fetch all videos from beginning
|
||||
if scraper.state_file.exists():
|
||||
scraper.state_file.unlink()
|
||||
logger.info("Cleared state for full backlog capture")
|
||||
|
||||
# Fetch 1000 videos (or all available if less)
|
||||
logger.info("Starting YouTube fetch - targeting 1000 videos total...")
|
||||
start_time = time.time()
|
||||
|
||||
try:
|
||||
videos = scraper.fetch_channel_videos(max_videos=1000)
|
||||
|
||||
if not videos:
|
||||
logger.error("No videos fetched")
|
||||
return False
|
||||
|
||||
logger.info(f"✅ Fetched {len(videos)} videos")
|
||||
|
||||
# Generate markdown
|
||||
markdown = scraper.format_markdown(videos)
|
||||
|
||||
# Save with new timestamp
|
||||
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
|
||||
filename = f"hvacknowitall_youtube_1000_backlog_{timestamp}.md"
|
||||
|
||||
# Save to markdown directory
|
||||
output_dir = config.data_dir / "markdown_current"
|
||||
output_dir.mkdir(parents=True, exist_ok=True)
|
||||
output_file = output_dir / filename
|
||||
|
||||
output_file.write_text(markdown, encoding='utf-8')
|
||||
logger.info(f"📄 Saved to: {output_file}")
|
||||
|
||||
# Update state
|
||||
new_state = {
|
||||
'last_update': datetime.now().isoformat(),
|
||||
'last_item_count': len(videos),
|
||||
'backlog_captured': True,
|
||||
'total_videos': len(videos)
|
||||
}
|
||||
|
||||
if videos:
|
||||
new_state['last_video_id'] = videos[-1].get('id')
|
||||
new_state['oldest_video_date'] = videos[-1].get('upload_date', '')
|
||||
|
||||
scraper.save_state(new_state)
|
||||
|
||||
# Statistics
|
||||
duration = time.time() - start_time
|
||||
logger.info("\n" + "=" * 60)
|
||||
logger.info("📊 YOUTUBE CAPTURE COMPLETE")
|
||||
logger.info(f"Total videos: {len(videos)}")
|
||||
logger.info(f"Duration: {duration:.1f} seconds")
|
||||
logger.info(f"Rate: {len(videos)/duration:.1f} videos/second")
|
||||
|
||||
# Show date range
|
||||
if videos:
|
||||
newest_date = videos[0].get('upload_date', 'Unknown')
|
||||
oldest_date = videos[-1].get('upload_date', 'Unknown')
|
||||
logger.info(f"Date range: {oldest_date} to {newest_date}")
|
||||
|
||||
# Check if we got all available videos
|
||||
if len(videos) < 1000:
|
||||
logger.info(f"⚠️ Channel has {len(videos)} total videos (less than 1000 requested)")
|
||||
else:
|
||||
logger.info("✅ Successfully fetched 1000 videos!")
|
||||
|
||||
return True
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error fetching videos: {e}")
|
||||
return False
|
||||
|
||||
if __name__ == "__main__":
|
||||
try:
|
||||
success = main()
|
||||
sys.exit(0 if success else 1)
|
||||
except KeyboardInterrupt:
|
||||
logger.info("\nCapture interrupted by user")
|
||||
sys.exit(1)
|
||||
except Exception as e:
|
||||
logger.critical(f"Capture failed: {e}")
|
||||
sys.exit(2)
|
||||
|
|
@ -1,144 +0,0 @@
|
|||
#!/usr/bin/env python3
|
||||
"""
|
||||
Fetch 100 YouTube videos with transcripts for backlog processing
|
||||
This will capture the first 100 videos with full transcript extraction
|
||||
"""
|
||||
|
||||
import sys
|
||||
from pathlib import Path
|
||||
sys.path.insert(0, str(Path(__file__).parent))
|
||||
|
||||
from src.base_scraper import ScraperConfig
|
||||
from src.youtube_scraper import YouTubeScraper
|
||||
from datetime import datetime
|
||||
import logging
|
||||
import time
|
||||
|
||||
# Set up logging
|
||||
logging.basicConfig(
|
||||
level=logging.INFO,
|
||||
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
|
||||
handlers=[
|
||||
logging.FileHandler('youtube_100_transcripts.log'),
|
||||
logging.StreamHandler()
|
||||
]
|
||||
)
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
def fetch_100_with_transcripts():
|
||||
"""Fetch 100 YouTube videos with transcripts for backlog"""
|
||||
logger.info("🎥 YOUTUBE BACKLOG: Fetching 100 videos WITH TRANSCRIPTS")
|
||||
logger.info("This will take approximately 5-8 minutes (3-5 seconds per video)")
|
||||
logger.info("=" * 70)
|
||||
|
||||
# Create config for backlog processing
|
||||
config = ScraperConfig(
|
||||
source_name="youtube",
|
||||
brand_name="hvacknowitall",
|
||||
data_dir=Path("data_production_backlog"),
|
||||
logs_dir=Path("logs_production_backlog"),
|
||||
timezone="America/Halifax"
|
||||
)
|
||||
|
||||
# Initialize scraper
|
||||
scraper = YouTubeScraper(config)
|
||||
|
||||
# Test authentication first
|
||||
auth_status = scraper.auth_handler.get_status()
|
||||
if not auth_status['has_valid_cookies']:
|
||||
logger.error("❌ No valid YouTube authentication found")
|
||||
logger.error("Please ensure you're logged into YouTube in Firefox")
|
||||
return False
|
||||
|
||||
logger.info(f"✅ Authentication validated: {auth_status['cookie_path']}")
|
||||
|
||||
# Fetch 100 videos with transcripts using the enhanced method
|
||||
logger.info("Fetching 100 videos with transcripts...")
|
||||
start_time = time.time()
|
||||
|
||||
try:
|
||||
videos = scraper.fetch_content(max_posts=100, fetch_transcripts=True)
|
||||
|
||||
if not videos:
|
||||
logger.error("❌ No videos fetched")
|
||||
return False
|
||||
|
||||
# Count videos with transcripts
|
||||
transcript_count = sum(1 for video in videos if video.get('transcript'))
|
||||
total_transcript_chars = sum(len(video.get('transcript', '')) for video in videos)
|
||||
|
||||
# Generate markdown
|
||||
logger.info("\nGenerating markdown with transcripts...")
|
||||
markdown = scraper.format_markdown(videos)
|
||||
|
||||
# Save with timestamp
|
||||
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
|
||||
filename = f"hvacknowitall_youtube_backlog_100_transcripts_{timestamp}.md"
|
||||
|
||||
output_dir = config.data_dir / "markdown_current"
|
||||
output_dir.mkdir(parents=True, exist_ok=True)
|
||||
output_file = output_dir / filename
|
||||
|
||||
output_file.write_text(markdown, encoding='utf-8')
|
||||
|
||||
# Calculate duration
|
||||
duration = time.time() - start_time
|
||||
|
||||
# Final statistics
|
||||
logger.info("\n" + "=" * 70)
|
||||
logger.info("🎉 YOUTUBE BACKLOG CAPTURE COMPLETE")
|
||||
logger.info(f"📊 STATISTICS:")
|
||||
logger.info(f" Total videos fetched: {len(videos)}")
|
||||
logger.info(f" Videos with transcripts: {transcript_count}")
|
||||
logger.info(f" Transcript success rate: {transcript_count/len(videos)*100:.1f}%")
|
||||
logger.info(f" Total transcript characters: {total_transcript_chars:,}")
|
||||
logger.info(f" Average transcript length: {total_transcript_chars/transcript_count if transcript_count > 0 else 0:,.0f} chars")
|
||||
logger.info(f" Processing time: {duration/60:.1f} minutes")
|
||||
logger.info(f" Average time per video: {duration/len(videos):.1f} seconds")
|
||||
logger.info(f"📄 Saved to: {output_file}")
|
||||
|
||||
# Show sample transcript info
|
||||
logger.info(f"\n📝 SAMPLE TRANSCRIPT DATA:")
|
||||
for i, video in enumerate(videos[:3]):
|
||||
title = video.get('title', 'Unknown')[:50] + "..."
|
||||
transcript = video.get('transcript', '')
|
||||
if transcript:
|
||||
logger.info(f" {i+1}. {title} - {len(transcript):,} chars")
|
||||
preview = transcript[:100] + "..." if len(transcript) > 100 else transcript
|
||||
logger.info(f" Preview: {preview}")
|
||||
else:
|
||||
logger.info(f" {i+1}. {title} - No transcript")
|
||||
|
||||
return True
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"❌ Failed to fetch videos: {e}")
|
||||
return False
|
||||
|
||||
def main():
|
||||
"""Main execution"""
|
||||
print("\n🎥 YouTube Backlog Capture with Transcripts")
|
||||
print("=" * 50)
|
||||
print("This will fetch 100 YouTube videos with full transcripts")
|
||||
print("Estimated time: 5-8 minutes")
|
||||
print("Output: Markdown file with videos and complete transcripts")
|
||||
print("\nPress Enter to continue or Ctrl+C to cancel...")
|
||||
|
||||
try:
|
||||
input()
|
||||
except KeyboardInterrupt:
|
||||
print("\nCancelled by user")
|
||||
return False
|
||||
|
||||
return fetch_100_with_transcripts()
|
||||
|
||||
if __name__ == "__main__":
|
||||
try:
|
||||
success = main()
|
||||
sys.exit(0 if success else 1)
|
||||
except KeyboardInterrupt:
|
||||
logger.info("\nCapture interrupted by user")
|
||||
sys.exit(1)
|
||||
except Exception as e:
|
||||
logger.critical(f"Capture failed: {e}")
|
||||
sys.exit(2)
|
||||
|
|
@ -1,152 +0,0 @@
|
|||
#!/usr/bin/env python3
|
||||
"""
|
||||
Fetch YouTube videos with transcripts
|
||||
This will take longer as it needs to fetch each video individually
|
||||
"""
|
||||
|
||||
import sys
|
||||
from pathlib import Path
|
||||
sys.path.insert(0, str(Path(__file__).parent))
|
||||
|
||||
from src.base_scraper import ScraperConfig
|
||||
from src.youtube_scraper import YouTubeScraper
|
||||
from datetime import datetime
|
||||
import logging
|
||||
import time
|
||||
|
||||
# Set up logging
|
||||
logging.basicConfig(
|
||||
level=logging.INFO,
|
||||
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
|
||||
handlers=[
|
||||
logging.FileHandler('youtube_transcripts.log'),
|
||||
logging.StreamHandler()
|
||||
]
|
||||
)
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
def fetch_with_transcripts(max_videos: int = 10):
|
||||
"""Fetch YouTube videos with transcripts"""
|
||||
logger.info("🎥 Fetching YouTube videos WITH TRANSCRIPTS")
|
||||
logger.info(f"This will fetch detailed info and transcripts for {max_videos} videos")
|
||||
logger.info("Note: This is slower as each video requires individual API calls")
|
||||
logger.info("=" * 60)
|
||||
|
||||
# Create config
|
||||
config = ScraperConfig(
|
||||
source_name="youtube",
|
||||
brand_name="hvacknowitall",
|
||||
data_dir=Path("data_production_backlog"),
|
||||
logs_dir=Path("logs_production_backlog"),
|
||||
timezone="America/Halifax"
|
||||
)
|
||||
|
||||
# Initialize scraper
|
||||
scraper = YouTubeScraper(config)
|
||||
|
||||
# First get video list (fast)
|
||||
logger.info(f"Step 1: Fetching video list from channel...")
|
||||
videos = scraper.fetch_channel_videos(max_videos=max_videos)
|
||||
|
||||
if not videos:
|
||||
logger.error("No videos found")
|
||||
return False
|
||||
|
||||
logger.info(f"Found {len(videos)} videos")
|
||||
|
||||
# Now fetch detailed info with transcripts for each video
|
||||
logger.info("\nStep 2: Fetching transcripts for each video...")
|
||||
logger.info("This will take approximately 3-5 seconds per video")
|
||||
|
||||
videos_with_transcripts = []
|
||||
transcript_count = 0
|
||||
|
||||
for i, video in enumerate(videos):
|
||||
video_id = video.get('id')
|
||||
if not video_id:
|
||||
continue
|
||||
|
||||
logger.info(f"\n[{i+1}/{len(videos)}] Processing: {video.get('title', 'Unknown')[:60]}...")
|
||||
|
||||
# Add delay to avoid rate limiting
|
||||
if i > 0:
|
||||
scraper._humanized_delay(2, 4)
|
||||
|
||||
# Fetch with transcript
|
||||
detailed_info = scraper.fetch_video_details(video_id, fetch_transcript=True)
|
||||
|
||||
if detailed_info:
|
||||
if detailed_info.get('transcript'):
|
||||
transcript_count += 1
|
||||
logger.info(f" ✅ Transcript found!")
|
||||
else:
|
||||
logger.info(f" ⚠️ No transcript available")
|
||||
|
||||
videos_with_transcripts.append(detailed_info)
|
||||
else:
|
||||
logger.warning(f" ❌ Failed to fetch details")
|
||||
# Use basic info if detailed fetch fails
|
||||
videos_with_transcripts.append(video)
|
||||
|
||||
# Extra delay every 10 videos
|
||||
if (i + 1) % 10 == 0:
|
||||
logger.info("Taking extended break after 10 videos...")
|
||||
time.sleep(10)
|
||||
|
||||
# Generate markdown
|
||||
logger.info("\nStep 3: Generating markdown...")
|
||||
markdown = scraper.format_markdown(videos_with_transcripts)
|
||||
|
||||
# Save with timestamp
|
||||
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
|
||||
filename = f"hvacknowitall_youtube_transcripts_{timestamp}.md"
|
||||
|
||||
output_dir = config.data_dir / "markdown_current"
|
||||
output_dir.mkdir(parents=True, exist_ok=True)
|
||||
output_file = output_dir / filename
|
||||
|
||||
output_file.write_text(markdown, encoding='utf-8')
|
||||
logger.info(f"📄 Saved to: {output_file}")
|
||||
|
||||
# Statistics
|
||||
logger.info("\n" + "=" * 60)
|
||||
logger.info("📊 YOUTUBE TRANSCRIPT CAPTURE COMPLETE")
|
||||
logger.info(f"Total videos: {len(videos_with_transcripts)}")
|
||||
logger.info(f"Videos with transcripts: {transcript_count}")
|
||||
logger.info(f"Success rate: {transcript_count/len(videos_with_transcripts)*100:.1f}%")
|
||||
|
||||
return True
|
||||
|
||||
def main():
|
||||
"""Main execution"""
|
||||
print("\n⚠️ WARNING: Fetching transcripts requires individual API calls for each video")
|
||||
print("This will take approximately 3-5 seconds per video")
|
||||
print(f"Estimated time for 370 videos: 20-30 minutes")
|
||||
print("\nOptions:")
|
||||
print("1. Test with 5 videos first")
|
||||
print("2. Fetch first 50 videos with transcripts")
|
||||
print("3. Fetch all 370 videos with transcripts (20-30 mins)")
|
||||
print("4. Cancel")
|
||||
|
||||
choice = input("\nEnter choice (1-4): ")
|
||||
|
||||
if choice == "1":
|
||||
return fetch_with_transcripts(5)
|
||||
elif choice == "2":
|
||||
return fetch_with_transcripts(50)
|
||||
elif choice == "3":
|
||||
return fetch_with_transcripts(370)
|
||||
else:
|
||||
print("Cancelled")
|
||||
return False
|
||||
|
||||
if __name__ == "__main__":
|
||||
try:
|
||||
success = main()
|
||||
sys.exit(0 if success else 1)
|
||||
except KeyboardInterrupt:
|
||||
logger.info("\nCapture interrupted by user")
|
||||
sys.exit(1)
|
||||
except Exception as e:
|
||||
logger.critical(f"Capture failed: {e}")
|
||||
sys.exit(2)
|
||||
|
|
@ -1,94 +0,0 @@
|
|||
#!/usr/bin/env python3
|
||||
"""
|
||||
Final verification of the complete MailChimp processing flow
|
||||
"""
|
||||
|
||||
import os
|
||||
import requests
|
||||
from dotenv import load_dotenv
|
||||
import re
|
||||
from markdownify import markdownify as md
|
||||
|
||||
load_dotenv()
|
||||
|
||||
def clean_content(content):
|
||||
"""Replicate the exact _clean_content logic"""
|
||||
if not content:
|
||||
return content
|
||||
|
||||
patterns_to_remove = [
|
||||
r'VIEW THIS EMAIL IN BROWSER[^\n]*\n?',
|
||||
r'\(\*\|ARCHIVE\|\*\)[^\n]*\n?',
|
||||
r'https://hvacknowitall\.com/?\n?',
|
||||
r'Newsletter produced by Teal Maker[^\n]*\n?',
|
||||
r'https://tealmaker\.com[^\n]*\n?',
|
||||
r'Copyright \(C\)[^\n]*\n?',
|
||||
r'\n{3,}',
|
||||
]
|
||||
|
||||
cleaned = content
|
||||
for pattern in patterns_to_remove:
|
||||
cleaned = re.sub(pattern, '', cleaned, flags=re.MULTILINE | re.IGNORECASE)
|
||||
|
||||
cleaned = re.sub(r'\n{3,}', '\n\n', cleaned)
|
||||
cleaned = cleaned.strip()
|
||||
return cleaned
|
||||
|
||||
def test_complete_flow():
|
||||
"""Test the complete processing flow for both working and empty campaigns"""
|
||||
|
||||
api_key = os.getenv('MAILCHIMP_API_KEY')
|
||||
server = os.getenv('MAILCHIMP_SERVER_PREFIX', 'us10')
|
||||
|
||||
base_url = f"https://{server}.api.mailchimp.com/3.0"
|
||||
headers = {'Authorization': f'Bearer {api_key}', 'Content-Type': 'application/json'}
|
||||
|
||||
# Test specific campaigns: one with content, one without
|
||||
test_campaigns = [
|
||||
{"id": "b2d24e152c", "name": "Has Content"},
|
||||
{"id": "00ffe573c4", "name": "No Content"}
|
||||
]
|
||||
|
||||
for campaign in test_campaigns:
|
||||
campaign_id = campaign["id"]
|
||||
campaign_name = campaign["name"]
|
||||
|
||||
print(f"\n{'='*60}")
|
||||
print(f"TESTING CAMPAIGN: {campaign_name} ({campaign_id})")
|
||||
print(f"{'='*60}")
|
||||
|
||||
# Step 1: Get content from API
|
||||
response = requests.get(f"{base_url}/campaigns/{campaign_id}/content", headers=headers)
|
||||
if response.status_code != 200:
|
||||
print(f"API Error: {response.status_code}")
|
||||
continue
|
||||
|
||||
content_data = response.json()
|
||||
plain_text = content_data.get('plain_text', '')
|
||||
html = content_data.get('html', '')
|
||||
|
||||
print(f"1. API Response:")
|
||||
print(f" Plain Text Length: {len(plain_text)}")
|
||||
print(f" HTML Length: {len(html)}")
|
||||
|
||||
# Step 2: Apply our processing logic (lines 236-246)
|
||||
if not plain_text and html:
|
||||
print(f"2. Converting HTML to Markdown...")
|
||||
plain_text = md(html, heading_style="ATX", bullets="-")
|
||||
print(f" Converted Length: {len(plain_text)}")
|
||||
else:
|
||||
print(f"2. Using Plain Text (no conversion needed)")
|
||||
|
||||
# Step 3: Clean content
|
||||
cleaned_text = clean_content(plain_text)
|
||||
print(f"3. After Cleaning:")
|
||||
print(f" Final Length: {len(cleaned_text)}")
|
||||
|
||||
if cleaned_text:
|
||||
preview = cleaned_text[:200].replace('\n', ' ')
|
||||
print(f" Preview: {preview}...")
|
||||
else:
|
||||
print(f" Result: EMPTY (no content to display)")
|
||||
|
||||
if __name__ == "__main__":
|
||||
test_complete_flow()
|
||||
|
|
@ -1,198 +0,0 @@
|
|||
#!/bin/bash
|
||||
set -e
|
||||
|
||||
# HKIA Scraper Services Installation Script
|
||||
# This script replaces old hvac-content services with new hkia-scraper services
|
||||
|
||||
echo "============================================================"
|
||||
echo "HKIA Content Scraper Services Installation"
|
||||
echo "============================================================"
|
||||
|
||||
# Colors for output
|
||||
RED='\033[0;31m'
|
||||
GREEN='\033[0;32m'
|
||||
YELLOW='\033[1;33m'
|
||||
BLUE='\033[0;34m'
|
||||
NC='\033[0m' # No Color
|
||||
|
||||
# Function to print colored output
|
||||
print_status() {
|
||||
echo -e "${GREEN}✅${NC} $1"
|
||||
}
|
||||
|
||||
print_warning() {
|
||||
echo -e "${YELLOW}⚠️${NC} $1"
|
||||
}
|
||||
|
||||
print_error() {
|
||||
echo -e "${RED}❌${NC} $1"
|
||||
}
|
||||
|
||||
print_info() {
|
||||
echo -e "${BLUE}ℹ️${NC} $1"
|
||||
}
|
||||
|
||||
# Check if running as root
|
||||
if [[ $EUID -eq 0 ]]; then
|
||||
print_error "This script should not be run as root. Run it as the user 'ben' and it will use sudo when needed."
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Check if we're in the right directory
|
||||
if [[ ! -f "CLAUDE.md" ]] || [[ ! -d "systemd" ]]; then
|
||||
print_error "Please run this script from the hvac-kia-content project root directory"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Check if systemd files exist
|
||||
required_files=(
|
||||
"systemd/hkia-scraper.service"
|
||||
"systemd/hkia-scraper.timer"
|
||||
"systemd/hkia-scraper-nas.service"
|
||||
"systemd/hkia-scraper-nas.timer"
|
||||
)
|
||||
|
||||
for file in "${required_files[@]}"; do
|
||||
if [[ ! -f "$file" ]]; then
|
||||
print_error "Required file not found: $file"
|
||||
exit 1
|
||||
fi
|
||||
done
|
||||
|
||||
print_info "All required service files found"
|
||||
|
||||
echo ""
|
||||
echo "============================================================"
|
||||
echo "STEP 1: Stopping and Disabling Old Services"
|
||||
echo "============================================================"
|
||||
|
||||
# List of old services to stop and disable
|
||||
old_services=(
|
||||
"hvac-content-images-8am.timer"
|
||||
"hvac-content-images-12pm.timer"
|
||||
"hvac-content-8am.timer"
|
||||
"hvac-content-12pm.timer"
|
||||
"hvac-content-images-8am.service"
|
||||
"hvac-content-images-12pm.service"
|
||||
"hvac-content-8am.service"
|
||||
"hvac-content-12pm.service"
|
||||
)
|
||||
|
||||
for service in "${old_services[@]}"; do
|
||||
if systemctl is-active --quiet "$service" 2>/dev/null; then
|
||||
print_info "Stopping $service..."
|
||||
sudo systemctl stop "$service"
|
||||
print_status "Stopped $service"
|
||||
else
|
||||
print_info "$service is not running"
|
||||
fi
|
||||
|
||||
if systemctl is-enabled --quiet "$service" 2>/dev/null; then
|
||||
print_info "Disabling $service..."
|
||||
sudo systemctl disable "$service"
|
||||
print_status "Disabled $service"
|
||||
else
|
||||
print_info "$service is not enabled"
|
||||
fi
|
||||
done
|
||||
|
||||
echo ""
|
||||
echo "============================================================"
|
||||
echo "STEP 2: Installing New HKIA Services"
|
||||
echo "============================================================"
|
||||
|
||||
# Copy service files to systemd directory
|
||||
print_info "Copying service files to /etc/systemd/system/..."
|
||||
sudo cp systemd/hkia-scraper.service /etc/systemd/system/
|
||||
sudo cp systemd/hkia-scraper.timer /etc/systemd/system/
|
||||
sudo cp systemd/hkia-scraper-nas.service /etc/systemd/system/
|
||||
sudo cp systemd/hkia-scraper-nas.timer /etc/systemd/system/
|
||||
|
||||
print_status "Service files copied successfully"
|
||||
|
||||
# Reload systemd daemon
|
||||
print_info "Reloading systemd daemon..."
|
||||
sudo systemctl daemon-reload
|
||||
print_status "Systemd daemon reloaded"
|
||||
|
||||
echo ""
|
||||
echo "============================================================"
|
||||
echo "STEP 3: Enabling New Services"
|
||||
echo "============================================================"
|
||||
|
||||
# New services to enable
|
||||
new_services=(
|
||||
"hkia-scraper.service"
|
||||
"hkia-scraper.timer"
|
||||
"hkia-scraper-nas.service"
|
||||
"hkia-scraper-nas.timer"
|
||||
)
|
||||
|
||||
for service in "${new_services[@]}"; do
|
||||
print_info "Enabling $service..."
|
||||
sudo systemctl enable "$service"
|
||||
print_status "Enabled $service"
|
||||
done
|
||||
|
||||
echo ""
|
||||
echo "============================================================"
|
||||
echo "STEP 4: Starting Timers"
|
||||
echo "============================================================"
|
||||
|
||||
# Start the timers (services will be triggered by timers)
|
||||
timers=("hkia-scraper.timer" "hkia-scraper-nas.timer")
|
||||
|
||||
for timer in "${timers[@]}"; do
|
||||
print_info "Starting $timer..."
|
||||
sudo systemctl start "$timer"
|
||||
print_status "Started $timer"
|
||||
done
|
||||
|
||||
echo ""
|
||||
echo "============================================================"
|
||||
echo "STEP 5: Verification"
|
||||
echo "============================================================"
|
||||
|
||||
# Check status of new services
|
||||
print_info "Checking status of new services..."
|
||||
|
||||
for timer in "${timers[@]}"; do
|
||||
echo ""
|
||||
print_info "Status of $timer:"
|
||||
sudo systemctl status "$timer" --no-pager -l
|
||||
done
|
||||
|
||||
echo ""
|
||||
echo "============================================================"
|
||||
echo "STEP 6: Schedule Summary"
|
||||
echo "============================================================"
|
||||
|
||||
print_info "New HKIA Services Schedule (Atlantic Daylight Time):"
|
||||
echo " 📅 Main Scraping: 8:00 AM and 12:00 PM"
|
||||
echo " 📁 NAS Sync: 8:30 AM and 12:30 PM (30min after scraping)"
|
||||
echo ""
|
||||
print_info "Active Sources: WordPress, MailChimp RSS, Podcast RSS, YouTube, Instagram"
|
||||
print_warning "TikTok scraper is disabled (not working as designed)"
|
||||
|
||||
echo ""
|
||||
echo "============================================================"
|
||||
echo "INSTALLATION COMPLETE"
|
||||
echo "============================================================"
|
||||
|
||||
print_status "HKIA scraper services have been successfully installed and started!"
|
||||
print_info "Next scheduled run will be at the next 8:00 AM or 12:00 PM ADT"
|
||||
|
||||
echo ""
|
||||
print_info "Useful commands:"
|
||||
echo " sudo systemctl status hkia-scraper.timer"
|
||||
echo " sudo systemctl status hkia-scraper-nas.timer"
|
||||
echo " sudo journalctl -f -u hkia-scraper.service"
|
||||
echo " sudo journalctl -f -u hkia-scraper-nas.service"
|
||||
|
||||
# Show next scheduled runs
|
||||
echo ""
|
||||
print_info "Next scheduled runs:"
|
||||
sudo systemctl list-timers | grep hkia || print_warning "No upcoming runs shown (timers may need a moment to register)"
|
||||
|
||||
echo ""
|
||||
print_status "Installation script completed successfully!"
|
||||
|
|
@ -136,7 +136,7 @@ class ProductionBacklogCapture:
|
|||
# Generate and save markdown
|
||||
markdown = scraper.format_markdown(items)
|
||||
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
|
||||
filename = f"hkia_{source_name}_backlog_{timestamp}.md"
|
||||
filename = f"hvacknowitall_{source_name}_backlog_{timestamp}.md"
|
||||
|
||||
# Save to current directory
|
||||
current_dir = scraper.config.data_dir / "markdown_current"
|
||||
|
|
@ -265,7 +265,7 @@ class ProductionBacklogCapture:
|
|||
|
||||
def main():
|
||||
"""Main execution function"""
|
||||
print("🚀 HKIA - Production Backlog Capture")
|
||||
print("🚀 HVAC Know It All - Production Backlog Capture")
|
||||
print("=" * 60)
|
||||
print("This will download complete historical content from ALL sources")
|
||||
print("Including all available media files (images, videos, audio)")
|
||||
|
|
|
|||
|
|
@ -5,7 +5,6 @@ description = "Add your description here"
|
|||
requires-python = ">=3.12"
|
||||
dependencies = [
|
||||
"feedparser>=6.0.11",
|
||||
"google-api-python-client>=2.179.0",
|
||||
"instaloader>=4.14.2",
|
||||
"markitdown>=0.1.2",
|
||||
"playwright>=1.54.0",
|
||||
|
|
@ -21,6 +20,5 @@ dependencies = [
|
|||
"scrapling>=0.2.99",
|
||||
"tenacity>=9.1.2",
|
||||
"tiktokapi>=7.1.0",
|
||||
"youtube-transcript-api>=1.2.2",
|
||||
"yt-dlp>=2025.8.11",
|
||||
]
|
||||
|
|
|
|||
|
|
@ -1,304 +0,0 @@
|
|||
#!/usr/bin/env python3
|
||||
"""
|
||||
Production script for API-based content scraping - Version 2
|
||||
Follows project specification file/folder naming conventions
|
||||
Captures YouTube videos with captions and MailChimp campaigns with cleaned content
|
||||
"""
|
||||
|
||||
import sys
|
||||
from pathlib import Path
|
||||
sys.path.insert(0, str(Path(__file__).parent))
|
||||
|
||||
from src.youtube_api_scraper_v2 import YouTubeAPIScraper
|
||||
from src.mailchimp_api_scraper_v2 import MailChimpAPIScraper
|
||||
from src.base_scraper import ScraperConfig
|
||||
from datetime import datetime
|
||||
import pytz
|
||||
import time
|
||||
import logging
|
||||
import subprocess
|
||||
|
||||
# Set up logging
|
||||
logging.basicConfig(
|
||||
level=logging.INFO,
|
||||
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
|
||||
handlers=[
|
||||
logging.FileHandler('logs/api_production_v2.log'),
|
||||
logging.StreamHandler()
|
||||
]
|
||||
)
|
||||
logger = logging.getLogger('api_production_v2')
|
||||
|
||||
|
||||
def get_atlantic_timestamp() -> str:
|
||||
"""Get current timestamp in Atlantic timezone for file naming."""
|
||||
tz = pytz.timezone('America/Halifax')
|
||||
return datetime.now(tz).strftime('%Y-%m-%dT%H%M%S')
|
||||
|
||||
|
||||
def run_youtube_api_production():
|
||||
"""Run YouTube API scraper for production backlog with captions."""
|
||||
logger.info("=" * 60)
|
||||
logger.info("YOUTUBE API SCRAPER - PRODUCTION V2")
|
||||
logger.info("=" * 60)
|
||||
|
||||
timestamp = get_atlantic_timestamp()
|
||||
|
||||
# Follow project specification directory structure
|
||||
config = ScraperConfig(
|
||||
source_name='YouTube', # Capitalized per spec
|
||||
brand_name='hvacnkowitall',
|
||||
data_dir=Path('data/markdown_current'),
|
||||
logs_dir=Path('logs/YouTube'),
|
||||
timezone='America/Halifax'
|
||||
)
|
||||
|
||||
try:
|
||||
scraper = YouTubeAPIScraper(config)
|
||||
|
||||
logger.info("Starting YouTube API fetch with captions for all videos...")
|
||||
start = time.time()
|
||||
|
||||
# Fetch all videos WITH captions for top 50 (use more quota)
|
||||
videos = scraper.fetch_content(fetch_captions=True)
|
||||
|
||||
elapsed = time.time() - start
|
||||
logger.info(f"Fetched {len(videos)} videos in {elapsed:.1f} seconds")
|
||||
|
||||
if videos:
|
||||
# Statistics
|
||||
total_views = sum(v.get('view_count', 0) for v in videos)
|
||||
total_likes = sum(v.get('like_count', 0) for v in videos)
|
||||
with_captions = sum(1 for v in videos if v.get('caption_text'))
|
||||
|
||||
logger.info(f"Statistics:")
|
||||
logger.info(f" Total videos: {len(videos)}")
|
||||
logger.info(f" Total views: {total_views:,}")
|
||||
logger.info(f" Total likes: {total_likes:,}")
|
||||
logger.info(f" Videos with captions: {with_captions}")
|
||||
logger.info(f" Quota used: {scraper.quota_used}/{scraper.daily_quota_limit} units")
|
||||
|
||||
# Save with project specification naming: <brandName>_<source>_<dateTime>.md
|
||||
filename = f"hvacnkowitall_YouTube_{timestamp}.md"
|
||||
markdown = scraper.format_markdown(videos)
|
||||
output_file = Path(f'data/markdown_current/{filename}')
|
||||
output_file.parent.mkdir(parents=True, exist_ok=True)
|
||||
output_file.write_text(markdown, encoding='utf-8')
|
||||
logger.info(f"Markdown saved to: {output_file}")
|
||||
|
||||
# Create archive copy
|
||||
archive_dir = Path('data/markdown_archives/YouTube')
|
||||
archive_dir.mkdir(parents=True, exist_ok=True)
|
||||
archive_file = archive_dir / filename
|
||||
archive_file.write_text(markdown, encoding='utf-8')
|
||||
logger.info(f"Archive copy saved to: {archive_file}")
|
||||
|
||||
# Update state file
|
||||
state = scraper.load_state()
|
||||
state = scraper.update_state(state, videos)
|
||||
scraper.save_state(state)
|
||||
logger.info("State file updated for incremental updates")
|
||||
|
||||
return True, len(videos), output_file
|
||||
else:
|
||||
logger.error("No videos fetched from YouTube API")
|
||||
return False, 0, None
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"YouTube API scraper failed: {e}")
|
||||
return False, 0, None
|
||||
|
||||
|
||||
def run_mailchimp_api_production():
|
||||
"""Run MailChimp API scraper for production backlog with cleaned content."""
|
||||
logger.info("\n" + "=" * 60)
|
||||
logger.info("MAILCHIMP API SCRAPER - PRODUCTION V2")
|
||||
logger.info("=" * 60)
|
||||
|
||||
timestamp = get_atlantic_timestamp()
|
||||
|
||||
# Follow project specification directory structure
|
||||
config = ScraperConfig(
|
||||
source_name='MailChimp', # Capitalized per spec
|
||||
brand_name='hvacnkowitall',
|
||||
data_dir=Path('data/markdown_current'),
|
||||
logs_dir=Path('logs/MailChimp'),
|
||||
timezone='America/Halifax'
|
||||
)
|
||||
|
||||
try:
|
||||
scraper = MailChimpAPIScraper(config)
|
||||
|
||||
logger.info("Starting MailChimp API fetch with content cleaning...")
|
||||
start = time.time()
|
||||
|
||||
# Fetch all campaigns from Bi-Weekly Newsletter folder
|
||||
campaigns = scraper.fetch_content(max_items=1000)
|
||||
|
||||
elapsed = time.time() - start
|
||||
logger.info(f"Fetched {len(campaigns)} campaigns in {elapsed:.1f} seconds")
|
||||
|
||||
if campaigns:
|
||||
# Statistics
|
||||
total_sent = sum(c.get('metrics', {}).get('emails_sent', 0) for c in campaigns)
|
||||
total_opens = sum(c.get('metrics', {}).get('unique_opens', 0) for c in campaigns)
|
||||
total_clicks = sum(c.get('metrics', {}).get('unique_clicks', 0) for c in campaigns)
|
||||
|
||||
logger.info(f"Statistics:")
|
||||
logger.info(f" Total campaigns: {len(campaigns)}")
|
||||
logger.info(f" Total emails sent: {total_sent:,}")
|
||||
logger.info(f" Total unique opens: {total_opens:,}")
|
||||
logger.info(f" Total unique clicks: {total_clicks:,}")
|
||||
|
||||
if campaigns:
|
||||
avg_open_rate = sum(c.get('metrics', {}).get('open_rate', 0) for c in campaigns) / len(campaigns)
|
||||
avg_click_rate = sum(c.get('metrics', {}).get('click_rate', 0) for c in campaigns) / len(campaigns)
|
||||
logger.info(f" Average open rate: {avg_open_rate*100:.1f}%")
|
||||
logger.info(f" Average click rate: {avg_click_rate*100:.1f}%")
|
||||
|
||||
# Save with project specification naming: <brandName>_<source>_<dateTime>.md
|
||||
filename = f"hvacnkowitall_MailChimp_{timestamp}.md"
|
||||
markdown = scraper.format_markdown(campaigns)
|
||||
output_file = Path(f'data/markdown_current/{filename}')
|
||||
output_file.parent.mkdir(parents=True, exist_ok=True)
|
||||
output_file.write_text(markdown, encoding='utf-8')
|
||||
logger.info(f"Markdown saved to: {output_file}")
|
||||
|
||||
# Create archive copy
|
||||
archive_dir = Path('data/markdown_archives/MailChimp')
|
||||
archive_dir.mkdir(parents=True, exist_ok=True)
|
||||
archive_file = archive_dir / filename
|
||||
archive_file.write_text(markdown, encoding='utf-8')
|
||||
logger.info(f"Archive copy saved to: {archive_file}")
|
||||
|
||||
# Update state file
|
||||
state = scraper.load_state()
|
||||
state = scraper.update_state(state, campaigns)
|
||||
scraper.save_state(state)
|
||||
logger.info("State file updated for incremental updates")
|
||||
|
||||
return True, len(campaigns), output_file
|
||||
else:
|
||||
logger.warning("No campaigns found in MailChimp")
|
||||
return True, 0, None
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"MailChimp API scraper failed: {e}")
|
||||
return False, 0, None
|
||||
|
||||
|
||||
def sync_to_nas():
|
||||
"""Sync API scraper results to NAS following project structure."""
|
||||
logger.info("\n" + "=" * 60)
|
||||
logger.info("SYNCING TO NAS - PROJECT STRUCTURE")
|
||||
logger.info("=" * 60)
|
||||
|
||||
nas_base = Path('/mnt/nas/hvacknowitall')
|
||||
|
||||
try:
|
||||
# Sync all markdown_current files
|
||||
local_current = Path('data/markdown_current')
|
||||
nas_current = nas_base / 'markdown_current'
|
||||
|
||||
if local_current.exists() and any(local_current.glob('*.md')):
|
||||
# Create destination if needed
|
||||
nas_current.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
# Sync all current markdown files
|
||||
cmd = ['rsync', '-av', '--include=*.md', '--exclude=*',
|
||||
str(local_current) + '/', str(nas_current) + '/']
|
||||
result = subprocess.run(cmd, capture_output=True, text=True)
|
||||
|
||||
if result.returncode == 0:
|
||||
logger.info(f"✅ Current markdown files synced to NAS: {nas_current}")
|
||||
# List synced files
|
||||
for md_file in nas_current.glob('*.md'):
|
||||
size = md_file.stat().st_size / 1024 # KB
|
||||
logger.info(f" - {md_file.name} ({size:.0f}KB)")
|
||||
else:
|
||||
logger.warning(f"Sync warning: {result.stderr}")
|
||||
else:
|
||||
logger.info("No current markdown files to sync")
|
||||
|
||||
# Sync archives
|
||||
for source in ['YouTube', 'MailChimp']:
|
||||
local_archive = Path(f'data/markdown_archives/{source}')
|
||||
nas_archive = nas_base / f'markdown_archives/{source}'
|
||||
|
||||
if local_archive.exists() and any(local_archive.glob('*.md')):
|
||||
nas_archive.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
cmd = ['rsync', '-av', '--include=*.md', '--exclude=*',
|
||||
str(local_archive) + '/', str(nas_archive) + '/']
|
||||
result = subprocess.run(cmd, capture_output=True, text=True)
|
||||
|
||||
if result.returncode == 0:
|
||||
logger.info(f"✅ {source} archives synced to NAS: {nas_archive}")
|
||||
else:
|
||||
logger.warning(f"{source} archive sync warning: {result.stderr}")
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to sync to NAS: {e}")
|
||||
|
||||
|
||||
def main():
|
||||
"""Main production run with project specification compliance."""
|
||||
logger.info("=" * 70)
|
||||
logger.info("HVAC KNOW IT ALL - API SCRAPERS PRODUCTION V2")
|
||||
logger.info("Following Project Specification Standards")
|
||||
logger.info("=" * 70)
|
||||
|
||||
atlantic_tz = pytz.timezone('America/Halifax')
|
||||
start_time = datetime.now(atlantic_tz)
|
||||
logger.info(f"Started at: {start_time.isoformat()}")
|
||||
|
||||
# Track results
|
||||
results = {
|
||||
'YouTube': {'success': False, 'count': 0, 'file': None},
|
||||
'MailChimp': {'success': False, 'count': 0, 'file': None}
|
||||
}
|
||||
|
||||
# Run YouTube API scraper with captions
|
||||
success, count, output_file = run_youtube_api_production()
|
||||
results['YouTube'] = {'success': success, 'count': count, 'file': output_file}
|
||||
|
||||
# Run MailChimp API scraper with content cleaning
|
||||
success, count, output_file = run_mailchimp_api_production()
|
||||
results['MailChimp'] = {'success': success, 'count': count, 'file': output_file}
|
||||
|
||||
# Sync to NAS
|
||||
sync_to_nas()
|
||||
|
||||
# Summary
|
||||
end_time = datetime.now(atlantic_tz)
|
||||
duration = end_time - start_time
|
||||
|
||||
logger.info("\n" + "=" * 70)
|
||||
logger.info("PRODUCTION V2 SUMMARY")
|
||||
logger.info("=" * 70)
|
||||
|
||||
for source, result in results.items():
|
||||
status = "✅" if result['success'] else "❌"
|
||||
logger.info(f"{status} {source}: {result['count']} items")
|
||||
if result['file']:
|
||||
logger.info(f" Output: {result['file']}")
|
||||
|
||||
logger.info(f"\nTotal duration: {duration.total_seconds():.1f} seconds")
|
||||
logger.info(f"Completed at: {end_time.isoformat()}")
|
||||
|
||||
# Project specification compliance
|
||||
logger.info("\nPROJECT SPECIFICATION COMPLIANCE:")
|
||||
logger.info("✅ File naming: hvacnkowitall_<Source>_<YYYY-MM-DDTHHMMSS>.md")
|
||||
logger.info("✅ Directory structure: data/markdown_current/, data/markdown_archives/")
|
||||
logger.info("✅ Capitalized source names: YouTube, MailChimp")
|
||||
logger.info("✅ Atlantic timezone timestamps")
|
||||
logger.info("✅ Archive copies created")
|
||||
logger.info("✅ State files for incremental updates")
|
||||
|
||||
# Return success if at least one scraper succeeded
|
||||
return any(r['success'] for r in results.values())
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
success = main()
|
||||
sys.exit(0 if success else 1)
|
||||
|
|
@ -1,278 +0,0 @@
|
|||
#!/usr/bin/env python3
|
||||
"""
|
||||
Production script for API-based content scraping
|
||||
Captures YouTube videos and MailChimp campaigns using official APIs
|
||||
"""
|
||||
|
||||
import sys
|
||||
from pathlib import Path
|
||||
sys.path.insert(0, str(Path(__file__).parent))
|
||||
|
||||
from src.youtube_api_scraper import YouTubeAPIScraper
|
||||
from src.mailchimp_api_scraper import MailChimpAPIScraper
|
||||
from src.base_scraper import ScraperConfig
|
||||
from datetime import datetime
|
||||
import pytz
|
||||
import time
|
||||
import logging
|
||||
|
||||
# Set up logging
|
||||
logging.basicConfig(
|
||||
level=logging.INFO,
|
||||
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
|
||||
handlers=[
|
||||
logging.FileHandler('logs/api_scrapers_production.log'),
|
||||
logging.StreamHandler()
|
||||
]
|
||||
)
|
||||
logger = logging.getLogger('api_production')
|
||||
|
||||
|
||||
def run_youtube_api_production():
|
||||
"""Run YouTube API scraper for production backlog"""
|
||||
logger.info("=" * 60)
|
||||
logger.info("YOUTUBE API SCRAPER - PRODUCTION RUN")
|
||||
logger.info("=" * 60)
|
||||
|
||||
tz = pytz.timezone('America/Halifax')
|
||||
timestamp = datetime.now(tz).strftime('%Y-%m-%dT%H%M%S')
|
||||
|
||||
config = ScraperConfig(
|
||||
source_name='youtube',
|
||||
brand_name='hvacknowitall',
|
||||
data_dir=Path('data/youtube'),
|
||||
logs_dir=Path('logs/youtube'),
|
||||
timezone='America/Halifax'
|
||||
)
|
||||
|
||||
try:
|
||||
scraper = YouTubeAPIScraper(config)
|
||||
|
||||
logger.info("Starting YouTube API fetch for full channel...")
|
||||
start = time.time()
|
||||
|
||||
# Fetch all videos with transcripts for top 50
|
||||
videos = scraper.fetch_content(fetch_transcripts=True)
|
||||
|
||||
elapsed = time.time() - start
|
||||
logger.info(f"Fetched {len(videos)} videos in {elapsed:.1f} seconds")
|
||||
|
||||
if videos:
|
||||
# Statistics
|
||||
total_views = sum(v.get('view_count', 0) for v in videos)
|
||||
total_likes = sum(v.get('like_count', 0) for v in videos)
|
||||
with_transcripts = sum(1 for v in videos if v.get('transcript'))
|
||||
|
||||
logger.info(f"Statistics:")
|
||||
logger.info(f" Total videos: {len(videos)}")
|
||||
logger.info(f" Total views: {total_views:,}")
|
||||
logger.info(f" Total likes: {total_likes:,}")
|
||||
logger.info(f" Videos with transcripts: {with_transcripts}")
|
||||
logger.info(f" Quota used: {scraper.quota_used}/{scraper.daily_quota_limit} units")
|
||||
|
||||
# Save markdown with timestamp
|
||||
markdown = scraper.format_markdown(videos)
|
||||
output_file = Path(f'data/youtube/hvacknowitall_youtube_{timestamp}.md')
|
||||
output_file.parent.mkdir(parents=True, exist_ok=True)
|
||||
output_file.write_text(markdown, encoding='utf-8')
|
||||
logger.info(f"Markdown saved to: {output_file}")
|
||||
|
||||
# Also save as "latest" for easy access
|
||||
latest_file = Path('data/youtube/hvacknowitall_youtube_latest.md')
|
||||
latest_file.write_text(markdown, encoding='utf-8')
|
||||
logger.info(f"Latest file updated: {latest_file}")
|
||||
|
||||
# Update state file
|
||||
state = scraper.load_state()
|
||||
state = scraper.update_state(state, videos)
|
||||
scraper.save_state(state)
|
||||
logger.info("State file updated for incremental updates")
|
||||
|
||||
return True, len(videos), output_file
|
||||
else:
|
||||
logger.error("No videos fetched from YouTube API")
|
||||
return False, 0, None
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"YouTube API scraper failed: {e}")
|
||||
return False, 0, None
|
||||
|
||||
|
||||
def run_mailchimp_api_production():
|
||||
"""Run MailChimp API scraper for production backlog"""
|
||||
logger.info("\n" + "=" * 60)
|
||||
logger.info("MAILCHIMP API SCRAPER - PRODUCTION RUN")
|
||||
logger.info("=" * 60)
|
||||
|
||||
tz = pytz.timezone('America/Halifax')
|
||||
timestamp = datetime.now(tz).strftime('%Y-%m-%dT%H%M%S')
|
||||
|
||||
config = ScraperConfig(
|
||||
source_name='mailchimp',
|
||||
brand_name='hvacknowitall',
|
||||
data_dir=Path('data/mailchimp'),
|
||||
logs_dir=Path('logs/mailchimp'),
|
||||
timezone='America/Halifax'
|
||||
)
|
||||
|
||||
try:
|
||||
scraper = MailChimpAPIScraper(config)
|
||||
|
||||
logger.info("Starting MailChimp API fetch for all campaigns...")
|
||||
start = time.time()
|
||||
|
||||
# Fetch all campaigns from Bi-Weekly Newsletter folder
|
||||
campaigns = scraper.fetch_content(max_items=1000) # Get all available
|
||||
|
||||
elapsed = time.time() - start
|
||||
logger.info(f"Fetched {len(campaigns)} campaigns in {elapsed:.1f} seconds")
|
||||
|
||||
if campaigns:
|
||||
# Statistics
|
||||
total_sent = sum(c.get('metrics', {}).get('emails_sent', 0) for c in campaigns)
|
||||
total_opens = sum(c.get('metrics', {}).get('unique_opens', 0) for c in campaigns)
|
||||
total_clicks = sum(c.get('metrics', {}).get('unique_clicks', 0) for c in campaigns)
|
||||
|
||||
logger.info(f"Statistics:")
|
||||
logger.info(f" Total campaigns: {len(campaigns)}")
|
||||
logger.info(f" Total emails sent: {total_sent:,}")
|
||||
logger.info(f" Total unique opens: {total_opens:,}")
|
||||
logger.info(f" Total unique clicks: {total_clicks:,}")
|
||||
|
||||
if campaigns:
|
||||
avg_open_rate = sum(c.get('metrics', {}).get('open_rate', 0) for c in campaigns) / len(campaigns)
|
||||
avg_click_rate = sum(c.get('metrics', {}).get('click_rate', 0) for c in campaigns) / len(campaigns)
|
||||
logger.info(f" Average open rate: {avg_open_rate*100:.1f}%")
|
||||
logger.info(f" Average click rate: {avg_click_rate*100:.1f}%")
|
||||
|
||||
# Save markdown with timestamp
|
||||
markdown = scraper.format_markdown(campaigns)
|
||||
output_file = Path(f'data/mailchimp/hvacknowitall_mailchimp_{timestamp}.md')
|
||||
output_file.parent.mkdir(parents=True, exist_ok=True)
|
||||
output_file.write_text(markdown, encoding='utf-8')
|
||||
logger.info(f"Markdown saved to: {output_file}")
|
||||
|
||||
# Also save as "latest" for easy access
|
||||
latest_file = Path('data/mailchimp/hvacknowitall_mailchimp_latest.md')
|
||||
latest_file.write_text(markdown, encoding='utf-8')
|
||||
logger.info(f"Latest file updated: {latest_file}")
|
||||
|
||||
# Update state file
|
||||
state = scraper.load_state()
|
||||
state = scraper.update_state(state, campaigns)
|
||||
scraper.save_state(state)
|
||||
logger.info("State file updated for incremental updates")
|
||||
|
||||
return True, len(campaigns), output_file
|
||||
else:
|
||||
logger.warning("No campaigns found in MailChimp")
|
||||
return True, 0, None # Not an error if no campaigns
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"MailChimp API scraper failed: {e}")
|
||||
return False, 0, None
|
||||
|
||||
|
||||
def sync_to_nas():
|
||||
"""Sync API scraper results to NAS"""
|
||||
logger.info("\n" + "=" * 60)
|
||||
logger.info("SYNCING TO NAS")
|
||||
logger.info("=" * 60)
|
||||
|
||||
import subprocess
|
||||
|
||||
nas_base = Path('/mnt/nas/hvacknowitall')
|
||||
|
||||
# Sync YouTube
|
||||
try:
|
||||
youtube_src = Path('data/youtube')
|
||||
youtube_dest = nas_base / 'markdown_current/youtube'
|
||||
|
||||
if youtube_src.exists() and any(youtube_src.glob('*.md')):
|
||||
# Create destination if needed
|
||||
youtube_dest.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
# Sync markdown files
|
||||
cmd = ['rsync', '-av', '--include=*.md', '--exclude=*',
|
||||
str(youtube_src) + '/', str(youtube_dest) + '/']
|
||||
result = subprocess.run(cmd, capture_output=True, text=True)
|
||||
|
||||
if result.returncode == 0:
|
||||
logger.info(f"✅ YouTube data synced to NAS: {youtube_dest}")
|
||||
else:
|
||||
logger.warning(f"YouTube sync warning: {result.stderr}")
|
||||
else:
|
||||
logger.info("No YouTube data to sync")
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to sync YouTube data: {e}")
|
||||
|
||||
# Sync MailChimp
|
||||
try:
|
||||
mailchimp_src = Path('data/mailchimp')
|
||||
mailchimp_dest = nas_base / 'markdown_current/mailchimp'
|
||||
|
||||
if mailchimp_src.exists() and any(mailchimp_src.glob('*.md')):
|
||||
# Create destination if needed
|
||||
mailchimp_dest.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
# Sync markdown files
|
||||
cmd = ['rsync', '-av', '--include=*.md', '--exclude=*',
|
||||
str(mailchimp_src) + '/', str(mailchimp_dest) + '/']
|
||||
result = subprocess.run(cmd, capture_output=True, text=True)
|
||||
|
||||
if result.returncode == 0:
|
||||
logger.info(f"✅ MailChimp data synced to NAS: {mailchimp_dest}")
|
||||
else:
|
||||
logger.warning(f"MailChimp sync warning: {result.stderr}")
|
||||
else:
|
||||
logger.info("No MailChimp data to sync")
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to sync MailChimp data: {e}")
|
||||
|
||||
|
||||
def main():
|
||||
"""Main production run"""
|
||||
logger.info("=" * 60)
|
||||
logger.info("HVAC KNOW IT ALL - API SCRAPERS PRODUCTION RUN")
|
||||
logger.info("=" * 60)
|
||||
logger.info(f"Started at: {datetime.now(pytz.timezone('America/Halifax')).isoformat()}")
|
||||
|
||||
# Track results
|
||||
results = {
|
||||
'youtube': {'success': False, 'count': 0, 'file': None},
|
||||
'mailchimp': {'success': False, 'count': 0, 'file': None}
|
||||
}
|
||||
|
||||
# Run YouTube API scraper
|
||||
success, count, output_file = run_youtube_api_production()
|
||||
results['youtube'] = {'success': success, 'count': count, 'file': output_file}
|
||||
|
||||
# Run MailChimp API scraper
|
||||
success, count, output_file = run_mailchimp_api_production()
|
||||
results['mailchimp'] = {'success': success, 'count': count, 'file': output_file}
|
||||
|
||||
# Sync to NAS
|
||||
sync_to_nas()
|
||||
|
||||
# Summary
|
||||
logger.info("\n" + "=" * 60)
|
||||
logger.info("PRODUCTION RUN SUMMARY")
|
||||
logger.info("=" * 60)
|
||||
|
||||
for source, result in results.items():
|
||||
status = "✅" if result['success'] else "❌"
|
||||
logger.info(f"{status} {source.upper()}: {result['count']} items")
|
||||
if result['file']:
|
||||
logger.info(f" Output: {result['file']}")
|
||||
|
||||
logger.info(f"\nCompleted at: {datetime.now(pytz.timezone('America/Halifax')).isoformat()}")
|
||||
|
||||
# Return success if at least one scraper succeeded
|
||||
return any(r['success'] for r in results.values())
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
success = main()
|
||||
sys.exit(0 if success else 1)
|
||||
|
|
@ -1,166 +0,0 @@
|
|||
#!/usr/bin/env python3
|
||||
"""
|
||||
Fetch the next 1000 Instagram posts (1001-2000) and update cumulative file.
|
||||
"""
|
||||
|
||||
import sys
|
||||
from pathlib import Path
|
||||
sys.path.insert(0, str(Path(__file__).parent))
|
||||
|
||||
from src.instagram_scraper_with_images import InstagramScraperWithImages
|
||||
from src.base_scraper import ScraperConfig
|
||||
from src.cumulative_markdown_manager import CumulativeMarkdownManager
|
||||
from datetime import datetime
|
||||
import pytz
|
||||
import time
|
||||
import logging
|
||||
import instaloader
|
||||
|
||||
# Set up logging
|
||||
logging.basicConfig(
|
||||
level=logging.INFO,
|
||||
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
|
||||
handlers=[
|
||||
logging.FileHandler('logs/instagram_next_1000.log'),
|
||||
logging.StreamHandler()
|
||||
]
|
||||
)
|
||||
logger = logging.getLogger('instagram_next_1000')
|
||||
|
||||
|
||||
def fetch_next_1000_posts():
|
||||
"""Fetch Instagram posts 1001-2000 and update cumulative file."""
|
||||
|
||||
logger.info("=" * 60)
|
||||
logger.info("INSTAGRAM NEXT 1000 POSTS (1001-2000)")
|
||||
logger.info("=" * 60)
|
||||
|
||||
# Get Atlantic timezone timestamp
|
||||
tz = pytz.timezone('America/Halifax')
|
||||
now = datetime.now(tz)
|
||||
timestamp = now.strftime('%Y-%m-%dT%H%M%S')
|
||||
|
||||
logger.info(f"Started at: {now.strftime('%Y-%m-%d %H:%M:%S %Z')}")
|
||||
|
||||
# Setup config
|
||||
config = ScraperConfig(
|
||||
source_name='Instagram',
|
||||
brand_name='hkia',
|
||||
data_dir=Path('data'),
|
||||
logs_dir=Path('logs'),
|
||||
timezone='America/Halifax'
|
||||
)
|
||||
|
||||
# Initialize scraper
|
||||
scraper = InstagramScraperWithImages(config)
|
||||
cumulative_manager = CumulativeMarkdownManager(config)
|
||||
|
||||
logger.info("Fetching posts 1001-2000 from Instagram...")
|
||||
logger.info("This will take several hours due to rate limiting")
|
||||
|
||||
all_items = []
|
||||
posts_to_skip = 1000 # We already have the first 1000
|
||||
max_posts = 1000 # We want the next 1000
|
||||
|
||||
try:
|
||||
# Ensure we have a valid context
|
||||
if not scraper.loader.context:
|
||||
logger.error("Failed to initialize Instagram context")
|
||||
return False
|
||||
|
||||
# Get profile
|
||||
profile = instaloader.Profile.from_username(scraper.loader.context, scraper.target_account)
|
||||
scraper._check_rate_limit()
|
||||
|
||||
# Get posts
|
||||
posts = profile.get_posts()
|
||||
|
||||
post_count = 0
|
||||
skipped = 0
|
||||
|
||||
for post in posts:
|
||||
# Skip first 1000 posts
|
||||
if skipped < posts_to_skip:
|
||||
skipped += 1
|
||||
if skipped % 100 == 0:
|
||||
logger.info(f"Skipping post {skipped}/{posts_to_skip}...")
|
||||
continue
|
||||
|
||||
# Stop after next 1000
|
||||
if post_count >= max_posts:
|
||||
break
|
||||
|
||||
try:
|
||||
# Download images for this post
|
||||
image_paths = scraper._download_post_images(post, post.shortcode)
|
||||
|
||||
# Extract post data
|
||||
post_data = {
|
||||
'id': post.shortcode,
|
||||
'type': scraper._get_post_type(post),
|
||||
'caption': post.caption if post.caption else '',
|
||||
'author': post.owner_username,
|
||||
'publish_date': post.date_utc.isoformat(),
|
||||
'link': f'https://www.instagram.com/p/{post.shortcode}/',
|
||||
'likes': post.likes,
|
||||
'comments': post.comments,
|
||||
'views': post.video_view_count if hasattr(post, 'video_view_count') else None,
|
||||
'media_count': post.mediacount if hasattr(post, 'mediacount') else 1,
|
||||
'hashtags': list(post.caption_hashtags) if post.caption else [],
|
||||
'mentions': list(post.caption_mentions) if post.caption else [],
|
||||
'is_video': getattr(post, 'is_video', False),
|
||||
'local_images': image_paths
|
||||
}
|
||||
|
||||
all_items.append(post_data)
|
||||
post_count += 1
|
||||
|
||||
# Aggressive rate limiting
|
||||
scraper._aggressive_delay()
|
||||
scraper._check_rate_limit()
|
||||
|
||||
# Progress updates
|
||||
if post_count % 10 == 0:
|
||||
logger.info(f"Fetched post {posts_to_skip + post_count} (#{post_count}/1000 in this batch)")
|
||||
|
||||
# Save incremental updates every 100 posts
|
||||
if post_count % 100 == 0:
|
||||
logger.info(f"Saving incremental update at {post_count} posts...")
|
||||
output_file = cumulative_manager.update_cumulative_file(all_items, 'Instagram')
|
||||
logger.info(f"Saved to: {output_file}")
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error processing post: {e}")
|
||||
continue
|
||||
|
||||
# Final save
|
||||
if all_items:
|
||||
output_file = cumulative_manager.update_cumulative_file(all_items, 'Instagram')
|
||||
|
||||
# Calculate statistics
|
||||
img_count = sum(len(item.get('local_images', [])) for item in all_items)
|
||||
|
||||
logger.info("=" * 60)
|
||||
logger.info("INSTAGRAM NEXT 1000 COMPLETED")
|
||||
logger.info("=" * 60)
|
||||
logger.info(f"Posts fetched: {len(all_items)}")
|
||||
logger.info(f"Post range: 1001-{1000 + len(all_items)}")
|
||||
logger.info(f"Images downloaded: {img_count}")
|
||||
logger.info(f"Output file: {output_file}")
|
||||
logger.info("=" * 60)
|
||||
|
||||
return True
|
||||
else:
|
||||
logger.warning("No posts fetched")
|
||||
return False
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Fatal error: {e}")
|
||||
import traceback
|
||||
traceback.print_exc()
|
||||
return False
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
success = fetch_next_1000_posts()
|
||||
sys.exit(0 if success else 1)
|
||||
|
|
@ -1,6 +1,6 @@
|
|||
#!/usr/bin/env python3
|
||||
"""
|
||||
Production runner for HKIA Content Aggregator
|
||||
Production runner for HVAC Know It All Content Aggregator
|
||||
Handles both regular scraping and special TikTok caption jobs
|
||||
"""
|
||||
import sys
|
||||
|
|
@ -125,7 +125,7 @@ def run_regular_scraping():
|
|||
# Create orchestrator config
|
||||
config = ScraperConfig(
|
||||
source_name="production",
|
||||
brand_name="hkia",
|
||||
brand_name="hvacknowitall",
|
||||
data_dir=DATA_DIR,
|
||||
logs_dir=LOGS_DIR,
|
||||
timezone="America/Halifax"
|
||||
|
|
@ -197,7 +197,7 @@ def run_regular_scraping():
|
|||
# Combine and save results
|
||||
if OUTPUT_CONFIG.get("combine_sources", True):
|
||||
combined_markdown = []
|
||||
combined_markdown.append(f"# HKIA Content Update")
|
||||
combined_markdown.append(f"# HVAC Know It All Content Update")
|
||||
combined_markdown.append(f"Generated: {datetime.now():%Y-%m-%d %H:%M:%S}")
|
||||
combined_markdown.append("")
|
||||
|
||||
|
|
@ -213,8 +213,8 @@ def run_regular_scraping():
|
|||
combined_markdown.append(markdown)
|
||||
|
||||
# Save combined output with spec-compliant naming
|
||||
# Format: hkia_combined_YYYY-MM-DD-THHMMSS.md
|
||||
output_file = DATA_DIR / f"hkia_combined_{datetime.now():%Y-%m-%d-T%H%M%S}.md"
|
||||
# Format: hvacknowitall_combined_YYYY-MM-DD-THHMMSS.md
|
||||
output_file = DATA_DIR / f"hvacknowitall_combined_{datetime.now():%Y-%m-%d-T%H%M%S}.md"
|
||||
output_file.write_text("\n".join(combined_markdown), encoding="utf-8")
|
||||
logger.info(f"Saved combined output to {output_file}")
|
||||
|
||||
|
|
@ -284,7 +284,7 @@ def run_tiktok_caption_job():
|
|||
|
||||
config = ScraperConfig(
|
||||
source_name="tiktok_captions",
|
||||
brand_name="hkia",
|
||||
brand_name="hvacknowitall",
|
||||
data_dir=DATA_DIR / "tiktok_captions",
|
||||
logs_dir=LOGS_DIR / "tiktok_captions",
|
||||
timezone="America/Halifax"
|
||||
|
|
|
|||
|
|
@ -1,238 +0,0 @@
|
|||
#!/usr/bin/env python3
|
||||
"""
|
||||
Production script with cumulative markdown and image downloads.
|
||||
Uses cumulative updates for all sources.
|
||||
"""
|
||||
|
||||
import sys
|
||||
from pathlib import Path
|
||||
sys.path.insert(0, str(Path(__file__).parent))
|
||||
|
||||
from src.youtube_api_scraper_with_thumbnails import YouTubeAPIScraperWithThumbnails
|
||||
from src.mailchimp_api_scraper_v2 import MailChimpAPIScraper
|
||||
from src.instagram_scraper_cumulative import InstagramScraperCumulative
|
||||
from src.rss_scraper_with_images import RSSScraperPodcastWithImages
|
||||
from src.wordpress_scraper import WordPressScraper
|
||||
from src.tiktok_scraper_advanced import TikTokScraperAdvanced
|
||||
from src.base_scraper import ScraperConfig
|
||||
from src.cumulative_markdown_manager import CumulativeMarkdownManager
|
||||
from datetime import datetime
|
||||
import pytz
|
||||
import time
|
||||
import logging
|
||||
import subprocess
|
||||
import os
|
||||
|
||||
# Set up logging
|
||||
logging.basicConfig(
|
||||
level=logging.INFO,
|
||||
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
|
||||
handlers=[
|
||||
logging.FileHandler('logs/production_cumulative.log'),
|
||||
logging.StreamHandler()
|
||||
]
|
||||
)
|
||||
logger = logging.getLogger('production_cumulative')
|
||||
|
||||
|
||||
def get_atlantic_timestamp() -> str:
|
||||
"""Get current timestamp in Atlantic timezone for file naming."""
|
||||
tz = pytz.timezone('America/Halifax')
|
||||
return datetime.now(tz).strftime('%Y-%m-%dT%H%M%S')
|
||||
|
||||
|
||||
def run_instagram_incremental():
|
||||
"""Run Instagram incremental update with cumulative markdown."""
|
||||
logger.info("=" * 60)
|
||||
logger.info("INSTAGRAM INCREMENTAL UPDATE (CUMULATIVE)")
|
||||
logger.info("=" * 60)
|
||||
|
||||
if not os.getenv('INSTAGRAM_USERNAME'):
|
||||
logger.warning("Instagram not configured")
|
||||
return False, 0, None
|
||||
|
||||
config = ScraperConfig(
|
||||
source_name='Instagram',
|
||||
brand_name='hkia',
|
||||
data_dir=Path('data'),
|
||||
logs_dir=Path('logs'),
|
||||
timezone='America/Halifax'
|
||||
)
|
||||
|
||||
try:
|
||||
scraper = InstagramScraperCumulative(config)
|
||||
return scraper.run_incremental(max_posts=50) # Check for 50 new posts
|
||||
except Exception as e:
|
||||
logger.error(f"Instagram error: {e}")
|
||||
return False, 0, None
|
||||
|
||||
|
||||
def run_youtube_incremental():
|
||||
"""Run YouTube incremental update with thumbnails."""
|
||||
logger.info("=" * 60)
|
||||
logger.info("YOUTUBE INCREMENTAL UPDATE")
|
||||
logger.info("=" * 60)
|
||||
|
||||
config = ScraperConfig(
|
||||
source_name='YouTube',
|
||||
brand_name='hkia',
|
||||
data_dir=Path('data'),
|
||||
logs_dir=Path('logs'),
|
||||
timezone='America/Halifax'
|
||||
)
|
||||
|
||||
try:
|
||||
scraper = YouTubeAPIScraperWithThumbnails(config)
|
||||
videos = scraper.fetch_content(max_posts=20) # Check for 20 new videos
|
||||
|
||||
if videos:
|
||||
manager = CumulativeMarkdownManager(config)
|
||||
output_file = manager.update_cumulative_file(videos, 'YouTube')
|
||||
|
||||
thumb_count = sum(1 for v in videos if v.get('local_thumbnail'))
|
||||
logger.info(f"✅ YouTube: {len(videos)} videos, {thumb_count} thumbnails")
|
||||
return True, len(videos), output_file
|
||||
else:
|
||||
logger.info("No new YouTube videos")
|
||||
return False, 0, None
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"YouTube error: {e}")
|
||||
return False, 0, None
|
||||
|
||||
|
||||
def run_podcast_incremental():
|
||||
"""Run Podcast incremental update with thumbnails."""
|
||||
logger.info("=" * 60)
|
||||
logger.info("PODCAST INCREMENTAL UPDATE")
|
||||
logger.info("=" * 60)
|
||||
|
||||
if not os.getenv('PODCAST_RSS_URL'):
|
||||
logger.warning("Podcast not configured")
|
||||
return False, 0, None
|
||||
|
||||
config = ScraperConfig(
|
||||
source_name='Podcast',
|
||||
brand_name='hkia',
|
||||
data_dir=Path('data'),
|
||||
logs_dir=Path('logs'),
|
||||
timezone='America/Halifax'
|
||||
)
|
||||
|
||||
try:
|
||||
scraper = RSSScraperPodcastWithImages(config)
|
||||
items = scraper.fetch_content(max_items=10) # Check for 10 new episodes
|
||||
|
||||
if items:
|
||||
manager = CumulativeMarkdownManager(config)
|
||||
output_file = manager.update_cumulative_file(items, 'Podcast')
|
||||
|
||||
thumb_count = sum(1 for item in items if item.get('local_thumbnail'))
|
||||
logger.info(f"✅ Podcast: {len(items)} episodes, {thumb_count} thumbnails")
|
||||
return True, len(items), output_file
|
||||
else:
|
||||
logger.info("No new podcast episodes")
|
||||
return False, 0, None
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Podcast error: {e}")
|
||||
return False, 0, None
|
||||
|
||||
|
||||
def sync_to_nas_with_images():
|
||||
"""Sync markdown files AND images to NAS."""
|
||||
logger.info("\n" + "=" * 60)
|
||||
logger.info("SYNCING TO NAS - MARKDOWN AND IMAGES")
|
||||
logger.info("=" * 60)
|
||||
|
||||
nas_base = Path('/mnt/nas/hkia')
|
||||
|
||||
try:
|
||||
# Sync markdown files
|
||||
local_current = Path('data/markdown_current')
|
||||
nas_current = nas_base / 'markdown_current'
|
||||
|
||||
if local_current.exists() and any(local_current.glob('*.md')):
|
||||
nas_current.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
cmd = ['rsync', '-av', '--include=*.md', '--exclude=*',
|
||||
str(local_current) + '/', str(nas_current) + '/']
|
||||
result = subprocess.run(cmd, capture_output=True, text=True)
|
||||
|
||||
if result.returncode == 0:
|
||||
logger.info(f"✅ Markdown files synced to NAS")
|
||||
else:
|
||||
logger.warning(f"Markdown sync warning: {result.stderr}")
|
||||
|
||||
# Sync media files
|
||||
local_media = Path('data/media')
|
||||
nas_media = nas_base / 'media'
|
||||
|
||||
if local_media.exists():
|
||||
nas_media.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
cmd = ['rsync', '-av',
|
||||
'--include=*/',
|
||||
'--include=*.jpg', '--include=*.jpeg',
|
||||
'--include=*.png', '--include=*.gif',
|
||||
'--exclude=*',
|
||||
str(local_media) + '/', str(nas_media) + '/']
|
||||
result = subprocess.run(cmd, capture_output=True, text=True)
|
||||
|
||||
if result.returncode == 0:
|
||||
logger.info(f"✅ Media files synced to NAS")
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to sync to NAS: {e}")
|
||||
|
||||
|
||||
def main():
|
||||
"""Main production run with cumulative updates and images."""
|
||||
logger.info("=" * 70)
|
||||
logger.info("HKIA - CUMULATIVE PRODUCTION")
|
||||
logger.info("With Image Downloads and Cumulative Markdown")
|
||||
logger.info("=" * 70)
|
||||
|
||||
atlantic_tz = pytz.timezone('America/Halifax')
|
||||
start_time = datetime.now(atlantic_tz)
|
||||
logger.info(f"Started at: {start_time.isoformat()}")
|
||||
|
||||
# Track results
|
||||
results = {}
|
||||
|
||||
# Run incremental updates
|
||||
success, count, file = run_instagram_incremental()
|
||||
results['Instagram'] = {'success': success, 'count': count, 'file': file}
|
||||
|
||||
time.sleep(2)
|
||||
|
||||
success, count, file = run_youtube_incremental()
|
||||
results['YouTube'] = {'success': success, 'count': count, 'file': file}
|
||||
|
||||
time.sleep(2)
|
||||
|
||||
success, count, file = run_podcast_incremental()
|
||||
results['Podcast'] = {'success': success, 'count': count, 'file': file}
|
||||
|
||||
# Also run MailChimp (already has cumulative support)
|
||||
# ... (add MailChimp, WordPress, TikTok as needed)
|
||||
|
||||
# Sync to NAS
|
||||
sync_to_nas_with_images()
|
||||
|
||||
# Summary
|
||||
logger.info("\n" + "=" * 60)
|
||||
logger.info("PRODUCTION SUMMARY")
|
||||
logger.info("=" * 60)
|
||||
|
||||
for source, result in results.items():
|
||||
if result['success']:
|
||||
logger.info(f"✅ {source}: {result['count']} items")
|
||||
else:
|
||||
logger.info(f"ℹ️ {source}: No new items")
|
||||
|
||||
logger.info("=" * 60)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
|
|
@ -1,344 +0,0 @@
|
|||
#!/usr/bin/env python3
|
||||
"""
|
||||
Production script with comprehensive image downloading for all sources.
|
||||
Downloads thumbnails and images from Instagram, YouTube, and Podcasts.
|
||||
"""
|
||||
|
||||
import sys
|
||||
from pathlib import Path
|
||||
sys.path.insert(0, str(Path(__file__).parent))
|
||||
|
||||
from src.youtube_api_scraper_with_thumbnails import YouTubeAPIScraperWithThumbnails
|
||||
from src.mailchimp_api_scraper_v2 import MailChimpAPIScraper
|
||||
from src.instagram_scraper_with_images import InstagramScraperWithImages
|
||||
from src.rss_scraper_with_images import RSSScraperPodcastWithImages
|
||||
from src.wordpress_scraper import WordPressScraper
|
||||
from src.tiktok_scraper_advanced import TikTokScraperAdvanced
|
||||
from src.base_scraper import ScraperConfig
|
||||
from src.cumulative_markdown_manager import CumulativeMarkdownManager
|
||||
from datetime import datetime
|
||||
import pytz
|
||||
import time
|
||||
import logging
|
||||
import subprocess
|
||||
import os
|
||||
|
||||
# Set up logging
|
||||
logging.basicConfig(
|
||||
level=logging.INFO,
|
||||
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
|
||||
handlers=[
|
||||
logging.FileHandler('logs/production_with_images.log'),
|
||||
logging.StreamHandler()
|
||||
]
|
||||
)
|
||||
logger = logging.getLogger('production_with_images')
|
||||
|
||||
|
||||
def get_atlantic_timestamp() -> str:
|
||||
"""Get current timestamp in Atlantic timezone for file naming."""
|
||||
tz = pytz.timezone('America/Halifax')
|
||||
return datetime.now(tz).strftime('%Y-%m-%dT%H%M%S')
|
||||
|
||||
|
||||
def run_youtube_with_thumbnails():
|
||||
"""Run YouTube API scraper with thumbnail downloads."""
|
||||
logger.info("=" * 60)
|
||||
logger.info("YOUTUBE API SCRAPER WITH THUMBNAILS")
|
||||
logger.info("=" * 60)
|
||||
|
||||
timestamp = get_atlantic_timestamp()
|
||||
|
||||
config = ScraperConfig(
|
||||
source_name='YouTube',
|
||||
brand_name='hkia',
|
||||
data_dir=Path('data'),
|
||||
logs_dir=Path('logs'),
|
||||
timezone='America/Halifax'
|
||||
)
|
||||
|
||||
try:
|
||||
scraper = YouTubeAPIScraperWithThumbnails(config)
|
||||
|
||||
# Fetch videos with thumbnails
|
||||
logger.info("Fetching YouTube videos and downloading thumbnails...")
|
||||
videos = scraper.fetch_content(max_posts=100) # Limit for testing
|
||||
|
||||
if videos:
|
||||
# Process cumulative markdown
|
||||
manager = CumulativeMarkdownManager(config)
|
||||
output_file = manager.update_cumulative_file(videos, 'YouTube')
|
||||
|
||||
logger.info(f"✅ YouTube completed: {len(videos)} videos")
|
||||
logger.info(f" Output: {output_file}")
|
||||
|
||||
# Count downloaded thumbnails
|
||||
thumb_count = sum(1 for v in videos if v.get('local_thumbnail'))
|
||||
logger.info(f" Thumbnails downloaded: {thumb_count}")
|
||||
|
||||
return True, len(videos), output_file
|
||||
else:
|
||||
logger.warning("No YouTube videos fetched")
|
||||
return False, 0, None
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"YouTube scraper error: {e}")
|
||||
import traceback
|
||||
traceback.print_exc()
|
||||
return False, 0, None
|
||||
|
||||
|
||||
def run_instagram_with_images():
|
||||
"""Run Instagram scraper with image downloads."""
|
||||
logger.info("=" * 60)
|
||||
logger.info("INSTAGRAM SCRAPER WITH IMAGES")
|
||||
logger.info("=" * 60)
|
||||
|
||||
if not os.getenv('INSTAGRAM_USERNAME'):
|
||||
logger.warning("Instagram not configured (INSTAGRAM_USERNAME missing)")
|
||||
return False, 0, None
|
||||
|
||||
timestamp = get_atlantic_timestamp()
|
||||
|
||||
config = ScraperConfig(
|
||||
source_name='Instagram',
|
||||
brand_name='hkia',
|
||||
data_dir=Path('data'),
|
||||
logs_dir=Path('logs'),
|
||||
timezone='America/Halifax'
|
||||
)
|
||||
|
||||
try:
|
||||
scraper = InstagramScraperWithImages(config)
|
||||
|
||||
# Fetch posts with images (limited for testing)
|
||||
logger.info("Fetching Instagram posts and downloading images...")
|
||||
items = scraper.fetch_content(max_posts=20) # Start with 20 for testing
|
||||
|
||||
if items:
|
||||
# Process cumulative markdown
|
||||
manager = CumulativeMarkdownManager(config)
|
||||
output_file = manager.update_cumulative_file(items, 'Instagram')
|
||||
|
||||
logger.info(f"✅ Instagram completed: {len(items)} posts")
|
||||
logger.info(f" Output: {output_file}")
|
||||
|
||||
# Count downloaded images
|
||||
img_count = sum(len(item.get('local_images', [])) for item in items)
|
||||
logger.info(f" Images downloaded: {img_count}")
|
||||
|
||||
return True, len(items), output_file
|
||||
else:
|
||||
logger.warning("No Instagram posts fetched")
|
||||
return False, 0, None
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Instagram scraper error: {e}")
|
||||
import traceback
|
||||
traceback.print_exc()
|
||||
return False, 0, None
|
||||
|
||||
|
||||
def run_podcast_with_thumbnails():
|
||||
"""Run Podcast RSS scraper with thumbnail downloads."""
|
||||
logger.info("=" * 60)
|
||||
logger.info("PODCAST RSS SCRAPER WITH THUMBNAILS")
|
||||
logger.info("=" * 60)
|
||||
|
||||
if not os.getenv('PODCAST_RSS_URL'):
|
||||
logger.warning("Podcast not configured (PODCAST_RSS_URL missing)")
|
||||
return False, 0, None
|
||||
|
||||
timestamp = get_atlantic_timestamp()
|
||||
|
||||
config = ScraperConfig(
|
||||
source_name='Podcast',
|
||||
brand_name='hkia',
|
||||
data_dir=Path('data'),
|
||||
logs_dir=Path('logs'),
|
||||
timezone='America/Halifax'
|
||||
)
|
||||
|
||||
try:
|
||||
scraper = RSSScraperPodcastWithImages(config)
|
||||
|
||||
# Fetch episodes with thumbnails
|
||||
logger.info("Fetching podcast episodes and downloading thumbnails...")
|
||||
items = scraper.fetch_content(max_items=50) # Limit for testing
|
||||
|
||||
if items:
|
||||
# Process cumulative markdown
|
||||
manager = CumulativeMarkdownManager(config)
|
||||
output_file = manager.update_cumulative_file(items, 'Podcast')
|
||||
|
||||
logger.info(f"✅ Podcast completed: {len(items)} episodes")
|
||||
logger.info(f" Output: {output_file}")
|
||||
|
||||
# Count downloaded thumbnails
|
||||
thumb_count = sum(1 for item in items if item.get('local_thumbnail'))
|
||||
logger.info(f" Thumbnails downloaded: {thumb_count}")
|
||||
|
||||
return True, len(items), output_file
|
||||
else:
|
||||
logger.warning("No podcast episodes fetched")
|
||||
return False, 0, None
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Podcast scraper error: {e}")
|
||||
import traceback
|
||||
traceback.print_exc()
|
||||
return False, 0, None
|
||||
|
||||
|
||||
def sync_to_nas_with_images():
|
||||
"""Sync markdown files AND images to NAS."""
|
||||
logger.info("\n" + "=" * 60)
|
||||
logger.info("SYNCING TO NAS - MARKDOWN AND IMAGES")
|
||||
logger.info("=" * 60)
|
||||
|
||||
nas_base = Path('/mnt/nas/hkia')
|
||||
|
||||
try:
|
||||
# Sync markdown files
|
||||
local_current = Path('data/markdown_current')
|
||||
nas_current = nas_base / 'markdown_current'
|
||||
|
||||
if local_current.exists() and any(local_current.glob('*.md')):
|
||||
nas_current.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
# Sync markdown files
|
||||
cmd = ['rsync', '-av', '--include=*.md', '--exclude=*',
|
||||
str(local_current) + '/', str(nas_current) + '/']
|
||||
result = subprocess.run(cmd, capture_output=True, text=True)
|
||||
|
||||
if result.returncode == 0:
|
||||
logger.info(f"✅ Markdown files synced to NAS: {nas_current}")
|
||||
md_count = len(list(nas_current.glob('*.md')))
|
||||
logger.info(f" Total markdown files: {md_count}")
|
||||
else:
|
||||
logger.warning(f"Markdown sync warning: {result.stderr}")
|
||||
|
||||
# Sync media files
|
||||
local_media = Path('data/media')
|
||||
nas_media = nas_base / 'media'
|
||||
|
||||
if local_media.exists():
|
||||
nas_media.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
# Sync all image files (jpg, jpeg, png, gif)
|
||||
cmd = ['rsync', '-av',
|
||||
'--include=*/',
|
||||
'--include=*.jpg', '--include=*.jpeg',
|
||||
'--include=*.png', '--include=*.gif',
|
||||
'--exclude=*',
|
||||
str(local_media) + '/', str(nas_media) + '/']
|
||||
result = subprocess.run(cmd, capture_output=True, text=True)
|
||||
|
||||
if result.returncode == 0:
|
||||
logger.info(f"✅ Media files synced to NAS: {nas_media}")
|
||||
|
||||
# Count images per source
|
||||
for source_dir in nas_media.glob('*'):
|
||||
if source_dir.is_dir():
|
||||
img_count = len(list(source_dir.glob('*.jpg'))) + \
|
||||
len(list(source_dir.glob('*.jpeg'))) + \
|
||||
len(list(source_dir.glob('*.png'))) + \
|
||||
len(list(source_dir.glob('*.gif')))
|
||||
if img_count > 0:
|
||||
logger.info(f" {source_dir.name}: {img_count} images")
|
||||
else:
|
||||
logger.warning(f"Media sync warning: {result.stderr}")
|
||||
|
||||
# Sync archives
|
||||
for source in ['YouTube', 'MailChimp', 'Instagram', 'Podcast', 'WordPress', 'TikTok']:
|
||||
local_archive = Path(f'data/markdown_archives/{source}')
|
||||
nas_archive = nas_base / f'markdown_archives/{source}'
|
||||
|
||||
if local_archive.exists() and any(local_archive.glob('*.md')):
|
||||
nas_archive.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
cmd = ['rsync', '-av', '--include=*.md', '--exclude=*',
|
||||
str(local_archive) + '/', str(nas_archive) + '/']
|
||||
result = subprocess.run(cmd, capture_output=True, text=True)
|
||||
|
||||
if result.returncode == 0:
|
||||
logger.info(f"✅ {source} archives synced to NAS")
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to sync to NAS: {e}")
|
||||
|
||||
|
||||
def main():
|
||||
"""Main production run with image downloads."""
|
||||
logger.info("=" * 70)
|
||||
logger.info("HKIA - PRODUCTION WITH IMAGE DOWNLOADS")
|
||||
logger.info("Downloads all thumbnails and images (no videos)")
|
||||
logger.info("=" * 70)
|
||||
|
||||
atlantic_tz = pytz.timezone('America/Halifax')
|
||||
start_time = datetime.now(atlantic_tz)
|
||||
logger.info(f"Started at: {start_time.isoformat()}")
|
||||
|
||||
# Track results
|
||||
results = {
|
||||
'YouTube': {'success': False, 'count': 0, 'file': None},
|
||||
'Instagram': {'success': False, 'count': 0, 'file': None},
|
||||
'Podcast': {'success': False, 'count': 0, 'file': None}
|
||||
}
|
||||
|
||||
# Run YouTube with thumbnails
|
||||
success, count, output_file = run_youtube_with_thumbnails()
|
||||
results['YouTube'] = {'success': success, 'count': count, 'file': output_file}
|
||||
|
||||
# Wait a bit between scrapers
|
||||
time.sleep(2)
|
||||
|
||||
# Run Instagram with images
|
||||
success, count, output_file = run_instagram_with_images()
|
||||
results['Instagram'] = {'success': success, 'count': count, 'file': output_file}
|
||||
|
||||
# Wait a bit between scrapers
|
||||
time.sleep(2)
|
||||
|
||||
# Run Podcast with thumbnails
|
||||
success, count, output_file = run_podcast_with_thumbnails()
|
||||
results['Podcast'] = {'success': success, 'count': count, 'file': output_file}
|
||||
|
||||
# Sync to NAS including images
|
||||
sync_to_nas_with_images()
|
||||
|
||||
# Summary
|
||||
end_time = datetime.now(atlantic_tz)
|
||||
duration = (end_time - start_time).total_seconds()
|
||||
|
||||
logger.info("\n" + "=" * 60)
|
||||
logger.info("PRODUCTION RUN SUMMARY")
|
||||
logger.info("=" * 60)
|
||||
|
||||
for source, result in results.items():
|
||||
if result['success']:
|
||||
logger.info(f"✅ {source}: {result['count']} items")
|
||||
if result['file']:
|
||||
logger.info(f" File: {result['file']}")
|
||||
else:
|
||||
logger.info(f"❌ {source}: Failed")
|
||||
|
||||
# Count total images downloaded
|
||||
media_dir = Path('data/media')
|
||||
total_images = 0
|
||||
if media_dir.exists():
|
||||
for source_dir in media_dir.glob('*'):
|
||||
if source_dir.is_dir():
|
||||
img_count = len(list(source_dir.glob('*.jpg'))) + \
|
||||
len(list(source_dir.glob('*.jpeg'))) + \
|
||||
len(list(source_dir.glob('*.png'))) + \
|
||||
len(list(source_dir.glob('*.gif')))
|
||||
total_images += img_count
|
||||
|
||||
logger.info(f"\nTotal images downloaded: {total_images}")
|
||||
logger.info(f"Duration: {duration:.1f} seconds")
|
||||
logger.info("=" * 60)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
|
|
@ -42,7 +42,7 @@ class BaseScraper(ABC):
|
|||
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
|
||||
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0',
|
||||
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:121.0) Gecko/20100101 Firefox/121.0',
|
||||
'HVAC-KnowItAll-Bot/1.0 (+https://hkia.com)' # Fallback bot UA
|
||||
'HVAC-KnowItAll-Bot/1.0 (+https://hvacknowitall.com)' # Fallback bot UA
|
||||
]
|
||||
self.current_ua_index = 0
|
||||
|
||||
|
|
|
|||
|
|
@ -1,80 +0,0 @@
|
|||
#!/usr/bin/env python3
|
||||
"""
|
||||
Enhanced Base Scraper with Cumulative Markdown Support
|
||||
Extension of base_scraper.py that adds cumulative markdown functionality
|
||||
"""
|
||||
|
||||
from src.base_scraper import BaseScraper
|
||||
from src.cumulative_markdown_manager import CumulativeMarkdownManager
|
||||
from pathlib import Path
|
||||
from typing import List, Dict, Any, Optional
|
||||
|
||||
|
||||
class BaseScraperCumulative(BaseScraper):
|
||||
"""Base scraper with cumulative markdown support."""
|
||||
|
||||
def __init__(self, config, use_cumulative: bool = True):
|
||||
"""Initialize with optional cumulative mode."""
|
||||
super().__init__(config)
|
||||
self.use_cumulative = use_cumulative
|
||||
|
||||
if self.use_cumulative:
|
||||
self.cumulative_manager = CumulativeMarkdownManager(config, self.logger)
|
||||
self.logger.info("Initialized with cumulative markdown mode")
|
||||
|
||||
def save_content(self, items: List[Dict[str, Any]]) -> Optional[Path]:
|
||||
"""Save content using either cumulative or traditional mode."""
|
||||
if not items:
|
||||
self.logger.warning("No items to save")
|
||||
return None
|
||||
|
||||
if self.use_cumulative:
|
||||
# Use cumulative manager
|
||||
return self.cumulative_manager.save_cumulative(
|
||||
items,
|
||||
self.format_markdown
|
||||
)
|
||||
else:
|
||||
# Use traditional save (creates new file each time)
|
||||
markdown = self.format_markdown(items)
|
||||
return self.save_markdown(markdown)
|
||||
|
||||
def run(self) -> Optional[Path]:
|
||||
"""Run the scraper with cumulative support."""
|
||||
try:
|
||||
self.logger.info(f"Starting {self.config.source_name} scraper "
|
||||
f"(cumulative={self.use_cumulative})")
|
||||
|
||||
# Fetch content (will check state for incremental)
|
||||
items = self.fetch_content()
|
||||
|
||||
if not items:
|
||||
self.logger.info("No new content found")
|
||||
return None
|
||||
|
||||
self.logger.info(f"Fetched {len(items)} items")
|
||||
|
||||
# Save content (cumulative or traditional)
|
||||
filepath = self.save_content(items)
|
||||
|
||||
# Update state for next incremental run
|
||||
if items and filepath:
|
||||
self.update_state(items)
|
||||
|
||||
# Log statistics if cumulative
|
||||
if self.use_cumulative:
|
||||
stats = self.cumulative_manager.get_statistics(filepath)
|
||||
self.logger.info(f"Cumulative stats: {stats}")
|
||||
|
||||
return filepath
|
||||
|
||||
except Exception as e:
|
||||
self.logger.error(f"Error in scraper run: {e}")
|
||||
raise
|
||||
|
||||
def get_cumulative_stats(self) -> Dict[str, int]:
|
||||
"""Get statistics about the cumulative file."""
|
||||
if not self.use_cumulative:
|
||||
return {}
|
||||
|
||||
return self.cumulative_manager.get_statistics()
|
||||
|
|
@ -1,294 +0,0 @@
|
|||
#!/usr/bin/env python3
|
||||
"""
|
||||
Unified cookie management system for YouTube authentication
|
||||
Based on compendium project's successful implementation
|
||||
"""
|
||||
|
||||
import os
|
||||
import time
|
||||
import fcntl
|
||||
import shutil
|
||||
from pathlib import Path
|
||||
from typing import Optional, List, Dict, Any
|
||||
from datetime import datetime, timedelta
|
||||
import logging
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
class CookieManager:
|
||||
"""Unified cookie discovery and validation system"""
|
||||
|
||||
def __init__(self):
|
||||
self.priority_paths = self._get_priority_paths()
|
||||
self.max_age_days = 90
|
||||
self.min_size = 50
|
||||
self.max_size = 50 * 1024 * 1024 # 50MB
|
||||
|
||||
def _get_priority_paths(self) -> List[Path]:
|
||||
"""Get cookie paths in priority order"""
|
||||
paths = []
|
||||
|
||||
# 1. Environment variable (highest priority)
|
||||
env_path = os.getenv('YOUTUBE_COOKIES_PATH')
|
||||
if env_path:
|
||||
paths.append(Path(env_path))
|
||||
|
||||
# 2. Container paths
|
||||
paths.extend([
|
||||
Path('/app/youtube_cookies.txt'),
|
||||
Path('/app/cookies.txt'),
|
||||
])
|
||||
|
||||
# 3. NAS production paths
|
||||
nas_base = Path('/mnt/nas/app_data')
|
||||
if nas_base.exists():
|
||||
paths.extend([
|
||||
nas_base / 'cookies' / 'youtube_cookies.txt',
|
||||
nas_base / 'cookies' / 'cookies.txt',
|
||||
])
|
||||
|
||||
# 4. Local development paths
|
||||
project_root = Path(__file__).parent.parent
|
||||
paths.extend([
|
||||
project_root / 'data_production_backlog' / '.cookies' / 'youtube_cookies.txt',
|
||||
project_root / 'data_production_backlog' / '.cookies' / 'cookies.txt',
|
||||
project_root / '.cookies' / 'youtube_cookies.txt',
|
||||
project_root / '.cookies' / 'cookies.txt',
|
||||
])
|
||||
|
||||
return paths
|
||||
|
||||
def find_valid_cookies(self) -> Optional[Path]:
|
||||
"""Find the first valid cookie file in priority order"""
|
||||
|
||||
for cookie_path in self.priority_paths:
|
||||
if self._validate_cookie_file(cookie_path):
|
||||
logger.info(f"Found valid cookies: {cookie_path}")
|
||||
return cookie_path
|
||||
|
||||
logger.warning("No valid cookie files found")
|
||||
return None
|
||||
|
||||
def _validate_cookie_file(self, cookie_path: Path) -> bool:
|
||||
"""Validate a cookie file"""
|
||||
|
||||
try:
|
||||
# Check existence and accessibility
|
||||
if not cookie_path.exists():
|
||||
return False
|
||||
|
||||
if not cookie_path.is_file():
|
||||
return False
|
||||
|
||||
if not os.access(cookie_path, os.R_OK):
|
||||
logger.warning(f"Cookie file not readable: {cookie_path}")
|
||||
return False
|
||||
|
||||
# Check file size
|
||||
file_size = cookie_path.stat().st_size
|
||||
if file_size < self.min_size:
|
||||
logger.warning(f"Cookie file too small ({file_size} bytes): {cookie_path}")
|
||||
return False
|
||||
|
||||
if file_size > self.max_size:
|
||||
logger.warning(f"Cookie file too large ({file_size} bytes): {cookie_path}")
|
||||
return False
|
||||
|
||||
# Check file age
|
||||
mtime = datetime.fromtimestamp(cookie_path.stat().st_mtime)
|
||||
age = datetime.now() - mtime
|
||||
if age > timedelta(days=self.max_age_days):
|
||||
logger.warning(f"Cookie file too old ({age.days} days): {cookie_path}")
|
||||
return False
|
||||
|
||||
# Validate Netscape format
|
||||
if not self._validate_netscape_format(cookie_path):
|
||||
return False
|
||||
|
||||
logger.debug(f"Cookie file validated: {cookie_path} ({file_size} bytes, {age.days} days old)")
|
||||
return True
|
||||
|
||||
except Exception as e:
|
||||
logger.warning(f"Error validating cookie file {cookie_path}: {e}")
|
||||
return False
|
||||
|
||||
def _validate_netscape_format(self, cookie_path: Path) -> bool:
|
||||
"""Validate cookie file is in proper Netscape format"""
|
||||
|
||||
try:
|
||||
content = cookie_path.read_text(encoding='utf-8', errors='ignore')
|
||||
lines = content.strip().split('\n')
|
||||
|
||||
# Should have header
|
||||
if not any('Netscape HTTP Cookie File' in line for line in lines[:5]):
|
||||
logger.warning(f"Missing Netscape header: {cookie_path}")
|
||||
return False
|
||||
|
||||
# Count valid cookie lines (non-comment, non-empty)
|
||||
cookie_count = 0
|
||||
for line in lines:
|
||||
line = line.strip()
|
||||
if line and not line.startswith('#'):
|
||||
# Basic tab-separated format check
|
||||
parts = line.split('\t')
|
||||
if len(parts) >= 6: # domain, flag, path, secure, expiration, name, [value]
|
||||
cookie_count += 1
|
||||
|
||||
if cookie_count < 3: # Need at least a few cookies
|
||||
logger.warning(f"Too few valid cookies ({cookie_count}): {cookie_path}")
|
||||
return False
|
||||
|
||||
logger.debug(f"Found {cookie_count} valid cookies in {cookie_path}")
|
||||
return True
|
||||
|
||||
except Exception as e:
|
||||
logger.warning(f"Error reading cookie file {cookie_path}: {e}")
|
||||
return False
|
||||
|
||||
def backup_cookies(self, cookie_path: Path) -> Optional[Path]:
|
||||
"""Create backup of cookie file"""
|
||||
|
||||
try:
|
||||
timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
|
||||
backup_path = cookie_path.with_suffix(f'.backup_{timestamp}')
|
||||
|
||||
shutil.copy2(cookie_path, backup_path)
|
||||
logger.info(f"Backed up cookies to: {backup_path}")
|
||||
return backup_path
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to backup cookies {cookie_path}: {e}")
|
||||
return None
|
||||
|
||||
def update_cookies(self, new_cookie_path: Path, target_path: Optional[Path] = None) -> bool:
|
||||
"""Atomically update cookie file with new cookies"""
|
||||
|
||||
if target_path is None:
|
||||
target_path = self.find_valid_cookies()
|
||||
if target_path is None:
|
||||
# Use first priority path as default
|
||||
target_path = self.priority_paths[0]
|
||||
target_path.parent.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
try:
|
||||
# Validate new cookies first
|
||||
if not self._validate_cookie_file(new_cookie_path):
|
||||
logger.error(f"New cookie file failed validation: {new_cookie_path}")
|
||||
return False
|
||||
|
||||
# Backup existing cookies
|
||||
if target_path.exists():
|
||||
backup_path = self.backup_cookies(target_path)
|
||||
if backup_path is None:
|
||||
logger.warning("Failed to backup existing cookies, proceeding anyway")
|
||||
|
||||
# Atomic replacement using file locking
|
||||
temp_path = target_path.with_suffix('.tmp')
|
||||
|
||||
try:
|
||||
# Copy new cookies to temp file
|
||||
shutil.copy2(new_cookie_path, temp_path)
|
||||
|
||||
# Lock and replace atomically
|
||||
with open(temp_path, 'r+b') as f:
|
||||
fcntl.flock(f.fileno(), fcntl.LOCK_EX)
|
||||
temp_path.replace(target_path)
|
||||
|
||||
logger.info(f"Successfully updated cookies: {target_path}")
|
||||
return True
|
||||
|
||||
finally:
|
||||
if temp_path.exists():
|
||||
temp_path.unlink()
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to update cookies: {e}")
|
||||
return False
|
||||
|
||||
def get_cookie_stats(self) -> Dict[str, Any]:
|
||||
"""Get statistics about available cookie files"""
|
||||
|
||||
stats = {
|
||||
'valid_files': [],
|
||||
'invalid_files': [],
|
||||
'total_cookies': 0,
|
||||
'newest_file': None,
|
||||
'oldest_file': None,
|
||||
}
|
||||
|
||||
for cookie_path in self.priority_paths:
|
||||
if cookie_path.exists():
|
||||
if self._validate_cookie_file(cookie_path):
|
||||
file_info = {
|
||||
'path': str(cookie_path),
|
||||
'size': cookie_path.stat().st_size,
|
||||
'mtime': datetime.fromtimestamp(cookie_path.stat().st_mtime),
|
||||
'cookie_count': self._count_cookies(cookie_path),
|
||||
}
|
||||
stats['valid_files'].append(file_info)
|
||||
stats['total_cookies'] += file_info['cookie_count']
|
||||
|
||||
if stats['newest_file'] is None or file_info['mtime'] > stats['newest_file']['mtime']:
|
||||
stats['newest_file'] = file_info
|
||||
if stats['oldest_file'] is None or file_info['mtime'] < stats['oldest_file']['mtime']:
|
||||
stats['oldest_file'] = file_info
|
||||
else:
|
||||
stats['invalid_files'].append(str(cookie_path))
|
||||
|
||||
return stats
|
||||
|
||||
def _count_cookies(self, cookie_path: Path) -> int:
|
||||
"""Count valid cookies in file"""
|
||||
|
||||
try:
|
||||
content = cookie_path.read_text(encoding='utf-8', errors='ignore')
|
||||
lines = content.strip().split('\n')
|
||||
|
||||
count = 0
|
||||
for line in lines:
|
||||
line = line.strip()
|
||||
if line and not line.startswith('#'):
|
||||
parts = line.split('\t')
|
||||
if len(parts) >= 6:
|
||||
count += 1
|
||||
|
||||
return count
|
||||
|
||||
except Exception:
|
||||
return 0
|
||||
|
||||
def cleanup_old_backups(self, keep_count: int = 5):
|
||||
"""Clean up old backup files, keeping only the most recent"""
|
||||
|
||||
for cookie_path in self.priority_paths:
|
||||
if cookie_path.exists():
|
||||
backup_pattern = f"{cookie_path.stem}.backup_*"
|
||||
backup_files = list(cookie_path.parent.glob(backup_pattern))
|
||||
|
||||
if len(backup_files) > keep_count:
|
||||
# Sort by modification time (newest first)
|
||||
backup_files.sort(key=lambda p: p.stat().st_mtime, reverse=True)
|
||||
|
||||
# Remove old backups
|
||||
for old_backup in backup_files[keep_count:]:
|
||||
try:
|
||||
old_backup.unlink()
|
||||
logger.debug(f"Removed old backup: {old_backup}")
|
||||
except Exception as e:
|
||||
logger.warning(f"Failed to remove backup {old_backup}: {e}")
|
||||
|
||||
# Convenience functions
|
||||
def get_youtube_cookies() -> Optional[Path]:
|
||||
"""Get valid YouTube cookies file"""
|
||||
manager = CookieManager()
|
||||
return manager.find_valid_cookies()
|
||||
|
||||
def update_youtube_cookies(new_cookie_path: Path) -> bool:
|
||||
"""Update YouTube cookies"""
|
||||
manager = CookieManager()
|
||||
return manager.update_cookies(new_cookie_path)
|
||||
|
||||
def get_cookie_stats() -> Dict[str, Any]:
|
||||
"""Get cookie file statistics"""
|
||||
manager = CookieManager()
|
||||
return manager.get_cookie_stats()
|
||||
|
|
@ -1,374 +0,0 @@
|
|||
#!/usr/bin/env python3
|
||||
"""
|
||||
Cumulative Markdown Manager
|
||||
Maintains a single, growing markdown file per source that combines:
|
||||
- Initial backlog content
|
||||
- Daily incremental updates
|
||||
- Updates to existing entries (e.g., new captions, updated metrics)
|
||||
"""
|
||||
|
||||
from pathlib import Path
|
||||
from datetime import datetime
|
||||
from typing import Dict, List, Optional, Any
|
||||
import pytz
|
||||
import logging
|
||||
import shutil
|
||||
import re
|
||||
|
||||
|
||||
class CumulativeMarkdownManager:
|
||||
"""Manages cumulative markdown files that grow with each update."""
|
||||
|
||||
def __init__(self, config, logger: Optional[logging.Logger] = None):
|
||||
"""Initialize with scraper config."""
|
||||
self.config = config
|
||||
self.logger = logger or logging.getLogger(self.__class__.__name__)
|
||||
self.tz = pytz.timezone(config.timezone)
|
||||
|
||||
# Paths
|
||||
self.current_dir = config.data_dir / "markdown_current"
|
||||
self.archive_dir = config.data_dir / "markdown_archives" / config.source_name.title()
|
||||
|
||||
# Ensure directories exist
|
||||
self.current_dir.mkdir(parents=True, exist_ok=True)
|
||||
self.archive_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
# File pattern for this source
|
||||
self.file_pattern = f"{config.brand_name}_{config.source_name}_*.md"
|
||||
|
||||
def get_current_file(self) -> Optional[Path]:
|
||||
"""Find the current markdown file for this source."""
|
||||
files = list(self.current_dir.glob(self.file_pattern))
|
||||
if not files:
|
||||
return None
|
||||
|
||||
# Return the most recent file (by filename timestamp)
|
||||
files.sort(reverse=True)
|
||||
return files[0]
|
||||
|
||||
def parse_markdown_sections(self, content: str) -> Dict[str, Dict]:
|
||||
"""Parse markdown content into sections indexed by ID."""
|
||||
sections = {}
|
||||
|
||||
# Split by ID headers
|
||||
parts = content.split('# ID: ')
|
||||
|
||||
for part in parts[1:]: # Skip first empty part
|
||||
if not part.strip():
|
||||
continue
|
||||
|
||||
lines = part.strip().split('\n')
|
||||
section_id = lines[0].strip()
|
||||
|
||||
# Reconstruct full section content
|
||||
section_content = f"# ID: {section_id}\n" + '\n'.join(lines[1:])
|
||||
|
||||
# Extract metadata for comparison
|
||||
metadata = self._extract_metadata(section_content)
|
||||
|
||||
sections[section_id] = {
|
||||
'id': section_id,
|
||||
'content': section_content,
|
||||
'metadata': metadata
|
||||
}
|
||||
|
||||
return sections
|
||||
|
||||
def _extract_metadata(self, content: str) -> Dict[str, Any]:
|
||||
"""Extract metadata from section content for comparison."""
|
||||
metadata = {}
|
||||
|
||||
# Extract common fields
|
||||
patterns = {
|
||||
'views': r'## Views?:\s*([0-9,]+)',
|
||||
'likes': r'## Likes?:\s*([0-9,]+)',
|
||||
'comments': r'## Comments?:\s*([0-9,]+)',
|
||||
'publish_date': r'## Publish(?:ed)? Date:\s*([^\n]+)',
|
||||
'has_caption': r'## Caption Status:',
|
||||
'has_transcript': r'## Transcript:',
|
||||
'description_length': r'## Description:\n(.+?)(?:\n##|\n---|\Z)',
|
||||
}
|
||||
|
||||
for key, pattern in patterns.items():
|
||||
match = re.search(pattern, content, re.DOTALL | re.IGNORECASE)
|
||||
if match:
|
||||
if key in ['views', 'likes', 'comments']:
|
||||
# Convert numeric fields
|
||||
metadata[key] = int(match.group(1).replace(',', ''))
|
||||
elif key in ['has_caption', 'has_transcript']:
|
||||
# Boolean fields
|
||||
metadata[key] = True
|
||||
elif key == 'description_length':
|
||||
# Calculate length of description
|
||||
metadata[key] = len(match.group(1).strip())
|
||||
else:
|
||||
metadata[key] = match.group(1).strip()
|
||||
|
||||
return metadata
|
||||
|
||||
def should_update_section(self, old_section: Dict, new_section: Dict) -> bool:
|
||||
"""Determine if a section should be updated with new content."""
|
||||
old_meta = old_section.get('metadata', {})
|
||||
new_meta = new_section.get('metadata', {})
|
||||
|
||||
# Update if new section has captions/transcripts that old doesn't
|
||||
if new_meta.get('has_caption') and not old_meta.get('has_caption'):
|
||||
return True
|
||||
if new_meta.get('has_transcript') and not old_meta.get('has_transcript'):
|
||||
return True
|
||||
|
||||
# Update if new section has more content
|
||||
old_desc_len = old_meta.get('description_length', 0)
|
||||
new_desc_len = new_meta.get('description_length', 0)
|
||||
if new_desc_len > old_desc_len * 1.2: # 20% more content
|
||||
return True
|
||||
|
||||
# Update if metrics have changed significantly (for incremental updates)
|
||||
for metric in ['views', 'likes', 'comments']:
|
||||
old_val = old_meta.get(metric, 0)
|
||||
new_val = new_meta.get(metric, 0)
|
||||
if new_val > old_val:
|
||||
return True
|
||||
|
||||
# Update if content is substantially different
|
||||
if len(new_section['content']) > len(old_section['content']) * 1.1:
|
||||
return True
|
||||
|
||||
return False
|
||||
|
||||
def merge_content(self, existing_sections: Dict[str, Dict],
|
||||
new_items: List[Dict[str, Any]],
|
||||
formatter_func) -> str:
|
||||
"""Merge new content with existing sections."""
|
||||
# Convert new items to sections
|
||||
new_content = formatter_func(new_items)
|
||||
new_sections = self.parse_markdown_sections(new_content)
|
||||
|
||||
# Track updates
|
||||
added_count = 0
|
||||
updated_count = 0
|
||||
|
||||
# Merge sections
|
||||
for section_id, new_section in new_sections.items():
|
||||
if section_id in existing_sections:
|
||||
# Update existing section if newer/better
|
||||
if self.should_update_section(existing_sections[section_id], new_section):
|
||||
existing_sections[section_id] = new_section
|
||||
updated_count += 1
|
||||
self.logger.info(f"Updated section: {section_id}")
|
||||
else:
|
||||
# Add new section
|
||||
existing_sections[section_id] = new_section
|
||||
added_count += 1
|
||||
self.logger.debug(f"Added new section: {section_id}")
|
||||
|
||||
self.logger.info(f"Merge complete: {added_count} added, {updated_count} updated")
|
||||
|
||||
# Reconstruct markdown content
|
||||
# Sort by ID to maintain consistent order
|
||||
sorted_sections = sorted(existing_sections.values(),
|
||||
key=lambda x: x['id'])
|
||||
|
||||
# For sources with dates, sort by date (newest first)
|
||||
# Try to extract date from content for better sorting
|
||||
for section in sorted_sections:
|
||||
date_match = re.search(r'## Publish(?:ed)? Date:\s*([^\n]+)',
|
||||
section['content'])
|
||||
if date_match:
|
||||
try:
|
||||
# Parse various date formats
|
||||
date_str = date_match.group(1).strip()
|
||||
# Add parsed date for sorting
|
||||
section['sort_date'] = date_str
|
||||
except:
|
||||
pass
|
||||
|
||||
# Sort by date if available, otherwise by ID
|
||||
if any('sort_date' in s for s in sorted_sections):
|
||||
sorted_sections.sort(key=lambda x: x.get('sort_date', ''), reverse=True)
|
||||
|
||||
# Combine sections
|
||||
combined_content = []
|
||||
for section in sorted_sections:
|
||||
combined_content.append(section['content'])
|
||||
combined_content.append("") # Empty line between sections
|
||||
|
||||
return '\n'.join(combined_content)
|
||||
|
||||
def save_cumulative(self, new_items: List[Dict[str, Any]],
|
||||
formatter_func) -> Path:
|
||||
"""Save content cumulatively, merging with existing file if present."""
|
||||
current_file = self.get_current_file()
|
||||
|
||||
if current_file and current_file.exists():
|
||||
# Load and merge with existing content
|
||||
self.logger.info(f"Loading existing file: {current_file.name}")
|
||||
existing_content = current_file.read_text(encoding='utf-8')
|
||||
existing_sections = self.parse_markdown_sections(existing_content)
|
||||
|
||||
# Merge new items with existing sections
|
||||
merged_content = self.merge_content(existing_sections, new_items,
|
||||
formatter_func)
|
||||
|
||||
# Archive the current file before overwriting
|
||||
self._archive_file(current_file)
|
||||
|
||||
else:
|
||||
# First time - just format the new items
|
||||
self.logger.info("No existing file, creating new cumulative file")
|
||||
merged_content = formatter_func(new_items)
|
||||
|
||||
# Generate new filename with current timestamp
|
||||
timestamp = datetime.now(self.tz).strftime('%Y-%m-%dT%H%M%S')
|
||||
filename = f"{self.config.brand_name}_{self.config.source_name}_{timestamp}.md"
|
||||
filepath = self.current_dir / filename
|
||||
|
||||
# Save merged content
|
||||
filepath.write_text(merged_content, encoding='utf-8')
|
||||
self.logger.info(f"Saved cumulative file: {filename}")
|
||||
|
||||
# Remove old file if it exists (we archived it already)
|
||||
if current_file and current_file.exists() and current_file != filepath:
|
||||
current_file.unlink()
|
||||
self.logger.debug(f"Removed old file: {current_file.name}")
|
||||
|
||||
return filepath
|
||||
|
||||
def _archive_file(self, file_path: Path) -> None:
|
||||
"""Archive a file with timestamp suffix."""
|
||||
if not file_path.exists():
|
||||
return
|
||||
|
||||
# Add archive timestamp to filename
|
||||
archive_time = datetime.now(self.tz).strftime('%Y%m%d_%H%M%S')
|
||||
archive_name = f"{file_path.stem}_archived_{archive_time}{file_path.suffix}"
|
||||
archive_path = self.archive_dir / archive_name
|
||||
|
||||
# Copy to archive
|
||||
shutil.copy2(file_path, archive_path)
|
||||
self.logger.debug(f"Archived to: {archive_path.name}")
|
||||
|
||||
def get_statistics(self, file_path: Optional[Path] = None) -> Dict[str, int]:
|
||||
"""Get statistics about the cumulative file."""
|
||||
if not file_path:
|
||||
file_path = self.get_current_file()
|
||||
|
||||
if not file_path or not file_path.exists():
|
||||
return {'total_sections': 0}
|
||||
|
||||
content = file_path.read_text(encoding='utf-8')
|
||||
sections = self.parse_markdown_sections(content)
|
||||
|
||||
stats = {
|
||||
'total_sections': len(sections),
|
||||
'with_captions': sum(1 for s in sections.values()
|
||||
if s['metadata'].get('has_caption')),
|
||||
'with_transcripts': sum(1 for s in sections.values()
|
||||
if s['metadata'].get('has_transcript')),
|
||||
'total_views': sum(s['metadata'].get('views', 0)
|
||||
for s in sections.values()),
|
||||
'file_size_kb': file_path.stat().st_size // 1024
|
||||
}
|
||||
|
||||
return stats
|
||||
|
||||
def update_cumulative_file(self, items: List[Dict[str, Any]], source_name: str) -> Path:
|
||||
"""
|
||||
Update cumulative file for a source using a basic formatter.
|
||||
This is a compatibility method for scripts that expect this interface.
|
||||
"""
|
||||
def basic_formatter(items: List[Dict[str, Any]]) -> str:
|
||||
"""Basic markdown formatter for any source."""
|
||||
sections = []
|
||||
|
||||
for item in items:
|
||||
section = []
|
||||
|
||||
# ID
|
||||
item_id = item.get('id', 'Unknown')
|
||||
section.append(f"# ID: {item_id}")
|
||||
section.append("")
|
||||
|
||||
# Title
|
||||
title = item.get('title', item.get('caption', 'Untitled'))
|
||||
if title:
|
||||
# Truncate very long titles/captions
|
||||
if len(title) > 100:
|
||||
title = title[:97] + "..."
|
||||
section.append(f"## Title: {title}")
|
||||
section.append("")
|
||||
|
||||
# Type
|
||||
item_type = item.get('type', source_name.lower())
|
||||
section.append(f"## Type: {item_type}")
|
||||
section.append("")
|
||||
|
||||
# Link
|
||||
link = item.get('link', item.get('url', ''))
|
||||
if link:
|
||||
section.append(f"## Link: {link}")
|
||||
section.append("")
|
||||
|
||||
# Author/Channel
|
||||
author = item.get('author', item.get('channel', ''))
|
||||
if author:
|
||||
section.append(f"## Author: {author}")
|
||||
section.append("")
|
||||
|
||||
# Publish Date
|
||||
pub_date = item.get('publish_date', item.get('published', ''))
|
||||
if pub_date:
|
||||
section.append(f"## Publish Date: {pub_date}")
|
||||
section.append("")
|
||||
|
||||
# Views
|
||||
views = item.get('views')
|
||||
if views is not None:
|
||||
section.append(f"## Views: {views:,}")
|
||||
section.append("")
|
||||
|
||||
# Likes
|
||||
likes = item.get('likes')
|
||||
if likes is not None:
|
||||
section.append(f"## Likes: {likes:,}")
|
||||
section.append("")
|
||||
|
||||
# Comments
|
||||
comments = item.get('comments')
|
||||
if comments is not None:
|
||||
section.append(f"## Comments: {comments:,}")
|
||||
section.append("")
|
||||
|
||||
# Local images
|
||||
local_images = item.get('local_images', [])
|
||||
if local_images:
|
||||
section.append(f"## Images Downloaded: {len(local_images)}")
|
||||
for i, img_path in enumerate(local_images, 1):
|
||||
rel_path = Path(img_path).relative_to(self.config.data_dir)
|
||||
section.append(f"")
|
||||
section.append("")
|
||||
|
||||
# Local thumbnail
|
||||
local_thumbnail = item.get('local_thumbnail')
|
||||
if local_thumbnail:
|
||||
section.append("## Thumbnail:")
|
||||
rel_path = Path(local_thumbnail).relative_to(self.config.data_dir)
|
||||
section.append(f"")
|
||||
section.append("")
|
||||
|
||||
# Description/Caption
|
||||
description = item.get('description', item.get('caption', ''))
|
||||
if description:
|
||||
section.append("## Description:")
|
||||
section.append(description)
|
||||
section.append("")
|
||||
|
||||
# Separator
|
||||
section.append("-" * 50)
|
||||
section.append("")
|
||||
|
||||
sections.append('\n'.join(section))
|
||||
|
||||
return '\n'.join(sections)
|
||||
|
||||
return self.save_cumulative(items, basic_formatter)
|
||||
|
|
@ -15,7 +15,7 @@ class InstagramScraper(BaseScraper):
|
|||
super().__init__(config)
|
||||
self.username = os.getenv('INSTAGRAM_USERNAME')
|
||||
self.password = os.getenv('INSTAGRAM_PASSWORD')
|
||||
self.target_account = os.getenv('INSTAGRAM_TARGET', 'hkia')
|
||||
self.target_account = os.getenv('INSTAGRAM_TARGET', 'hvacknowitall')
|
||||
|
||||
# Session file for persistence (needs .session extension)
|
||||
self.session_file = self.config.data_dir / '.sessions' / f'{self.username}.session'
|
||||
|
|
|
|||
|
|
@ -1,116 +0,0 @@
|
|||
"""
|
||||
Instagram scraper with cumulative markdown support and image downloads.
|
||||
"""
|
||||
|
||||
from typing import List, Dict, Any
|
||||
from pathlib import Path
|
||||
from src.instagram_scraper_with_images import InstagramScraperWithImages
|
||||
from src.cumulative_markdown_manager import CumulativeMarkdownManager
|
||||
|
||||
|
||||
class InstagramScraperCumulative(InstagramScraperWithImages):
|
||||
"""Instagram scraper that uses cumulative markdown management."""
|
||||
|
||||
def __init__(self, config):
|
||||
super().__init__(config)
|
||||
self.cumulative_manager = CumulativeMarkdownManager(config)
|
||||
|
||||
def run_incremental(self, max_posts: int = 50) -> tuple:
|
||||
"""Run incremental update with cumulative markdown."""
|
||||
self.logger.info(f"Running Instagram incremental update (max {max_posts} posts)")
|
||||
|
||||
# Fetch new content
|
||||
items = self.fetch_content(max_posts=max_posts)
|
||||
|
||||
if items:
|
||||
# Update cumulative file
|
||||
output_file = self.cumulative_manager.update_cumulative_file(items, 'Instagram')
|
||||
|
||||
self.logger.info(f"✅ Instagram incremental: {len(items)} posts")
|
||||
self.logger.info(f" Updated: {output_file}")
|
||||
|
||||
# Count images
|
||||
img_count = sum(len(item.get('local_images', [])) for item in items)
|
||||
if img_count > 0:
|
||||
self.logger.info(f" Images downloaded: {img_count}")
|
||||
|
||||
return True, len(items), output_file
|
||||
else:
|
||||
self.logger.warning("No new Instagram posts found")
|
||||
return False, 0, None
|
||||
|
||||
def run_backlog(self, start_from: int = 0, max_posts: int = 1000) -> tuple:
|
||||
"""Run backlog capture starting from a specific post number."""
|
||||
self.logger.info(f"Running Instagram backlog (posts {start_from} to {start_from + max_posts})")
|
||||
|
||||
# For backlog, we need to skip already captured posts
|
||||
# This is a simplified approach - in production you'd track exact post IDs
|
||||
all_items = []
|
||||
|
||||
try:
|
||||
# Get profile
|
||||
profile = instaloader.Profile.from_username(self.loader.context, self.target_account)
|
||||
self._check_rate_limit()
|
||||
|
||||
# Get posts
|
||||
posts = profile.get_posts()
|
||||
|
||||
# Skip to start position
|
||||
for i, post in enumerate(posts):
|
||||
if i < start_from:
|
||||
continue
|
||||
if i >= start_from + max_posts:
|
||||
break
|
||||
|
||||
try:
|
||||
# Download images for this post
|
||||
image_paths = self._download_post_images(post, post.shortcode)
|
||||
|
||||
# Extract post data
|
||||
post_data = {
|
||||
'id': post.shortcode,
|
||||
'type': self._get_post_type(post),
|
||||
'caption': post.caption if post.caption else '',
|
||||
'author': post.owner_username,
|
||||
'publish_date': post.date_utc.isoformat(),
|
||||
'link': f'https://www.instagram.com/p/{post.shortcode}/',
|
||||
'likes': post.likes,
|
||||
'comments': post.comments,
|
||||
'views': post.video_view_count if hasattr(post, 'video_view_count') else None,
|
||||
'media_count': post.mediacount if hasattr(post, 'mediacount') else 1,
|
||||
'hashtags': list(post.caption_hashtags) if post.caption else [],
|
||||
'mentions': list(post.caption_mentions) if post.caption else [],
|
||||
'is_video': getattr(post, 'is_video', False),
|
||||
'local_images': image_paths
|
||||
}
|
||||
|
||||
all_items.append(post_data)
|
||||
|
||||
# Rate limiting
|
||||
self._aggressive_delay()
|
||||
self._check_rate_limit()
|
||||
|
||||
# Progress
|
||||
if len(all_items) % 10 == 0:
|
||||
self.logger.info(f"Fetched {len(all_items)}/{max_posts} posts (starting from {start_from})")
|
||||
|
||||
except Exception as e:
|
||||
self.logger.error(f"Error processing post: {e}")
|
||||
continue
|
||||
|
||||
if all_items:
|
||||
# Update cumulative file
|
||||
output_file = self.cumulative_manager.update_cumulative_file(all_items, 'Instagram')
|
||||
|
||||
self.logger.info(f"✅ Instagram backlog: {len(all_items)} posts")
|
||||
self.logger.info(f" Posts {start_from} to {start_from + len(all_items)}")
|
||||
self.logger.info(f" Updated: {output_file}")
|
||||
|
||||
return True, len(all_items), output_file
|
||||
else:
|
||||
self.logger.warning(f"No posts fetched in range {start_from} to {start_from + max_posts}")
|
||||
return False, 0, None
|
||||
|
||||
except Exception as e:
|
||||
self.logger.error(f"Backlog error: {e}")
|
||||
return False, 0, None
|
||||
|
|
@ -1,300 +0,0 @@
|
|||
"""
|
||||
Enhanced Instagram scraper that downloads all images (but not videos).
|
||||
"""
|
||||
|
||||
import os
|
||||
import time
|
||||
import random
|
||||
from typing import Any, Dict, List, Optional
|
||||
from datetime import datetime
|
||||
from pathlib import Path
|
||||
import instaloader
|
||||
from src.instagram_scraper import InstagramScraper
|
||||
|
||||
|
||||
class InstagramScraperWithImages(InstagramScraper):
|
||||
"""Instagram scraper that downloads all post images."""
|
||||
|
||||
def __init__(self, config):
|
||||
super().__init__(config)
|
||||
# Create media directory for Instagram
|
||||
self.media_dir = self.config.data_dir / "media" / "Instagram"
|
||||
self.media_dir.mkdir(parents=True, exist_ok=True)
|
||||
self.logger.info(f"Instagram media directory: {self.media_dir}")
|
||||
|
||||
def _download_post_images(self, post, post_id: str) -> List[str]:
|
||||
"""Download all images from a post (skip videos)."""
|
||||
image_paths = []
|
||||
|
||||
try:
|
||||
# Check if it's a video post - skip downloading video
|
||||
if getattr(post, 'is_video', False):
|
||||
# Videos might have a thumbnail we can grab
|
||||
if hasattr(post, 'url'):
|
||||
# This is usually the video thumbnail
|
||||
thumbnail_url = post.url
|
||||
local_path = self.download_media(
|
||||
thumbnail_url,
|
||||
f"instagram_{post_id}_video_thumb",
|
||||
"image"
|
||||
)
|
||||
if local_path:
|
||||
image_paths.append(local_path)
|
||||
self.logger.info(f"Downloaded video thumbnail for {post_id}")
|
||||
else:
|
||||
# Single image or carousel
|
||||
if hasattr(post, 'mediacount') and post.mediacount > 1:
|
||||
# Carousel post with multiple images
|
||||
image_num = 1
|
||||
for node in post.get_sidecar_nodes():
|
||||
# Skip video nodes in carousel
|
||||
if not node.is_video:
|
||||
image_url = node.display_url
|
||||
local_path = self.download_media(
|
||||
image_url,
|
||||
f"instagram_{post_id}_image_{image_num}",
|
||||
"image"
|
||||
)
|
||||
if local_path:
|
||||
image_paths.append(local_path)
|
||||
self.logger.info(f"Downloaded carousel image {image_num} for {post_id}")
|
||||
image_num += 1
|
||||
else:
|
||||
# Single image post
|
||||
if hasattr(post, 'url'):
|
||||
image_url = post.url
|
||||
local_path = self.download_media(
|
||||
image_url,
|
||||
f"instagram_{post_id}_image",
|
||||
"image"
|
||||
)
|
||||
if local_path:
|
||||
image_paths.append(local_path)
|
||||
self.logger.info(f"Downloaded image for {post_id}")
|
||||
|
||||
except Exception as e:
|
||||
self.logger.error(f"Error downloading images for post {post_id}: {e}")
|
||||
|
||||
return image_paths
|
||||
|
||||
def fetch_posts(self, max_posts: int = 20) -> List[Dict[str, Any]]:
|
||||
"""Fetch posts from Instagram profile with image downloads."""
|
||||
posts_data = []
|
||||
|
||||
try:
|
||||
# Ensure we have a valid context
|
||||
if not self.loader.context:
|
||||
self.logger.warning("Instagram context not initialized, attempting re-login")
|
||||
self._login()
|
||||
|
||||
if not self.loader.context:
|
||||
self.logger.error("Failed to initialize Instagram context")
|
||||
return posts_data
|
||||
|
||||
self.logger.info(f"Fetching posts with images from @{self.target_account}")
|
||||
|
||||
# Get profile
|
||||
profile = instaloader.Profile.from_username(self.loader.context, self.target_account)
|
||||
self._check_rate_limit()
|
||||
|
||||
# Get posts
|
||||
posts = profile.get_posts()
|
||||
|
||||
count = 0
|
||||
for post in posts:
|
||||
if count >= max_posts:
|
||||
break
|
||||
|
||||
try:
|
||||
# Download images for this post
|
||||
image_paths = self._download_post_images(post, post.shortcode)
|
||||
|
||||
# Extract post data
|
||||
post_data = {
|
||||
'id': post.shortcode,
|
||||
'type': self._get_post_type(post),
|
||||
'caption': post.caption if post.caption else '',
|
||||
'author': post.owner_username,
|
||||
'publish_date': post.date_utc.isoformat(),
|
||||
'link': f'https://www.instagram.com/p/{post.shortcode}/',
|
||||
'likes': post.likes,
|
||||
'comments': post.comments,
|
||||
'views': post.video_view_count if hasattr(post, 'video_view_count') else None,
|
||||
'media_count': post.mediacount if hasattr(post, 'mediacount') else 1,
|
||||
'hashtags': list(post.caption_hashtags) if post.caption else [],
|
||||
'mentions': list(post.caption_mentions) if post.caption else [],
|
||||
'is_video': getattr(post, 'is_video', False),
|
||||
'local_images': image_paths # Add downloaded image paths
|
||||
}
|
||||
|
||||
posts_data.append(post_data)
|
||||
count += 1
|
||||
|
||||
# Aggressive rate limiting between posts
|
||||
self._aggressive_delay()
|
||||
self._check_rate_limit()
|
||||
|
||||
# Log progress
|
||||
if count % 5 == 0:
|
||||
self.logger.info(f"Fetched {count}/{max_posts} posts with images")
|
||||
|
||||
except Exception as e:
|
||||
self.logger.error(f"Error processing post: {e}")
|
||||
continue
|
||||
|
||||
self.logger.info(f"Successfully fetched {len(posts_data)} posts with images")
|
||||
|
||||
except Exception as e:
|
||||
self.logger.error(f"Error fetching posts: {e}")
|
||||
|
||||
return posts_data
|
||||
|
||||
def fetch_stories(self) -> List[Dict[str, Any]]:
|
||||
"""Fetch stories from Instagram profile with image downloads."""
|
||||
stories_data = []
|
||||
|
||||
try:
|
||||
# Ensure we have a valid context
|
||||
if not self.loader.context:
|
||||
self.logger.warning("Instagram context not initialized, attempting re-login")
|
||||
self._login()
|
||||
|
||||
if not self.loader.context:
|
||||
self.logger.error("Failed to initialize Instagram context")
|
||||
return stories_data
|
||||
|
||||
self.logger.info(f"Fetching stories with images from @{self.target_account}")
|
||||
|
||||
# Get profile
|
||||
profile = instaloader.Profile.from_username(self.loader.context, self.target_account)
|
||||
self._check_rate_limit()
|
||||
|
||||
# Get user ID for stories
|
||||
userid = profile.userid
|
||||
|
||||
# Get stories
|
||||
for story in self.loader.get_stories(userids=[userid]):
|
||||
for item in story:
|
||||
try:
|
||||
# Download story image (skip video stories)
|
||||
image_paths = []
|
||||
if not item.is_video and hasattr(item, 'url'):
|
||||
local_path = self.download_media(
|
||||
item.url,
|
||||
f"instagram_{item.mediaid}_story",
|
||||
"image"
|
||||
)
|
||||
if local_path:
|
||||
image_paths.append(local_path)
|
||||
self.logger.info(f"Downloaded story image {item.mediaid}")
|
||||
|
||||
story_data = {
|
||||
'id': item.mediaid,
|
||||
'type': 'story',
|
||||
'caption': '', # Stories usually don't have captions
|
||||
'author': item.owner_username,
|
||||
'publish_date': item.date_utc.isoformat(),
|
||||
'link': f'https://www.instagram.com/stories/{item.owner_username}/{item.mediaid}/',
|
||||
'is_video': item.is_video if hasattr(item, 'is_video') else False,
|
||||
'local_images': image_paths # Add downloaded image paths
|
||||
}
|
||||
|
||||
stories_data.append(story_data)
|
||||
|
||||
# Rate limiting
|
||||
self._aggressive_delay()
|
||||
self._check_rate_limit()
|
||||
|
||||
except Exception as e:
|
||||
self.logger.error(f"Error processing story: {e}")
|
||||
continue
|
||||
|
||||
self.logger.info(f"Successfully fetched {len(stories_data)} stories with images")
|
||||
|
||||
except Exception as e:
|
||||
self.logger.error(f"Error fetching stories: {e}")
|
||||
|
||||
return stories_data
|
||||
|
||||
def format_markdown(self, items: List[Dict[str, Any]]) -> str:
|
||||
"""Format Instagram content as markdown with image references."""
|
||||
markdown_sections = []
|
||||
|
||||
for item in items:
|
||||
section = []
|
||||
|
||||
# ID
|
||||
section.append(f"# ID: {item.get('id', 'N/A')}")
|
||||
section.append("")
|
||||
|
||||
# Type
|
||||
section.append(f"## Type: {item.get('type', 'post')}")
|
||||
section.append("")
|
||||
|
||||
# Link
|
||||
section.append(f"## Link: {item.get('link', '')}")
|
||||
section.append("")
|
||||
|
||||
# Author
|
||||
section.append(f"## Author: {item.get('author', 'N/A')}")
|
||||
section.append("")
|
||||
|
||||
# Publish Date
|
||||
section.append(f"## Publish Date: {item.get('publish_date', 'N/A')}")
|
||||
section.append("")
|
||||
|
||||
# Caption
|
||||
if item.get('caption'):
|
||||
section.append("## Caption:")
|
||||
section.append(item['caption'])
|
||||
section.append("")
|
||||
|
||||
# Engagement metrics
|
||||
if item.get('likes') is not None:
|
||||
section.append(f"## Likes: {item.get('likes', 0)}")
|
||||
section.append("")
|
||||
|
||||
if item.get('comments') is not None:
|
||||
section.append(f"## Comments: {item.get('comments', 0)}")
|
||||
section.append("")
|
||||
|
||||
if item.get('views') is not None:
|
||||
section.append(f"## Views: {item.get('views', 0)}")
|
||||
section.append("")
|
||||
|
||||
# Local images
|
||||
if item.get('local_images'):
|
||||
section.append("## Downloaded Images:")
|
||||
for img_path in item['local_images']:
|
||||
# Convert to relative path for markdown
|
||||
rel_path = Path(img_path).relative_to(self.config.data_dir)
|
||||
section.append(f"- [{rel_path.name}]({rel_path})")
|
||||
section.append("")
|
||||
|
||||
# Hashtags
|
||||
if item.get('hashtags'):
|
||||
section.append(f"## Hashtags: {' '.join(['#' + tag for tag in item['hashtags']])}")
|
||||
section.append("")
|
||||
|
||||
# Mentions
|
||||
if item.get('mentions'):
|
||||
section.append(f"## Mentions: {' '.join(['@' + mention for mention in item['mentions']])}")
|
||||
section.append("")
|
||||
|
||||
# Media count
|
||||
if item.get('media_count') and item['media_count'] > 1:
|
||||
section.append(f"## Media Count: {item['media_count']}")
|
||||
section.append("")
|
||||
|
||||
# Is video
|
||||
if item.get('is_video'):
|
||||
section.append("## Media Type: Video (thumbnail downloaded)")
|
||||
section.append("")
|
||||
|
||||
# Separator
|
||||
section.append("-" * 50)
|
||||
section.append("")
|
||||
|
||||
markdown_sections.append('\n'.join(section))
|
||||
|
||||
return '\n'.join(markdown_sections)
|
||||
|
|
@ -1,355 +0,0 @@
|
|||
#!/usr/bin/env python3
|
||||
"""
|
||||
MailChimp API scraper for fetching campaign data and metrics
|
||||
Fetches only campaigns from "Bi-Weekly Newsletter" folder
|
||||
"""
|
||||
|
||||
import os
|
||||
import time
|
||||
import requests
|
||||
from typing import Any, Dict, List, Optional
|
||||
from datetime import datetime
|
||||
from src.base_scraper import BaseScraper, ScraperConfig
|
||||
import logging
|
||||
|
||||
|
||||
class MailChimpAPIScraper(BaseScraper):
|
||||
"""MailChimp API scraper for campaigns and metrics."""
|
||||
|
||||
def __init__(self, config: ScraperConfig):
|
||||
super().__init__(config)
|
||||
|
||||
self.api_key = os.getenv('MAILCHIMP_API_KEY')
|
||||
self.server_prefix = os.getenv('MAILCHIMP_SERVER_PREFIX', 'us10')
|
||||
|
||||
if not self.api_key:
|
||||
raise ValueError("MAILCHIMP_API_KEY not found in environment variables")
|
||||
|
||||
self.base_url = f"https://{self.server_prefix}.api.mailchimp.com/3.0"
|
||||
self.headers = {
|
||||
'Authorization': f'Bearer {self.api_key}',
|
||||
'Content-Type': 'application/json'
|
||||
}
|
||||
|
||||
# Cache folder ID for "Bi-Weekly Newsletter"
|
||||
self.target_folder_id = None
|
||||
self.target_folder_name = "Bi-Weekly Newsletter"
|
||||
|
||||
self.logger.info(f"Initialized MailChimp API scraper for server: {self.server_prefix}")
|
||||
|
||||
def _test_connection(self) -> bool:
|
||||
"""Test API connection."""
|
||||
try:
|
||||
response = requests.get(f"{self.base_url}/ping", headers=self.headers)
|
||||
if response.status_code == 200:
|
||||
self.logger.info("MailChimp API connection successful")
|
||||
return True
|
||||
else:
|
||||
self.logger.error(f"MailChimp API connection failed: {response.status_code}")
|
||||
return False
|
||||
except Exception as e:
|
||||
self.logger.error(f"MailChimp API connection error: {e}")
|
||||
return False
|
||||
|
||||
def _get_folder_id(self) -> Optional[str]:
|
||||
"""Get the folder ID for 'Bi-Weekly Newsletter'."""
|
||||
if self.target_folder_id:
|
||||
return self.target_folder_id
|
||||
|
||||
try:
|
||||
response = requests.get(
|
||||
f"{self.base_url}/campaign-folders",
|
||||
headers=self.headers,
|
||||
params={'count': 100}
|
||||
)
|
||||
|
||||
if response.status_code == 200:
|
||||
folders_data = response.json()
|
||||
for folder in folders_data.get('folders', []):
|
||||
if folder['name'] == self.target_folder_name:
|
||||
self.target_folder_id = folder['id']
|
||||
self.logger.info(f"Found '{self.target_folder_name}' folder: {self.target_folder_id}")
|
||||
return self.target_folder_id
|
||||
|
||||
self.logger.warning(f"'{self.target_folder_name}' folder not found")
|
||||
else:
|
||||
self.logger.error(f"Failed to fetch folders: {response.status_code}")
|
||||
|
||||
except Exception as e:
|
||||
self.logger.error(f"Error fetching folders: {e}")
|
||||
|
||||
return None
|
||||
|
||||
def _fetch_campaign_content(self, campaign_id: str) -> Optional[Dict[str, Any]]:
|
||||
"""Fetch campaign content."""
|
||||
try:
|
||||
response = requests.get(
|
||||
f"{self.base_url}/campaigns/{campaign_id}/content",
|
||||
headers=self.headers
|
||||
)
|
||||
|
||||
if response.status_code == 200:
|
||||
return response.json()
|
||||
else:
|
||||
self.logger.warning(f"Failed to fetch content for campaign {campaign_id}: {response.status_code}")
|
||||
return None
|
||||
|
||||
except Exception as e:
|
||||
self.logger.error(f"Error fetching campaign content: {e}")
|
||||
return None
|
||||
|
||||
def _fetch_campaign_report(self, campaign_id: str) -> Optional[Dict[str, Any]]:
|
||||
"""Fetch campaign report with metrics."""
|
||||
try:
|
||||
response = requests.get(
|
||||
f"{self.base_url}/reports/{campaign_id}",
|
||||
headers=self.headers
|
||||
)
|
||||
|
||||
if response.status_code == 200:
|
||||
return response.json()
|
||||
else:
|
||||
self.logger.warning(f"Failed to fetch report for campaign {campaign_id}: {response.status_code}")
|
||||
return None
|
||||
|
||||
except Exception as e:
|
||||
self.logger.error(f"Error fetching campaign report: {e}")
|
||||
return None
|
||||
|
||||
def fetch_content(self, max_items: int = None) -> List[Dict[str, Any]]:
|
||||
"""Fetch campaigns from MailChimp API."""
|
||||
|
||||
# Test connection first
|
||||
if not self._test_connection():
|
||||
self.logger.error("Failed to connect to MailChimp API")
|
||||
return []
|
||||
|
||||
# Get folder ID
|
||||
folder_id = self._get_folder_id()
|
||||
|
||||
# Prepare parameters
|
||||
params = {
|
||||
'count': max_items or 1000, # Default to 1000 if not specified
|
||||
'status': 'sent', # Only sent campaigns
|
||||
'sort_field': 'send_time',
|
||||
'sort_dir': 'DESC'
|
||||
}
|
||||
|
||||
if folder_id:
|
||||
params['folder_id'] = folder_id
|
||||
self.logger.info(f"Fetching campaigns from '{self.target_folder_name}' folder")
|
||||
else:
|
||||
self.logger.info("Fetching all sent campaigns")
|
||||
|
||||
try:
|
||||
response = requests.get(
|
||||
f"{self.base_url}/campaigns",
|
||||
headers=self.headers,
|
||||
params=params
|
||||
)
|
||||
|
||||
if response.status_code != 200:
|
||||
self.logger.error(f"Failed to fetch campaigns: {response.status_code}")
|
||||
return []
|
||||
|
||||
campaigns_data = response.json()
|
||||
campaigns = campaigns_data.get('campaigns', [])
|
||||
|
||||
self.logger.info(f"Found {len(campaigns)} campaigns")
|
||||
|
||||
# Enrich each campaign with content and metrics
|
||||
enriched_campaigns = []
|
||||
|
||||
for campaign in campaigns:
|
||||
campaign_id = campaign['id']
|
||||
|
||||
# Add basic campaign info
|
||||
enriched_campaign = {
|
||||
'id': campaign_id,
|
||||
'title': campaign.get('settings', {}).get('subject_line', 'Untitled'),
|
||||
'preview_text': campaign.get('settings', {}).get('preview_text', ''),
|
||||
'from_name': campaign.get('settings', {}).get('from_name', ''),
|
||||
'reply_to': campaign.get('settings', {}).get('reply_to', ''),
|
||||
'send_time': campaign.get('send_time'),
|
||||
'status': campaign.get('status'),
|
||||
'type': campaign.get('type', 'regular'),
|
||||
'archive_url': campaign.get('archive_url', ''),
|
||||
'long_archive_url': campaign.get('long_archive_url', ''),
|
||||
'folder_id': campaign.get('settings', {}).get('folder_id')
|
||||
}
|
||||
|
||||
# Fetch content
|
||||
content_data = self._fetch_campaign_content(campaign_id)
|
||||
if content_data:
|
||||
enriched_campaign['plain_text'] = content_data.get('plain_text', '')
|
||||
enriched_campaign['html'] = content_data.get('html', '')
|
||||
# Convert HTML to markdown if needed
|
||||
if enriched_campaign['html'] and not enriched_campaign['plain_text']:
|
||||
enriched_campaign['plain_text'] = self.convert_to_markdown(
|
||||
enriched_campaign['html'],
|
||||
content_type="text/html"
|
||||
)
|
||||
|
||||
# Fetch metrics
|
||||
report_data = self._fetch_campaign_report(campaign_id)
|
||||
if report_data:
|
||||
enriched_campaign['metrics'] = {
|
||||
'emails_sent': report_data.get('emails_sent', 0),
|
||||
'unique_opens': report_data.get('opens', {}).get('unique_opens', 0),
|
||||
'open_rate': report_data.get('opens', {}).get('open_rate', 0),
|
||||
'total_opens': report_data.get('opens', {}).get('opens_total', 0),
|
||||
'unique_clicks': report_data.get('clicks', {}).get('unique_clicks', 0),
|
||||
'click_rate': report_data.get('clicks', {}).get('click_rate', 0),
|
||||
'total_clicks': report_data.get('clicks', {}).get('clicks_total', 0),
|
||||
'unsubscribed': report_data.get('unsubscribed', 0),
|
||||
'bounces': {
|
||||
'hard': report_data.get('bounces', {}).get('hard_bounces', 0),
|
||||
'soft': report_data.get('bounces', {}).get('soft_bounces', 0),
|
||||
'syntax_errors': report_data.get('bounces', {}).get('syntax_errors', 0)
|
||||
},
|
||||
'abuse_reports': report_data.get('abuse_reports', 0),
|
||||
'forwards': {
|
||||
'count': report_data.get('forwards', {}).get('forwards_count', 0),
|
||||
'opens': report_data.get('forwards', {}).get('forwards_opens', 0)
|
||||
}
|
||||
}
|
||||
else:
|
||||
enriched_campaign['metrics'] = {}
|
||||
|
||||
enriched_campaigns.append(enriched_campaign)
|
||||
|
||||
# Add small delay to avoid rate limiting
|
||||
time.sleep(0.5)
|
||||
|
||||
return enriched_campaigns
|
||||
|
||||
except Exception as e:
|
||||
self.logger.error(f"Error fetching campaigns: {e}")
|
||||
return []
|
||||
|
||||
def format_markdown(self, campaigns: List[Dict[str, Any]]) -> str:
|
||||
"""Format campaigns as markdown with enhanced metrics."""
|
||||
markdown_sections = []
|
||||
|
||||
for campaign in campaigns:
|
||||
section = []
|
||||
|
||||
# ID
|
||||
section.append(f"# ID: {campaign.get('id', 'N/A')}")
|
||||
section.append("")
|
||||
|
||||
# Title
|
||||
section.append(f"## Title: {campaign.get('title', 'Untitled')}")
|
||||
section.append("")
|
||||
|
||||
# Type
|
||||
section.append(f"## Type: email_campaign")
|
||||
section.append("")
|
||||
|
||||
# Send Time
|
||||
send_time = campaign.get('send_time', '')
|
||||
if send_time:
|
||||
section.append(f"## Send Date: {send_time}")
|
||||
section.append("")
|
||||
|
||||
# From and Reply-to
|
||||
from_name = campaign.get('from_name', '')
|
||||
reply_to = campaign.get('reply_to', '')
|
||||
if from_name:
|
||||
section.append(f"## From: {from_name}")
|
||||
if reply_to:
|
||||
section.append(f"## Reply To: {reply_to}")
|
||||
section.append("")
|
||||
|
||||
# Archive URL
|
||||
archive_url = campaign.get('long_archive_url') or campaign.get('archive_url', '')
|
||||
if archive_url:
|
||||
section.append(f"## Archive URL: {archive_url}")
|
||||
section.append("")
|
||||
|
||||
# Metrics
|
||||
metrics = campaign.get('metrics', {})
|
||||
if metrics:
|
||||
section.append("## Metrics:")
|
||||
section.append(f"### Emails Sent: {metrics.get('emails_sent', 0)}")
|
||||
section.append(f"### Opens: {metrics.get('unique_opens', 0)} unique ({metrics.get('open_rate', 0)*100:.1f}%)")
|
||||
section.append(f"### Clicks: {metrics.get('unique_clicks', 0)} unique ({metrics.get('click_rate', 0)*100:.1f}%)")
|
||||
section.append(f"### Unsubscribes: {metrics.get('unsubscribed', 0)}")
|
||||
|
||||
bounces = metrics.get('bounces', {})
|
||||
total_bounces = bounces.get('hard', 0) + bounces.get('soft', 0)
|
||||
if total_bounces > 0:
|
||||
section.append(f"### Bounces: {total_bounces} (Hard: {bounces.get('hard', 0)}, Soft: {bounces.get('soft', 0)})")
|
||||
|
||||
if metrics.get('abuse_reports', 0) > 0:
|
||||
section.append(f"### Abuse Reports: {metrics.get('abuse_reports', 0)}")
|
||||
|
||||
forwards = metrics.get('forwards', {})
|
||||
if forwards.get('count', 0) > 0:
|
||||
section.append(f"### Forwards: {forwards.get('count', 0)}")
|
||||
|
||||
section.append("")
|
||||
|
||||
# Preview Text
|
||||
preview_text = campaign.get('preview_text', '')
|
||||
if preview_text:
|
||||
section.append(f"## Preview Text:")
|
||||
section.append(preview_text)
|
||||
section.append("")
|
||||
|
||||
# Content
|
||||
content = campaign.get('plain_text', '')
|
||||
if content:
|
||||
section.append("## Content:")
|
||||
section.append(content)
|
||||
section.append("")
|
||||
|
||||
# Separator
|
||||
section.append("-" * 50)
|
||||
section.append("")
|
||||
|
||||
markdown_sections.append('\n'.join(section))
|
||||
|
||||
return '\n'.join(markdown_sections)
|
||||
|
||||
def get_incremental_items(self, items: List[Dict[str, Any]], state: Dict[str, Any]) -> List[Dict[str, Any]]:
|
||||
"""Get only new campaigns since last sync."""
|
||||
if not state:
|
||||
return items
|
||||
|
||||
last_campaign_id = state.get('last_campaign_id')
|
||||
last_send_time = state.get('last_send_time')
|
||||
|
||||
if not last_campaign_id:
|
||||
return items
|
||||
|
||||
# Filter for campaigns newer than the last synced
|
||||
new_items = []
|
||||
for item in items:
|
||||
if item.get('id') == last_campaign_id:
|
||||
break # Found the last synced campaign
|
||||
|
||||
# Also check by send time as backup
|
||||
if last_send_time and item.get('send_time'):
|
||||
if item['send_time'] <= last_send_time:
|
||||
continue
|
||||
|
||||
new_items.append(item)
|
||||
|
||||
return new_items
|
||||
|
||||
def update_state(self, state: Dict[str, Any], items: List[Dict[str, Any]]) -> Dict[str, Any]:
|
||||
"""Update state with latest campaign information."""
|
||||
if not items:
|
||||
return state
|
||||
|
||||
# Get the first item (most recent)
|
||||
latest_item = items[0]
|
||||
|
||||
state['last_campaign_id'] = latest_item.get('id')
|
||||
state['last_send_time'] = latest_item.get('send_time')
|
||||
state['last_campaign_title'] = latest_item.get('title')
|
||||
state['last_sync'] = datetime.now(self.tz).isoformat()
|
||||
state['campaign_count'] = len(items)
|
||||
|
||||
return state
|
||||
|
|
@ -1,410 +0,0 @@
|
|||
#!/usr/bin/env python3
|
||||
"""
|
||||
MailChimp API scraper for fetching campaign data and metrics
|
||||
Fetches only campaigns from "Bi-Weekly Newsletter" folder
|
||||
Cleans headers and footers from content
|
||||
"""
|
||||
|
||||
import os
|
||||
import time
|
||||
import requests
|
||||
import re
|
||||
from typing import Any, Dict, List, Optional
|
||||
from datetime import datetime
|
||||
from src.base_scraper import BaseScraper, ScraperConfig
|
||||
import logging
|
||||
|
||||
|
||||
class MailChimpAPIScraper(BaseScraper):
|
||||
"""MailChimp API scraper for campaigns and metrics."""
|
||||
|
||||
def __init__(self, config: ScraperConfig):
|
||||
super().__init__(config)
|
||||
|
||||
self.api_key = os.getenv('MAILCHIMP_API_KEY')
|
||||
self.server_prefix = os.getenv('MAILCHIMP_SERVER_PREFIX', 'us10')
|
||||
|
||||
if not self.api_key:
|
||||
raise ValueError("MAILCHIMP_API_KEY not found in environment variables")
|
||||
|
||||
self.base_url = f"https://{self.server_prefix}.api.mailchimp.com/3.0"
|
||||
self.headers = {
|
||||
'Authorization': f'Bearer {self.api_key}',
|
||||
'Content-Type': 'application/json'
|
||||
}
|
||||
|
||||
# Cache folder ID for "Bi-Weekly Newsletter"
|
||||
self.target_folder_id = None
|
||||
self.target_folder_name = "Bi-Weekly Newsletter"
|
||||
|
||||
self.logger.info(f"Initialized MailChimp API scraper for server: {self.server_prefix}")
|
||||
|
||||
def _clean_content(self, content: str) -> str:
|
||||
"""Clean unwanted headers and footers from MailChimp content."""
|
||||
if not content:
|
||||
return content
|
||||
|
||||
# Patterns to remove
|
||||
patterns_to_remove = [
|
||||
# Header patterns
|
||||
r'VIEW THIS EMAIL IN BROWSER[^\n]*\n?',
|
||||
r'\(\*\|ARCHIVE\|\*\)[^\n]*\n?',
|
||||
r'https://hkia\.com/?\n?',
|
||||
|
||||
# Footer patterns
|
||||
r'Newsletter produced by Teal Maker[^\n]*\n?',
|
||||
r'https://tealmaker\.com[^\n]*\n?',
|
||||
r'https://open\.spotify\.com[^\n]*\n?',
|
||||
r'https://www\.instagram\.com[^\n]*\n?',
|
||||
r'https://www\.youtube\.com[^\n]*\n?',
|
||||
r'https://www\.facebook\.com[^\n]*\n?',
|
||||
r'https://x\.com[^\n]*\n?',
|
||||
r'https://www\.linkedin\.com[^\n]*\n?',
|
||||
r'Copyright \(C\)[^\n]*\n?',
|
||||
r'\*\|CURRENT_YEAR\|\*[^\n]*\n?',
|
||||
r'\*\|LIST:COMPANY\|\*[^\n]*\n?',
|
||||
r'\*\|IFNOT:ARCHIVE_PAGE\|\*[^\n]*\*\|END:IF\|\*\n?',
|
||||
r'\*\|LIST:DESCRIPTION\|\*[^\n]*\n?',
|
||||
r'\*\|LIST_ADDRESS\|\*[^\n]*\n?',
|
||||
r'Our mailing address is:[^\n]*\n?',
|
||||
r'Want to change how you receive these emails\?[^\n]*\n?',
|
||||
r'You can update your preferences[^\n]*\n?',
|
||||
r'\(\*\|UPDATE_PROFILE\|\*\)[^\n]*\n?',
|
||||
r'or unsubscribe[^\n]*\n?',
|
||||
r'\(\*\|UNSUB\|\*\)[^\n]*\n?',
|
||||
|
||||
# Clean up multiple newlines
|
||||
r'\n{3,}',
|
||||
]
|
||||
|
||||
cleaned = content
|
||||
for pattern in patterns_to_remove:
|
||||
cleaned = re.sub(pattern, '', cleaned, flags=re.MULTILINE | re.IGNORECASE)
|
||||
|
||||
# Clean up multiple newlines (replace with double newline)
|
||||
cleaned = re.sub(r'\n{3,}', '\n\n', cleaned)
|
||||
|
||||
# Trim whitespace
|
||||
cleaned = cleaned.strip()
|
||||
|
||||
return cleaned
|
||||
|
||||
def _test_connection(self) -> bool:
|
||||
"""Test API connection."""
|
||||
try:
|
||||
response = requests.get(f"{self.base_url}/ping", headers=self.headers)
|
||||
if response.status_code == 200:
|
||||
self.logger.info("MailChimp API connection successful")
|
||||
return True
|
||||
else:
|
||||
self.logger.error(f"MailChimp API connection failed: {response.status_code}")
|
||||
return False
|
||||
except Exception as e:
|
||||
self.logger.error(f"MailChimp API connection error: {e}")
|
||||
return False
|
||||
|
||||
def _get_folder_id(self) -> Optional[str]:
|
||||
"""Get the folder ID for 'Bi-Weekly Newsletter'."""
|
||||
if self.target_folder_id:
|
||||
return self.target_folder_id
|
||||
|
||||
try:
|
||||
response = requests.get(
|
||||
f"{self.base_url}/campaign-folders",
|
||||
headers=self.headers,
|
||||
params={'count': 100}
|
||||
)
|
||||
|
||||
if response.status_code == 200:
|
||||
folders_data = response.json()
|
||||
for folder in folders_data.get('folders', []):
|
||||
if folder['name'] == self.target_folder_name:
|
||||
self.target_folder_id = folder['id']
|
||||
self.logger.info(f"Found '{self.target_folder_name}' folder: {self.target_folder_id}")
|
||||
return self.target_folder_id
|
||||
|
||||
self.logger.warning(f"'{self.target_folder_name}' folder not found")
|
||||
else:
|
||||
self.logger.error(f"Failed to fetch folders: {response.status_code}")
|
||||
|
||||
except Exception as e:
|
||||
self.logger.error(f"Error fetching folders: {e}")
|
||||
|
||||
return None
|
||||
|
||||
def _fetch_campaign_content(self, campaign_id: str) -> Optional[Dict[str, Any]]:
|
||||
"""Fetch campaign content."""
|
||||
try:
|
||||
response = requests.get(
|
||||
f"{self.base_url}/campaigns/{campaign_id}/content",
|
||||
headers=self.headers
|
||||
)
|
||||
|
||||
if response.status_code == 200:
|
||||
return response.json()
|
||||
else:
|
||||
self.logger.warning(f"Failed to fetch content for campaign {campaign_id}: {response.status_code}")
|
||||
return None
|
||||
|
||||
except Exception as e:
|
||||
self.logger.error(f"Error fetching campaign content: {e}")
|
||||
return None
|
||||
|
||||
def _fetch_campaign_report(self, campaign_id: str) -> Optional[Dict[str, Any]]:
|
||||
"""Fetch campaign report with metrics."""
|
||||
try:
|
||||
response = requests.get(
|
||||
f"{self.base_url}/reports/{campaign_id}",
|
||||
headers=self.headers
|
||||
)
|
||||
|
||||
if response.status_code == 200:
|
||||
return response.json()
|
||||
else:
|
||||
self.logger.warning(f"Failed to fetch report for campaign {campaign_id}: {response.status_code}")
|
||||
return None
|
||||
|
||||
except Exception as e:
|
||||
self.logger.error(f"Error fetching campaign report: {e}")
|
||||
return None
|
||||
|
||||
def fetch_content(self, max_items: int = None) -> List[Dict[str, Any]]:
|
||||
"""Fetch campaigns from MailChimp API."""
|
||||
|
||||
# Test connection first
|
||||
if not self._test_connection():
|
||||
self.logger.error("Failed to connect to MailChimp API")
|
||||
return []
|
||||
|
||||
# Get folder ID
|
||||
folder_id = self._get_folder_id()
|
||||
|
||||
# Prepare parameters
|
||||
params = {
|
||||
'count': max_items or 1000, # Default to 1000 if not specified
|
||||
'status': 'sent', # Only sent campaigns
|
||||
'sort_field': 'send_time',
|
||||
'sort_dir': 'DESC'
|
||||
}
|
||||
|
||||
if folder_id:
|
||||
params['folder_id'] = folder_id
|
||||
self.logger.info(f"Fetching campaigns from '{self.target_folder_name}' folder")
|
||||
else:
|
||||
self.logger.info("Fetching all sent campaigns")
|
||||
|
||||
try:
|
||||
response = requests.get(
|
||||
f"{self.base_url}/campaigns",
|
||||
headers=self.headers,
|
||||
params=params
|
||||
)
|
||||
|
||||
if response.status_code != 200:
|
||||
self.logger.error(f"Failed to fetch campaigns: {response.status_code}")
|
||||
return []
|
||||
|
||||
campaigns_data = response.json()
|
||||
campaigns = campaigns_data.get('campaigns', [])
|
||||
|
||||
self.logger.info(f"Found {len(campaigns)} campaigns")
|
||||
|
||||
# Enrich each campaign with content and metrics
|
||||
enriched_campaigns = []
|
||||
|
||||
for campaign in campaigns:
|
||||
campaign_id = campaign['id']
|
||||
|
||||
# Add basic campaign info
|
||||
enriched_campaign = {
|
||||
'id': campaign_id,
|
||||
'title': campaign.get('settings', {}).get('subject_line', 'Untitled'),
|
||||
'preview_text': campaign.get('settings', {}).get('preview_text', ''),
|
||||
'from_name': campaign.get('settings', {}).get('from_name', ''),
|
||||
'reply_to': campaign.get('settings', {}).get('reply_to', ''),
|
||||
'send_time': campaign.get('send_time'),
|
||||
'status': campaign.get('status'),
|
||||
'type': campaign.get('type', 'regular'),
|
||||
'archive_url': campaign.get('archive_url', ''),
|
||||
'long_archive_url': campaign.get('long_archive_url', ''),
|
||||
'folder_id': campaign.get('settings', {}).get('folder_id')
|
||||
}
|
||||
|
||||
# Fetch content
|
||||
content_data = self._fetch_campaign_content(campaign_id)
|
||||
if content_data:
|
||||
plain_text = content_data.get('plain_text', '')
|
||||
|
||||
# If no plain text, convert HTML first
|
||||
if not plain_text and content_data.get('html'):
|
||||
plain_text = self.convert_to_markdown(
|
||||
content_data['html'],
|
||||
content_type="text/html"
|
||||
)
|
||||
|
||||
# Clean the content (only once, after deciding on source)
|
||||
enriched_campaign['plain_text'] = self._clean_content(plain_text)
|
||||
|
||||
# Fetch metrics
|
||||
report_data = self._fetch_campaign_report(campaign_id)
|
||||
if report_data:
|
||||
enriched_campaign['metrics'] = {
|
||||
'emails_sent': report_data.get('emails_sent', 0),
|
||||
'unique_opens': report_data.get('opens', {}).get('unique_opens', 0),
|
||||
'open_rate': report_data.get('opens', {}).get('open_rate', 0),
|
||||
'total_opens': report_data.get('opens', {}).get('opens_total', 0),
|
||||
'unique_clicks': report_data.get('clicks', {}).get('unique_clicks', 0),
|
||||
'click_rate': report_data.get('clicks', {}).get('click_rate', 0),
|
||||
'total_clicks': report_data.get('clicks', {}).get('clicks_total', 0),
|
||||
'unsubscribed': report_data.get('unsubscribed', 0),
|
||||
'bounces': {
|
||||
'hard': report_data.get('bounces', {}).get('hard_bounces', 0),
|
||||
'soft': report_data.get('bounces', {}).get('soft_bounces', 0),
|
||||
'syntax_errors': report_data.get('bounces', {}).get('syntax_errors', 0)
|
||||
},
|
||||
'abuse_reports': report_data.get('abuse_reports', 0),
|
||||
'forwards': {
|
||||
'count': report_data.get('forwards', {}).get('forwards_count', 0),
|
||||
'opens': report_data.get('forwards', {}).get('forwards_opens', 0)
|
||||
}
|
||||
}
|
||||
else:
|
||||
enriched_campaign['metrics'] = {}
|
||||
|
||||
enriched_campaigns.append(enriched_campaign)
|
||||
|
||||
# Add small delay to avoid rate limiting
|
||||
time.sleep(0.5)
|
||||
|
||||
return enriched_campaigns
|
||||
|
||||
except Exception as e:
|
||||
self.logger.error(f"Error fetching campaigns: {e}")
|
||||
return []
|
||||
|
||||
def format_markdown(self, campaigns: List[Dict[str, Any]]) -> str:
|
||||
"""Format campaigns as markdown with enhanced metrics."""
|
||||
markdown_sections = []
|
||||
|
||||
for campaign in campaigns:
|
||||
section = []
|
||||
|
||||
# ID
|
||||
section.append(f"# ID: {campaign.get('id', 'N/A')}")
|
||||
section.append("")
|
||||
|
||||
# Title
|
||||
section.append(f"## Title: {campaign.get('title', 'Untitled')}")
|
||||
section.append("")
|
||||
|
||||
# Type
|
||||
section.append(f"## Type: email_campaign")
|
||||
section.append("")
|
||||
|
||||
# Send Time
|
||||
send_time = campaign.get('send_time', '')
|
||||
if send_time:
|
||||
section.append(f"## Send Date: {send_time}")
|
||||
section.append("")
|
||||
|
||||
# From and Reply-to
|
||||
from_name = campaign.get('from_name', '')
|
||||
reply_to = campaign.get('reply_to', '')
|
||||
if from_name:
|
||||
section.append(f"## From: {from_name}")
|
||||
if reply_to:
|
||||
section.append(f"## Reply To: {reply_to}")
|
||||
section.append("")
|
||||
|
||||
# Archive URL
|
||||
archive_url = campaign.get('long_archive_url') or campaign.get('archive_url', '')
|
||||
if archive_url:
|
||||
section.append(f"## Archive URL: {archive_url}")
|
||||
section.append("")
|
||||
|
||||
# Metrics
|
||||
metrics = campaign.get('metrics', {})
|
||||
if metrics:
|
||||
section.append("## Metrics:")
|
||||
section.append(f"### Emails Sent: {metrics.get('emails_sent', 0):,}")
|
||||
section.append(f"### Opens: {metrics.get('unique_opens', 0):,} unique ({metrics.get('open_rate', 0)*100:.1f}%)")
|
||||
section.append(f"### Clicks: {metrics.get('unique_clicks', 0):,} unique ({metrics.get('click_rate', 0)*100:.1f}%)")
|
||||
section.append(f"### Unsubscribes: {metrics.get('unsubscribed', 0)}")
|
||||
|
||||
bounces = metrics.get('bounces', {})
|
||||
total_bounces = bounces.get('hard', 0) + bounces.get('soft', 0)
|
||||
if total_bounces > 0:
|
||||
section.append(f"### Bounces: {total_bounces} (Hard: {bounces.get('hard', 0)}, Soft: {bounces.get('soft', 0)})")
|
||||
|
||||
if metrics.get('abuse_reports', 0) > 0:
|
||||
section.append(f"### Abuse Reports: {metrics.get('abuse_reports', 0)}")
|
||||
|
||||
forwards = metrics.get('forwards', {})
|
||||
if forwards.get('count', 0) > 0:
|
||||
section.append(f"### Forwards: {forwards.get('count', 0)}")
|
||||
|
||||
section.append("")
|
||||
|
||||
# Preview Text
|
||||
preview_text = campaign.get('preview_text', '')
|
||||
if preview_text:
|
||||
section.append(f"## Preview Text:")
|
||||
section.append(preview_text)
|
||||
section.append("")
|
||||
|
||||
# Content (cleaned)
|
||||
content = campaign.get('plain_text', '')
|
||||
if content:
|
||||
section.append("## Content:")
|
||||
section.append(content)
|
||||
section.append("")
|
||||
|
||||
# Separator
|
||||
section.append("-" * 50)
|
||||
section.append("")
|
||||
|
||||
markdown_sections.append('\n'.join(section))
|
||||
|
||||
return '\n'.join(markdown_sections)
|
||||
|
||||
def get_incremental_items(self, items: List[Dict[str, Any]], state: Dict[str, Any]) -> List[Dict[str, Any]]:
|
||||
"""Get only new campaigns since last sync."""
|
||||
if not state:
|
||||
return items
|
||||
|
||||
last_campaign_id = state.get('last_campaign_id')
|
||||
last_send_time = state.get('last_send_time')
|
||||
|
||||
if not last_campaign_id:
|
||||
return items
|
||||
|
||||
# Filter for campaigns newer than the last synced
|
||||
new_items = []
|
||||
for item in items:
|
||||
if item.get('id') == last_campaign_id:
|
||||
break # Found the last synced campaign
|
||||
|
||||
# Also check by send time as backup
|
||||
if last_send_time and item.get('send_time'):
|
||||
if item['send_time'] <= last_send_time:
|
||||
continue
|
||||
|
||||
new_items.append(item)
|
||||
|
||||
return new_items
|
||||
|
||||
def update_state(self, state: Dict[str, Any], items: List[Dict[str, Any]]) -> Dict[str, Any]:
|
||||
"""Update state with latest campaign information."""
|
||||
if not items:
|
||||
return state
|
||||
|
||||
# Get the first item (most recent)
|
||||
latest_item = items[0]
|
||||
|
||||
state['last_campaign_id'] = latest_item.get('id')
|
||||
state['last_send_time'] = latest_item.get('send_time')
|
||||
state['last_campaign_title'] = latest_item.get('title')
|
||||
state['last_sync'] = datetime.now(self.tz).isoformat()
|
||||
state['campaign_count'] = len(items)
|
||||
|
||||
return state
|
||||
|
|
@ -1,6 +1,6 @@
|
|||
#!/usr/bin/env python3
|
||||
"""
|
||||
HKIA Content Orchestrator
|
||||
HVAC Know It All Content Orchestrator
|
||||
Coordinates all scrapers and handles NAS synchronization.
|
||||
"""
|
||||
|
||||
|
|
@ -23,7 +23,6 @@ from src.rss_scraper import RSSScraperMailChimp, RSSScraperPodcast
|
|||
from src.youtube_scraper import YouTubeScraper
|
||||
from src.instagram_scraper import InstagramScraper
|
||||
from src.tiktok_scraper_advanced import TikTokScraperAdvanced
|
||||
from src.hvacrschool_scraper import HVACRSchoolScraper
|
||||
|
||||
# Load environment variables
|
||||
load_dotenv()
|
||||
|
|
@ -36,7 +35,7 @@ class ContentOrchestrator:
|
|||
"""Initialize the orchestrator."""
|
||||
self.data_dir = data_dir or Path("/opt/hvac-kia-content/data")
|
||||
self.logs_dir = logs_dir or Path("/opt/hvac-kia-content/logs")
|
||||
self.nas_path = Path(os.getenv('NAS_PATH', '/mnt/nas/hkia'))
|
||||
self.nas_path = Path(os.getenv('NAS_PATH', '/mnt/nas/hvacknowitall'))
|
||||
self.timezone = os.getenv('TIMEZONE', 'America/Halifax')
|
||||
self.tz = pytz.timezone(self.timezone)
|
||||
|
||||
|
|
@ -58,7 +57,7 @@ class ContentOrchestrator:
|
|||
# WordPress scraper
|
||||
config = ScraperConfig(
|
||||
source_name="wordpress",
|
||||
brand_name="hkia",
|
||||
brand_name="hvacknowitall",
|
||||
data_dir=self.data_dir,
|
||||
logs_dir=self.logs_dir,
|
||||
timezone=self.timezone
|
||||
|
|
@ -68,7 +67,7 @@ class ContentOrchestrator:
|
|||
# MailChimp RSS scraper
|
||||
config = ScraperConfig(
|
||||
source_name="mailchimp",
|
||||
brand_name="hkia",
|
||||
brand_name="hvacknowitall",
|
||||
data_dir=self.data_dir,
|
||||
logs_dir=self.logs_dir,
|
||||
timezone=self.timezone
|
||||
|
|
@ -78,7 +77,7 @@ class ContentOrchestrator:
|
|||
# Podcast RSS scraper
|
||||
config = ScraperConfig(
|
||||
source_name="podcast",
|
||||
brand_name="hkia",
|
||||
brand_name="hvacknowitall",
|
||||
data_dir=self.data_dir,
|
||||
logs_dir=self.logs_dir,
|
||||
timezone=self.timezone
|
||||
|
|
@ -88,7 +87,7 @@ class ContentOrchestrator:
|
|||
# YouTube scraper
|
||||
config = ScraperConfig(
|
||||
source_name="youtube",
|
||||
brand_name="hkia",
|
||||
brand_name="hvacknowitall",
|
||||
data_dir=self.data_dir,
|
||||
logs_dir=self.logs_dir,
|
||||
timezone=self.timezone
|
||||
|
|
@ -98,32 +97,22 @@ class ContentOrchestrator:
|
|||
# Instagram scraper
|
||||
config = ScraperConfig(
|
||||
source_name="instagram",
|
||||
brand_name="hkia",
|
||||
brand_name="hvacknowitall",
|
||||
data_dir=self.data_dir,
|
||||
logs_dir=self.logs_dir,
|
||||
timezone=self.timezone
|
||||
)
|
||||
scrapers['instagram'] = InstagramScraper(config)
|
||||
|
||||
# TikTok scraper - DISABLED (not working as designed)
|
||||
# config = ScraperConfig(
|
||||
# source_name="tiktok",
|
||||
# brand_name="hkia",
|
||||
# data_dir=self.data_dir,
|
||||
# logs_dir=self.logs_dir,
|
||||
# timezone=self.timezone
|
||||
# )
|
||||
# scrapers['tiktok'] = TikTokScraperAdvanced(config)
|
||||
|
||||
# HVACR School scraper
|
||||
# TikTok scraper (advanced with headed browser)
|
||||
config = ScraperConfig(
|
||||
source_name="hvacrschool",
|
||||
brand_name="hkia",
|
||||
source_name="tiktok",
|
||||
brand_name="hvacknowitall",
|
||||
data_dir=self.data_dir,
|
||||
logs_dir=self.logs_dir,
|
||||
timezone=self.timezone
|
||||
)
|
||||
scrapers['hvacrschool'] = HVACRSchoolScraper(config)
|
||||
scrapers['tiktok'] = TikTokScraperAdvanced(config)
|
||||
|
||||
return scrapers
|
||||
|
||||
|
|
@ -169,7 +158,7 @@ class ContentOrchestrator:
|
|||
# Generate and save markdown
|
||||
markdown = scraper.format_markdown(new_items)
|
||||
timestamp = datetime.now(scraper.tz).strftime("%Y%m%d_%H%M%S")
|
||||
filename = f"hkia_{name}_{timestamp}.md"
|
||||
filename = f"hvacknowitall_{name}_{timestamp}.md"
|
||||
|
||||
# Save to current markdown directory
|
||||
current_dir = scraper.config.data_dir / "markdown_current"
|
||||
|
|
@ -210,18 +199,26 @@ class ContentOrchestrator:
|
|||
results = []
|
||||
|
||||
if parallel:
|
||||
# Run all scrapers in parallel (TikTok disabled)
|
||||
# Run scrapers in parallel (except TikTok which needs DISPLAY)
|
||||
non_gui_scrapers = {k: v for k, v in self.scrapers.items() if k != 'tiktok'}
|
||||
|
||||
with ThreadPoolExecutor(max_workers=max_workers) as executor:
|
||||
# Submit all active scrapers
|
||||
# Submit non-GUI scrapers
|
||||
future_to_name = {
|
||||
executor.submit(self.run_scraper, name, scraper): name
|
||||
for name, scraper in self.scrapers.items()
|
||||
for name, scraper in non_gui_scrapers.items()
|
||||
}
|
||||
|
||||
# Collect results
|
||||
for future in as_completed(future_to_name):
|
||||
result = future.result()
|
||||
results.append(result)
|
||||
|
||||
# Run TikTok separately (requires DISPLAY)
|
||||
if 'tiktok' in self.scrapers:
|
||||
print("Running TikTok scraper separately (requires GUI)...")
|
||||
tiktok_result = self.run_scraper('tiktok', self.scrapers['tiktok'])
|
||||
results.append(tiktok_result)
|
||||
|
||||
else:
|
||||
# Run scrapers sequentially
|
||||
|
|
@ -325,7 +322,7 @@ class ContentOrchestrator:
|
|||
|
||||
def main():
|
||||
"""Main entry point."""
|
||||
parser = argparse.ArgumentParser(description='HKIA Content Orchestrator')
|
||||
parser = argparse.ArgumentParser(description='HVAC Know It All Content Orchestrator')
|
||||
parser.add_argument('--data-dir', type=Path, help='Data directory path')
|
||||
parser.add_argument('--sync-nas', action='store_true', help='Sync to NAS after scraping')
|
||||
parser.add_argument('--nas-only', action='store_true', help='Only sync to NAS (no scraping)')
|
||||
|
|
|
|||
|
|
@ -1,152 +0,0 @@
|
|||
"""
|
||||
Enhanced RSS scrapers that download podcast episode thumbnails.
|
||||
"""
|
||||
|
||||
from typing import Dict, List, Any, Optional
|
||||
from pathlib import Path
|
||||
from src.rss_scraper import RSSScraperPodcast, RSSScraperMailChimp
|
||||
|
||||
|
||||
class RSSScraperPodcastWithImages(RSSScraperPodcast):
|
||||
"""Podcast RSS scraper that downloads episode thumbnails."""
|
||||
|
||||
def __init__(self, config):
|
||||
super().__init__(config)
|
||||
# Create media directory for Podcast
|
||||
self.media_dir = self.config.data_dir / "media" / "Podcast"
|
||||
self.media_dir.mkdir(parents=True, exist_ok=True)
|
||||
self.logger.info(f"Podcast media directory: {self.media_dir}")
|
||||
|
||||
def _download_episode_thumbnail(self, episode_id: str, image_url: str) -> Optional[str]:
|
||||
"""Download podcast episode thumbnail."""
|
||||
if not image_url:
|
||||
return None
|
||||
|
||||
try:
|
||||
# Clean episode ID for filename
|
||||
safe_id = episode_id.replace('/', '_').replace('\\', '_')[:50]
|
||||
|
||||
local_path = self.download_media(
|
||||
image_url,
|
||||
f"podcast_{safe_id}_thumbnail",
|
||||
"image"
|
||||
)
|
||||
if local_path:
|
||||
self.logger.info(f"Downloaded thumbnail for episode {safe_id}")
|
||||
return local_path
|
||||
except Exception as e:
|
||||
self.logger.error(f"Error downloading thumbnail for {episode_id}: {e}")
|
||||
return None
|
||||
|
||||
def fetch_content(self, max_items: Optional[int] = None) -> List[Dict[str, Any]]:
|
||||
"""Fetch RSS feed content with thumbnail downloads."""
|
||||
items = super().fetch_content(max_items)
|
||||
|
||||
# Download thumbnails for each episode
|
||||
for item in items:
|
||||
image_url = self.extract_image_link(item)
|
||||
if image_url:
|
||||
episode_id = item.get('id') or item.get('guid', 'unknown')
|
||||
local_thumbnail = self._download_episode_thumbnail(episode_id, image_url)
|
||||
item['local_thumbnail'] = local_thumbnail
|
||||
item['thumbnail_url'] = image_url
|
||||
|
||||
# Also store audio link for reference (but don't download)
|
||||
audio_link = self.extract_audio_link(item)
|
||||
if audio_link:
|
||||
item['audio_url'] = audio_link
|
||||
|
||||
return items
|
||||
|
||||
def format_markdown(self, items: List[Dict[str, Any]]) -> str:
|
||||
"""Format podcast items as markdown with thumbnail references."""
|
||||
markdown_sections = []
|
||||
|
||||
for item in items:
|
||||
section = []
|
||||
|
||||
# ID
|
||||
item_id = item.get('id') or item.get('guid', 'N/A')
|
||||
section.append(f"# ID: {item_id}")
|
||||
section.append("")
|
||||
|
||||
# Title
|
||||
title = item.get('title', 'Untitled')
|
||||
section.append(f"## Title: {title}")
|
||||
section.append("")
|
||||
|
||||
# Type
|
||||
section.append("## Type: podcast")
|
||||
section.append("")
|
||||
|
||||
# Link
|
||||
link = item.get('link', '')
|
||||
section.append(f"## Link: {link}")
|
||||
section.append("")
|
||||
|
||||
# Audio URL
|
||||
if item.get('audio_url'):
|
||||
section.append(f"## Audio: {item['audio_url']}")
|
||||
section.append("")
|
||||
|
||||
# Publish Date
|
||||
pub_date = item.get('published') or item.get('pubDate', '')
|
||||
section.append(f"## Publish Date: {pub_date}")
|
||||
section.append("")
|
||||
|
||||
# Duration
|
||||
duration = item.get('itunes_duration', '')
|
||||
if duration:
|
||||
section.append(f"## Duration: {duration}")
|
||||
section.append("")
|
||||
|
||||
# Thumbnail
|
||||
if item.get('local_thumbnail'):
|
||||
section.append("## Thumbnail:")
|
||||
# Convert to relative path for markdown
|
||||
rel_path = Path(item['local_thumbnail']).relative_to(self.config.data_dir)
|
||||
section.append(f"")
|
||||
section.append("")
|
||||
elif item.get('thumbnail_url'):
|
||||
section.append(f"## Thumbnail URL: {item['thumbnail_url']}")
|
||||
section.append("")
|
||||
|
||||
# Description
|
||||
section.append("## Description:")
|
||||
|
||||
# Try to get full content first, then summary, then description
|
||||
content = item.get('content')
|
||||
if content and isinstance(content, list) and len(content) > 0:
|
||||
content_html = content[0].get('value', '')
|
||||
if content_html:
|
||||
content_md = self.convert_to_markdown(content_html)
|
||||
section.append(content_md)
|
||||
elif item.get('summary'):
|
||||
summary_md = self.convert_to_markdown(item.get('summary'))
|
||||
section.append(summary_md)
|
||||
elif item.get('description'):
|
||||
desc_md = self.convert_to_markdown(item.get('description'))
|
||||
section.append(desc_md)
|
||||
|
||||
section.append("")
|
||||
|
||||
# iTunes metadata if available
|
||||
if item.get('itunes_author'):
|
||||
section.append(f"## Author: {item['itunes_author']}")
|
||||
section.append("")
|
||||
|
||||
if item.get('itunes_episode'):
|
||||
section.append(f"## Episode Number: {item['itunes_episode']}")
|
||||
section.append("")
|
||||
|
||||
if item.get('itunes_season'):
|
||||
section.append(f"## Season: {item['itunes_season']}")
|
||||
section.append("")
|
||||
|
||||
# Separator
|
||||
section.append("-" * 50)
|
||||
section.append("")
|
||||
|
||||
markdown_sections.append('\n'.join(section))
|
||||
|
||||
return '\n'.join(markdown_sections)
|
||||
|
|
@ -21,7 +21,7 @@ class TikTokScraper(BaseScraper):
|
|||
super().__init__(config)
|
||||
self.username = os.getenv('TIKTOK_USERNAME')
|
||||
self.password = os.getenv('TIKTOK_PASSWORD')
|
||||
self.target_account = os.getenv('TIKTOK_TARGET', 'hkia')
|
||||
self.target_account = os.getenv('TIKTOK_TARGET', 'hvacknowitall')
|
||||
|
||||
# Session directory for persistence
|
||||
self.session_dir = self.config.data_dir / '.sessions' / 'tiktok'
|
||||
|
|
|
|||
|
|
@ -15,7 +15,7 @@ class TikTokScraperAdvanced(BaseScraper):
|
|||
|
||||
def __init__(self, config: ScraperConfig):
|
||||
super().__init__(config)
|
||||
self.target_username = os.getenv('TIKTOK_TARGET', 'hkia')
|
||||
self.target_username = os.getenv('TIKTOK_TARGET', 'hvacknowitall')
|
||||
self.base_url = f"https://www.tiktok.com/@{self.target_username}"
|
||||
|
||||
# Configure global StealthyFetcher settings
|
||||
|
|
|
|||
|
|
@ -9,7 +9,7 @@ from src.base_scraper import BaseScraper, ScraperConfig
|
|||
class WordPressScraper(BaseScraper):
|
||||
def __init__(self, config: ScraperConfig):
|
||||
super().__init__(config)
|
||||
self.base_url = os.getenv('WORDPRESS_URL', 'https://hkia.com/')
|
||||
self.base_url = os.getenv('WORDPRESS_URL', 'https://hvacknowitall.com/')
|
||||
self.username = os.getenv('WORDPRESS_USERNAME')
|
||||
self.api_key = os.getenv('WORDPRESS_API_KEY')
|
||||
self.auth = (self.username, self.api_key)
|
||||
|
|
|
|||
|
|
@ -1,470 +0,0 @@
|
|||
#!/usr/bin/env python3
|
||||
"""
|
||||
YouTube Data API v3 scraper with quota management
|
||||
Designed to stay within 10,000 units/day limit
|
||||
|
||||
Quota costs:
|
||||
- channels.list: 1 unit
|
||||
- playlistItems.list: 1 unit per page (50 items max)
|
||||
- videos.list: 1 unit per page (50 videos max)
|
||||
- search.list: 100 units (avoid if possible!)
|
||||
- captions.list: 50 units
|
||||
- captions.download: 200 units
|
||||
|
||||
Strategy for 370 videos:
|
||||
- Get channel info: 1 unit
|
||||
- Get all playlist items (370/50 = 8 pages): 8 units
|
||||
- Get video details in batches of 50: 8 units
|
||||
- Total for full channel: ~17 units (very efficient!)
|
||||
- We can afford transcripts for select videos only
|
||||
"""
|
||||
|
||||
import os
|
||||
import time
|
||||
from typing import Any, Dict, List, Optional, Tuple
|
||||
from datetime import datetime
|
||||
from googleapiclient.discovery import build
|
||||
from googleapiclient.errors import HttpError
|
||||
from youtube_transcript_api import YouTubeTranscriptApi
|
||||
from src.base_scraper import BaseScraper, ScraperConfig
|
||||
import logging
|
||||
|
||||
|
||||
class YouTubeAPIScraper(BaseScraper):
|
||||
"""YouTube API scraper with quota management."""
|
||||
|
||||
# Quota costs for different operations
|
||||
QUOTA_COSTS = {
|
||||
'channels_list': 1,
|
||||
'playlist_items': 1,
|
||||
'videos_list': 1,
|
||||
'search': 100,
|
||||
'captions_list': 50,
|
||||
'captions_download': 200,
|
||||
'transcript_api': 0 # Using youtube-transcript-api doesn't cost quota
|
||||
}
|
||||
|
||||
def __init__(self, config: ScraperConfig):
|
||||
super().__init__(config)
|
||||
|
||||
self.api_key = os.getenv('YOUTUBE_API_KEY')
|
||||
if not self.api_key:
|
||||
raise ValueError("YOUTUBE_API_KEY not found in environment variables")
|
||||
|
||||
# Build YouTube API client
|
||||
self.youtube = build('youtube', 'v3', developerKey=self.api_key)
|
||||
|
||||
# Channel configuration
|
||||
self.channel_url = os.getenv('YOUTUBE_CHANNEL_URL', 'https://www.youtube.com/@HVACKnowItAll')
|
||||
self.channel_id = None
|
||||
self.uploads_playlist_id = None
|
||||
|
||||
# Quota tracking
|
||||
self.quota_used = 0
|
||||
self.daily_quota_limit = 10000
|
||||
|
||||
# Transcript fetching strategy
|
||||
self.max_transcripts_per_run = 50 # Limit transcripts to save quota
|
||||
|
||||
self.logger.info(f"Initialized YouTube API scraper for channel: {self.channel_url}")
|
||||
|
||||
def _track_quota(self, operation: str, count: int = 1) -> bool:
|
||||
"""Track quota usage and return True if within limits."""
|
||||
cost = self.QUOTA_COSTS.get(operation, 0) * count
|
||||
|
||||
if self.quota_used + cost > self.daily_quota_limit:
|
||||
self.logger.warning(f"Quota limit would be exceeded. Current: {self.quota_used}, Cost: {cost}")
|
||||
return False
|
||||
|
||||
self.quota_used += cost
|
||||
self.logger.debug(f"Quota used: {self.quota_used}/{self.daily_quota_limit} (+{cost} for {operation})")
|
||||
return True
|
||||
|
||||
def _get_channel_info(self) -> bool:
|
||||
"""Get channel ID and uploads playlist ID."""
|
||||
if self.channel_id and self.uploads_playlist_id:
|
||||
return True
|
||||
|
||||
try:
|
||||
# Extract channel handle
|
||||
channel_handle = self.channel_url.split('@')[-1]
|
||||
|
||||
# Try to get channel by handle first (costs 1 unit)
|
||||
if not self._track_quota('channels_list'):
|
||||
return False
|
||||
|
||||
response = self.youtube.channels().list(
|
||||
part='snippet,statistics,contentDetails',
|
||||
forHandle=channel_handle
|
||||
).execute()
|
||||
|
||||
if not response.get('items'):
|
||||
# Fallback to search by name (costs 100 units - avoid!)
|
||||
self.logger.warning("Channel not found by handle, trying search...")
|
||||
|
||||
if not self._track_quota('search'):
|
||||
return False
|
||||
|
||||
search_response = self.youtube.search().list(
|
||||
part='snippet',
|
||||
q="HKIA",
|
||||
type='channel',
|
||||
maxResults=1
|
||||
).execute()
|
||||
|
||||
if not search_response.get('items'):
|
||||
self.logger.error("Channel not found")
|
||||
return False
|
||||
|
||||
self.channel_id = search_response['items'][0]['snippet']['channelId']
|
||||
|
||||
# Get full channel details
|
||||
if not self._track_quota('channels_list'):
|
||||
return False
|
||||
|
||||
response = self.youtube.channels().list(
|
||||
part='snippet,statistics,contentDetails',
|
||||
id=self.channel_id
|
||||
).execute()
|
||||
|
||||
if response.get('items'):
|
||||
channel_data = response['items'][0]
|
||||
self.channel_id = channel_data['id']
|
||||
self.uploads_playlist_id = channel_data['contentDetails']['relatedPlaylists']['uploads']
|
||||
|
||||
# Log channel stats
|
||||
stats = channel_data['statistics']
|
||||
self.logger.info(f"Channel: {channel_data['snippet']['title']}")
|
||||
self.logger.info(f"Subscribers: {int(stats.get('subscriberCount', 0)):,}")
|
||||
self.logger.info(f"Total videos: {int(stats.get('videoCount', 0)):,}")
|
||||
|
||||
return True
|
||||
|
||||
except HttpError as e:
|
||||
self.logger.error(f"YouTube API error: {e}")
|
||||
except Exception as e:
|
||||
self.logger.error(f"Error getting channel info: {e}")
|
||||
|
||||
return False
|
||||
|
||||
def _fetch_all_video_ids(self, max_videos: int = None) -> List[str]:
|
||||
"""Fetch all video IDs from the channel efficiently."""
|
||||
if not self._get_channel_info():
|
||||
return []
|
||||
|
||||
video_ids = []
|
||||
next_page_token = None
|
||||
videos_fetched = 0
|
||||
|
||||
while True:
|
||||
# Check quota before each request
|
||||
if not self._track_quota('playlist_items'):
|
||||
self.logger.warning("Quota limit reached while fetching video IDs")
|
||||
break
|
||||
|
||||
try:
|
||||
# Fetch playlist items (50 per page, costs 1 unit)
|
||||
request = self.youtube.playlistItems().list(
|
||||
part='contentDetails',
|
||||
playlistId=self.uploads_playlist_id,
|
||||
maxResults=50,
|
||||
pageToken=next_page_token
|
||||
)
|
||||
|
||||
response = request.execute()
|
||||
|
||||
for item in response.get('items', []):
|
||||
video_ids.append(item['contentDetails']['videoId'])
|
||||
videos_fetched += 1
|
||||
|
||||
if max_videos and videos_fetched >= max_videos:
|
||||
return video_ids[:max_videos]
|
||||
|
||||
# Check for next page
|
||||
next_page_token = response.get('nextPageToken')
|
||||
if not next_page_token:
|
||||
break
|
||||
|
||||
except HttpError as e:
|
||||
self.logger.error(f"Error fetching video IDs: {e}")
|
||||
break
|
||||
|
||||
self.logger.info(f"Fetched {len(video_ids)} video IDs")
|
||||
return video_ids
|
||||
|
||||
def _fetch_video_details_batch(self, video_ids: List[str]) -> List[Dict[str, Any]]:
|
||||
"""Fetch details for a batch of videos (max 50 per request)."""
|
||||
if not video_ids:
|
||||
return []
|
||||
|
||||
# YouTube API allows max 50 videos per request
|
||||
batch_size = 50
|
||||
all_videos = []
|
||||
|
||||
for i in range(0, len(video_ids), batch_size):
|
||||
batch = video_ids[i:i + batch_size]
|
||||
|
||||
# Check quota (1 unit per request)
|
||||
if not self._track_quota('videos_list'):
|
||||
self.logger.warning("Quota limit reached while fetching video details")
|
||||
break
|
||||
|
||||
try:
|
||||
response = self.youtube.videos().list(
|
||||
part='snippet,statistics,contentDetails',
|
||||
id=','.join(batch)
|
||||
).execute()
|
||||
|
||||
for video in response.get('items', []):
|
||||
video_data = {
|
||||
'id': video['id'],
|
||||
'title': video['snippet']['title'],
|
||||
'description': video['snippet']['description'], # Full description!
|
||||
'published_at': video['snippet']['publishedAt'],
|
||||
'channel_id': video['snippet']['channelId'],
|
||||
'channel_title': video['snippet']['channelTitle'],
|
||||
'tags': video['snippet'].get('tags', []),
|
||||
'duration': video['contentDetails']['duration'],
|
||||
'definition': video['contentDetails']['definition'],
|
||||
'thumbnail': video['snippet']['thumbnails'].get('maxres', {}).get('url') or
|
||||
video['snippet']['thumbnails'].get('high', {}).get('url', ''),
|
||||
|
||||
# Statistics
|
||||
'view_count': int(video['statistics'].get('viewCount', 0)),
|
||||
'like_count': int(video['statistics'].get('likeCount', 0)),
|
||||
'comment_count': int(video['statistics'].get('commentCount', 0)),
|
||||
|
||||
# Calculate engagement metrics
|
||||
'engagement_rate': 0,
|
||||
'like_ratio': 0
|
||||
}
|
||||
|
||||
# Calculate engagement metrics
|
||||
if video_data['view_count'] > 0:
|
||||
video_data['engagement_rate'] = (
|
||||
(video_data['like_count'] + video_data['comment_count']) /
|
||||
video_data['view_count']
|
||||
) * 100
|
||||
video_data['like_ratio'] = (video_data['like_count'] / video_data['view_count']) * 100
|
||||
|
||||
all_videos.append(video_data)
|
||||
|
||||
# Small delay to be respectful
|
||||
time.sleep(0.1)
|
||||
|
||||
except HttpError as e:
|
||||
self.logger.error(f"Error fetching video details: {e}")
|
||||
|
||||
return all_videos
|
||||
|
||||
def _fetch_transcript(self, video_id: str) -> Optional[str]:
|
||||
"""Fetch transcript using youtube-transcript-api (no quota cost!)."""
|
||||
try:
|
||||
# This uses youtube-transcript-api which doesn't consume API quota
|
||||
# Create instance and use fetch method
|
||||
api = YouTubeTranscriptApi()
|
||||
transcript_segments = api.fetch(video_id)
|
||||
|
||||
if transcript_segments:
|
||||
# Combine all segments into full text
|
||||
full_text = ' '.join([seg['text'] for seg in transcript_segments])
|
||||
return full_text
|
||||
|
||||
except Exception as e:
|
||||
self.logger.debug(f"No transcript available for video {video_id}: {e}")
|
||||
|
||||
return None
|
||||
|
||||
def fetch_content(self, max_posts: int = None, fetch_transcripts: bool = True) -> List[Dict[str, Any]]:
|
||||
"""Fetch video content with intelligent quota management."""
|
||||
|
||||
self.logger.info(f"Starting YouTube API fetch (quota limit: {self.daily_quota_limit})")
|
||||
|
||||
# Step 1: Get all video IDs (very cheap - ~8 units for 370 videos)
|
||||
video_ids = self._fetch_all_video_ids(max_posts)
|
||||
|
||||
if not video_ids:
|
||||
self.logger.warning("No video IDs fetched")
|
||||
return []
|
||||
|
||||
# Step 2: Fetch video details in batches (also cheap - ~8 units for 370 videos)
|
||||
videos = self._fetch_video_details_batch(video_ids)
|
||||
|
||||
self.logger.info(f"Fetched details for {len(videos)} videos")
|
||||
|
||||
# Step 3: Fetch transcripts for top videos (no quota cost!)
|
||||
if fetch_transcripts:
|
||||
# Prioritize videos by views for transcript fetching
|
||||
videos_sorted = sorted(videos, key=lambda x: x['view_count'], reverse=True)
|
||||
|
||||
# Limit transcript fetching to top videos
|
||||
max_transcripts = min(self.max_transcripts_per_run, len(videos_sorted))
|
||||
|
||||
self.logger.info(f"Fetching transcripts for top {max_transcripts} videos by views")
|
||||
|
||||
for i, video in enumerate(videos_sorted[:max_transcripts]):
|
||||
transcript = self._fetch_transcript(video['id'])
|
||||
if transcript:
|
||||
video['transcript'] = transcript
|
||||
self.logger.debug(f"Got transcript for video {i+1}/{max_transcripts}: {video['title']}")
|
||||
|
||||
# Small delay to be respectful
|
||||
time.sleep(0.5)
|
||||
|
||||
# Log final quota usage
|
||||
self.logger.info(f"Total quota used: {self.quota_used}/{self.daily_quota_limit} units")
|
||||
self.logger.info(f"Remaining quota: {self.daily_quota_limit - self.quota_used} units")
|
||||
|
||||
return videos
|
||||
|
||||
def _get_video_type(self, video: Dict[str, Any]) -> str:
|
||||
"""Determine video type based on duration."""
|
||||
duration = video.get('duration', 'PT0S')
|
||||
|
||||
# Parse ISO 8601 duration
|
||||
import re
|
||||
match = re.match(r'PT(?:(\d+)H)?(?:(\d+)M)?(?:(\d+)S)?', duration)
|
||||
if match:
|
||||
hours = int(match.group(1) or 0)
|
||||
minutes = int(match.group(2) or 0)
|
||||
seconds = int(match.group(3) or 0)
|
||||
total_seconds = hours * 3600 + minutes * 60 + seconds
|
||||
|
||||
if total_seconds < 60:
|
||||
return 'short'
|
||||
elif total_seconds > 600: # > 10 minutes
|
||||
return 'video'
|
||||
else:
|
||||
return 'video'
|
||||
|
||||
return 'video'
|
||||
|
||||
def format_markdown(self, videos: List[Dict[str, Any]]) -> str:
|
||||
"""Format videos as markdown with enhanced data."""
|
||||
markdown_sections = []
|
||||
|
||||
for video in videos:
|
||||
section = []
|
||||
|
||||
# ID
|
||||
section.append(f"# ID: {video.get('id', 'N/A')}")
|
||||
section.append("")
|
||||
|
||||
# Title
|
||||
section.append(f"## Title: {video.get('title', 'Untitled')}")
|
||||
section.append("")
|
||||
|
||||
# Type
|
||||
video_type = self._get_video_type(video)
|
||||
section.append(f"## Type: {video_type}")
|
||||
section.append("")
|
||||
|
||||
# Author
|
||||
section.append(f"## Author: {video.get('channel_title', 'Unknown')}")
|
||||
section.append("")
|
||||
|
||||
# Link
|
||||
section.append(f"## Link: https://www.youtube.com/watch?v={video.get('id')}")
|
||||
section.append("")
|
||||
|
||||
# Upload Date
|
||||
section.append(f"## Upload Date: {video.get('published_at', '')}")
|
||||
section.append("")
|
||||
|
||||
# Duration
|
||||
section.append(f"## Duration: {video.get('duration', 'Unknown')}")
|
||||
section.append("")
|
||||
|
||||
# Views
|
||||
section.append(f"## Views: {video.get('view_count', 0):,}")
|
||||
section.append("")
|
||||
|
||||
# Likes
|
||||
section.append(f"## Likes: {video.get('like_count', 0):,}")
|
||||
section.append("")
|
||||
|
||||
# Comments
|
||||
section.append(f"## Comments: {video.get('comment_count', 0):,}")
|
||||
section.append("")
|
||||
|
||||
# Engagement Metrics
|
||||
section.append(f"## Engagement Rate: {video.get('engagement_rate', 0):.2f}%")
|
||||
section.append(f"## Like Ratio: {video.get('like_ratio', 0):.2f}%")
|
||||
section.append("")
|
||||
|
||||
# Tags
|
||||
tags = video.get('tags', [])
|
||||
if tags:
|
||||
section.append(f"## Tags: {', '.join(tags[:10])}") # First 10 tags
|
||||
section.append("")
|
||||
|
||||
# Thumbnail
|
||||
thumbnail = video.get('thumbnail', '')
|
||||
if thumbnail:
|
||||
section.append(f"## Thumbnail: {thumbnail}")
|
||||
section.append("")
|
||||
|
||||
# Full Description (untruncated!)
|
||||
section.append("## Description:")
|
||||
description = video.get('description', '')
|
||||
if description:
|
||||
section.append(description)
|
||||
section.append("")
|
||||
|
||||
# Transcript
|
||||
transcript = video.get('transcript')
|
||||
if transcript:
|
||||
section.append("## Transcript:")
|
||||
section.append(transcript)
|
||||
section.append("")
|
||||
|
||||
# Separator
|
||||
section.append("-" * 50)
|
||||
section.append("")
|
||||
|
||||
markdown_sections.append('\n'.join(section))
|
||||
|
||||
return '\n'.join(markdown_sections)
|
||||
|
||||
def get_incremental_items(self, items: List[Dict[str, Any]], state: Dict[str, Any]) -> List[Dict[str, Any]]:
|
||||
"""Get only new videos since last sync."""
|
||||
if not state:
|
||||
return items
|
||||
|
||||
last_video_id = state.get('last_video_id')
|
||||
last_published = state.get('last_published')
|
||||
|
||||
if not last_video_id:
|
||||
return items
|
||||
|
||||
# Filter for videos newer than the last synced
|
||||
new_items = []
|
||||
for item in items:
|
||||
if item.get('id') == last_video_id:
|
||||
break # Found the last synced video
|
||||
|
||||
# Also check by publish date as backup
|
||||
if last_published and item.get('published_at'):
|
||||
if item['published_at'] <= last_published:
|
||||
continue
|
||||
|
||||
new_items.append(item)
|
||||
|
||||
return new_items
|
||||
|
||||
def update_state(self, state: Dict[str, Any], items: List[Dict[str, Any]]) -> Dict[str, Any]:
|
||||
"""Update state with latest video information."""
|
||||
if not items:
|
||||
return state
|
||||
|
||||
# Get the first item (most recent)
|
||||
latest_item = items[0]
|
||||
|
||||
state['last_video_id'] = latest_item.get('id')
|
||||
state['last_published'] = latest_item.get('published_at')
|
||||
state['last_video_title'] = latest_item.get('title')
|
||||
state['last_sync'] = datetime.now(self.tz).isoformat()
|
||||
state['video_count'] = len(items)
|
||||
state['quota_used'] = self.quota_used
|
||||
|
||||
return state
|
||||
|
|
@ -1,513 +0,0 @@
|
|||
#!/usr/bin/env python3
|
||||
"""
|
||||
YouTube Data API v3 scraper with quota management and captions support
|
||||
Designed to stay within 10,000 units/day limit while fetching captions
|
||||
|
||||
Quota costs:
|
||||
- channels.list: 1 unit
|
||||
- playlistItems.list: 1 unit per page (50 items max)
|
||||
- videos.list: 1 unit per page (50 videos max)
|
||||
- search.list: 100 units (avoid if possible!)
|
||||
- captions.list: 50 units per video
|
||||
- captions.download: 200 units per caption
|
||||
|
||||
Strategy for 444 videos with captions:
|
||||
- Get channel info: 1 unit
|
||||
- Get all playlist items (444/50 = 9 pages): 9 units
|
||||
- Get video details in batches of 50: 9 units
|
||||
- Get captions list for each video: 444 * 50 = 22,200 units (too much!)
|
||||
- Alternative: Use captions.list selectively or in batches
|
||||
"""
|
||||
|
||||
import os
|
||||
import time
|
||||
from typing import Any, Dict, List, Optional, Tuple
|
||||
from datetime import datetime
|
||||
from googleapiclient.discovery import build
|
||||
from googleapiclient.errors import HttpError
|
||||
from src.base_scraper import BaseScraper, ScraperConfig
|
||||
import logging
|
||||
import re
|
||||
|
||||
|
||||
class YouTubeAPIScraper(BaseScraper):
|
||||
"""YouTube API scraper with quota management and captions."""
|
||||
|
||||
# Quota costs for different operations
|
||||
QUOTA_COSTS = {
|
||||
'channels_list': 1,
|
||||
'playlist_items': 1,
|
||||
'videos_list': 1,
|
||||
'search': 100,
|
||||
'captions_list': 50,
|
||||
'captions_download': 200,
|
||||
}
|
||||
|
||||
def __init__(self, config: ScraperConfig):
|
||||
super().__init__(config)
|
||||
|
||||
self.api_key = os.getenv('YOUTUBE_API_KEY')
|
||||
if not self.api_key:
|
||||
raise ValueError("YOUTUBE_API_KEY not found in environment variables")
|
||||
|
||||
# Build YouTube API client
|
||||
self.youtube = build('youtube', 'v3', developerKey=self.api_key)
|
||||
|
||||
# Channel configuration
|
||||
self.channel_url = os.getenv('YOUTUBE_CHANNEL_URL', 'https://www.youtube.com/@HVACKnowItAll')
|
||||
self.channel_id = None
|
||||
self.uploads_playlist_id = None
|
||||
|
||||
# Quota tracking
|
||||
self.quota_used = 0
|
||||
self.daily_quota_limit = 10000
|
||||
|
||||
# Caption fetching strategy
|
||||
self.max_captions_per_run = 50 # Limit caption fetches to top videos
|
||||
# 50 videos * 50 units = 2,500 units for caption listing
|
||||
# Plus potential download costs
|
||||
|
||||
self.logger.info(f"Initialized YouTube API scraper for channel: {self.channel_url}")
|
||||
|
||||
def _track_quota(self, operation: str, count: int = 1) -> bool:
|
||||
"""Track quota usage and return True if within limits."""
|
||||
cost = self.QUOTA_COSTS.get(operation, 0) * count
|
||||
|
||||
if self.quota_used + cost > self.daily_quota_limit:
|
||||
self.logger.warning(f"Quota limit would be exceeded. Current: {self.quota_used}, Cost: {cost}")
|
||||
return False
|
||||
|
||||
self.quota_used += cost
|
||||
self.logger.debug(f"Quota used: {self.quota_used}/{self.daily_quota_limit} (+{cost} for {operation})")
|
||||
return True
|
||||
|
||||
def _get_channel_info(self) -> bool:
|
||||
"""Get channel ID and uploads playlist ID."""
|
||||
if self.channel_id and self.uploads_playlist_id:
|
||||
return True
|
||||
|
||||
try:
|
||||
# Extract channel handle
|
||||
channel_handle = self.channel_url.split('@')[-1]
|
||||
|
||||
# Try to get channel by handle first (costs 1 unit)
|
||||
if not self._track_quota('channels_list'):
|
||||
return False
|
||||
|
||||
response = self.youtube.channels().list(
|
||||
part='snippet,statistics,contentDetails',
|
||||
forHandle=channel_handle
|
||||
).execute()
|
||||
|
||||
if not response.get('items'):
|
||||
# Fallback to search by name (costs 100 units - avoid!)
|
||||
self.logger.warning("Channel not found by handle, trying search...")
|
||||
|
||||
if not self._track_quota('search'):
|
||||
return False
|
||||
|
||||
search_response = self.youtube.search().list(
|
||||
part='snippet',
|
||||
q="HVAC Know It All",
|
||||
type='channel',
|
||||
maxResults=1
|
||||
).execute()
|
||||
|
||||
if not search_response.get('items'):
|
||||
self.logger.error("Channel not found")
|
||||
return False
|
||||
|
||||
self.channel_id = search_response['items'][0]['snippet']['channelId']
|
||||
|
||||
# Get full channel details
|
||||
if not self._track_quota('channels_list'):
|
||||
return False
|
||||
|
||||
response = self.youtube.channels().list(
|
||||
part='snippet,statistics,contentDetails',
|
||||
id=self.channel_id
|
||||
).execute()
|
||||
|
||||
if response.get('items'):
|
||||
channel_data = response['items'][0]
|
||||
self.channel_id = channel_data['id']
|
||||
self.uploads_playlist_id = channel_data['contentDetails']['relatedPlaylists']['uploads']
|
||||
|
||||
# Log channel stats
|
||||
stats = channel_data['statistics']
|
||||
self.logger.info(f"Channel: {channel_data['snippet']['title']}")
|
||||
self.logger.info(f"Subscribers: {int(stats.get('subscriberCount', 0)):,}")
|
||||
self.logger.info(f"Total videos: {int(stats.get('videoCount', 0)):,}")
|
||||
|
||||
return True
|
||||
|
||||
except HttpError as e:
|
||||
self.logger.error(f"YouTube API error: {e}")
|
||||
except Exception as e:
|
||||
self.logger.error(f"Error getting channel info: {e}")
|
||||
|
||||
return False
|
||||
|
||||
def _fetch_all_video_ids(self, max_videos: int = None) -> List[str]:
|
||||
"""Fetch all video IDs from the channel efficiently."""
|
||||
if not self._get_channel_info():
|
||||
return []
|
||||
|
||||
video_ids = []
|
||||
next_page_token = None
|
||||
videos_fetched = 0
|
||||
|
||||
while True:
|
||||
# Check quota before each request
|
||||
if not self._track_quota('playlist_items'):
|
||||
self.logger.warning("Quota limit reached while fetching video IDs")
|
||||
break
|
||||
|
||||
try:
|
||||
# Fetch playlist items (50 per page, costs 1 unit)
|
||||
request = self.youtube.playlistItems().list(
|
||||
part='contentDetails',
|
||||
playlistId=self.uploads_playlist_id,
|
||||
maxResults=50,
|
||||
pageToken=next_page_token
|
||||
)
|
||||
|
||||
response = request.execute()
|
||||
|
||||
for item in response.get('items', []):
|
||||
video_ids.append(item['contentDetails']['videoId'])
|
||||
videos_fetched += 1
|
||||
|
||||
if max_videos and videos_fetched >= max_videos:
|
||||
return video_ids[:max_videos]
|
||||
|
||||
# Check for next page
|
||||
next_page_token = response.get('nextPageToken')
|
||||
if not next_page_token:
|
||||
break
|
||||
|
||||
except HttpError as e:
|
||||
self.logger.error(f"Error fetching video IDs: {e}")
|
||||
break
|
||||
|
||||
self.logger.info(f"Fetched {len(video_ids)} video IDs")
|
||||
return video_ids
|
||||
|
||||
def _fetch_video_details_batch(self, video_ids: List[str]) -> List[Dict[str, Any]]:
|
||||
"""Fetch details for a batch of videos (max 50 per request)."""
|
||||
if not video_ids:
|
||||
return []
|
||||
|
||||
# YouTube API allows max 50 videos per request
|
||||
batch_size = 50
|
||||
all_videos = []
|
||||
|
||||
for i in range(0, len(video_ids), batch_size):
|
||||
batch = video_ids[i:i + batch_size]
|
||||
|
||||
# Check quota (1 unit per request)
|
||||
if not self._track_quota('videos_list'):
|
||||
self.logger.warning("Quota limit reached while fetching video details")
|
||||
break
|
||||
|
||||
try:
|
||||
response = self.youtube.videos().list(
|
||||
part='snippet,statistics,contentDetails',
|
||||
id=','.join(batch)
|
||||
).execute()
|
||||
|
||||
for video in response.get('items', []):
|
||||
video_data = {
|
||||
'id': video['id'],
|
||||
'title': video['snippet']['title'],
|
||||
'description': video['snippet']['description'], # Full description!
|
||||
'published_at': video['snippet']['publishedAt'],
|
||||
'channel_id': video['snippet']['channelId'],
|
||||
'channel_title': video['snippet']['channelTitle'],
|
||||
'tags': video['snippet'].get('tags', []),
|
||||
'duration': video['contentDetails']['duration'],
|
||||
'definition': video['contentDetails']['definition'],
|
||||
'caption': video['contentDetails'].get('caption', 'false'), # Has captions?
|
||||
'thumbnail': video['snippet']['thumbnails'].get('maxres', {}).get('url') or
|
||||
video['snippet']['thumbnails'].get('high', {}).get('url', ''),
|
||||
|
||||
# Statistics
|
||||
'view_count': int(video['statistics'].get('viewCount', 0)),
|
||||
'like_count': int(video['statistics'].get('likeCount', 0)),
|
||||
'comment_count': int(video['statistics'].get('commentCount', 0)),
|
||||
|
||||
# Calculate engagement metrics
|
||||
'engagement_rate': 0,
|
||||
'like_ratio': 0
|
||||
}
|
||||
|
||||
# Calculate engagement metrics
|
||||
if video_data['view_count'] > 0:
|
||||
video_data['engagement_rate'] = (
|
||||
(video_data['like_count'] + video_data['comment_count']) /
|
||||
video_data['view_count']
|
||||
) * 100
|
||||
video_data['like_ratio'] = (video_data['like_count'] / video_data['view_count']) * 100
|
||||
|
||||
all_videos.append(video_data)
|
||||
|
||||
# Small delay to be respectful
|
||||
time.sleep(0.1)
|
||||
|
||||
except HttpError as e:
|
||||
self.logger.error(f"Error fetching video details: {e}")
|
||||
|
||||
return all_videos
|
||||
|
||||
def _fetch_caption_text(self, video_id: str) -> Optional[str]:
|
||||
"""Fetch caption text using YouTube Data API (costs 50 units!)."""
|
||||
try:
|
||||
# Check quota (50 units for list)
|
||||
if not self._track_quota('captions_list'):
|
||||
self.logger.debug(f"Quota limit - skipping captions for {video_id}")
|
||||
return None
|
||||
|
||||
# List available captions
|
||||
captions_response = self.youtube.captions().list(
|
||||
part='snippet',
|
||||
videoId=video_id
|
||||
).execute()
|
||||
|
||||
captions = captions_response.get('items', [])
|
||||
if not captions:
|
||||
self.logger.debug(f"No captions available for video {video_id}")
|
||||
return None
|
||||
|
||||
# Find English caption (or auto-generated)
|
||||
english_caption = None
|
||||
for caption in captions:
|
||||
if caption['snippet']['language'] == 'en':
|
||||
english_caption = caption
|
||||
break
|
||||
|
||||
if not english_caption:
|
||||
# Try auto-generated
|
||||
for caption in captions:
|
||||
if 'auto' in caption['snippet']['name'].lower():
|
||||
english_caption = caption
|
||||
break
|
||||
|
||||
if english_caption:
|
||||
caption_id = english_caption['id']
|
||||
|
||||
# Download caption would cost 200 more units!
|
||||
# For now, just note that captions are available
|
||||
self.logger.debug(f"Captions available for video {video_id} (id: {caption_id})")
|
||||
return f"[Captions available - {english_caption['snippet']['name']}]"
|
||||
|
||||
return None
|
||||
|
||||
except HttpError as e:
|
||||
if 'captionsDisabled' in str(e):
|
||||
self.logger.debug(f"Captions disabled for video {video_id}")
|
||||
else:
|
||||
self.logger.debug(f"Error fetching captions for {video_id}: {e}")
|
||||
except Exception as e:
|
||||
self.logger.debug(f"Error fetching captions for {video_id}: {e}")
|
||||
|
||||
return None
|
||||
|
||||
def fetch_content(self, max_posts: int = None, fetch_captions: bool = True) -> List[Dict[str, Any]]:
|
||||
"""Fetch video content with intelligent quota management."""
|
||||
|
||||
self.logger.info(f"Starting YouTube API fetch (quota limit: {self.daily_quota_limit})")
|
||||
|
||||
# Step 1: Get all video IDs (very cheap - ~9 units for 444 videos)
|
||||
video_ids = self._fetch_all_video_ids(max_posts)
|
||||
|
||||
if not video_ids:
|
||||
self.logger.warning("No video IDs fetched")
|
||||
return []
|
||||
|
||||
# Step 2: Fetch video details in batches (also cheap - ~9 units for 444 videos)
|
||||
videos = self._fetch_video_details_batch(video_ids)
|
||||
|
||||
self.logger.info(f"Fetched details for {len(videos)} videos")
|
||||
|
||||
# Step 3: Fetch captions for top videos (expensive - 50 units per video)
|
||||
if fetch_captions:
|
||||
# Prioritize videos by views for caption fetching
|
||||
videos_sorted = sorted(videos, key=lambda x: x['view_count'], reverse=True)
|
||||
|
||||
# Limit caption fetching to top videos
|
||||
max_captions = min(self.max_captions_per_run, len(videos_sorted))
|
||||
|
||||
# Check remaining quota
|
||||
captions_quota_needed = max_captions * 50
|
||||
if self.quota_used + captions_quota_needed > self.daily_quota_limit:
|
||||
max_captions = (self.daily_quota_limit - self.quota_used) // 50
|
||||
self.logger.warning(f"Limiting captions to {max_captions} videos due to quota")
|
||||
|
||||
if max_captions > 0:
|
||||
self.logger.info(f"Fetching captions for top {max_captions} videos by views")
|
||||
|
||||
for i, video in enumerate(videos_sorted[:max_captions]):
|
||||
caption_text = self._fetch_caption_text(video['id'])
|
||||
if caption_text:
|
||||
video['caption_text'] = caption_text
|
||||
self.logger.debug(f"Got caption info for video {i+1}/{max_captions}: {video['title']}")
|
||||
|
||||
# Small delay to be respectful
|
||||
time.sleep(0.5)
|
||||
|
||||
# Log final quota usage
|
||||
self.logger.info(f"Total quota used: {self.quota_used}/{self.daily_quota_limit} units")
|
||||
self.logger.info(f"Remaining quota: {self.daily_quota_limit - self.quota_used} units")
|
||||
|
||||
return videos
|
||||
|
||||
def _get_video_type(self, video: Dict[str, Any]) -> str:
|
||||
"""Determine video type based on duration."""
|
||||
duration = video.get('duration', 'PT0S')
|
||||
|
||||
# Parse ISO 8601 duration
|
||||
match = re.match(r'PT(?:(\d+)H)?(?:(\d+)M)?(?:(\d+)S)?', duration)
|
||||
if match:
|
||||
hours = int(match.group(1) or 0)
|
||||
minutes = int(match.group(2) or 0)
|
||||
seconds = int(match.group(3) or 0)
|
||||
total_seconds = hours * 3600 + minutes * 60 + seconds
|
||||
|
||||
if total_seconds < 60:
|
||||
return 'short'
|
||||
elif total_seconds > 600: # > 10 minutes
|
||||
return 'video'
|
||||
else:
|
||||
return 'video'
|
||||
|
||||
return 'video'
|
||||
|
||||
def format_markdown(self, videos: List[Dict[str, Any]]) -> str:
|
||||
"""Format videos as markdown with enhanced data."""
|
||||
markdown_sections = []
|
||||
|
||||
for video in videos:
|
||||
section = []
|
||||
|
||||
# ID
|
||||
section.append(f"# ID: {video.get('id', 'N/A')}")
|
||||
section.append("")
|
||||
|
||||
# Title
|
||||
section.append(f"## Title: {video.get('title', 'Untitled')}")
|
||||
section.append("")
|
||||
|
||||
# Type
|
||||
video_type = self._get_video_type(video)
|
||||
section.append(f"## Type: {video_type}")
|
||||
section.append("")
|
||||
|
||||
# Author
|
||||
section.append(f"## Author: {video.get('channel_title', 'Unknown')}")
|
||||
section.append("")
|
||||
|
||||
# Link
|
||||
section.append(f"## Link: https://www.youtube.com/watch?v={video.get('id')}")
|
||||
section.append("")
|
||||
|
||||
# Upload Date
|
||||
section.append(f"## Upload Date: {video.get('published_at', '')}")
|
||||
section.append("")
|
||||
|
||||
# Duration
|
||||
section.append(f"## Duration: {video.get('duration', 'Unknown')}")
|
||||
section.append("")
|
||||
|
||||
# Views
|
||||
section.append(f"## Views: {video.get('view_count', 0):,}")
|
||||
section.append("")
|
||||
|
||||
# Likes
|
||||
section.append(f"## Likes: {video.get('like_count', 0):,}")
|
||||
section.append("")
|
||||
|
||||
# Comments
|
||||
section.append(f"## Comments: {video.get('comment_count', 0):,}")
|
||||
section.append("")
|
||||
|
||||
# Engagement Metrics
|
||||
section.append(f"## Engagement Rate: {video.get('engagement_rate', 0):.2f}%")
|
||||
section.append(f"## Like Ratio: {video.get('like_ratio', 0):.2f}%")
|
||||
section.append("")
|
||||
|
||||
# Tags
|
||||
tags = video.get('tags', [])
|
||||
if tags:
|
||||
section.append(f"## Tags: {', '.join(tags[:10])}") # First 10 tags
|
||||
section.append("")
|
||||
|
||||
# Thumbnail
|
||||
thumbnail = video.get('thumbnail', '')
|
||||
if thumbnail:
|
||||
section.append(f"## Thumbnail: {thumbnail}")
|
||||
section.append("")
|
||||
|
||||
# Full Description (untruncated!)
|
||||
section.append("## Description:")
|
||||
description = video.get('description', '')
|
||||
if description:
|
||||
section.append(description)
|
||||
section.append("")
|
||||
|
||||
# Caption/Transcript
|
||||
caption_text = video.get('caption_text')
|
||||
if caption_text:
|
||||
section.append("## Caption Status:")
|
||||
section.append(caption_text)
|
||||
section.append("")
|
||||
|
||||
# Separator
|
||||
section.append("-" * 50)
|
||||
section.append("")
|
||||
|
||||
markdown_sections.append('\n'.join(section))
|
||||
|
||||
return '\n'.join(markdown_sections)
|
||||
|
||||
def get_incremental_items(self, items: List[Dict[str, Any]], state: Dict[str, Any]) -> List[Dict[str, Any]]:
|
||||
"""Get only new videos since last sync."""
|
||||
if not state:
|
||||
return items
|
||||
|
||||
last_video_id = state.get('last_video_id')
|
||||
last_published = state.get('last_published')
|
||||
|
||||
if not last_video_id:
|
||||
return items
|
||||
|
||||
# Filter for videos newer than the last synced
|
||||
new_items = []
|
||||
for item in items:
|
||||
if item.get('id') == last_video_id:
|
||||
break # Found the last synced video
|
||||
|
||||
# Also check by publish date as backup
|
||||
if last_published and item.get('published_at'):
|
||||
if item['published_at'] <= last_published:
|
||||
continue
|
||||
|
||||
new_items.append(item)
|
||||
|
||||
return new_items
|
||||
|
||||
def update_state(self, state: Dict[str, Any], items: List[Dict[str, Any]]) -> Dict[str, Any]:
|
||||
"""Update state with latest video information."""
|
||||
if not items:
|
||||
return state
|
||||
|
||||
# Get the first item (most recent)
|
||||
latest_item = items[0]
|
||||
|
||||
state['last_video_id'] = latest_item.get('id')
|
||||
state['last_published'] = latest_item.get('published_at')
|
||||
state['last_video_title'] = latest_item.get('title')
|
||||
state['last_sync'] = datetime.now(self.tz).isoformat()
|
||||
state['video_count'] = len(items)
|
||||
state['quota_used'] = self.quota_used
|
||||
|
||||
return state
|
||||
|
|
@ -1,222 +0,0 @@
|
|||
"""
|
||||
Enhanced YouTube API scraper that downloads video thumbnails.
|
||||
"""
|
||||
|
||||
from typing import List, Dict, Any, Optional
|
||||
from pathlib import Path
|
||||
from src.youtube_api_scraper_v2 import YouTubeAPIScraper
|
||||
|
||||
|
||||
class YouTubeAPIScraperWithThumbnails(YouTubeAPIScraper):
|
||||
"""YouTube API scraper that downloads video thumbnails."""
|
||||
|
||||
def __init__(self, config):
|
||||
super().__init__(config)
|
||||
# Create media directory for YouTube
|
||||
self.media_dir = self.config.data_dir / "media" / "YouTube"
|
||||
self.media_dir.mkdir(parents=True, exist_ok=True)
|
||||
self.logger.info(f"YouTube media directory: {self.media_dir}")
|
||||
|
||||
def _download_thumbnail(self, video_id: str, thumbnail_url: str) -> Optional[str]:
|
||||
"""Download video thumbnail."""
|
||||
if not thumbnail_url:
|
||||
return None
|
||||
|
||||
try:
|
||||
local_path = self.download_media(
|
||||
thumbnail_url,
|
||||
f"youtube_{video_id}_thumbnail",
|
||||
"image"
|
||||
)
|
||||
if local_path:
|
||||
self.logger.info(f"Downloaded thumbnail for video {video_id}")
|
||||
return local_path
|
||||
except Exception as e:
|
||||
self.logger.error(f"Error downloading thumbnail for {video_id}: {e}")
|
||||
return None
|
||||
|
||||
def fetch_content(self, max_posts: int = None, fetch_captions: bool = True) -> List[Dict[str, Any]]:
|
||||
"""Fetch YouTube videos with thumbnail downloads."""
|
||||
# Call parent method to get videos
|
||||
videos = super().fetch_content(max_posts, fetch_captions)
|
||||
|
||||
# Download thumbnails for each video
|
||||
for video in videos:
|
||||
if video.get('thumbnail'):
|
||||
local_thumbnail = self._download_thumbnail(video['id'], video['thumbnail'])
|
||||
video['local_thumbnail'] = local_thumbnail
|
||||
|
||||
return videos
|
||||
|
||||
def fetch_video_details(self, video_ids: List[str]) -> List[Dict[str, Any]]:
|
||||
"""Fetch detailed video information with thumbnail downloads."""
|
||||
if not video_ids:
|
||||
return []
|
||||
|
||||
# YouTube API allows max 50 videos per request
|
||||
batch_size = 50
|
||||
all_videos = []
|
||||
|
||||
for i in range(0, len(video_ids), batch_size):
|
||||
batch = video_ids[i:i + batch_size]
|
||||
|
||||
# Check quota (1 unit per request)
|
||||
if not self._track_quota('videos_list'):
|
||||
self.logger.warning("Quota limit reached while fetching video details")
|
||||
break
|
||||
|
||||
try:
|
||||
response = self.youtube.videos().list(
|
||||
part='snippet,statistics,contentDetails',
|
||||
id=','.join(batch)
|
||||
).execute()
|
||||
|
||||
for video in response.get('items', []):
|
||||
# Get thumbnail URL (highest quality available)
|
||||
thumbnail_url = (
|
||||
video['snippet']['thumbnails'].get('maxres', {}).get('url') or
|
||||
video['snippet']['thumbnails'].get('high', {}).get('url') or
|
||||
video['snippet']['thumbnails'].get('medium', {}).get('url') or
|
||||
video['snippet']['thumbnails'].get('default', {}).get('url', '')
|
||||
)
|
||||
|
||||
# Download thumbnail
|
||||
local_thumbnail = self._download_thumbnail(video['id'], thumbnail_url)
|
||||
|
||||
video_data = {
|
||||
'id': video['id'],
|
||||
'title': video['snippet']['title'],
|
||||
'description': video['snippet']['description'],
|
||||
'published_at': video['snippet']['publishedAt'],
|
||||
'channel_id': video['snippet']['channelId'],
|
||||
'channel_title': video['snippet']['channelTitle'],
|
||||
'tags': video['snippet'].get('tags', []),
|
||||
'duration': video['contentDetails']['duration'],
|
||||
'definition': video['contentDetails']['definition'],
|
||||
'caption': video['contentDetails'].get('caption', 'false'),
|
||||
'thumbnail': thumbnail_url,
|
||||
'local_thumbnail': local_thumbnail, # Add local thumbnail path
|
||||
|
||||
# Statistics
|
||||
'view_count': int(video['statistics'].get('viewCount', 0)),
|
||||
'like_count': int(video['statistics'].get('likeCount', 0)),
|
||||
'comment_count': int(video['statistics'].get('commentCount', 0)),
|
||||
|
||||
# Calculate engagement metrics
|
||||
'engagement_rate': 0,
|
||||
'like_ratio': 0
|
||||
}
|
||||
|
||||
# Calculate engagement metrics
|
||||
if video_data['view_count'] > 0:
|
||||
video_data['engagement_rate'] = (
|
||||
(video_data['like_count'] + video_data['comment_count']) /
|
||||
video_data['view_count']
|
||||
) * 100
|
||||
video_data['like_ratio'] = (video_data['like_count'] / video_data['view_count']) * 100
|
||||
|
||||
all_videos.append(video_data)
|
||||
|
||||
# Small delay to be respectful
|
||||
import time
|
||||
time.sleep(0.1)
|
||||
|
||||
except Exception as e:
|
||||
self.logger.error(f"Error fetching video details: {e}")
|
||||
|
||||
return all_videos
|
||||
|
||||
def format_markdown(self, videos: List[Dict[str, Any]]) -> str:
|
||||
"""Format videos as markdown with thumbnail references."""
|
||||
markdown_sections = []
|
||||
|
||||
for video in videos:
|
||||
section = []
|
||||
|
||||
# ID
|
||||
section.append(f"# ID: {video.get('id', 'N/A')}")
|
||||
section.append("")
|
||||
|
||||
# Title
|
||||
section.append(f"## Title: {video.get('title', 'Untitled')}")
|
||||
section.append("")
|
||||
|
||||
# Type
|
||||
section.append("## Type: video")
|
||||
section.append("")
|
||||
|
||||
# Link
|
||||
section.append(f"## Link: https://www.youtube.com/watch?v={video.get('id', '')}")
|
||||
section.append("")
|
||||
|
||||
# Channel
|
||||
section.append(f"## Channel: {video.get('channel_title', 'N/A')}")
|
||||
section.append("")
|
||||
|
||||
# Published Date
|
||||
section.append(f"## Published: {video.get('published_at', 'N/A')}")
|
||||
section.append("")
|
||||
|
||||
# Duration
|
||||
if video.get('duration'):
|
||||
section.append(f"## Duration: {video['duration']}")
|
||||
section.append("")
|
||||
|
||||
# Description
|
||||
if video.get('description'):
|
||||
section.append("## Description:")
|
||||
section.append(video['description'][:1000]) # Limit description length
|
||||
if len(video.get('description', '')) > 1000:
|
||||
section.append("... [truncated]")
|
||||
section.append("")
|
||||
|
||||
# Statistics
|
||||
section.append("## Statistics:")
|
||||
section.append(f"- Views: {video.get('view_count', 0):,}")
|
||||
section.append(f"- Likes: {video.get('like_count', 0):,}")
|
||||
section.append(f"- Comments: {video.get('comment_count', 0):,}")
|
||||
section.append(f"- Engagement Rate: {video.get('engagement_rate', 0):.2f}%")
|
||||
section.append(f"- Like Ratio: {video.get('like_ratio', 0):.2f}%")
|
||||
section.append("")
|
||||
|
||||
# Caption/Transcript
|
||||
if video.get('caption_text'):
|
||||
section.append("## Transcript:")
|
||||
# Show first 500 chars of transcript
|
||||
transcript_preview = video['caption_text'][:500]
|
||||
section.append(transcript_preview)
|
||||
if len(video.get('caption_text', '')) > 500:
|
||||
section.append("... [See full transcript below]")
|
||||
section.append("")
|
||||
|
||||
# Add full transcript at the end
|
||||
section.append("### Full Transcript:")
|
||||
section.append(video['caption_text'])
|
||||
section.append("")
|
||||
elif video.get('caption') == 'true':
|
||||
section.append("## Captions: Available (not fetched)")
|
||||
section.append("")
|
||||
|
||||
# Thumbnail
|
||||
if video.get('local_thumbnail'):
|
||||
section.append("## Thumbnail:")
|
||||
# Convert to relative path for markdown
|
||||
rel_path = Path(video['local_thumbnail']).relative_to(self.config.data_dir)
|
||||
section.append(f"")
|
||||
section.append("")
|
||||
elif video.get('thumbnail'):
|
||||
section.append(f"## Thumbnail URL: {video['thumbnail']}")
|
||||
section.append("")
|
||||
|
||||
# Tags
|
||||
if video.get('tags'):
|
||||
section.append(f"## Tags: {', '.join(video['tags'][:10])}") # Limit to 10 tags
|
||||
section.append("")
|
||||
|
||||
# Separator
|
||||
section.append("-" * 50)
|
||||
section.append("")
|
||||
|
||||
markdown_sections.append('\n'.join(section))
|
||||
|
||||
return '\n'.join(markdown_sections)
|
||||
|
|
@ -1,353 +0,0 @@
|
|||
#!/usr/bin/env python3
|
||||
"""
|
||||
Intelligent YouTube authentication handler with bot detection
|
||||
Based on compendium project's successful implementation
|
||||
"""
|
||||
|
||||
import re
|
||||
import time
|
||||
import logging
|
||||
from typing import Dict, Any, Optional, List
|
||||
from pathlib import Path
|
||||
from datetime import datetime, timedelta
|
||||
import yt_dlp
|
||||
from .cookie_manager import CookieManager
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
class YouTubeAuthHandler:
|
||||
"""Handle YouTube authentication with bot detection and recovery"""
|
||||
|
||||
# Bot detection patterns from compendium
|
||||
BOT_DETECTION_PATTERNS = [
|
||||
r"sign in to confirm you're not a bot",
|
||||
r"this helps protect our community",
|
||||
r"unusual traffic",
|
||||
r"automated requests",
|
||||
r"rate.*limit",
|
||||
r"HTTP Error 403",
|
||||
r"429 Too Many Requests",
|
||||
r"quota exceeded",
|
||||
r"temporarily blocked",
|
||||
r"suspicious activity",
|
||||
r"verify.*human",
|
||||
r"captcha",
|
||||
r"robot",
|
||||
r"please try again later",
|
||||
r"slow down",
|
||||
r"access denied",
|
||||
r"service unavailable"
|
||||
]
|
||||
|
||||
def __init__(self):
|
||||
self.cookie_manager = CookieManager()
|
||||
self.failure_count = 0
|
||||
self.last_failure_time = None
|
||||
self.cooldown_duration = 5 * 60 # 5 minutes
|
||||
self.mass_failure_threshold = 10 # Trigger recovery after 10 failures
|
||||
self.authenticated = False
|
||||
|
||||
def is_bot_detection_error(self, error_message: str) -> bool:
|
||||
"""Check if error message indicates bot detection"""
|
||||
|
||||
error_lower = error_message.lower()
|
||||
for pattern in self.BOT_DETECTION_PATTERNS:
|
||||
if re.search(pattern, error_lower):
|
||||
logger.warning(f"Bot detection pattern matched: {pattern}")
|
||||
return True
|
||||
return False
|
||||
|
||||
def is_in_cooldown(self) -> bool:
|
||||
"""Check if we're in cooldown period"""
|
||||
|
||||
if self.last_failure_time is None:
|
||||
return False
|
||||
|
||||
elapsed = time.time() - self.last_failure_time
|
||||
return elapsed < self.cooldown_duration
|
||||
|
||||
def record_failure(self, error_message: str):
|
||||
"""Record authentication failure"""
|
||||
|
||||
self.failure_count += 1
|
||||
self.last_failure_time = time.time()
|
||||
self.authenticated = False
|
||||
|
||||
logger.error(f"Authentication failure #{self.failure_count}: {error_message}")
|
||||
|
||||
if self.failure_count >= self.mass_failure_threshold:
|
||||
logger.critical(f"Mass failure detected ({self.failure_count} failures)")
|
||||
self._trigger_recovery()
|
||||
|
||||
def record_success(self):
|
||||
"""Record successful authentication"""
|
||||
|
||||
self.failure_count = 0
|
||||
self.last_failure_time = None
|
||||
self.authenticated = True
|
||||
logger.info("Authentication successful - failure count reset")
|
||||
|
||||
def _trigger_recovery(self):
|
||||
"""Trigger recovery procedures after mass failures"""
|
||||
|
||||
logger.info("Triggering authentication recovery procedures...")
|
||||
|
||||
# Clean up old cookies
|
||||
self.cookie_manager.cleanup_old_backups(keep_count=3)
|
||||
|
||||
# Force cooldown
|
||||
self.last_failure_time = time.time()
|
||||
|
||||
logger.info(f"Recovery complete - entering {self.cooldown_duration}s cooldown")
|
||||
|
||||
def get_ytdlp_options(self, include_auth: bool = True, use_browser_cookies: bool = True) -> Dict[str, Any]:
|
||||
"""Get optimized yt-dlp options with 2025 authentication methods"""
|
||||
|
||||
base_opts = {
|
||||
'quiet': True,
|
||||
'no_warnings': True,
|
||||
'writesubtitles': True,
|
||||
'writeautomaticsub': True,
|
||||
'subtitleslangs': ['en'],
|
||||
'socket_timeout': 30,
|
||||
'extractor_retries': 3,
|
||||
'fragment_retries': 10,
|
||||
'retry_sleep_functions': {'http': lambda n: min(10 * n, 60)},
|
||||
'skip_download': True,
|
||||
# Critical: Add sleep intervals as per compendium
|
||||
'sleep_interval_requests': 15, # 15 seconds between requests (compendium uses 10+)
|
||||
'sleep_interval': 5, # 5 seconds between downloads
|
||||
'max_sleep_interval': 30, # Max sleep interval
|
||||
# Add rate limiting
|
||||
'ratelimit': 50000, # 50KB/s to be more conservative
|
||||
'ignoreerrors': True, # Continue on errors
|
||||
# 2025 User-Agent (latest Chrome)
|
||||
'user_agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36',
|
||||
'referer': 'https://www.youtube.com/',
|
||||
'http_headers': {
|
||||
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
|
||||
'Accept-Language': 'en-us,en;q=0.5',
|
||||
'Accept-Encoding': 'gzip,deflate',
|
||||
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.7',
|
||||
'Keep-Alive': '300',
|
||||
'Connection': 'keep-alive',
|
||||
}
|
||||
}
|
||||
|
||||
if include_auth:
|
||||
# Prioritize browser cookies as per yt-dlp 2025 recommendations
|
||||
if use_browser_cookies:
|
||||
try:
|
||||
# Use Firefox browser cookies directly (2025 recommended method)
|
||||
base_opts['cookiesfrombrowser'] = ('firefox', '/home/ben/snap/firefox/common/.mozilla/firefox/7a3tcyzf.default')
|
||||
logger.debug("Using direct Firefox browser cookies (2025 method)")
|
||||
except Exception as e:
|
||||
logger.warning(f"Browser cookie error: {e}")
|
||||
# Fallback to auto-discovery
|
||||
base_opts['cookiesfrombrowser'] = ('firefox',)
|
||||
logger.debug("Using Firefox browser cookies with auto-discovery")
|
||||
else:
|
||||
# Fallback to cookie file method
|
||||
try:
|
||||
cookie_path = self.cookie_manager.find_valid_cookies()
|
||||
if cookie_path:
|
||||
base_opts['cookiefile'] = str(cookie_path)
|
||||
logger.debug(f"Using cookie file: {cookie_path}")
|
||||
else:
|
||||
logger.warning("No valid cookies found")
|
||||
except Exception as e:
|
||||
logger.warning(f"Cookie management error: {e}")
|
||||
|
||||
return base_opts
|
||||
|
||||
def extract_video_info(self, video_url: str, max_retries: int = 3) -> Optional[Dict[str, Any]]:
|
||||
"""Extract video info with 2025 authentication and retry logic"""
|
||||
|
||||
if self.is_in_cooldown():
|
||||
remaining = self.cooldown_duration - (time.time() - self.last_failure_time)
|
||||
logger.warning(f"In cooldown - {remaining:.0f}s remaining")
|
||||
return None
|
||||
|
||||
# Try both browser cookies and file cookies
|
||||
auth_methods = [
|
||||
("browser_cookies", True), # 2025 recommended method
|
||||
("file_cookies", False) # Fallback method
|
||||
]
|
||||
|
||||
for method_name, use_browser in auth_methods:
|
||||
logger.info(f"Trying authentication method: {method_name}")
|
||||
|
||||
for attempt in range(max_retries):
|
||||
try:
|
||||
ydl_opts = self.get_ytdlp_options(use_browser_cookies=use_browser)
|
||||
|
||||
with yt_dlp.YoutubeDL(ydl_opts) as ydl:
|
||||
logger.debug(f"Extracting video info ({method_name}, attempt {attempt + 1}/{max_retries}): {video_url}")
|
||||
info = ydl.extract_info(video_url, download=False)
|
||||
|
||||
if info:
|
||||
logger.info(f"✅ Success with {method_name}")
|
||||
self.record_success()
|
||||
return info
|
||||
|
||||
except Exception as e:
|
||||
error_msg = str(e)
|
||||
logger.error(f"{method_name} attempt {attempt + 1} failed: {error_msg}")
|
||||
|
||||
if self.is_bot_detection_error(error_msg):
|
||||
self.record_failure(error_msg)
|
||||
|
||||
# If bot detection with browser cookies, try longer delay
|
||||
if use_browser and attempt < max_retries - 1:
|
||||
delay = (attempt + 1) * 60 # 60s, 120s, 180s for browser method
|
||||
logger.info(f"Bot detection with browser cookies - waiting {delay}s before retry")
|
||||
time.sleep(delay)
|
||||
elif attempt < max_retries - 1:
|
||||
delay = (attempt + 1) * 30 # 30s, 60s, 90s for file method
|
||||
logger.info(f"Bot detection - waiting {delay}s before retry")
|
||||
time.sleep(delay)
|
||||
else:
|
||||
# Non-bot error, shorter delay
|
||||
if attempt < max_retries - 1:
|
||||
time.sleep(10)
|
||||
|
||||
# If this method failed completely, try next method
|
||||
logger.warning(f"Method {method_name} failed after {max_retries} attempts")
|
||||
|
||||
logger.error(f"All authentication methods failed after {max_retries} attempts each")
|
||||
return None
|
||||
|
||||
def test_authentication(self) -> bool:
|
||||
"""Test authentication with a known video"""
|
||||
|
||||
test_video = "https://www.youtube.com/watch?v=dQw4w9WgXcQ" # Rick Roll - always available
|
||||
|
||||
logger.info("Testing YouTube authentication...")
|
||||
info = self.extract_video_info(test_video, max_retries=1)
|
||||
|
||||
if info:
|
||||
logger.info("✅ Authentication test successful")
|
||||
return True
|
||||
else:
|
||||
logger.error("❌ Authentication test failed")
|
||||
return False
|
||||
|
||||
def get_status(self) -> Dict[str, Any]:
|
||||
"""Get current authentication status"""
|
||||
|
||||
cookie_path = self.cookie_manager.find_valid_cookies()
|
||||
|
||||
status = {
|
||||
'authenticated': self.authenticated,
|
||||
'failure_count': self.failure_count,
|
||||
'in_cooldown': self.is_in_cooldown(),
|
||||
'cooldown_remaining': 0,
|
||||
'has_valid_cookies': cookie_path is not None,
|
||||
'cookie_path': str(cookie_path) if cookie_path else None,
|
||||
}
|
||||
|
||||
if self.is_in_cooldown() and self.last_failure_time:
|
||||
status['cooldown_remaining'] = max(0, self.cooldown_duration - (time.time() - self.last_failure_time))
|
||||
|
||||
return status
|
||||
|
||||
def force_reauthentication(self):
|
||||
"""Force re-authentication on next request"""
|
||||
|
||||
logger.info("Forcing re-authentication...")
|
||||
self.authenticated = False
|
||||
self.failure_count = 0
|
||||
self.last_failure_time = None
|
||||
|
||||
def update_cookies_from_browser(self) -> bool:
|
||||
"""Update cookies from browser session - Compendium method"""
|
||||
|
||||
logger.info("Attempting to update cookies from browser using compendium method...")
|
||||
|
||||
# Snap Firefox path for this system
|
||||
browser_profiles = [
|
||||
('firefox', '/home/ben/snap/firefox/common/.mozilla/firefox/7a3tcyzf.default'),
|
||||
('firefox', None), # Let yt-dlp auto-discover
|
||||
('chrome', None),
|
||||
('chromium', None)
|
||||
]
|
||||
|
||||
for browser, profile_path in browser_profiles:
|
||||
try:
|
||||
logger.info(f"Trying to extract cookies from {browser}" + (f" (profile: {profile_path})" if profile_path else ""))
|
||||
|
||||
# Use yt-dlp to extract cookies from browser
|
||||
if profile_path:
|
||||
temp_opts = {
|
||||
'cookiesfrombrowser': (browser, profile_path),
|
||||
'quiet': False, # Enable output to see what's happening
|
||||
'skip_download': True,
|
||||
'no_warnings': False,
|
||||
}
|
||||
else:
|
||||
temp_opts = {
|
||||
'cookiesfrombrowser': (browser,),
|
||||
'quiet': False,
|
||||
'skip_download': True,
|
||||
'no_warnings': False,
|
||||
}
|
||||
|
||||
# Test with a simple video first
|
||||
test_video = "https://www.youtube.com/watch?v=dQw4w9WgXcQ"
|
||||
|
||||
logger.info(f"Testing {browser} cookies with test video...")
|
||||
with yt_dlp.YoutubeDL(temp_opts) as ydl:
|
||||
info = ydl.extract_info(test_video, download=False)
|
||||
|
||||
if info and not self.is_bot_detection_error(str(info)):
|
||||
logger.info(f"✅ Successfully authenticated with {browser} cookies!")
|
||||
|
||||
# Now save the working cookies
|
||||
timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
|
||||
cookie_path = Path(f"data_production_backlog/.cookies/youtube_cookies_{browser}_{timestamp}.txt")
|
||||
cookie_path.parent.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
save_opts = temp_opts.copy()
|
||||
save_opts['cookiefile'] = str(cookie_path)
|
||||
|
||||
logger.info(f"Saving working {browser} cookies to {cookie_path}")
|
||||
with yt_dlp.YoutubeDL(save_opts) as ydl2:
|
||||
# Save cookies by doing another extraction
|
||||
ydl2.extract_info(test_video, download=False)
|
||||
|
||||
if cookie_path.exists() and cookie_path.stat().st_size > 100:
|
||||
# Update main cookie file using compendium atomic method
|
||||
success = self.cookie_manager.update_cookies(cookie_path)
|
||||
if success:
|
||||
logger.info(f"✅ Cookies successfully updated from {browser}")
|
||||
self.record_success()
|
||||
return True
|
||||
else:
|
||||
logger.warning(f"Cookie file was not created or is too small: {cookie_path}")
|
||||
|
||||
except Exception as e:
|
||||
error_msg = str(e)
|
||||
logger.warning(f"Failed to extract cookies from {browser}: {error_msg}")
|
||||
|
||||
# Check if this is a bot detection error
|
||||
if self.is_bot_detection_error(error_msg):
|
||||
logger.error(f"Bot detection error with {browser} - this browser session may be flagged")
|
||||
continue
|
||||
|
||||
logger.error("Failed to extract working cookies from any browser")
|
||||
return False
|
||||
|
||||
# Convenience functions
|
||||
def get_auth_handler() -> YouTubeAuthHandler:
|
||||
"""Get YouTube authentication handler"""
|
||||
return YouTubeAuthHandler()
|
||||
|
||||
def test_youtube_access() -> bool:
|
||||
"""Test YouTube access"""
|
||||
handler = YouTubeAuthHandler()
|
||||
return handler.test_authentication()
|
||||
|
||||
def extract_youtube_video(video_url: str) -> Optional[Dict[str, Any]]:
|
||||
"""Extract YouTube video with authentication"""
|
||||
handler = YouTubeAuthHandler()
|
||||
return handler.extract_video_info(video_url)
|
||||
|
|
@ -2,14 +2,11 @@ import os
|
|||
import time
|
||||
import random
|
||||
import json
|
||||
import urllib.request
|
||||
import urllib.parse
|
||||
from typing import Any, Dict, List, Optional
|
||||
from datetime import datetime
|
||||
from pathlib import Path
|
||||
import yt_dlp
|
||||
from src.base_scraper import BaseScraper, ScraperConfig
|
||||
from src.youtube_auth_handler import YouTubeAuthHandler
|
||||
|
||||
|
||||
class YouTubeScraper(BaseScraper):
|
||||
|
|
@ -17,45 +14,41 @@ class YouTubeScraper(BaseScraper):
|
|||
|
||||
def __init__(self, config: ScraperConfig):
|
||||
super().__init__(config)
|
||||
self.username = os.getenv('YOUTUBE_USERNAME')
|
||||
self.password = os.getenv('YOUTUBE_PASSWORD')
|
||||
self.channel_url = os.getenv('YOUTUBE_CHANNEL_URL', 'https://www.youtube.com/@HVACKnowItAll')
|
||||
# Use videos tab URL to get individual videos instead of playlists
|
||||
self.videos_url = self.channel_url.rstrip('/') + '/videos'
|
||||
|
||||
# Initialize authentication handler
|
||||
self.auth_handler = YouTubeAuthHandler()
|
||||
|
||||
# Setup cookies_file attribute for compatibility
|
||||
self.cookies_file = Path(config.data_dir) / '.cookies' / 'youtube_cookies.txt'
|
||||
# Cookies file for session persistence
|
||||
self.cookies_file = self.config.data_dir / '.cookies' / 'youtube_cookies.txt'
|
||||
self.cookies_file.parent.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
# Test authentication on startup
|
||||
auth_status = self.auth_handler.get_status()
|
||||
if not auth_status['has_valid_cookies']:
|
||||
self.logger.warning("No valid YouTube cookies found")
|
||||
# Try to extract from browser
|
||||
if self.auth_handler.update_cookies_from_browser():
|
||||
self.logger.info("Successfully extracted cookies from browser")
|
||||
else:
|
||||
self.logger.error("Failed to get YouTube authentication")
|
||||
# User agents for rotation
|
||||
self.user_agents = [
|
||||
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
|
||||
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
|
||||
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
|
||||
]
|
||||
|
||||
def _get_ydl_options(self, include_transcripts: bool = False) -> Dict[str, Any]:
|
||||
def _get_ydl_options(self) -> Dict[str, Any]:
|
||||
"""Get yt-dlp options with authentication and rate limiting."""
|
||||
# Use the auth handler's optimized options
|
||||
options = self.auth_handler.get_ytdlp_options(include_auth=True)
|
||||
|
||||
# Add transcript options if requested
|
||||
if include_transcripts:
|
||||
options.update({
|
||||
'writesubtitles': True,
|
||||
'writeautomaticsub': True,
|
||||
'subtitleslangs': ['en'],
|
||||
})
|
||||
|
||||
# Override with more conservative settings for channel scraping
|
||||
options.update({
|
||||
options = {
|
||||
'quiet': True,
|
||||
'no_warnings': True,
|
||||
'extract_flat': False, # Get full video info
|
||||
'sleep_interval_requests': 20, # Even more conservative for channel scraping
|
||||
})
|
||||
'ignoreerrors': True, # Continue on error
|
||||
'cookiefile': str(self.cookies_file),
|
||||
'cookiesfrombrowser': None, # Don't use browser cookies
|
||||
'username': self.username,
|
||||
'password': self.password,
|
||||
'ratelimit': 100000, # 100KB/s rate limit
|
||||
'sleep_interval': 1, # Sleep between downloads
|
||||
'max_sleep_interval': 3,
|
||||
'user_agent': random.choice(self.user_agents),
|
||||
'referer': 'https://www.youtube.com/',
|
||||
'add_header': ['Accept-Language:en-US,en;q=0.9'],
|
||||
}
|
||||
|
||||
# Add proxy if configured
|
||||
proxy = os.getenv('YOUTUBE_PROXY')
|
||||
|
|
@ -69,37 +62,17 @@ class YouTubeScraper(BaseScraper):
|
|||
delay = random.uniform(min_seconds, max_seconds)
|
||||
self.logger.debug(f"Waiting {delay:.2f} seconds...")
|
||||
time.sleep(delay)
|
||||
|
||||
def _backlog_delay(self, transcript_mode: bool = False) -> None:
|
||||
"""Minimal delay for backlog processing - yt-dlp handles most rate limiting."""
|
||||
if transcript_mode:
|
||||
# Minimal delay for transcript fetching - let yt-dlp handle it
|
||||
base_delay = random.uniform(2, 5)
|
||||
else:
|
||||
# Minimal delay for basic video info
|
||||
base_delay = random.uniform(1, 3)
|
||||
|
||||
# Add some randomization to appear more human
|
||||
jitter = random.uniform(0.8, 1.2)
|
||||
final_delay = base_delay * jitter
|
||||
|
||||
self.logger.debug(f"Minimal backlog delay: {final_delay:.1f} seconds...")
|
||||
time.sleep(final_delay)
|
||||
|
||||
def fetch_channel_videos(self, max_videos: int = 50) -> List[Dict[str, Any]]:
|
||||
"""Fetch video list from YouTube channel using auth handler."""
|
||||
"""Fetch video list from YouTube channel."""
|
||||
videos = []
|
||||
|
||||
try:
|
||||
self.logger.info(f"Fetching videos from channel: {self.videos_url}")
|
||||
|
||||
# Use auth handler's optimized extraction with proper cookie management
|
||||
ydl_opts = self.auth_handler.get_ytdlp_options(include_auth=True)
|
||||
ydl_opts.update({
|
||||
'extract_flat': True, # Just get video list, not full info
|
||||
'playlistend': max_videos,
|
||||
'sleep_interval_requests': 10, # Conservative for channel listing
|
||||
})
|
||||
ydl_opts = self._get_ydl_options()
|
||||
ydl_opts['extract_flat'] = True # Just get video list, not full info
|
||||
ydl_opts['playlistend'] = max_videos
|
||||
|
||||
with yt_dlp.YoutubeDL(ydl_opts) as ydl:
|
||||
channel_info = ydl.extract_info(self.videos_url, download=False)
|
||||
|
|
@ -110,230 +83,30 @@ class YouTubeScraper(BaseScraper):
|
|||
self.logger.info(f"Found {len(videos)} videos in channel")
|
||||
else:
|
||||
self.logger.warning("No entries found in channel info")
|
||||
|
||||
# Save cookies for next session
|
||||
if self.cookies_file.exists():
|
||||
self.logger.debug("Cookies saved for next session")
|
||||
|
||||
except Exception as e:
|
||||
self.logger.error(f"Error fetching channel videos: {e}")
|
||||
# Check for bot detection and try recovery
|
||||
if self.auth_handler.is_bot_detection_error(str(e)):
|
||||
self.logger.warning("Bot detection in channel fetch - attempting recovery")
|
||||
self.auth_handler.record_failure(str(e))
|
||||
# Try browser cookie update
|
||||
if self.auth_handler.update_cookies_from_browser():
|
||||
self.logger.info("Cookie update successful - could retry channel fetch")
|
||||
|
||||
return videos
|
||||
|
||||
def fetch_video_details(self, video_id: str, fetch_transcript: bool = False) -> Optional[Dict[str, Any]]:
|
||||
"""Fetch detailed information for a specific video, optionally including transcript."""
|
||||
def fetch_video_details(self, video_id: str) -> Optional[Dict[str, Any]]:
|
||||
"""Fetch detailed information for a specific video."""
|
||||
try:
|
||||
video_url = f"https://www.youtube.com/watch?v={video_id}"
|
||||
|
||||
# Use auth handler for authenticated extraction with compendium retry logic
|
||||
video_info = self.auth_handler.extract_video_info(video_url, max_retries=3)
|
||||
ydl_opts = self._get_ydl_options()
|
||||
ydl_opts['extract_flat'] = False # Get full video info
|
||||
|
||||
if not video_info:
|
||||
self.logger.error(f"Failed to extract video info for {video_id}")
|
||||
|
||||
# If extraction failed, try to update cookies from browser (compendium approach)
|
||||
if self.auth_handler.failure_count >= 3:
|
||||
self.logger.warning("Multiple failures detected - attempting browser cookie extraction")
|
||||
if self.auth_handler.update_cookies_from_browser():
|
||||
self.logger.info("Cookie update successful - retrying video extraction")
|
||||
video_info = self.auth_handler.extract_video_info(video_url, max_retries=1)
|
||||
|
||||
if not video_info:
|
||||
return None
|
||||
|
||||
# Extract transcript if requested and available
|
||||
if fetch_transcript:
|
||||
transcript = self._extract_transcript(video_info)
|
||||
if transcript:
|
||||
video_info['transcript'] = transcript
|
||||
self.logger.info(f"Extracted transcript for video {video_id} ({len(transcript)} chars)")
|
||||
else:
|
||||
video_info['transcript'] = None
|
||||
self.logger.warning(f"No transcript available for video {video_id}")
|
||||
|
||||
return video_info
|
||||
with yt_dlp.YoutubeDL(ydl_opts) as ydl:
|
||||
video_info = ydl.extract_info(video_url, download=False)
|
||||
return video_info
|
||||
|
||||
except Exception as e:
|
||||
self.logger.error(f"Error fetching video {video_id}: {e}")
|
||||
# Check if this is a bot detection error and handle accordingly
|
||||
if self.auth_handler.is_bot_detection_error(str(e)):
|
||||
self.logger.warning("Bot detection error - triggering enhanced recovery")
|
||||
self.auth_handler.record_failure(str(e))
|
||||
|
||||
# Try browser cookie extraction immediately for bot detection
|
||||
if self.auth_handler.update_cookies_from_browser():
|
||||
self.logger.info("Emergency cookie update successful - attempting retry")
|
||||
try:
|
||||
video_info = self.auth_handler.extract_video_info(video_url, max_retries=1)
|
||||
if video_info:
|
||||
if fetch_transcript:
|
||||
transcript = self._extract_transcript(video_info)
|
||||
if transcript:
|
||||
video_info['transcript'] = transcript
|
||||
return video_info
|
||||
except Exception as retry_error:
|
||||
self.logger.error(f"Retry after cookie update failed: {retry_error}")
|
||||
|
||||
return None
|
||||
|
||||
def _extract_transcript(self, video_info: Dict[str, Any]) -> Optional[str]:
|
||||
"""Extract transcript text from video info."""
|
||||
try:
|
||||
# Try to get subtitles or automatic captions
|
||||
subtitles = video_info.get('subtitles', {})
|
||||
auto_captions = video_info.get('automatic_captions', {})
|
||||
|
||||
# Prefer English subtitles/captions
|
||||
transcript_data = None
|
||||
transcript_source = None
|
||||
|
||||
if 'en' in subtitles:
|
||||
transcript_data = subtitles['en']
|
||||
transcript_source = "manual subtitles"
|
||||
elif 'en' in auto_captions:
|
||||
transcript_data = auto_captions['en']
|
||||
transcript_source = "auto-generated captions"
|
||||
|
||||
if not transcript_data:
|
||||
return None
|
||||
|
||||
self.logger.debug(f"Using {transcript_source} for video {video_info.get('id')}")
|
||||
|
||||
# Find the best format (prefer json3, then srv1, then vtt)
|
||||
caption_url = None
|
||||
format_preference = ['json3', 'srv1', 'vtt', 'ttml']
|
||||
|
||||
for preferred_format in format_preference:
|
||||
for caption in transcript_data:
|
||||
if caption.get('ext') == preferred_format:
|
||||
caption_url = caption.get('url')
|
||||
break
|
||||
if caption_url:
|
||||
break
|
||||
|
||||
if not caption_url:
|
||||
# Fallback to first available format
|
||||
if transcript_data:
|
||||
caption_url = transcript_data[0].get('url')
|
||||
|
||||
if not caption_url:
|
||||
return None
|
||||
|
||||
# Fetch and parse the transcript
|
||||
return self._fetch_and_parse_transcript(caption_url, video_info.get('id'))
|
||||
|
||||
except Exception as e:
|
||||
self.logger.error(f"Error extracting transcript: {e}")
|
||||
return None
|
||||
|
||||
def _fetch_and_parse_transcript(self, caption_url: str, video_id: str) -> Optional[str]:
|
||||
"""Fetch and parse transcript from caption URL."""
|
||||
try:
|
||||
# Fetch the caption content
|
||||
with urllib.request.urlopen(caption_url) as response:
|
||||
content = response.read().decode('utf-8')
|
||||
|
||||
# Parse based on format
|
||||
if 'json3' in caption_url or caption_url.endswith('.json'):
|
||||
return self._parse_json_transcript(content)
|
||||
elif 'srv1' in caption_url or 'srv2' in caption_url:
|
||||
return self._parse_srv_transcript(content)
|
||||
elif caption_url.endswith('.vtt'):
|
||||
return self._parse_vtt_transcript(content)
|
||||
else:
|
||||
# Try to auto-detect format
|
||||
content_lower = content.lower().strip()
|
||||
if content_lower.startswith('{') or 'wiremag' in content_lower:
|
||||
return self._parse_json_transcript(content)
|
||||
elif 'webvtt' in content_lower:
|
||||
return self._parse_vtt_transcript(content)
|
||||
elif '<transcript>' in content_lower or '<text>' in content_lower:
|
||||
return self._parse_srv_transcript(content)
|
||||
else:
|
||||
# Last resort - return raw content
|
||||
self.logger.warning(f"Unknown transcript format for {video_id}, returning raw content")
|
||||
return content
|
||||
|
||||
except Exception as e:
|
||||
self.logger.error(f"Error fetching transcript for video {video_id}: {e}")
|
||||
return None
|
||||
|
||||
def _parse_json_transcript(self, content: str) -> Optional[str]:
|
||||
"""Parse JSON3 format transcript."""
|
||||
try:
|
||||
data = json.loads(content)
|
||||
transcript_parts = []
|
||||
|
||||
# Handle YouTube's JSON3 format
|
||||
if 'events' in data:
|
||||
for event in data['events']:
|
||||
if 'segs' in event:
|
||||
for seg in event['segs']:
|
||||
if 'utf8' in seg:
|
||||
text = seg['utf8'].strip()
|
||||
if text and text not in ['♪', '[Music]', '[Applause]']:
|
||||
transcript_parts.append(text)
|
||||
|
||||
return ' '.join(transcript_parts) if transcript_parts else None
|
||||
|
||||
except Exception as e:
|
||||
self.logger.error(f"Error parsing JSON transcript: {e}")
|
||||
return None
|
||||
|
||||
def _parse_srv_transcript(self, content: str) -> Optional[str]:
|
||||
"""Parse SRV format transcript (XML-like)."""
|
||||
try:
|
||||
import xml.etree.ElementTree as ET
|
||||
|
||||
# Parse XML content
|
||||
root = ET.fromstring(content)
|
||||
transcript_parts = []
|
||||
|
||||
# Extract text from <text> elements
|
||||
for text_elem in root.findall('.//text'):
|
||||
text = text_elem.text
|
||||
if text and text.strip():
|
||||
clean_text = text.strip()
|
||||
if clean_text not in ['♪', '[Music]', '[Applause]']:
|
||||
transcript_parts.append(clean_text)
|
||||
|
||||
return ' '.join(transcript_parts) if transcript_parts else None
|
||||
|
||||
except Exception as e:
|
||||
self.logger.error(f"Error parsing SRV transcript: {e}")
|
||||
return None
|
||||
|
||||
def _parse_vtt_transcript(self, content: str) -> Optional[str]:
|
||||
"""Parse VTT format transcript."""
|
||||
try:
|
||||
lines = content.split('\n')
|
||||
transcript_parts = []
|
||||
|
||||
for line in lines:
|
||||
line = line.strip()
|
||||
# Skip VTT headers, timestamps, and empty lines
|
||||
if (not line or
|
||||
line.startswith('WEBVTT') or
|
||||
line.startswith('NOTE') or
|
||||
'-->' in line or
|
||||
line.isdigit()):
|
||||
continue
|
||||
|
||||
# Clean up common caption artifacts
|
||||
if line not in ['♪', '[Music]', '[Applause]', ' ']:
|
||||
# Remove HTML tags if present
|
||||
import re
|
||||
clean_line = re.sub(r'<[^>]+>', '', line)
|
||||
if clean_line.strip():
|
||||
transcript_parts.append(clean_line.strip())
|
||||
|
||||
return ' '.join(transcript_parts) if transcript_parts else None
|
||||
|
||||
except Exception as e:
|
||||
self.logger.error(f"Error parsing VTT transcript: {e}")
|
||||
return None
|
||||
|
||||
def _get_video_type(self, video: Dict[str, Any]) -> str:
|
||||
|
|
@ -348,7 +121,7 @@ class YouTubeScraper(BaseScraper):
|
|||
else:
|
||||
return 'video'
|
||||
|
||||
def fetch_content(self, max_posts: int = None, fetch_transcripts: bool = False) -> List[Dict[str, Any]]:
|
||||
def fetch_content(self) -> List[Dict[str, Any]]:
|
||||
"""Fetch and enrich video content with rate limiting."""
|
||||
# First get list of videos
|
||||
videos = self.fetch_channel_videos()
|
||||
|
|
@ -356,10 +129,6 @@ class YouTubeScraper(BaseScraper):
|
|||
if not videos:
|
||||
return []
|
||||
|
||||
# Limit videos if max_posts specified
|
||||
if max_posts:
|
||||
videos = videos[:max_posts]
|
||||
|
||||
# Enrich each video with detailed information
|
||||
enriched_videos = []
|
||||
|
||||
|
|
@ -369,44 +138,24 @@ class YouTubeScraper(BaseScraper):
|
|||
if not video_id:
|
||||
continue
|
||||
|
||||
transcript_note = " (with transcripts)" if fetch_transcripts else ""
|
||||
self.logger.info(f"Fetching details for video {i+1}/{len(videos)}: {video_id}{transcript_note}")
|
||||
self.logger.info(f"Fetching details for video {i+1}/{len(videos)}: {video_id}")
|
||||
|
||||
# Determine if this is backlog processing (no max_posts = full backlog)
|
||||
is_backlog = max_posts is None
|
||||
|
||||
# Add appropriate delay between requests
|
||||
# Add humanized delay between requests
|
||||
if i > 0:
|
||||
if is_backlog:
|
||||
# Use extended backlog delays (30-90 seconds for transcripts)
|
||||
self._backlog_delay(transcript_mode=fetch_transcripts)
|
||||
else:
|
||||
# Use normal delays for limited fetching
|
||||
self._humanized_delay()
|
||||
self._humanized_delay()
|
||||
|
||||
# Fetch full video details with optional transcripts
|
||||
detailed_info = self.fetch_video_details(video_id, fetch_transcript=fetch_transcripts)
|
||||
# Fetch full video details
|
||||
detailed_info = self.fetch_video_details(video_id)
|
||||
|
||||
if detailed_info:
|
||||
# Add video type
|
||||
detailed_info['type'] = self._get_video_type(detailed_info)
|
||||
enriched_videos.append(detailed_info)
|
||||
|
||||
# Extra delay after every 5 videos for backlog processing
|
||||
if is_backlog and (i + 1) % 5 == 0:
|
||||
self.logger.info("Taking extended break after 5 videos (backlog mode)...")
|
||||
# Even longer break every 5 videos for backlog (2-5 minutes)
|
||||
extra_delay = random.uniform(120, 300) # 2-5 minutes
|
||||
self.logger.info(f"Extended break: {extra_delay/60:.1f} minutes...")
|
||||
time.sleep(extra_delay)
|
||||
else:
|
||||
# If video details failed and we're doing transcripts, check for rate limiting
|
||||
if fetch_transcripts and is_backlog:
|
||||
self.logger.warning(f"Failed to get details for video {video_id} - may be rate limited")
|
||||
# Add emergency rate limiting delay
|
||||
emergency_delay = random.uniform(180, 300) # 3-5 minutes
|
||||
self.logger.info(f"Emergency rate limit delay: {emergency_delay/60:.1f} minutes...")
|
||||
time.sleep(emergency_delay)
|
||||
# Extra delay after every 5 videos
|
||||
if (i + 1) % 5 == 0:
|
||||
self.logger.info("Taking longer break after 5 videos...")
|
||||
self._humanized_delay(5, 10)
|
||||
|
||||
except Exception as e:
|
||||
self.logger.error(f"Error enriching video {video.get('id')}: {e}")
|
||||
|
|
@ -499,13 +248,6 @@ class YouTubeScraper(BaseScraper):
|
|||
section.append(description)
|
||||
section.append("")
|
||||
|
||||
# Transcript
|
||||
transcript = video.get('transcript')
|
||||
if transcript:
|
||||
section.append("## Transcript:")
|
||||
section.append(transcript)
|
||||
section.append("")
|
||||
|
||||
# Separator
|
||||
section.append("-" * 50)
|
||||
section.append("")
|
||||
|
|
|
|||
|
|
@ -1,16 +0,0 @@
|
|||
[Unit]
|
||||
Description=HKIA Content NAS Sync
|
||||
After=network.target
|
||||
|
||||
[Service]
|
||||
Type=oneshot
|
||||
User=ben
|
||||
Group=ben
|
||||
WorkingDirectory=/home/ben/dev/hvac-kia-content
|
||||
Environment="PATH=/home/ben/.local/bin:/usr/local/bin:/usr/bin:/bin"
|
||||
ExecStart=/usr/bin/bash -c 'source /home/ben/dev/hvac-kia-content/.venv/bin/activate && python -m src.orchestrator --nas-only'
|
||||
StandardOutput=journal
|
||||
StandardError=journal
|
||||
|
||||
[Install]
|
||||
WantedBy=multi-user.target
|
||||
|
|
@ -1,13 +0,0 @@
|
|||
[Unit]
|
||||
Description=HKIA NAS Sync Timer - Runs 30min after scraper runs
|
||||
Requires=hkia-scraper-nas.service
|
||||
|
||||
[Timer]
|
||||
# 8:30 AM Atlantic Daylight Time (local time)
|
||||
OnCalendar=*-*-* 08:30:00
|
||||
# 12:30 PM Atlantic Daylight Time (local time)
|
||||
OnCalendar=*-*-* 12:30:00
|
||||
Persistent=true
|
||||
|
||||
[Install]
|
||||
WantedBy=timers.target
|
||||
|
|
@ -1,18 +0,0 @@
|
|||
[Unit]
|
||||
Description=HKIA Content Scraper - Main Run
|
||||
After=network.target
|
||||
|
||||
[Service]
|
||||
Type=oneshot
|
||||
User=ben
|
||||
Group=ben
|
||||
WorkingDirectory=/home/ben/dev/hvac-kia-content
|
||||
Environment="PATH=/home/ben/.local/bin:/usr/local/bin:/usr/bin:/bin"
|
||||
Environment="DISPLAY=:0"
|
||||
Environment="XAUTHORITY=/run/user/1000/.mutter-Xwaylandauth.90WDB3"
|
||||
ExecStart=/usr/bin/bash -c 'source /home/ben/dev/hvac-kia-content/.venv/bin/activate && python run_production_with_images.py'
|
||||
StandardOutput=journal
|
||||
StandardError=journal
|
||||
|
||||
[Install]
|
||||
WantedBy=multi-user.target
|
||||
|
|
@ -1,13 +0,0 @@
|
|||
[Unit]
|
||||
Description=HKIA Content Scraper Timer - Runs at 8AM and 12PM ADT
|
||||
Requires=hkia-scraper.service
|
||||
|
||||
[Timer]
|
||||
# 8:00 AM Atlantic Daylight Time (local time)
|
||||
OnCalendar=*-*-* 08:00:00
|
||||
# 12:00 PM Atlantic Daylight Time (local time)
|
||||
OnCalendar=*-*-* 12:00:00
|
||||
Persistent=true
|
||||
|
||||
[Install]
|
||||
WantedBy=timers.target
|
||||
|
|
@ -1,162 +0,0 @@
|
|||
#!/usr/bin/env python3
|
||||
"""
|
||||
Test full backlog capture with new API scrapers
|
||||
This will fetch all YouTube videos and MailChimp campaigns using APIs
|
||||
"""
|
||||
|
||||
import sys
|
||||
from pathlib import Path
|
||||
sys.path.insert(0, str(Path(__file__).parent))
|
||||
|
||||
from src.youtube_api_scraper import YouTubeAPIScraper
|
||||
from src.mailchimp_api_scraper import MailChimpAPIScraper
|
||||
from src.base_scraper import ScraperConfig
|
||||
import time
|
||||
|
||||
def test_youtube_api_full():
|
||||
"""Test YouTube API scraper with full channel fetch"""
|
||||
print("=" * 60)
|
||||
print("TESTING YOUTUBE API SCRAPER - FULL CHANNEL")
|
||||
print("=" * 60)
|
||||
|
||||
config = ScraperConfig(
|
||||
source_name='youtube_api',
|
||||
brand_name='hvacknowitall',
|
||||
data_dir=Path('data_api_test/youtube'),
|
||||
logs_dir=Path('logs_api_test/youtube'),
|
||||
timezone='America/Halifax'
|
||||
)
|
||||
|
||||
scraper = YouTubeAPIScraper(config)
|
||||
|
||||
print(f"Fetching all videos from channel...")
|
||||
start = time.time()
|
||||
|
||||
# Fetch all videos (should be ~370)
|
||||
# With transcripts for top 50 by views
|
||||
videos = scraper.fetch_content(fetch_transcripts=True)
|
||||
|
||||
elapsed = time.time() - start
|
||||
print(f"\n✅ Fetched {len(videos)} videos in {elapsed:.1f} seconds")
|
||||
|
||||
# Show statistics
|
||||
total_views = sum(v.get('view_count', 0) for v in videos)
|
||||
total_likes = sum(v.get('like_count', 0) for v in videos)
|
||||
with_transcripts = sum(1 for v in videos if v.get('transcript'))
|
||||
|
||||
print(f"\nStatistics:")
|
||||
print(f" Total videos: {len(videos)}")
|
||||
print(f" Total views: {total_views:,}")
|
||||
print(f" Total likes: {total_likes:,}")
|
||||
print(f" Videos with transcripts: {with_transcripts}")
|
||||
print(f" Quota used: {scraper.quota_used}/{scraper.daily_quota_limit} units")
|
||||
|
||||
# Show top 5 videos by views
|
||||
print(f"\nTop 5 videos by views:")
|
||||
top_videos = sorted(videos, key=lambda x: x.get('view_count', 0), reverse=True)[:5]
|
||||
for i, video in enumerate(top_videos, 1):
|
||||
views = video.get('view_count', 0)
|
||||
title = video.get('title', 'Unknown')[:60]
|
||||
has_transcript = '✓' if video.get('transcript') else '✗'
|
||||
print(f" {i}. {views:,} views | {title}... | Transcript: {has_transcript}")
|
||||
|
||||
# Save markdown
|
||||
markdown = scraper.format_markdown(videos)
|
||||
output_file = Path('data_api_test/youtube/youtube_api_full.md')
|
||||
output_file.parent.mkdir(parents=True, exist_ok=True)
|
||||
output_file.write_text(markdown, encoding='utf-8')
|
||||
print(f"\nMarkdown saved to: {output_file}")
|
||||
|
||||
return videos
|
||||
|
||||
|
||||
def test_mailchimp_api_full():
|
||||
"""Test MailChimp API scraper with full campaign fetch"""
|
||||
print("\n" + "=" * 60)
|
||||
print("TESTING MAILCHIMP API SCRAPER - ALL CAMPAIGNS")
|
||||
print("=" * 60)
|
||||
|
||||
config = ScraperConfig(
|
||||
source_name='mailchimp_api',
|
||||
brand_name='hvacknowitall',
|
||||
data_dir=Path('data_api_test/mailchimp'),
|
||||
logs_dir=Path('logs_api_test/mailchimp'),
|
||||
timezone='America/Halifax'
|
||||
)
|
||||
|
||||
scraper = MailChimpAPIScraper(config)
|
||||
|
||||
print(f"Fetching all campaigns from 'Bi-Weekly Newsletter' folder...")
|
||||
start = time.time()
|
||||
|
||||
# Fetch all campaigns (up to 100)
|
||||
campaigns = scraper.fetch_content(max_items=100)
|
||||
|
||||
elapsed = time.time() - start
|
||||
print(f"\n✅ Fetched {len(campaigns)} campaigns in {elapsed:.1f} seconds")
|
||||
|
||||
if campaigns:
|
||||
# Show statistics
|
||||
total_sent = sum(c.get('metrics', {}).get('emails_sent', 0) for c in campaigns)
|
||||
total_opens = sum(c.get('metrics', {}).get('unique_opens', 0) for c in campaigns)
|
||||
total_clicks = sum(c.get('metrics', {}).get('unique_clicks', 0) for c in campaigns)
|
||||
|
||||
print(f"\nStatistics:")
|
||||
print(f" Total campaigns: {len(campaigns)}")
|
||||
print(f" Total emails sent: {total_sent:,}")
|
||||
print(f" Total unique opens: {total_opens:,}")
|
||||
print(f" Total unique clicks: {total_clicks:,}")
|
||||
|
||||
# Calculate average rates
|
||||
if campaigns:
|
||||
avg_open_rate = sum(c.get('metrics', {}).get('open_rate', 0) for c in campaigns) / len(campaigns)
|
||||
avg_click_rate = sum(c.get('metrics', {}).get('click_rate', 0) for c in campaigns) / len(campaigns)
|
||||
print(f" Average open rate: {avg_open_rate*100:.1f}%")
|
||||
print(f" Average click rate: {avg_click_rate*100:.1f}%")
|
||||
|
||||
# Show recent campaigns
|
||||
print(f"\n5 Most Recent Campaigns:")
|
||||
for i, campaign in enumerate(campaigns[:5], 1):
|
||||
title = campaign.get('title', 'Unknown')[:50]
|
||||
send_time = campaign.get('send_time', 'Unknown')[:10]
|
||||
metrics = campaign.get('metrics', {})
|
||||
opens = metrics.get('unique_opens', 0)
|
||||
open_rate = metrics.get('open_rate', 0) * 100
|
||||
print(f" {i}. {send_time} | {title}... | Opens: {opens} ({open_rate:.1f}%)")
|
||||
|
||||
# Save markdown
|
||||
markdown = scraper.format_markdown(campaigns)
|
||||
output_file = Path('data_api_test/mailchimp/mailchimp_api_full.md')
|
||||
output_file.parent.mkdir(parents=True, exist_ok=True)
|
||||
output_file.write_text(markdown, encoding='utf-8')
|
||||
print(f"\nMarkdown saved to: {output_file}")
|
||||
else:
|
||||
print("\n⚠️ No campaigns found!")
|
||||
|
||||
return campaigns
|
||||
|
||||
|
||||
def main():
|
||||
"""Run full API scraper tests"""
|
||||
print("HVAC Know It All - API Scraper Full Test")
|
||||
print("This will fetch all content using the new API scrapers")
|
||||
print("-" * 60)
|
||||
|
||||
# Test YouTube API
|
||||
youtube_videos = test_youtube_api_full()
|
||||
|
||||
# Test MailChimp API
|
||||
mailchimp_campaigns = test_mailchimp_api_full()
|
||||
|
||||
# Summary
|
||||
print("\n" + "=" * 60)
|
||||
print("SUMMARY")
|
||||
print("=" * 60)
|
||||
print(f"✅ YouTube API: {len(youtube_videos)} videos fetched")
|
||||
print(f"✅ MailChimp API: {len(mailchimp_campaigns)} campaigns fetched")
|
||||
print("\nAPI scrapers are working successfully!")
|
||||
print("Ready for production deployment.")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
|
|
@ -1,67 +0,0 @@
|
|||
#!/usr/bin/env python3
|
||||
"""
|
||||
Test the CumulativeMarkdownManager fix.
|
||||
"""
|
||||
|
||||
import sys
|
||||
from pathlib import Path
|
||||
sys.path.insert(0, str(Path(__file__).parent))
|
||||
|
||||
from src.cumulative_markdown_manager import CumulativeMarkdownManager
|
||||
from src.base_scraper import ScraperConfig
|
||||
|
||||
def test_cumulative_manager():
|
||||
"""Test that the update_cumulative_file method works."""
|
||||
print("Testing CumulativeMarkdownManager fix...")
|
||||
|
||||
# Create test config
|
||||
config = ScraperConfig(
|
||||
source_name='TestSource',
|
||||
brand_name='hkia',
|
||||
data_dir=Path('test_data'),
|
||||
logs_dir=Path('test_logs'),
|
||||
timezone='America/Halifax'
|
||||
)
|
||||
|
||||
# Create manager
|
||||
manager = CumulativeMarkdownManager(config)
|
||||
|
||||
# Test data
|
||||
test_items = [
|
||||
{
|
||||
'id': 'test123',
|
||||
'title': 'Test Post',
|
||||
'type': 'test',
|
||||
'link': 'https://example.com/test123',
|
||||
'author': 'test_user',
|
||||
'publish_date': '2025-08-19',
|
||||
'views': 1000,
|
||||
'likes': 50,
|
||||
'comments': 10,
|
||||
'local_images': ['test_data/media/test_image.jpg'],
|
||||
'description': 'This is a test post'
|
||||
}
|
||||
]
|
||||
|
||||
try:
|
||||
# This should work now
|
||||
output_file = manager.update_cumulative_file(test_items, 'TestSource')
|
||||
print(f"✅ Success! Created file: {output_file}")
|
||||
|
||||
# Check that the file exists and has content
|
||||
if output_file.exists():
|
||||
content = output_file.read_text()
|
||||
print(f"✅ File has {len(content)} characters")
|
||||
print(f"✅ Contains ID section: {'# ID: test123' in content}")
|
||||
return True
|
||||
else:
|
||||
print("❌ File was not created")
|
||||
return False
|
||||
|
||||
except Exception as e:
|
||||
print(f"❌ Error: {e}")
|
||||
return False
|
||||
|
||||
if __name__ == "__main__":
|
||||
success = test_cumulative_manager()
|
||||
sys.exit(0 if success else 1)
|
||||
|
|
@ -1,236 +0,0 @@
|
|||
#!/usr/bin/env python3
|
||||
"""
|
||||
Test the cumulative markdown functionality
|
||||
Demonstrates how backlog + incremental updates work together
|
||||
"""
|
||||
|
||||
import sys
|
||||
from pathlib import Path
|
||||
sys.path.insert(0, str(Path(__file__).parent))
|
||||
|
||||
from src.cumulative_markdown_manager import CumulativeMarkdownManager
|
||||
from src.base_scraper import ScraperConfig
|
||||
import logging
|
||||
from datetime import datetime
|
||||
import pytz
|
||||
|
||||
# Set up logging
|
||||
logging.basicConfig(
|
||||
level=logging.INFO,
|
||||
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
|
||||
)
|
||||
logger = logging.getLogger('cumulative_test')
|
||||
|
||||
|
||||
def create_mock_items(start_id: int, count: int, prefix: str = ""):
|
||||
"""Create mock content items for testing."""
|
||||
items = []
|
||||
for i in range(count):
|
||||
item_id = f"video_{start_id + i}"
|
||||
items.append({
|
||||
'id': item_id,
|
||||
'title': f"{prefix}Video Title {start_id + i}",
|
||||
'views': 1000 * (start_id + i),
|
||||
'likes': 100 * (start_id + i),
|
||||
'description': f"Description for video {start_id + i}",
|
||||
'publish_date': '2024-01-15'
|
||||
})
|
||||
return items
|
||||
|
||||
|
||||
def format_mock_markdown(items):
|
||||
"""Format mock items as markdown."""
|
||||
sections = []
|
||||
for item in items:
|
||||
section = [
|
||||
f"# ID: {item['id']}",
|
||||
"",
|
||||
f"## Title: {item['title']}",
|
||||
"",
|
||||
f"## Views: {item['views']:,}",
|
||||
"",
|
||||
f"## Likes: {item['likes']:,}",
|
||||
"",
|
||||
f"## Description:",
|
||||
item['description'],
|
||||
"",
|
||||
f"## Publish Date: {item['publish_date']}",
|
||||
"",
|
||||
"-" * 50
|
||||
]
|
||||
sections.append('\n'.join(section))
|
||||
|
||||
return '\n\n'.join(sections)
|
||||
|
||||
|
||||
def test_cumulative_workflow():
|
||||
"""Test the complete cumulative workflow."""
|
||||
logger.info("=" * 60)
|
||||
logger.info("TESTING CUMULATIVE MARKDOWN WORKFLOW")
|
||||
logger.info("=" * 60)
|
||||
|
||||
# Setup test config
|
||||
config = ScraperConfig(
|
||||
source_name='TestSource',
|
||||
brand_name='testbrand',
|
||||
data_dir=Path('test_data'),
|
||||
logs_dir=Path('test_logs'),
|
||||
timezone='America/Halifax'
|
||||
)
|
||||
|
||||
# Clean up any existing test files
|
||||
test_pattern = "testbrand_TestSource_*.md"
|
||||
for old_file in Path('test_data/markdown_current').glob(test_pattern):
|
||||
old_file.unlink()
|
||||
logger.info(f"Cleaned up old test file: {old_file.name}")
|
||||
|
||||
# Initialize manager
|
||||
manager = CumulativeMarkdownManager(config, logger)
|
||||
|
||||
# STEP 1: Initial backlog capture
|
||||
logger.info("\n" + "=" * 40)
|
||||
logger.info("STEP 1: BACKLOG CAPTURE (Day 1)")
|
||||
logger.info("=" * 40)
|
||||
|
||||
backlog_items = create_mock_items(1, 5, "Backlog ")
|
||||
logger.info(f"Created {len(backlog_items)} backlog items")
|
||||
|
||||
file1 = manager.save_cumulative(backlog_items, format_mock_markdown)
|
||||
logger.info(f"Saved backlog to: {file1.name}")
|
||||
|
||||
stats = manager.get_statistics(file1)
|
||||
logger.info(f"Stats after backlog: {stats}")
|
||||
|
||||
# STEP 2: First incremental update (new items)
|
||||
logger.info("\n" + "=" * 40)
|
||||
logger.info("STEP 2: INCREMENTAL UPDATE - New Items (Day 2)")
|
||||
logger.info("=" * 40)
|
||||
|
||||
new_items = create_mock_items(6, 2, "New ")
|
||||
logger.info(f"Created {len(new_items)} new items")
|
||||
|
||||
file2 = manager.save_cumulative(new_items, format_mock_markdown)
|
||||
logger.info(f"Saved incremental to: {file2.name}")
|
||||
|
||||
stats = manager.get_statistics(file2)
|
||||
logger.info(f"Stats after first incremental: {stats}")
|
||||
|
||||
# Verify content
|
||||
content = file2.read_text(encoding='utf-8')
|
||||
id_count = content.count('# ID:')
|
||||
logger.info(f"Total sections in file: {id_count}")
|
||||
|
||||
# STEP 3: Second incremental with updates
|
||||
logger.info("\n" + "=" * 40)
|
||||
logger.info("STEP 3: INCREMENTAL UPDATE - With Updates (Day 3)")
|
||||
logger.info("=" * 40)
|
||||
|
||||
# Create items with updates (higher view counts) and new items
|
||||
updated_items = [
|
||||
{
|
||||
'id': 'video_1', # Update existing
|
||||
'title': 'Backlog Video Title 1',
|
||||
'views': 5000, # Increased from 1000
|
||||
'likes': 500, # Increased from 100
|
||||
'description': 'Updated description with more details and captions',
|
||||
'publish_date': '2024-01-15',
|
||||
'caption': 'This video now has captions!' # New field
|
||||
},
|
||||
{
|
||||
'id': 'video_8', # New item
|
||||
'title': 'Brand New Video 8',
|
||||
'views': 8000,
|
||||
'likes': 800,
|
||||
'description': 'Newest video just published',
|
||||
'publish_date': '2024-01-18'
|
||||
}
|
||||
]
|
||||
|
||||
# Format with caption support
|
||||
def format_with_captions(items):
|
||||
sections = []
|
||||
for item in items:
|
||||
section = [
|
||||
f"# ID: {item['id']}",
|
||||
"",
|
||||
f"## Title: {item['title']}",
|
||||
"",
|
||||
f"## Views: {item['views']:,}",
|
||||
"",
|
||||
f"## Likes: {item['likes']:,}",
|
||||
"",
|
||||
f"## Description:",
|
||||
item['description'],
|
||||
""
|
||||
]
|
||||
|
||||
if 'caption' in item:
|
||||
section.extend([
|
||||
"## Caption Status:",
|
||||
item['caption'],
|
||||
""
|
||||
])
|
||||
|
||||
section.extend([
|
||||
f"## Publish Date: {item['publish_date']}",
|
||||
"",
|
||||
"-" * 50
|
||||
])
|
||||
|
||||
sections.append('\n'.join(section))
|
||||
|
||||
return '\n\n'.join(sections)
|
||||
|
||||
logger.info(f"Created 1 update + 1 new item")
|
||||
|
||||
file3 = manager.save_cumulative(updated_items, format_with_captions)
|
||||
logger.info(f"Saved second incremental to: {file3.name}")
|
||||
|
||||
stats = manager.get_statistics(file3)
|
||||
logger.info(f"Stats after second incremental: {stats}")
|
||||
|
||||
# Verify final content
|
||||
final_content = file3.read_text(encoding='utf-8')
|
||||
final_id_count = final_content.count('# ID:')
|
||||
caption_count = final_content.count('## Caption Status:')
|
||||
|
||||
logger.info(f"Final total sections: {final_id_count}")
|
||||
logger.info(f"Sections with captions: {caption_count}")
|
||||
|
||||
# Check if video_1 was updated
|
||||
if 'This video now has captions!' in final_content:
|
||||
logger.info("✅ Successfully updated video_1 with captions")
|
||||
else:
|
||||
logger.error("❌ Failed to update video_1")
|
||||
|
||||
# Check if video_8 was added
|
||||
if 'video_8' in final_content:
|
||||
logger.info("✅ Successfully added new video_8")
|
||||
else:
|
||||
logger.error("❌ Failed to add video_8")
|
||||
|
||||
# List archive files
|
||||
logger.info("\n" + "=" * 40)
|
||||
logger.info("ARCHIVED FILES:")
|
||||
logger.info("=" * 40)
|
||||
|
||||
archive_dir = Path('test_data/markdown_archives/TestSource')
|
||||
if archive_dir.exists():
|
||||
archives = list(archive_dir.glob("*.md"))
|
||||
for archive in sorted(archives):
|
||||
logger.info(f" - {archive.name}")
|
||||
|
||||
logger.info("\n" + "=" * 60)
|
||||
logger.info("TEST COMPLETE!")
|
||||
logger.info("=" * 60)
|
||||
logger.info("Summary:")
|
||||
logger.info(f" - Started with 5 backlog items")
|
||||
logger.info(f" - Added 2 new items in first incremental")
|
||||
logger.info(f" - Updated 1 item + added 1 item in second incremental")
|
||||
logger.info(f" - Final file has {final_id_count} total items")
|
||||
logger.info(f" - {caption_count} items have captions")
|
||||
logger.info(f" - {len(archives) if archive_dir.exists() else 0} versions archived")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
test_cumulative_workflow()
|
||||
|
|
@ -4,14 +4,20 @@
|
|||
|
||||
## Author: @hvacknowitall
|
||||
|
||||
## Publish Date: 2025-08-19T07:27:36.452004-03:00
|
||||
## Publish Date: 2025-08-18T19:40:36.783410-03:00
|
||||
|
||||
## Link: https://www.tiktok.com/@hvacknowitall/video/7099516072725908741
|
||||
|
||||
## Views: 126,400
|
||||
|
||||
## Likes: 3,119
|
||||
|
||||
## Comments: 150
|
||||
|
||||
## Shares: 245
|
||||
|
||||
## Caption:
|
||||
(No caption available - fetch individual video for details)
|
||||
Start planning now for 2023!
|
||||
|
||||
--------------------------------------------------
|
||||
|
||||
|
|
@ -21,14 +27,20 @@
|
|||
|
||||
## Author: @hvacknowitall
|
||||
|
||||
## Publish Date: 2025-08-19T07:27:36.452152-03:00
|
||||
## Publish Date: 2025-08-18T19:40:36.783580-03:00
|
||||
|
||||
## Link: https://www.tiktok.com/@hvacknowitall/video/7189380105762786566
|
||||
|
||||
## Views: 93,900
|
||||
|
||||
## Likes: 1,807
|
||||
|
||||
## Comments: 46
|
||||
|
||||
## Shares: 450
|
||||
|
||||
## Caption:
|
||||
(No caption available - fetch individual video for details)
|
||||
Finally here... Launch date of the @navac_inc NTB7L. If you're heading down to @ahrexpo you'll get a chance to check it out in action.
|
||||
|
||||
--------------------------------------------------
|
||||
|
||||
|
|
@ -38,557 +50,19 @@
|
|||
|
||||
## Author: @hvacknowitall
|
||||
|
||||
## Publish Date: 2025-08-19T07:27:36.452251-03:00
|
||||
## Publish Date: 2025-08-18T19:40:36.783708-03:00
|
||||
|
||||
## Link: https://www.tiktok.com/@hvacknowitall/video/7124848964452617477
|
||||
|
||||
## Views: 229,800
|
||||
|
||||
## Caption:
|
||||
(No caption available - fetch individual video for details)
|
||||
## Likes: 5,960
|
||||
|
||||
--------------------------------------------------
|
||||
## Comments: 50
|
||||
|
||||
# ID: 7540016568957226261
|
||||
|
||||
## Type: video
|
||||
|
||||
## Author: @hvacknowitall
|
||||
|
||||
## Publish Date: 2025-08-19T07:27:36.452379-03:00
|
||||
|
||||
## Link: https://www.tiktok.com/@hvacknowitall/video/7540016568957226261
|
||||
|
||||
## Views: 6,277
|
||||
## Shares: 274
|
||||
|
||||
## Caption:
|
||||
(No caption available - fetch individual video for details)
|
||||
|
||||
--------------------------------------------------
|
||||
|
||||
# ID: 7538196385712115000
|
||||
|
||||
## Type: video
|
||||
|
||||
## Author: @hvacknowitall
|
||||
|
||||
## Publish Date: 2025-08-19T07:27:36.452472-03:00
|
||||
|
||||
## Link: https://www.tiktok.com/@hvacknowitall/video/7538196385712115000
|
||||
|
||||
## Views: 4,521
|
||||
|
||||
## Caption:
|
||||
(No caption available - fetch individual video for details)
|
||||
|
||||
--------------------------------------------------
|
||||
|
||||
# ID: 7538097200132295941
|
||||
|
||||
## Type: video
|
||||
|
||||
## Author: @hvacknowitall
|
||||
|
||||
## Publish Date: 2025-08-19T07:27:36.452567-03:00
|
||||
|
||||
## Link: https://www.tiktok.com/@hvacknowitall/video/7538097200132295941
|
||||
|
||||
## Views: 1,291
|
||||
|
||||
## Caption:
|
||||
(No caption available - fetch individual video for details)
|
||||
|
||||
--------------------------------------------------
|
||||
|
||||
# ID: 7537732064779537720
|
||||
|
||||
## Type: video
|
||||
|
||||
## Author: @hvacknowitall
|
||||
|
||||
## Publish Date: 2025-08-19T07:27:36.452792-03:00
|
||||
|
||||
## Link: https://www.tiktok.com/@hvacknowitall/video/7537732064779537720
|
||||
|
||||
## Views: 22,400
|
||||
|
||||
## Caption:
|
||||
(No caption available - fetch individual video for details)
|
||||
|
||||
--------------------------------------------------
|
||||
|
||||
# ID: 7535113073150020920
|
||||
|
||||
## Type: video
|
||||
|
||||
## Author: @hvacknowitall
|
||||
|
||||
## Publish Date: 2025-08-19T07:27:36.452888-03:00
|
||||
|
||||
## Link: https://www.tiktok.com/@hvacknowitall/video/7535113073150020920
|
||||
|
||||
## Views: 5,374
|
||||
|
||||
## Caption:
|
||||
(No caption available - fetch individual video for details)
|
||||
|
||||
--------------------------------------------------
|
||||
|
||||
# ID: 7534847716896083256
|
||||
|
||||
## Type: video
|
||||
|
||||
## Author: @hvacknowitall
|
||||
|
||||
## Publish Date: 2025-08-19T07:27:36.452975-03:00
|
||||
|
||||
## Link: https://www.tiktok.com/@hvacknowitall/video/7534847716896083256
|
||||
|
||||
## Views: 4,596
|
||||
|
||||
## Caption:
|
||||
(No caption available - fetch individual video for details)
|
||||
|
||||
--------------------------------------------------
|
||||
|
||||
# ID: 7534027218721197318
|
||||
|
||||
## Type: video
|
||||
|
||||
## Author: @hvacknowitall
|
||||
|
||||
## Publish Date: 2025-08-19T07:27:36.453068-03:00
|
||||
|
||||
## Link: https://www.tiktok.com/@hvacknowitall/video/7534027218721197318
|
||||
|
||||
## Views: 3,873
|
||||
|
||||
## Caption:
|
||||
(No caption available - fetch individual video for details)
|
||||
|
||||
--------------------------------------------------
|
||||
|
||||
# ID: 7532664694616755512
|
||||
|
||||
## Type: video
|
||||
|
||||
## Author: @hvacknowitall
|
||||
|
||||
## Publish Date: 2025-08-19T07:27:36.453149-03:00
|
||||
|
||||
## Link: https://www.tiktok.com/@hvacknowitall/video/7532664694616755512
|
||||
|
||||
## Views: 11,200
|
||||
|
||||
## Caption:
|
||||
(No caption available - fetch individual video for details)
|
||||
|
||||
--------------------------------------------------
|
||||
|
||||
# ID: 7530798356034080056
|
||||
|
||||
## Type: video
|
||||
|
||||
## Author: @hvacknowitall
|
||||
|
||||
## Publish Date: 2025-08-19T07:27:36.453331-03:00
|
||||
|
||||
## Link: https://www.tiktok.com/@hvacknowitall/video/7530798356034080056
|
||||
|
||||
## Views: 8,652
|
||||
|
||||
## Caption:
|
||||
(No caption available - fetch individual video for details)
|
||||
|
||||
--------------------------------------------------
|
||||
|
||||
# ID: 7530310420045761797
|
||||
|
||||
## Type: video
|
||||
|
||||
## Author: @hvacknowitall
|
||||
|
||||
## Publish Date: 2025-08-19T07:27:36.453421-03:00
|
||||
|
||||
## Link: https://www.tiktok.com/@hvacknowitall/video/7530310420045761797
|
||||
|
||||
## Views: 7,847
|
||||
|
||||
## Caption:
|
||||
(No caption available - fetch individual video for details)
|
||||
|
||||
--------------------------------------------------
|
||||
|
||||
# ID: 7529941807065500984
|
||||
|
||||
## Type: video
|
||||
|
||||
## Author: @hvacknowitall
|
||||
|
||||
## Publish Date: 2025-08-19T07:27:36.453663-03:00
|
||||
|
||||
## Link: https://www.tiktok.com/@hvacknowitall/video/7529941807065500984
|
||||
|
||||
## Views: 9,518
|
||||
|
||||
## Caption:
|
||||
(No caption available - fetch individual video for details)
|
||||
|
||||
--------------------------------------------------
|
||||
|
||||
# ID: 7528820889589206328
|
||||
|
||||
## Type: video
|
||||
|
||||
## Author: @hvacknowitall
|
||||
|
||||
## Publish Date: 2025-08-19T07:27:36.453753-03:00
|
||||
|
||||
## Link: https://www.tiktok.com/@hvacknowitall/video/7528820889589206328
|
||||
|
||||
## Views: 15,800
|
||||
|
||||
## Caption:
|
||||
(No caption available - fetch individual video for details)
|
||||
|
||||
--------------------------------------------------
|
||||
|
||||
# ID: 7527709142165933317
|
||||
|
||||
## Type: video
|
||||
|
||||
## Author: @hvacknowitall
|
||||
|
||||
## Publish Date: 2025-08-19T07:27:36.453935-03:00
|
||||
|
||||
## Link: https://www.tiktok.com/@hvacknowitall/video/7527709142165933317
|
||||
|
||||
## Views: 2,562
|
||||
|
||||
## Caption:
|
||||
(No caption available - fetch individual video for details)
|
||||
|
||||
--------------------------------------------------
|
||||
|
||||
# ID: 7524443251642813701
|
||||
|
||||
## Type: video
|
||||
|
||||
## Author: @hvacknowitall
|
||||
|
||||
## Publish Date: 2025-08-19T07:27:36.454089-03:00
|
||||
|
||||
## Link: https://www.tiktok.com/@hvacknowitall/video/7524443251642813701
|
||||
|
||||
## Views: 1,996
|
||||
|
||||
## Caption:
|
||||
(No caption available - fetch individual video for details)
|
||||
|
||||
--------------------------------------------------
|
||||
|
||||
# ID: 7522648911681457464
|
||||
|
||||
## Type: video
|
||||
|
||||
## Author: @hvacknowitall
|
||||
|
||||
## Publish Date: 2025-08-19T07:27:36.454175-03:00
|
||||
|
||||
## Link: https://www.tiktok.com/@hvacknowitall/video/7522648911681457464
|
||||
|
||||
## Views: 10,700
|
||||
|
||||
## Caption:
|
||||
(No caption available - fetch individual video for details)
|
||||
|
||||
--------------------------------------------------
|
||||
|
||||
# ID: 7520750214311988485
|
||||
|
||||
## Type: video
|
||||
|
||||
## Author: @hvacknowitall
|
||||
|
||||
## Publish Date: 2025-08-19T07:27:36.454258-03:00
|
||||
|
||||
## Link: https://www.tiktok.com/@hvacknowitall/video/7520750214311988485
|
||||
|
||||
## Views: 159,400
|
||||
|
||||
## Caption:
|
||||
(No caption available - fetch individual video for details)
|
||||
|
||||
--------------------------------------------------
|
||||
|
||||
# ID: 7520734215592365368
|
||||
|
||||
## Type: video
|
||||
|
||||
## Author: @hvacknowitall
|
||||
|
||||
## Publish Date: 2025-08-19T07:27:36.454460-03:00
|
||||
|
||||
## Link: https://www.tiktok.com/@hvacknowitall/video/7520734215592365368
|
||||
|
||||
## Views: 4,481
|
||||
|
||||
## Caption:
|
||||
(No caption available - fetch individual video for details)
|
||||
|
||||
--------------------------------------------------
|
||||
|
||||
# ID: 7520290054502190342
|
||||
|
||||
## Type: video
|
||||
|
||||
## Author: @hvacknowitall
|
||||
|
||||
## Publish Date: 2025-08-19T07:27:36.454549-03:00
|
||||
|
||||
## Link: https://www.tiktok.com/@hvacknowitall/video/7520290054502190342
|
||||
|
||||
## Views: 5,201
|
||||
|
||||
## Caption:
|
||||
(No caption available - fetch individual video for details)
|
||||
|
||||
--------------------------------------------------
|
||||
|
||||
# ID: 7519663363446590726
|
||||
|
||||
## Type: video
|
||||
|
||||
## Author: @hvacknowitall
|
||||
|
||||
## Publish Date: 2025-08-19T07:27:36.454631-03:00
|
||||
|
||||
## Link: https://www.tiktok.com/@hvacknowitall/video/7519663363446590726
|
||||
|
||||
## Views: 4,249
|
||||
|
||||
## Caption:
|
||||
(No caption available - fetch individual video for details)
|
||||
|
||||
--------------------------------------------------
|
||||
|
||||
# ID: 7519143575838264581
|
||||
|
||||
## Type: video
|
||||
|
||||
## Author: @hvacknowitall
|
||||
|
||||
## Publish Date: 2025-08-19T07:27:36.454714-03:00
|
||||
|
||||
## Link: https://www.tiktok.com/@hvacknowitall/video/7519143575838264581
|
||||
|
||||
## Views: 73,400
|
||||
|
||||
## Caption:
|
||||
(No caption available - fetch individual video for details)
|
||||
|
||||
--------------------------------------------------
|
||||
|
||||
# ID: 7518919306252471608
|
||||
|
||||
## Type: video
|
||||
|
||||
## Author: @hvacknowitall
|
||||
|
||||
## Publish Date: 2025-08-19T07:27:36.454796-03:00
|
||||
|
||||
## Link: https://www.tiktok.com/@hvacknowitall/video/7518919306252471608
|
||||
|
||||
## Views: 35,600
|
||||
|
||||
## Caption:
|
||||
(No caption available - fetch individual video for details)
|
||||
|
||||
--------------------------------------------------
|
||||
|
||||
# ID: 7517701341196586245
|
||||
|
||||
## Type: video
|
||||
|
||||
## Author: @hvacknowitall
|
||||
|
||||
## Publish Date: 2025-08-19T07:27:36.455050-03:00
|
||||
|
||||
## Link: https://www.tiktok.com/@hvacknowitall/video/7517701341196586245
|
||||
|
||||
## Views: 4,236
|
||||
|
||||
## Caption:
|
||||
(No caption available - fetch individual video for details)
|
||||
|
||||
--------------------------------------------------
|
||||
|
||||
# ID: 7516930528050826502
|
||||
|
||||
## Type: video
|
||||
|
||||
## Author: @hvacknowitall
|
||||
|
||||
## Publish Date: 2025-08-19T07:27:36.455138-03:00
|
||||
|
||||
## Link: https://www.tiktok.com/@hvacknowitall/video/7516930528050826502
|
||||
|
||||
## Views: 7,868
|
||||
|
||||
## Caption:
|
||||
(No caption available - fetch individual video for details)
|
||||
|
||||
--------------------------------------------------
|
||||
|
||||
# ID: 7516268018662493496
|
||||
|
||||
## Type: video
|
||||
|
||||
## Author: @hvacknowitall
|
||||
|
||||
## Publish Date: 2025-08-19T07:27:36.455219-03:00
|
||||
|
||||
## Link: https://www.tiktok.com/@hvacknowitall/video/7516268018662493496
|
||||
|
||||
## Views: 3,705
|
||||
|
||||
## Caption:
|
||||
(No caption available - fetch individual video for details)
|
||||
|
||||
--------------------------------------------------
|
||||
|
||||
# ID: 7516262642558799109
|
||||
|
||||
## Type: video
|
||||
|
||||
## Author: @hvacknowitall
|
||||
|
||||
## Publish Date: 2025-08-19T07:27:36.455301-03:00
|
||||
|
||||
## Link: https://www.tiktok.com/@hvacknowitall/video/7516262642558799109
|
||||
|
||||
## Views: 2,740
|
||||
|
||||
## Caption:
|
||||
(No caption available - fetch individual video for details)
|
||||
|
||||
--------------------------------------------------
|
||||
|
||||
# ID: 7515566208591088902
|
||||
|
||||
## Type: video
|
||||
|
||||
## Author: @hvacknowitall
|
||||
|
||||
## Publish Date: 2025-08-19T07:27:36.455485-03:00
|
||||
|
||||
## Link: https://www.tiktok.com/@hvacknowitall/video/7515566208591088902
|
||||
|
||||
## Views: 8,736
|
||||
|
||||
## Caption:
|
||||
(No caption available - fetch individual video for details)
|
||||
|
||||
--------------------------------------------------
|
||||
|
||||
# ID: 7515071260376845624
|
||||
|
||||
## Type: video
|
||||
|
||||
## Author: @hvacknowitall
|
||||
|
||||
## Publish Date: 2025-08-19T07:27:36.455578-03:00
|
||||
|
||||
## Link: https://www.tiktok.com/@hvacknowitall/video/7515071260376845624
|
||||
|
||||
## Views: 4,929
|
||||
|
||||
## Caption:
|
||||
(No caption available - fetch individual video for details)
|
||||
|
||||
--------------------------------------------------
|
||||
|
||||
# ID: 7514797712802417928
|
||||
|
||||
## Type: video
|
||||
|
||||
## Author: @hvacknowitall
|
||||
|
||||
## Publish Date: 2025-08-19T07:27:36.455668-03:00
|
||||
|
||||
## Link: https://www.tiktok.com/@hvacknowitall/video/7514797712802417928
|
||||
|
||||
## Views: 10,500
|
||||
|
||||
## Caption:
|
||||
(No caption available - fetch individual video for details)
|
||||
|
||||
--------------------------------------------------
|
||||
|
||||
# ID: 7514713297292201224
|
||||
|
||||
## Type: video
|
||||
|
||||
## Author: @hvacknowitall
|
||||
|
||||
## Publish Date: 2025-08-19T07:27:36.455764-03:00
|
||||
|
||||
## Link: https://www.tiktok.com/@hvacknowitall/video/7514713297292201224
|
||||
|
||||
## Views: 3,056
|
||||
|
||||
## Caption:
|
||||
(No caption available - fetch individual video for details)
|
||||
|
||||
--------------------------------------------------
|
||||
|
||||
# ID: 7514708767557160200
|
||||
|
||||
## Type: video
|
||||
|
||||
## Author: @hvacknowitall
|
||||
|
||||
## Publish Date: 2025-08-19T07:27:36.455856-03:00
|
||||
|
||||
## Link: https://www.tiktok.com/@hvacknowitall/video/7514708767557160200
|
||||
|
||||
## Views: 1,806
|
||||
|
||||
## Caption:
|
||||
(No caption available - fetch individual video for details)
|
||||
|
||||
--------------------------------------------------
|
||||
|
||||
# ID: 7512963405142101266
|
||||
|
||||
## Type: video
|
||||
|
||||
## Author: @hvacknowitall
|
||||
|
||||
## Publish Date: 2025-08-19T07:27:36.456054-03:00
|
||||
|
||||
## Link: https://www.tiktok.com/@hvacknowitall/video/7512963405142101266
|
||||
|
||||
## Views: 16,100
|
||||
|
||||
## Caption:
|
||||
(No caption available - fetch individual video for details)
|
||||
|
||||
--------------------------------------------------
|
||||
|
||||
# ID: 7512609729022070024
|
||||
|
||||
## Type: video
|
||||
|
||||
## Author: @hvacknowitall
|
||||
|
||||
## Publish Date: 2025-08-19T07:27:36.456140-03:00
|
||||
|
||||
## Link: https://www.tiktok.com/@hvacknowitall/video/7512609729022070024
|
||||
|
||||
## Views: 3,176
|
||||
|
||||
## Caption:
|
||||
(No caption available - fetch individual video for details)
|
||||
SkillMill bringing the fire!
|
||||
|
||||
--------------------------------------------------
|
||||
|
|
|
|||
Binary file not shown.
|
|
@ -1,106 +0,0 @@
|
|||
# ID: Cm1wgRMr_mj
|
||||
|
||||
## Type: reel
|
||||
|
||||
## Link: https://www.instagram.com/p/Cm1wgRMr_mj/
|
||||
|
||||
## Author: hvacknowitall1
|
||||
|
||||
## Publish Date: 2022-12-31T17:04:53
|
||||
|
||||
## Caption:
|
||||
Full video link on my story!
|
||||
|
||||
Schrader cores alone should not be responsible for keeping refrigerant inside a system. Caps with an 0- ring and a tab of Nylog have never done me wrong.
|
||||
|
||||
#hvac #hvacr #hvactech #hvaclife #hvacknowledge #hvacrtroubleshooting #refrigerantleak #hvacsystem #refrigerantleakdetection @refrigerationtechnologies @testonorthamerica
|
||||
|
||||
## Likes: 1721
|
||||
|
||||
## Comments: 130
|
||||
|
||||
## Views: 35609
|
||||
|
||||
## Downloaded Images:
|
||||
- [instagram_Cm1wgRMr_mj_video_thumb_500092098_1651754822171979_6746252523565085629_n.jpg](media/Instagram_Test/instagram_Cm1wgRMr_mj_video_thumb_500092098_1651754822171979_6746252523565085629_n.jpg)
|
||||
|
||||
## Hashtags: #hvac #hvacr #hvactech #hvaclife #hvacknowledge #hvacrtroubleshooting #refrigerantleak #hvacsystem #refrigerantleakdetection
|
||||
|
||||
## Mentions: @refrigerationtechnologies @testonorthamerica
|
||||
|
||||
## Media Type: Video (thumbnail downloaded)
|
||||
|
||||
--------------------------------------------------
|
||||
|
||||
# ID: CpgiKyqPoX1
|
||||
|
||||
## Type: reel
|
||||
|
||||
## Link: https://www.instagram.com/p/CpgiKyqPoX1/
|
||||
|
||||
## Author: hvacknowitall1
|
||||
|
||||
## Publish Date: 2023-03-08T00:50:48
|
||||
|
||||
## Caption:
|
||||
Bend a little press a little...
|
||||
|
||||
It's nice to not have to pull out the torches and N2 rig sometimes. Bending where possible also cuts down on fittings.
|
||||
|
||||
First time using @rectorseal
|
||||
Slim duct, nice product!
|
||||
|
||||
Forgot I was wearing my ring!
|
||||
|
||||
#hvac #hvacr #pressgang #hvaclife #heatpump #hvacsystem #heatpumplife #hvacaf #hvacinstall #hvactools @navac_inc @rapidlockingsystem
|
||||
|
||||
## Likes: 2030
|
||||
|
||||
## Comments: 84
|
||||
|
||||
## Views: 34384
|
||||
|
||||
## Downloaded Images:
|
||||
- [instagram_CpgiKyqPoX1_video_thumb_499054454_1230012498832653_5784531596244021913_n.jpg](media/Instagram_Test/instagram_CpgiKyqPoX1_video_thumb_499054454_1230012498832653_5784531596244021913_n.jpg)
|
||||
|
||||
## Hashtags: #hvac #hvacr #pressgang #hvaclife #heatpump #hvacsystem #heatpumplife #hvacaf #hvacinstall #hvactools
|
||||
|
||||
## Mentions: @rectorseal @navac_inc @rapidlockingsystem
|
||||
|
||||
## Media Type: Video (thumbnail downloaded)
|
||||
|
||||
--------------------------------------------------
|
||||
|
||||
# ID: Cqlsju_vey6
|
||||
|
||||
## Type: reel
|
||||
|
||||
## Link: https://www.instagram.com/p/Cqlsju_vey6/
|
||||
|
||||
## Author: hvacknowitall1
|
||||
|
||||
## Publish Date: 2023-04-03T21:25:49
|
||||
|
||||
## Caption:
|
||||
For the last 8-9 months...
|
||||
|
||||
This tool has been one of my most valuable!
|
||||
|
||||
@navac_inc NEF6LM
|
||||
|
||||
#hvac #hvacr #hvacjourneyman #hvacapprentice #hvactools #refrigeration #copperflare #ductlessairconditioner #heatpump #vrf #hvacaf
|
||||
|
||||
## Likes: 2574
|
||||
|
||||
## Comments: 93
|
||||
|
||||
## Views: 47266
|
||||
|
||||
## Downloaded Images:
|
||||
- [instagram_Cqlsju_vey6_video_thumb_502969627_2823555661180034_9127260342398152415_n.jpg](media/Instagram_Test/instagram_Cqlsju_vey6_video_thumb_502969627_2823555661180034_9127260342398152415_n.jpg)
|
||||
|
||||
## Hashtags: #hvac #hvacr #hvacjourneyman #hvacapprentice #hvactools #refrigeration #copperflare #ductlessairconditioner #heatpump #vrf #hvacaf
|
||||
|
||||
## Media Type: Video (thumbnail downloaded)
|
||||
|
||||
--------------------------------------------------
|
||||
Binary file not shown.
|
Before Width: | Height: | Size: 70 KiB |
Binary file not shown.
|
Before Width: | Height: | Size: 107 KiB |
Binary file not shown.
|
Before Width: | Height: | Size: 70 KiB |
Binary file not shown.
|
Before Width: | Height: | Size: 3.7 MiB |
Binary file not shown.
|
Before Width: | Height: | Size: 3.7 MiB |
Binary file not shown.
|
Before Width: | Height: | Size: 3.6 MiB |
|
|
@ -1,244 +0,0 @@
|
|||
# ID: 0161281b-002a-4e9d-b491-3b386404edaa
|
||||
|
||||
## Title: HVAC-as-a-Service Approach for Cannabis Retrofits to Solve Capital Barriers - John Zimmerman Part 2
|
||||
|
||||
## Type: podcast
|
||||
|
||||
## Link: http://sites.libsyn.com/568690/hvac-as-a-service-approach-for-cannabis-retrofits-to-solve-capital-barriers-john-zimmerman-part-2
|
||||
|
||||
## Publish Date: Mon, 18 Aug 2025 09:00:00 +0000
|
||||
|
||||
## Duration: 21:18
|
||||
|
||||
## Thumbnail:
|
||||

|
||||
|
||||
## Description:
|
||||
In this episode of the HVAC Know It All Podcast, host [Gary McCreadie](https://www.linkedin.com/in/gary-mccreadie-38217a77/) continues his conversation with [John Zimmerman](https://www.linkedin.com/in/john-zimmerman-p-e-3161216/), Founder & CEO of [Harvest Integrated](https://www.linkedin.com/company/harvestintegrated/), about HVAC solutions for the cannabis industry. John explains how his company approaches retrofit applications by offering full solutions, including ductwork, electrical services, and equipment installation. He emphasizes the importance of designing scalable, efficient systems without burdening growers with unnecessary upfront costs, providing them with long-term solutions for their HVAC needs.
|
||||
|
||||
The discussion also focuses on the best types of equipment for grow operations. John shares why packaged DX units with variable speed compressors are the ideal choice, offering flexibility as plants grow and the environment changes. He also discusses how 24/7 monitoring and service calls are handled, and how they’re leveraging technology to streamline maintenance. The conversation wraps up by exploring the growing trend of “HVAC as a service” and its impact on businesses, especially those in the cannabis industry that may not have the capital for large upfront investments.
|
||||
|
||||
John also touches on the future of HVAC service models, comparing them to data centers and explaining how the shift from large capital expenditures to manageable monthly expenses can help businesses grow more efficiently. This episode offers valuable insights for anyone in the HVAC field, particularly those working with or interested in the cannabis industry.
|
||||
|
||||
**Expect to Learn:**
|
||||
|
||||
- How Harvest Integrated handles retrofit applications and provides full HVAC solutions.
|
||||
- Why packaged DX units with variable speed compressors are best for grow operations.
|
||||
- How 24/7 monitoring and streamlined service improve system reliability.
|
||||
- The advantages of "HVAC as a service" for growers and businesses.
|
||||
- Why shifting from capital expenses to operating expenses can help businesses scale effectively.
|
||||
|
||||
**Episode Highlights:**
|
||||
|
||||
[00:33] - Introduction Part 2 with John Zimmerman
|
||||
|
||||
[02:48] - Full HVAC Solutions: Design, Ductwork, and Electrical Services
|
||||
|
||||
[04:12] - Subcontracting Work vs. In-House Installers and Service
|
||||
|
||||
[05:48] - Best HVAC Equipment for Grow Rooms: Packaged DX Units vs. Four-Pipe Systems
|
||||
|
||||
[08:50] - Variable Speed Compressors and Scalability for Grow Operations
|
||||
|
||||
[10:33] - Managing Evaporator Coils and Filters in Humid Environments
|
||||
|
||||
[13:08] - Pricing and Business Model: HVAC as a Service for Growers
|
||||
|
||||
[16:05] - Expanding HVAC as a Service Beyond the Cannabis Industry
|
||||
|
||||
[20:18] - The Future of HVAC Service Models
|
||||
|
||||
**This Episode is Kindly Sponsored by:**
|
||||
|
||||
Master: <https://www.master.ca/>
|
||||
|
||||
Cintas: <https://www.cintas.com/>
|
||||
|
||||
Cool Air Products: <https://www.coolairproducts.net/>
|
||||
|
||||
property.com: <https://mccreadie.property.com>
|
||||
|
||||
SupplyHouse: <https://www.supplyhouse.com/tm>
|
||||
Use promo code HKIA5 to get 5% off your first order at Supplyhouse!
|
||||
|
||||
**Follow the Guest John Zimmerman on:**
|
||||
|
||||
LinkedIn: <https://www.linkedin.com/in/john-zimmerman-p-e-3161216/>
|
||||
|
||||
Harvest Integrated: <https://www.linkedin.com/company/harvestintegrated/>
|
||||
|
||||
**Follow the Host:**
|
||||
|
||||
LinkedIn: <https://www.linkedin.com/in/gary-mccreadie-38217a77/>
|
||||
|
||||
Website: <https://www.hvacknowitall.com>
|
||||
|
||||
Facebook: <https://www.facebook.com/people/HVAC-Know-It-All-2/61569643061429/>
|
||||
|
||||
Instagram: <https://www.instagram.com/hvacknowitall1/>
|
||||
|
||||
--------------------------------------------------
|
||||
|
||||
# ID: 74b0a060-e128-4890-99e6-dabe1032f63d
|
||||
|
||||
## Title: How HVAC Design & Redundancy Protect Cannabis Grow Rooms & Boost Yields with John Zimmerman Part 1
|
||||
|
||||
## Type: podcast
|
||||
|
||||
## Link: http://sites.libsyn.com/568690/how-hvac-design-redundancy-protect-cannabis-grow-rooms-boost-yields-with-john-zimmerman-part-1
|
||||
|
||||
## Publish Date: Thu, 14 Aug 2025 05:00:00 +0000
|
||||
|
||||
## Duration: 20:18
|
||||
|
||||
## Thumbnail:
|
||||

|
||||
|
||||
## Description:
|
||||
In this episode of the HVAC Know It All Podcast, host [Gary McCreadie](https://www.linkedin.com/in/gary-mccreadie-38217a77/) chats with [John Zimmerman](https://www.linkedin.com/in/john-zimmerman-p-e-3161216/), Founder & CEO of [Harvest Integrated](https://www.linkedin.com/company/harvestintegrated/), to kick off a two-part conversation about the unique challenges of HVAC systems in the cannabis industry. John, who has a strong background in data center cooling, brings valuable expertise to the table, now applied to creating optimal environments for indoor grow operations. At Harvest Integrated, John and his team provide “climate as a service,” helping cannabis growers with reliable and efficient HVAC systems, tailored to their specific needs.
|
||||
|
||||
The discussion in part one focuses on the complexities of maintaining the perfect environment for plant growth. John explains how HVAC requirements for grow rooms are similar to those in data centers but with added challenges, like the high humidity produced by the plants. He walks Gary through the different stages of plant growth, including vegetative, flowering, and drying, and how each requires specific adjustments to temperature and humidity control. He also highlights the importance of redundancy in these systems to prevent costly downtime and potential crop loss.
|
||||
|
||||
John shares how Harvest Integrated’s business model offers a comprehensive service to growers, from designing and installing systems to maintaining and repairing them over time. The company’s unique approach ensures that growers have the support they need without the typical issues of system failures and lack of proper service. Tune in for part one of this insightful conversation, and stay tuned for the second part where John talks about the real-world applications and challenges in the cannabis HVAC space.
|
||||
|
||||
**Expect to Learn:**
|
||||
|
||||
- The unique HVAC challenges of cannabis grow rooms and how they differ from other industries.
|
||||
- Why humidity control is key in maintaining a healthy environment for plants.
|
||||
- How each stage of plant growth requires specific temperature and humidity adjustments.
|
||||
- Why redundancy in HVAC systems is critical to prevent costly downtime.
|
||||
- How Harvest Integrated’s "climate as a service" model supports growers with ongoing system management.
|
||||
|
||||
**Episode Highlights:**
|
||||
|
||||
[00:00] - Introduction to John Zimmerman and Harvest Integrated
|
||||
|
||||
[03:35] - HVAC Challenges in Cannabis Grow Rooms
|
||||
|
||||
[04:09] - Comparing Grow Room HVAC to Data Centers
|
||||
|
||||
[05:32] - The Importance of Humidity Control in Growing Plants
|
||||
|
||||
[08:33] - The Role of Redundancy in HVAC Systems
|
||||
|
||||
[11:37] - Different Stages of Plant Growth and HVAC Needs
|
||||
|
||||
[16:57] - How Harvest Integrated’s "Climate as a Service" Model Works
|
||||
|
||||
[19:17] - The Process of Designing and Maintaining Grow Room HVAC Systems
|
||||
|
||||
**This Episode is Kindly Sponsored by:**
|
||||
|
||||
Master: <https://www.master.ca/>
|
||||
|
||||
Cintas: <https://www.cintas.com/>
|
||||
|
||||
SupplyHouse: <https://www.supplyhouse.com/>
|
||||
|
||||
Cool Air Products: <https://www.coolairproducts.net/>
|
||||
|
||||
property.com: <https://mccreadie.property.com>
|
||||
|
||||
**Follow the Guest John Zimmerman on:**
|
||||
|
||||
LinkedIn: <https://www.linkedin.com/in/john-zimmerman-p-e-3161216/>
|
||||
|
||||
Harvest Integrated: <https://www.linkedin.com/company/harvestintegrated/>
|
||||
|
||||
**Follow the Host:**
|
||||
|
||||
LinkedIn: <https://www.linkedin.com/in/gary-mccreadie-38217a77/>
|
||||
|
||||
Website: <https://www.hvacknowitall.com>
|
||||
|
||||
Facebook: <https://www.facebook.com/people/HVAC-Know-It-All-2/61569643061429/>
|
||||
|
||||
Instagram: <https://www.instagram.com/hvacknowitall1/>
|
||||
|
||||
--------------------------------------------------
|
||||
|
||||
# ID: c3fd8863-be09-404b-af8b-8414da9de923
|
||||
|
||||
## Title: HVAC Rental Trap for Homeowners to Avoid Long-Term Losses and Bad Installs with Scott Pierson Part 2
|
||||
|
||||
## Type: podcast
|
||||
|
||||
## Link: http://sites.libsyn.com/568690/hvac-rental-trap-for-homeowners-to-avoid-long-term-losses-and-bad-installs-with-scott-pierson-part-2
|
||||
|
||||
## Publish Date: Mon, 11 Aug 2025 08:30:00 +0000
|
||||
|
||||
## Duration: 19:00
|
||||
|
||||
## Thumbnail:
|
||||

|
||||
|
||||
## Description:
|
||||
In part 2 of this episode of the HVAC Know It All Podcast, host [Gary McCreadie](https://www.linkedin.com/in/gary-mccreadie-38217a77/), Director of Player Development and Head Coach at [Shelburne Soccer Club](https://shelburnesoccerclub.sportngin.com/), and President of [McCreadie HVAC & Refrigeration Services and HVAC Know It All Inc](https://www.linkedin.com/company/mccreadie-hvac-refrigeration-services/), switches roles again to be interviewed by [Scott Pierson](https://www.linkedin.com/in/scott-pierson-15121a79/), Vice President of HVAC & Market Strategy at [Encompass Supply Chain Solutions](https://www.linkedin.com/company/encompass-supply-chain-solutions-inc-/). They talk about how much today’s customers really know about HVAC, why correct load calculations matter, and the risks of oversizing or undersizing systems. Gary shares tips for new business owners on choosing the right CRM tools, and they discuss helpful tech like remote support apps for younger technicians. The conversation also looks at how private equity ownership can push sales over service quality, and why doing the job right builds both trust and comfort for customers.
|
||||
|
||||
Gary McCreadie joins Scott Pierson to talk about how customer knowledge, technology, and business practices are shaping the HVAC industry today. Gary explains why proper load calculations are key to avoiding problems from oversized or undersized systems. They discuss tools like CRM software and remote support apps that help small businesses and newer techs work smarter. Gary also shares concerns about private equity companies focusing more on sales than service quality. It’s a real conversation on doing quality work, using the right tools, and keeping customers comfortable.
|
||||
|
||||
Gary talks about how some customers know more about HVAC than before, but many still misunderstand system needs. He explains why proper sizing through load calculations is so important to avoid comfort and equipment issues. Gary and Scott discuss useful tools like CRM software and remote support apps that help small companies and younger techs work better. They also look at how private equity ownership can push sales over quality service, and why doing the job right matters. It’s a clear, practical talk on using the right tools, making smart choices, and keeping customers happy.
|
||||
|
||||
**Expect to Learn:**
|
||||
|
||||
- Why proper load calculations are key to avoiding comfort and equipment problems.
|
||||
- How CRM software and remote support apps help small businesses and new techs work smarter.
|
||||
- What risks come from oversizing or undersizing HVAC systems?
|
||||
- How private equity ownership can shift focus from quality service to sales.
|
||||
- Why is doing the job right build trust, comfort, and long-term customer satisfaction?
|
||||
|
||||
**Episode Highlights:**
|
||||
|
||||
[00:00] - Introduction to Gary McCreadie in Part 02
|
||||
|
||||
[00:37] - Are Customers More HVAC-Savvy Today?
|
||||
|
||||
[03:04] - Why Load Calculations Prevent System Problems
|
||||
|
||||
[03:50] - Risks of Oversizing and Undersizing Equipment
|
||||
|
||||
[05:58] - Choosing the Right CRM Tools for Your Business
|
||||
|
||||
[08:52] - Remote Support Apps Helping Young Technicians
|
||||
|
||||
[10:03] - Private Equity’s Impact on Service vs. Sales
|
||||
|
||||
[15:17] - Correct Sizing for Better Comfort and Efficiency
|
||||
|
||||
[16:24] - Balancing Profit with Quality HVAC Work
|
||||
|
||||
**This Episode is Kindly Sponsored by:**
|
||||
|
||||
Master: <https://www.master.ca/>
|
||||
|
||||
Cintas: <https://www.cintas.com/>
|
||||
|
||||
Supply House: <https://www.supplyhouse.com/>
|
||||
|
||||
Cool Air Products: <https://www.coolairproducts.net/>
|
||||
|
||||
property.com: <https://mccreadie.property.com>
|
||||
|
||||
**Follow Scott Pierson on:**
|
||||
|
||||
LinkedIn: <https://www.linkedin.com/in/scott-pierson-15121a79/>
|
||||
|
||||
Encompass Supply Chain Solutions: <https://www.linkedin.com/company/encompass-supply-chain-solutions-inc-/>
|
||||
|
||||
**Follow Gary McCreadie on:**
|
||||
|
||||
LinkedIn: <https://www.linkedin.com/in/gary-mccreadie-38217a77/>
|
||||
|
||||
McCreadie HVAC & Refrigeration Services: <https://www.linkedin.com/company/mccreadie-hvac-refrigeration-services/>
|
||||
|
||||
HVAC Know It All Inc: <https://www.linkedin.com/company/hvac-know-it-all-inc/>
|
||||
|
||||
Shelburne Soccer Club: <https://shelburnesoccerclub.sportngin.com/>
|
||||
|
||||
Website: <https://www.hvacknowitall.com>
|
||||
|
||||
Facebook: <https://www.facebook.com/people/HVAC-Know-It-All-2/61569643061429/>
|
||||
|
||||
Instagram: <https://www.instagram.com/hvacknowitall1/>
|
||||
|
||||
--------------------------------------------------
|
||||
|
|
@ -1,104 +0,0 @@
|
|||
# ID: video_1
|
||||
|
||||
## Title: Backlog Video Title 1
|
||||
|
||||
## Views: 1,000
|
||||
|
||||
## Likes: 100
|
||||
|
||||
## Description:
|
||||
Description for video 1
|
||||
|
||||
## Publish Date: 2024-01-15
|
||||
|
||||
--------------------------------------------------
|
||||
|
||||
# ID: video_2
|
||||
|
||||
## Title: Backlog Video Title 2
|
||||
|
||||
## Views: 2,000
|
||||
|
||||
## Likes: 200
|
||||
|
||||
## Description:
|
||||
Description for video 2
|
||||
|
||||
## Publish Date: 2024-01-15
|
||||
|
||||
--------------------------------------------------
|
||||
|
||||
# ID: video_3
|
||||
|
||||
## Title: Backlog Video Title 3
|
||||
|
||||
## Views: 3,000
|
||||
|
||||
## Likes: 300
|
||||
|
||||
## Description:
|
||||
Description for video 3
|
||||
|
||||
## Publish Date: 2024-01-15
|
||||
|
||||
--------------------------------------------------
|
||||
|
||||
# ID: video_4
|
||||
|
||||
## Title: Backlog Video Title 4
|
||||
|
||||
## Views: 4,000
|
||||
|
||||
## Likes: 400
|
||||
|
||||
## Description:
|
||||
Description for video 4
|
||||
|
||||
## Publish Date: 2024-01-15
|
||||
|
||||
--------------------------------------------------
|
||||
|
||||
# ID: video_5
|
||||
|
||||
## Title: Backlog Video Title 5
|
||||
|
||||
## Views: 5,000
|
||||
|
||||
## Likes: 500
|
||||
|
||||
## Description:
|
||||
Description for video 5
|
||||
|
||||
## Publish Date: 2024-01-15
|
||||
|
||||
--------------------------------------------------
|
||||
|
||||
# ID: video_6
|
||||
|
||||
## Title: New Video Title 6
|
||||
|
||||
## Views: 6,000
|
||||
|
||||
## Likes: 600
|
||||
|
||||
## Description:
|
||||
Description for video 6
|
||||
|
||||
## Publish Date: 2024-01-15
|
||||
|
||||
--------------------------------------------------
|
||||
|
||||
# ID: video_7
|
||||
|
||||
## Title: New Video Title 7
|
||||
|
||||
## Views: 7,000
|
||||
|
||||
## Likes: 700
|
||||
|
||||
## Description:
|
||||
Description for video 7
|
||||
|
||||
## Publish Date: 2024-01-15
|
||||
|
||||
--------------------------------------------------
|
||||
|
|
@ -1,122 +0,0 @@
|
|||
# ID: video_8
|
||||
|
||||
## Title: Brand New Video 8
|
||||
|
||||
## Views: 8,000
|
||||
|
||||
## Likes: 800
|
||||
|
||||
## Description:
|
||||
Newest video just published
|
||||
|
||||
## Publish Date: 2024-01-18
|
||||
|
||||
--------------------------------------------------
|
||||
|
||||
# ID: video_1
|
||||
|
||||
## Title: Backlog Video Title 1
|
||||
|
||||
## Views: 5,000
|
||||
|
||||
## Likes: 500
|
||||
|
||||
## Description:
|
||||
Updated description with more details and captions
|
||||
|
||||
## Caption Status:
|
||||
This video now has captions!
|
||||
|
||||
## Publish Date: 2024-01-15
|
||||
|
||||
--------------------------------------------------
|
||||
|
||||
# ID: video_2
|
||||
|
||||
## Title: Backlog Video Title 2
|
||||
|
||||
## Views: 2,000
|
||||
|
||||
## Likes: 200
|
||||
|
||||
## Description:
|
||||
Description for video 2
|
||||
|
||||
## Publish Date: 2024-01-15
|
||||
|
||||
--------------------------------------------------
|
||||
|
||||
# ID: video_3
|
||||
|
||||
## Title: Backlog Video Title 3
|
||||
|
||||
## Views: 3,000
|
||||
|
||||
## Likes: 300
|
||||
|
||||
## Description:
|
||||
Description for video 3
|
||||
|
||||
## Publish Date: 2024-01-15
|
||||
|
||||
--------------------------------------------------
|
||||
|
||||
# ID: video_4
|
||||
|
||||
## Title: Backlog Video Title 4
|
||||
|
||||
## Views: 4,000
|
||||
|
||||
## Likes: 400
|
||||
|
||||
## Description:
|
||||
Description for video 4
|
||||
|
||||
## Publish Date: 2024-01-15
|
||||
|
||||
--------------------------------------------------
|
||||
|
||||
# ID: video_5
|
||||
|
||||
## Title: Backlog Video Title 5
|
||||
|
||||
## Views: 5,000
|
||||
|
||||
## Likes: 500
|
||||
|
||||
## Description:
|
||||
Description for video 5
|
||||
|
||||
## Publish Date: 2024-01-15
|
||||
|
||||
--------------------------------------------------
|
||||
|
||||
# ID: video_6
|
||||
|
||||
## Title: New Video Title 6
|
||||
|
||||
## Views: 6,000
|
||||
|
||||
## Likes: 600
|
||||
|
||||
## Description:
|
||||
Description for video 6
|
||||
|
||||
## Publish Date: 2024-01-15
|
||||
|
||||
--------------------------------------------------
|
||||
|
||||
# ID: video_7
|
||||
|
||||
## Title: New Video Title 7
|
||||
|
||||
## Views: 7,000
|
||||
|
||||
## Likes: 700
|
||||
|
||||
## Description:
|
||||
Description for video 7
|
||||
|
||||
## Publish Date: 2024-01-15
|
||||
|
||||
--------------------------------------------------
|
||||
File diff suppressed because one or more lines are too long
Some files were not shown because too many files have changed in this diff Show more
Loading…
Reference in a new issue