diff --git a/skills/ad-creative/SKILL.md b/skills/ad-creative/SKILL.md index 31d37b5..6a7fcbd 100644 --- a/skills/ad-creative/SKILL.md +++ b/skills/ad-creative/SKILL.md @@ -122,7 +122,8 @@ For detailed specs and format variations, see [references/platform-specs.md](ref For image and video ad creative, use generative AI tools and code-based video rendering. See [references/generative-tools.md](references/generative-tools.md) for the complete guide covering: - **Image generation** — Nano Banana Pro (Gemini), Flux, Ideogram for static ad images -- **Video generation** — Veo, Kling, Runway, Sora, Higgsfield for video ads +- **Video generation** — Veo, Kling, Runway, Sora, Seedance, Higgsfield for video ads +- **Voice & audio** — ElevenLabs, OpenAI TTS, Cartesia for voiceovers, cloning, multilingual - **Code-based video** — Remotion for templated, data-driven video at scale - **Platform image specs** — Correct dimensions for every ad placement - **Cost comparison** — Pricing for 100+ ad variations across tools diff --git a/skills/ad-creative/references/generative-tools.md b/skills/ad-creative/references/generative-tools.md index 6cb86c2..4a76942 100644 --- a/skills/ad-creative/references/generative-tools.md +++ b/skills/ad-creative/references/generative-tools.md @@ -11,6 +11,10 @@ Reference for using AI image generators, video generators, and code-based video | Static ad images (banners, social) | Image generation | Nano Banana Pro, Flux, Ideogram | | Ad images with text overlays | Image generation (text-capable) | Ideogram, Nano Banana Pro | | Short video ads (6-30 sec) | Video generation | Veo, Kling, Runway, Sora, Seedance | +| Video ads with voiceover | Video gen + voice | Veo/Sora (native), or Runway + ElevenLabs | +| Voiceover tracks for ads | Voice generation | ElevenLabs, OpenAI TTS, Cartesia | +| Multi-language ad versions | Voice generation | ElevenLabs, PlayHT | +| Brand voice cloning | Voice generation | ElevenLabs, Resemble AI | | Product mockups and variations | Image generation + references | Flux (multi-image reference) | | Templated video ads at scale | Code-based video | Remotion | | Personalized video (name, data) | Code-based video | Remotion | @@ -276,6 +280,166 @@ Full-stack video creation platform with cinematic camera controls. --- +## Voice & Audio Generation + +For layering realistic voiceovers onto video ads, adding narration to product demos, or generating audio for Remotion-rendered videos. These tools turn ad scripts into natural-sounding voice tracks. + +### When to Use Voice Tools + +Many video generators (Veo, Kling, Sora, Seedance) now include native audio. Use standalone voice tools when you need: + +- **Voiceover on silent video** — Runway Gen-4 and Remotion produce silent output +- **Brand voice consistency** — Clone a specific voice for all ads +- **Multi-language versions** — Same ad script in 20+ languages +- **Script iteration** — Re-record voiceover without reshooting video +- **Precise control** — Exact timing, emotion, and pacing + +--- + +### ElevenLabs + +The market leader in realistic voice generation and voice cloning. + +**Best for:** Most natural-sounding voiceovers, brand voice cloning, multilingual +**API:** REST API with streaming support +**Pricing:** ~$0.12-0.30 per 1,000 characters depending on plan; starts at $5/month + +**Capabilities:** +- 29+ languages with natural accent and intonation +- Voice cloning from short audio clips (instant) or longer recordings (professional) +- Emotion and style control +- Streaming for real-time generation +- Voice library with hundreds of pre-built voices + +**Ad creative use cases:** +- Generate voiceover tracks for video ads +- Clone your brand spokesperson's voice for all ad variations +- Produce the same ad in 10+ languages from one script +- A/B test different voice styles (authoritative vs. friendly vs. urgent) + +**API example:** +```bash +curl -X POST "https://api.elevenlabs.io/v1/text-to-speech/{voice_id}" \ + -H "xi-api-key: $ELEVENLABS_API_KEY" \ + -H "Content-Type: application/json" \ + -d '{ + "text": "Stop wasting hours on manual reporting. Try DataFlow free for 14 days.", + "model_id": "eleven_multilingual_v2", + "voice_settings": {"stability": 0.5, "similarity_boost": 0.75} + }' --output voiceover.mp3 +``` + +**Docs:** [ElevenLabs API](https://elevenlabs.io/docs/api-reference/text-to-speech) + +--- + +### OpenAI TTS + +Simple, affordable text-to-speech built into the OpenAI API. + +**Best for:** Quick voiceovers, cost-effective at scale, simple integration +**API:** OpenAI API (same SDK as GPT/DALL-E) +**Pricing:** $15/million chars (standard), $30/million chars (HD); ~$0.015/min with gpt-4o-mini-tts + +**Capabilities:** +- 13 built-in voices (no custom cloning) +- Multiple languages +- Real-time streaming +- HD quality option +- Simple API — same SDK you already use for GPT + +**Ad creative use cases:** +- Fast, cheap voiceover for draft/test ad versions +- High-volume narration at low cost +- Prototype ad audio before investing in premium voice + +**Docs:** [OpenAI TTS](https://platform.openai.com/docs/guides/text-to-speech) + +--- + +### Cartesia Sonic + +Ultra-low latency voice generation built for real-time applications. + +**Best for:** Real-time voice, lowest latency, emotional expressiveness +**API:** REST + WebSocket streaming +**Pricing:** Pay-as-you-go from $0.0085/sec; starts at $5/month + +**Capabilities:** +- 40ms time-to-first-audio (fastest in class) +- 15+ languages +- Nonverbal expressiveness: laughter, breathing, emotional inflections +- Sonic Turbo for even lower latency +- Streaming API for real-time generation + +**Ad creative use cases:** +- Real-time ad preview during creative iteration +- Interactive demo videos with dynamic narration +- Ads requiring natural laughter, sighs, or emotional reactions + +**Docs:** [Cartesia Sonic](https://docs.cartesia.ai/build-with-cartesia/tts-models/latest) + +--- + +### Other Voice Tools + +| Tool | Best For | Differentiator | API | +|------|----------|---------------|-----| +| **PlayHT** | Large voice library, low latency | 900+ voices, <300ms latency, ultra-realistic | [play.ht](https://play.ht/) | +| **Resemble AI** | Enterprise voice cloning | On-premise deployment, real-time speech-to-speech | [resemble.ai](https://www.resemble.ai/) | +| **WellSaid Labs** | Ethical, commercial-safe voices | Voices from compensated actors, safe for commercial use | [wellsaid.io](https://www.wellsaid.io/) | +| **Fish Audio** | Budget-friendly, emotion control | ~50-70% cheaper than ElevenLabs, emotion tags | [fish.audio](https://fish.audio/) | +| **Murf AI** | Non-technical teams | Browser-based studio, 200+ voices | [murf.ai](https://murf.ai/) | +| **Google Cloud TTS** | Google ecosystem, scale | 220+ voices, 40+ languages, enterprise SLAs | [Google TTS](https://cloud.google.com/text-to-speech) | +| **Amazon Polly** | AWS ecosystem, cost | Neural voices, SSML control, cheap at volume | [Amazon Polly](https://aws.amazon.com/polly/) | + +--- + +### Voice Tool Comparison + +| Tool | Quality | Cloning | Languages | Latency | Price/1K chars | +|------|---------|---------|-----------|---------|----------------| +| **ElevenLabs** | Best | Yes (instant + pro) | 29+ | ~200ms | $0.12-0.30 | +| **OpenAI TTS** | Good | No | 13+ | ~300ms | $0.015-0.030 | +| **Cartesia Sonic** | Very good | No | 15+ | ~40ms | ~$0.008/sec | +| **PlayHT** | Very good | Yes | 140+ | <300ms | ~$0.10-0.20 | +| **Fish Audio** | Good | Yes | 13+ | ~200ms | ~$0.05-0.10 | +| **WellSaid** | Very good | No (actor voices) | English | ~300ms | Custom pricing | + +### Choosing a Voice Tool + +``` +Need voiceover for ads? +├── Need to clone a specific brand voice? +│ ├── Best quality → ElevenLabs +│ ├── Enterprise/on-premise → Resemble AI +│ └── Budget-friendly → Fish Audio, PlayHT +├── Need multilingual (same ad, many languages)? +│ ├── Most languages → PlayHT (140+) +│ └── Best quality → ElevenLabs (29+) +├── Need cheap, fast, good-enough? +│ └── OpenAI TTS ($0.015/min) +├── Need commercially-safe licensing? +│ └── WellSaid Labs (actor-compensated voices) +└── Need real-time/interactive? + └── Cartesia Sonic (40ms TTFA) +``` + +### Workflow: Voice + Video + +``` +1. Write ad script (use ad-creative skill for copy) +2. Generate voiceover with ElevenLabs/OpenAI TTS +3. Generate or render video: + a. Silent video from Runway/Remotion → layer voice track + b. Or use Veo/Sora/Seedance with native audio (skip separate VO) +4. Combine with ffmpeg if layering separately: + ffmpeg -i video.mp4 -i voiceover.mp3 -c:v copy -c:a aac output.mp4 +5. Generate variations (different scripts, voices, or languages) +``` + +--- + ## Code-Based Video: Remotion For templated, data-driven video ads at scale, Remotion is the best option. Unlike AI video generators that produce unique video from prompts, Remotion uses React code to render deterministic, brand-perfect video from templates and data.