- Forked from coreyhaines31/marketingskills v1.1.0 (MIT license) - Removed 4 SaaS-only skills (churn-prevention, paywall-upgrade-cro, onboarding-cro, signup-flow-cro) - Reworked 2 skills (popup-cro → hvac-estimate-popups, revops → hvac-lead-ops) - Adapted all 28 retained skills with HVAC industry context and Compendium integration - Created 10 new HVAC-specific skills: - hvac-content-from-data (flagship DB integration) - hvac-seasonal-campaign (demand cycle marketing) - hvac-review-management (GBP review strategy) - hvac-video-repurpose (long-form → social) - hvac-technical-content (audience-calibrated writing) - hvac-brand-voice (trade authenticity guide) - hvac-contractor-website-audit (discovery & analysis) - hvac-contractor-website-package (marketing package assembly) - hvac-compliance-claims (EPA/rebate/safety claim checking) - hvac-content-qc (fact-check & citation gate) - Renamed product-marketing-context → hvac-marketing-context (global) - Created COMPENDIUM_INTEGRATION.md (shared integration contract) - Added Compendium wrapper tools (search, scrape, classify) - Added compendium capability tags to YAML frontmatter - Updated README, AGENTS.md, CLAUDE.md, VERSIONS.md, marketplace.json - All 38 skills pass validate-skills.sh - Zero dangling references to removed/renamed skills Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
307 lines
8.6 KiB
Markdown
307 lines
8.6 KiB
Markdown
---
|
|
name: ab-test-setup
|
|
description: When the user wants to plan, design, or implement an A/B test or experiment. Also use when the user mentions "A/B test," "split test," "experiment," "test this change," "variant copy," "multivariate test," "hypothesis," "should I test this," "which version is better," "test two versions," "statistical significance," or "how long should I run this test." Use this whenever someone is comparing two approaches and wants to measure which performs better. For tracking implementation, see analytics-tracking. For page-level conversion optimization, see page-cro.
|
|
metadata:
|
|
version: 2.0.0
|
|
compendium:
|
|
mode: standalone
|
|
tools: []
|
|
---
|
|
|
|
# A/B Test Setup
|
|
|
|
You are an expert in experimentation and A/B testing. Your goal is to help design tests that produce statistically valid, actionable results.
|
|
|
|
## Initial Assessment
|
|
|
|
**Check for product marketing context first:**
|
|
If `.agents/hvac-marketing-context.md` exists (or `.claude/hvac-marketing-context.md` in older setups), read it before asking questions. Use that context and only ask for information not already covered or specific to this task.
|
|
|
|
Before designing a test, understand:
|
|
|
|
1. **Test Context** - What are you trying to improve? What change are you considering?
|
|
2. **Current State** - Baseline conversion rate? Current traffic volume?
|
|
3. **Constraints** - Technical complexity? Timeline? Tools available?
|
|
|
|
---
|
|
|
|
## Core Principles
|
|
|
|
### 1. Start with a Hypothesis
|
|
- Not just "let's see what happens"
|
|
- Specific prediction of outcome
|
|
- Based on reasoning or data
|
|
|
|
### 2. Test One Thing
|
|
- Single variable per test
|
|
- Otherwise you don't know what worked
|
|
|
|
### 3. Statistical Rigor
|
|
- Pre-determine sample size
|
|
- Don't peek and stop early
|
|
- Commit to the methodology
|
|
|
|
### 4. Measure What Matters
|
|
- Primary metric tied to business value
|
|
- Secondary metrics for context
|
|
- Guardrail metrics to prevent harm
|
|
|
|
---
|
|
|
|
## Hypothesis Framework
|
|
|
|
### Structure
|
|
|
|
```
|
|
Because [observation/data],
|
|
we believe [change]
|
|
will cause [expected outcome]
|
|
for [audience].
|
|
We'll know this is true when [metrics].
|
|
```
|
|
|
|
### Example
|
|
|
|
**Weak**: "Changing the button color might increase clicks."
|
|
|
|
**Strong**: "Because users report difficulty finding the CTA (per heatmaps and feedback), we believe making the button larger and more visible will increase quote request submissions by 15%+ for new visitors. We'll measure quote request conversion rate from page view to form submission."
|
|
|
|
---
|
|
|
|
## Test Types
|
|
|
|
| Type | Description | Traffic Needed |
|
|
|------|-------------|----------------|
|
|
| A/B | Two versions, single change | Moderate |
|
|
| A/B/n | Multiple variants | Higher |
|
|
| MVT | Multiple changes in combinations | Very high |
|
|
| Split URL | Different URLs for variants | Moderate |
|
|
|
|
---
|
|
|
|
## Sample Size
|
|
|
|
### Quick Reference
|
|
|
|
| Baseline | 10% Lift | 20% Lift | 50% Lift |
|
|
|----------|----------|----------|----------|
|
|
| 1% | 150k/variant | 39k/variant | 6k/variant |
|
|
| 3% | 47k/variant | 12k/variant | 2k/variant |
|
|
| 5% | 27k/variant | 7k/variant | 1.2k/variant |
|
|
| 10% | 12k/variant | 3k/variant | 550/variant |
|
|
|
|
**Calculators:**
|
|
- [Evan Miller's](https://www.evanmiller.org/ab-testing/sample-size.html)
|
|
- [Optimizely's](https://www.optimizely.com/sample-size-calculator/)
|
|
|
|
**For HVAC example:** If your current quote request rate is 3% and you want to detect a 20% lift, you need 12,000 visitors per variant, or 24,000 total.
|
|
|
|
---
|
|
|
|
## Metrics Selection
|
|
|
|
### Primary Metric
|
|
- Single metric that matters most
|
|
- Directly tied to hypothesis
|
|
- What you'll use to call the test
|
|
|
|
### Secondary Metrics
|
|
- Support primary metric interpretation
|
|
- Explain why/how the change worked
|
|
|
|
### Guardrail Metrics
|
|
- Things that shouldn't get worse
|
|
- Stop test if significantly negative
|
|
|
|
### Example: Quote Request CTA Test
|
|
- **Primary**: Quote request submission rate
|
|
- **Secondary**: Click-through rate on CTA, form abandonment rate
|
|
- **Guardrail**: Page bounce rate (shouldn't go up), phone calls (shouldn't decrease)
|
|
|
|
---
|
|
|
|
## Designing Variants
|
|
|
|
### What to Vary
|
|
|
|
| Category | Examples |
|
|
|----------|----------|
|
|
| Headlines/Copy | Message angle, value prop, urgency, tone |
|
|
| CTA | Button copy, size, placement, color |
|
|
| Visual Design | Layout, hierarchy, images |
|
|
| Form | Fields required, button placement, form length |
|
|
| Timing | When CTA appears (immediately vs. after scroll) |
|
|
|
|
### Best Practices
|
|
- Single, meaningful change
|
|
- Bold enough to make a difference
|
|
- True to the hypothesis
|
|
|
|
### HVAC-Specific Test Ideas
|
|
|
|
**Example 1: CTA Copy Test**
|
|
- **Control**: "Get Free Quote"
|
|
- **Variant**: "Schedule Service Today"
|
|
- **Hypothesis**: Specific action language increases urgency and form completion
|
|
|
|
**Example 2: Urgency Test**
|
|
- **Control**: Standard headline
|
|
- **Variant**: "Emergency AC Service Available Now"
|
|
- **Hypothesis**: Urgency language increases quote requests for emergency services
|
|
|
|
**Example 3: Form Length Test**
|
|
- **Control**: 5-field quote form (Name, Phone, Service type, Issue, Address)
|
|
- **Variant**: 3-field form (Name, Phone, Service type)
|
|
- **Hypothesis**: Fewer required fields increase form completion rate
|
|
|
|
---
|
|
|
|
## Traffic Allocation
|
|
|
|
| Approach | Split | When to Use |
|
|
|----------|-------|-------------|
|
|
| Standard | 50/50 | Default for A/B |
|
|
| Conservative | 90/10, 80/20 | Limit risk of bad variant |
|
|
| Ramping | Start small, increase | Technical risk mitigation |
|
|
|
|
**Considerations:**
|
|
- Consistency: Users see same variant on return
|
|
- Balanced exposure across time of day/week
|
|
|
|
---
|
|
|
|
## Implementation
|
|
|
|
### Client-Side
|
|
- JavaScript modifies page after load
|
|
- Quick to implement, can cause flicker
|
|
- Tools: PostHog, Optimizely, VWO
|
|
|
|
### Server-Side
|
|
- Variant determined before render
|
|
- No flicker, requires dev work
|
|
- Tools: PostHog, LaunchDarkly, Split
|
|
|
|
---
|
|
|
|
## Running the Test
|
|
|
|
### Pre-Launch Checklist
|
|
- [ ] Hypothesis documented
|
|
- [ ] Primary metric defined
|
|
- [ ] Sample size calculated
|
|
- [ ] Variants implemented correctly
|
|
- [ ] Tracking verified
|
|
- [ ] QA completed on all variants
|
|
|
|
### During the Test
|
|
|
|
**DO:**
|
|
- Monitor for technical issues
|
|
- Check segment quality
|
|
- Document external factors
|
|
|
|
**Avoid:**
|
|
- Peek at results and stop early
|
|
- Make changes to variants
|
|
- Add traffic from new sources
|
|
|
|
### The Peeking Problem
|
|
Looking at results before reaching sample size and stopping early leads to false positives and wrong decisions. Pre-commit to sample size and trust the process.
|
|
|
|
---
|
|
|
|
## Analyzing Results
|
|
|
|
### Statistical Significance
|
|
- 95% confidence = p-value < 0.05
|
|
- Means <5% chance result is random
|
|
- Not a guarantee—just a threshold
|
|
|
|
### Analysis Checklist
|
|
|
|
1. **Reach sample size?** If not, result is preliminary
|
|
2. **Statistically significant?** Check confidence intervals
|
|
3. **Effect size meaningful?** Compare to MDE, project impact
|
|
4. **Secondary metrics consistent?** Support the primary?
|
|
5. **Guardrail concerns?** Anything get worse?
|
|
6. **Segment differences?** Mobile vs. desktop? New vs. returning?
|
|
|
|
### Interpreting Results
|
|
|
|
| Result | Conclusion |
|
|
|--------|------------|
|
|
| Significant winner | Implement variant |
|
|
| Significant loser | Keep control, learn why |
|
|
| No significant difference | Need more traffic or bolder test |
|
|
| Mixed signals | Dig deeper, maybe segment |
|
|
|
|
---
|
|
|
|
## Documentation
|
|
|
|
Document every test with:
|
|
- Hypothesis
|
|
- Variants (with screenshots)
|
|
- Results (sample, metrics, significance, confidence intervals)
|
|
- Decision and learnings
|
|
- Next steps
|
|
|
|
**Example:**
|
|
```markdown
|
|
## Test: CTA Copy for Quote Requests
|
|
|
|
**Hypothesis:** "Schedule Service Today" (action-specific) will increase quote form
|
|
submissions more than "Get Free Quote" (generic).
|
|
|
|
**Duration:** Jan 15-29, 2024
|
|
**Sample Size:** 15,000 per variant
|
|
|
|
**Results:**
|
|
- Control: 3.2% conversion rate (485/15,000)
|
|
- Variant: 3.8% conversion rate (570/15,000)
|
|
- Lift: +18.8%
|
|
- Confidence: 97.3% (p=0.004)
|
|
|
|
**Decision:** Implement variant. Action-specific CTA performed significantly better.
|
|
|
|
**Learning:** Specificity drives urgency. Test "Call Now" vs "Schedule Today" next.
|
|
```
|
|
|
|
---
|
|
|
|
## Common Mistakes
|
|
|
|
### Test Design
|
|
- Testing too small a change (undetectable)
|
|
- Testing too many things (can't isolate)
|
|
- No clear hypothesis
|
|
|
|
### Execution
|
|
- Stopping early
|
|
- Changing things mid-test
|
|
- Not checking implementation
|
|
|
|
### Analysis
|
|
- Ignoring confidence intervals
|
|
- Cherry-picking segments
|
|
- Over-interpreting inconclusive results
|
|
|
|
---
|
|
|
|
## Task-Specific Questions
|
|
|
|
1. What's your current conversion rate?
|
|
2. How much traffic does this page get?
|
|
3. What change are you considering and why?
|
|
4. What's the smallest improvement worth detecting?
|
|
5. What tools do you have for testing?
|
|
6. Have you tested this area before?
|
|
|
|
---
|
|
|
|
## Related Skills
|
|
|
|
- **page-cro**: For generating test ideas based on CRO principles
|
|
- **analytics-tracking**: For setting up test measurement
|
|
- **hvac-copywriting**: For creating variant copy
|