fix: align eval assertions with SKILL.md content per Codex review

Fixes 5 issues identified by independent Codex review: - product-marketing-context: match auto-draft workflow, section flexibility - marketing-psychology: replace phantom models with actual SKILL.md models - ad-creative: correct RSA pinning guidance to match skill - free-tool-strategy: boundary test now defers to related skill (page-cro) - paywall-upgrade-cro: boundary test references only related skills Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-04 14:07:38 -08:00 · 2026-03-04 14:07:38 -08:00 · 7e7e7a09d8
commit 7e7e7a09d8
parent 11e9ea811f
5 changed files with 20 additions and 19 deletions
--- a/skills/ad-creative/evals/evals.json
+++ b/skills/ad-creative/evals/evals.json
@ -20,13 +20,13 @@
    {
      "id": 2,
      "prompt": "I need Google Ads copy for our CRM product. We're targeting the keyword 'best CRM for small business'. Need responsive search ads.",
-      "expected_output": "Should generate Google RSA creative respecting character limits: headlines (≤30 chars each, need 10-15 variations) and descriptions (≤90 chars each, need 4+ variations). Should pin key headlines to positions 1-2 for message consistency. Should include the target keyword in headlines. Should provide multiple angle-based variations. Should suggest ad extensions (sitelinks, callouts, structured snippets). Should follow Google Ads best practices for RSA.",
+      "expected_output": "Should generate Google RSA creative respecting character limits: headlines (≤30 chars each, need 10-15 variations) and descriptions (≤90 chars each, need 4+ variations). Should note that pinning should be used sparingly as it reduces optimization. Should include the target keyword in headlines. Should provide multiple angle-based variations. Should suggest ad extensions (sitelinks, callouts, structured snippets). Should follow Google Ads best practices for RSA.",
      "assertions": [
        "Respects Google RSA character limits (30 char headlines, 90 char descriptions)",
        "Generates 10-15 headline variations",
        "Generates 4+ description variations",
        "Includes target keyword in headlines",
-        "Suggests pinning strategy",
+        "Notes pinning should be used sparingly per skill guidance",
        "Suggests ad extensions",
        "Uses angle-based variation approach"
      ],
--- a/skills/free-tool-strategy/evals/evals.json
+++ b/skills/free-tool-strategy/evals/evals.json
@ -76,13 +76,13 @@
    },
    {
      "id": 6,
-      "prompt": "Can you help me build the actual code for a website grader tool? I want it in Next.js with a form and results page.",
+      "prompt": "How do I optimize the landing page for our free tool to get more signups? The tool itself is great but nobody finds it.",
-      "expected_output": "Should recognize this is a development/coding task, not a free tool strategy task. This skill covers strategy, evaluation, and planning of free tools — not the actual development. Should suggest that the user work on the implementation as a separate coding task. May provide strategic guidance on what the tool should include and how to structure it from a marketing perspective.",
+      "expected_output": "Should recognize this is a landing page conversion optimization task, not a free tool strategy task. Should defer to or cross-reference the page-cro skill for optimizing the tool's landing page conversion rate. May provide free-tool-specific context (gating strategy, value demonstration) but should make clear that page-cro is the right skill for page conversion optimization.",
      "assertions": [
-        "Recognizes this as a development task, not strategy",
+        "Recognizes this as page CRO, not free tool strategy",
-        "Does not attempt to write full application code",
+        "References or defers to page-cro skill",
-        "May provide strategic guidance on tool requirements",
+        "May provide free-tool-specific context",
-        "Suggests implementation as a separate task"
+        "Does not attempt full page CRO using free tool strategy patterns"
      ],
      "files": []
    }
--- a/skills/marketing-psychology/evals/evals.json
+++ b/skills/marketing-psychology/evals/evals.json
@ -33,7 +33,7 @@
    {
      "id": 3,
      "prompt": "what psychological principles should I use to write better marketing copy?",
-      "expected_output": "Should trigger on casual phrasing. Should recommend copy-relevant mental models: social proof, reciprocity, loss aversion, anchoring, specificity bias, the power of 'because' (reason-giving), storytelling/narrative transportation, IKEA effect, endowment effect. For each principle, should explain what it is and provide a specific copywriting application. Should reference the quick reference table by challenge. Should organize by where in the copy each principle applies (headlines, body, CTAs, testimonials).",
+      "expected_output": "Should trigger on casual phrasing. Should recommend copy-relevant mental models from the skill's taxonomy: social proof, reciprocity, loss aversion, anchoring, scarcity, IKEA Effect, Endowment Effect, Commitment & Consistency. For each principle, should explain what it is and provide a specific copywriting application. Should reference the quick reference table by challenge. Should organize by where in the copy each principle applies (headlines, body, CTAs, testimonials).",
      "assertions": [
        "Triggers on casual phrasing",
        "Recommends copy-relevant mental models",
@ -47,7 +47,7 @@
    {
      "id": 4,
      "prompt": "I'm designing an onboarding flow and want to use behavioral psychology to increase activation. What models should I apply?",
-      "expected_output": "Should apply design and behavioral models: BJ Fogg Behavior Model (Motivation × Ability × Prompt), Hick's Law (reduce choices), IKEA Effect (let users build something), Endowment Effect (let them experience ownership), Variable Rewards, Zeigarnik Effect (incomplete tasks drive completion), Commitment and Consistency (small asks first). Should explain how each applies to onboarding specifically. Should provide actionable recommendations for each model.",
+      "expected_output": "Should apply design and behavioral models from the skill's taxonomy: Goal-Gradient Effect (motivation increases near goal), Hick's Law (reduce choices), IKEA Effect (let users build something), Endowment Effect (let them experience ownership), Zeigarnik Effect (incomplete tasks drive completion), Commitment & Consistency (small asks first). Should explain how each applies to onboarding specifically. Should provide actionable recommendations for each model.",
      "assertions": [
        "Applies BJ Fogg Behavior Model",
        "Applies Hick's Law",
--- a/skills/paywall-upgrade-cro/evals/evals.json
+++ b/skills/paywall-upgrade-cro/evals/evals.json
@ -80,10 +80,10 @@
    {
      "id": 6,
      "prompt": "Can you help me optimize our public pricing page? We want more visitors to choose the Pro plan over the Basic plan.",
-      "expected_output": "Should recognize this is a public pricing page optimization task, not an in-app paywall task. Should defer to or cross-reference the page-cro skill (for pricing page CRO) or pricing-strategy skill (for plan structure). Paywall-upgrade-cro specifically handles in-app upgrade prompts for existing users, not public-facing pricing pages.",
+      "expected_output": "Should recognize this is a public pricing page optimization task, not an in-app paywall task. Should defer to or cross-reference the page-cro skill for pricing page CRO. Paywall-upgrade-cro specifically handles in-app upgrade prompts for existing users, not public-facing pricing pages.",
      "assertions": [
        "Recognizes this as public pricing page optimization",
-        "References or defers to page-cro or pricing-strategy skill",
+        "References or defers to page-cro skill",
        "Explains that paywall-upgrade-cro is for in-app upgrade prompts",
        "Does not attempt public pricing page optimization"
      ],
--- a/skills/product-marketing-context/evals/evals.json
+++ b/skills/product-marketing-context/evals/evals.json
@ -4,12 +4,12 @@
    {
      "id": 1,
      "prompt": "I want to set up my product marketing context. We're a B2B SaaS company that sells a customer feedback platform to product teams.",
-      "expected_output": "Should check if .agents/product-marketing-context.md already exists. If not, should start the step-by-step document creation process covering all 12 sections: Product Overview, Target Audience, Personas, Problems You Solve, Competitive Landscape, Differentiation, Objections, Switching Dynamics, Customer Language, Brand Voice, Proof Points, and Goals. Should guide the user through each section, asking questions to fill in the details. Should create the file at .agents/product-marketing-context.md when complete.",
+      "expected_output": "Should check if .agents/product-marketing-context.md already exists. If not, should offer two options: (1) Auto-draft from codebase (recommended) or (2) Start from scratch. If user chooses start from scratch, should walk through sections conversationally one at a time. Should cover all applicable sections: Product Overview, Target Audience, Personas, Problems You Solve, Competitive Landscape, Differentiation, Objections, Switching Dynamics, Customer Language, Brand Voice, Proof Points, and Goals. Should create the file at .agents/product-marketing-context.md when complete.",
      "assertions": [
        "Checks for existing product-marketing-context.md",
-        "Initiates step-by-step creation process",
+        "Offers two options: auto-draft or start from scratch",
-        "Covers all 12 sections",
+        "Covers applicable sections",
-        "Asks questions to fill in each section",
+        "Walks through sections conversationally one at a time",
        "Creates file at .agents/product-marketing-context.md"
      ],
      "files": []
@ -17,7 +17,7 @@
    {
      "id": 2,
      "prompt": "Update our product marketing context. We just added a new enterprise tier and our target audience has expanded to include VP of Engineering, not just Product Managers.",
-      "expected_output": "Should check for existing .agents/product-marketing-context.md and read it. Should identify which sections need updating based on the changes: Target Audience (add VP of Engineering), Personas (add new persona), Product Overview (new enterprise tier), possibly Pricing/Packaging, Objections (enterprise-specific), and Competitive Landscape (enterprise competitors). Should update only the relevant sections, preserving existing content that hasn't changed.",
+      "expected_output": "Should check for existing .agents/product-marketing-context.md and read it. Should identify which sections need updating based on the changes: Target Audience (add VP of Engineering), Personas (add new persona), Product Overview (new enterprise tier, including pricing updates within that section), Objections (enterprise-specific), and Competitive Landscape (enterprise competitors). Should update only the relevant sections, preserving existing content that hasn't changed.",
      "assertions": [
        "Reads existing product-marketing-context.md",
        "Identifies sections that need updating",
@ -31,13 +31,14 @@
    {
      "id": 3,
      "prompt": "create a product context doc for my app. it's a mobile app that helps people find hiking trails. we're just getting started.",
-      "expected_output": "Should trigger on casual phrasing. Should check for existing context doc. Should start the step-by-step creation process, adapting questions for an early-stage mobile app (B2C, outdoor/fitness niche). Should note that some sections may be sparse for an early-stage product and that's okay — they can be filled in as the business matures. Should still cover all 12 sections but accept lighter answers for sections like Proof Points or Competitive Landscape if the company is new.",
+      "expected_output": "Should trigger on casual phrasing. Should check for existing context doc. Should offer auto-draft or start-from-scratch options. Should adapt questions for an early-stage B2C mobile app (outdoor/fitness niche). Should note that some sections may be sparse for an early-stage product and that's okay — they can be filled in as the business matures. Should skip non-applicable sections (e.g., Personas section is B2B-focused) rather than forcing all 12. Should accept lighter answers for sections like Proof Points or Competitive Landscape if the company is new.",
      "assertions": [
        "Triggers on casual phrasing",
        "Checks for existing context doc",
        "Offers auto-draft or start-from-scratch options",
        "Adapts questions for early-stage B2C mobile app",
        "Notes some sections may be sparse early on",
-        "Still covers all 12 sections",
+        "Skips non-applicable sections rather than forcing all 12",
        "Creates file at .agents/product-marketing-context.md"
      ],
      "files": []