fix: address eval review - assertion mismatches and factual error

- marketing-psychology eval 4: BJ Fogg assertion did not match expected_output which lists Goal-Gradient Effect. Fixed. - sales-enablement eval 2: all 6 categories assertion contradicted expected_output which only categorizes the 3 given objections. Fixed. - ad-creative eval 5: TikTok hard limit corrected to recommended (80 chars recommended, 100 max) per SKILL.md. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-04 15:51:28 -08:00 · 2026-03-04 15:51:28 -08:00 · 926c624d07
commit 926c624d07
parent 7e7e7a09d8
3 changed files with 3 additions and 3 deletions
--- a/skills/ad-creative/evals/evals.json
+++ b/skills/ad-creative/evals/evals.json
@ -63,7 +63,7 @@
    {
      "id": 5,
      "prompt": "I need to generate a big batch of ad variations for a multi-platform campaign launching next week. We're a meal delivery service targeting busy professionals. Need ads for Google, Meta, and TikTok.",
-      "expected_output": "Should activate the batch generation workflow. Should generate creative for all three platforms respecting each platform's character limits: Google RSA (30/90), Meta (125/40/30), TikTok (≤80 chars). Should identify 3-5 angles that work across platforms (convenience, health, time savings, variety, cost vs eating out). Should generate variations per angle per platform. Should note platform-specific creative considerations (TikTok needs video concepts, not just text). Should organize output clearly by platform.",
+      "expected_output": "Should activate the batch generation workflow. Should generate creative for all three platforms respecting each platform's character limits: Google RSA (30/90), Meta (125/40/30), TikTok (80 chars recommended, 100 max). Should identify 3-5 angles that work across platforms (convenience, health, time savings, variety, cost vs eating out). Should generate variations per angle per platform. Should note platform-specific creative considerations (TikTok needs video concepts, not just text). Should organize output clearly by platform.",
      "assertions": [
        "Activates batch generation workflow",
        "Generates for all three platforms",
--- a/skills/marketing-psychology/evals/evals.json
+++ b/skills/marketing-psychology/evals/evals.json
@ -49,7 +49,7 @@
      "prompt": "I'm designing an onboarding flow and want to use behavioral psychology to increase activation. What models should I apply?",
      "expected_output": "Should apply design and behavioral models from the skill's taxonomy: Goal-Gradient Effect (motivation increases near goal), Hick's Law (reduce choices), IKEA Effect (let users build something), Endowment Effect (let them experience ownership), Zeigarnik Effect (incomplete tasks drive completion), Commitment & Consistency (small asks first). Should explain how each applies to onboarding specifically. Should provide actionable recommendations for each model.",
      "assertions": [
-        "Applies BJ Fogg Behavior Model",
+        "Applies Goal-Gradient Effect",
        "Applies Hick's Law",
        "Applies IKEA Effect or Endowment Effect",
        "Applies Zeigarnik Effect or commitment principles",
--- a/skills/sales-enablement/evals/evals.json
+++ b/skills/sales-enablement/evals/evals.json
@ -26,7 +26,7 @@
        "Provides structured response for each (acknowledge, reframe, evidence, bridge)",
        "Provides 2-3 response variations per objection",
        "Organizes for quick reference during calls",
-        "Addresses all 6 objection categories from the skill"
+        "Categorizes objections using the skill's framework (competitor, budget, need/timing)"
      ],
      "files": []
    },