Experimentation Program Toolkit

Artifact

Experimentation Program Toolkit

Templates for hypothesis documentation, ICE scoring, experiment tracking, and building a structured testing program.

Templates, frameworks, and scorecards for building and running a systematic experimentation program.


1. Experiment Brief Template

Use this for every experiment before launch. No brief = no test.

Experiment Name: ________ (short, descriptive)

Experiment ID: EXP-___

Owner: ________

Date Submitted: ________

Estimated Run Time: ___ weeks

Hypothesis

We believe that [specific change] will cause [primary metric] to [improve/decrease] by [estimated magnitude] because [reasoning based on data or insight].

Experiment Design

Field Details
What changes Describe the exact variation being tested
Control What the current experience looks like
Variant What the new experience looks like
Primary metric The one metric that determines win/loss
Secondary metrics Supporting metrics to watch
Guardrail metrics Metrics that must NOT degrade (e.g., page load, satisfaction)
Traffic split e.g., 50/50, or 90/10 for riskier changes
Minimum sample size Calculated based on MDE and baseline conversion
Run time Minimum days to reach significance (include full week cycles)

Success Criteria

  • Win: Primary metric improves by ≥___% with ≥95% statistical significance AND guardrail metrics are flat or improved
  • Inconclusive: No statistically significant change detected within run time
  • Lose: Primary metric declines OR guardrail metrics degrade

Decision

Outcome Action
Win Ship to 100%, document learning
Inconclusive Extend test OR kill and move on (decide before launch)
Lose Kill, document learning, consider inverse test

2. ICE Prioritization Scorecard

Rank experiment ideas before committing resources.

# Experiment Idea Impact (1–10) Confidence (1–10) Ease (1–10) ICE Score Priority
1 ___ __ __ __ __
2 ___ __ __ __ __
3 ___ __ __ __ __
4 ___ __ __ __ __
5 ___ __ __ __ __

Scoring guide:

  • Impact: If this experiment wins, how much would it move the needle? (10 = transforms a key metric, 1 = marginal)
  • Confidence: How sure are you it will work, based on data or prior tests? (10 = strong evidence, 1 = pure guess)
  • Ease: How quickly and cheaply can you run this test? (10 = ship this week, 1 = months of dev work)

ICE Score = Impact + Confidence + Ease (or average them — pick one method and be consistent)

Rule of thumb: Top 3 ICE scores go into this sprint. The rest go to backlog.


3. Experiment Backlog & Pipeline Tracker

Maintain a living pipeline of experiment ideas, in-flight tests, and completed experiments.

Backlog (Ideas)

Idea Hypothesis (1-line) ICE Score Source Status
___ ___ __ Customer feedback 🟡 Queued
___ ___ __ Competitor analysis 🟡 Queued
___ ___ __ Data anomaly 🟡 Queued

In-Flight

Experiment Start Date Est. End Primary Metric Current Trend Status
EXP-001 __ __ ___ ↑ / → / ↓ 🟢 Running
EXP-002 __ __ ___ ↑ / → / ↓ 🟢 Running

Completed

Experiment Result Primary Metric Change Learning Shipped?
EXP-001 Win +12% conversion Users prefer shorter forms ✅ Yes
EXP-002 Lose -3% CTR Urgency messaging backfired ❌ No
EXP-003 Inconclusive +1.2% (not sig) Need larger sample or bigger change ❌ No

4. Sample Size Calculator Reference

Use this to estimate how long you need to run each test.

Inputs needed:

  • Baseline conversion rate: Your current metric (e.g., 3.5% trial-to-paid conversion)
  • Minimum Detectable Effect (MDE): The smallest improvement worth detecting (e.g., 10% relative lift = 3.5% → 3.85%)
  • Statistical significance level: Typically 95% (p < 0.05)
  • Statistical power: Typically 80%
  • Daily traffic/volume: How many users or events hit the test per day

Quick reference table (two-sided test, 95% significance, 80% power):

Baseline Rate MDE (relative) Sample Size per Variant At 1,000/day = Days
2% 10% ~78,000 ~78 days
2% 20% ~20,000 ~20 days
5% 10% ~30,000 ~30 days
5% 20% ~7,700 ~8 days
10% 10% ~14,500 ~15 days
10% 20% ~3,700 ~4 days

Key rule: Never stop a test early because it "looks like it’s winning." Peeking inflates false positive rates. Set your run time upfront and stick to it.


5. Experiment Retrospective Template

Complete this after every experiment to capture institutional knowledge.

Experiment: EXP-___ — ________

Result: Win / Lose / Inconclusive

Run dates: ___ to ___

What happened?

Describe the result in plain language. Include the metric change and confidence level.

Why did it happen?

Your best interpretation of why the variant won, lost, or was flat. Reference qualitative data if available (session recordings, surveys, support tickets).

What did we learn?

The transferable insight. Frame it as a principle, not just a finding.

Example: "Users abandon long forms not because of time, but because they don’t understand why each field is needed. Progressive disclosure with context copy outperforms simply reducing fields."

What’s next?

  • If win: When does it ship? Any follow-up experiments?
  • If lose: What’s the counter-hypothesis? Should we test the opposite?
  • If inconclusive: Expand sample? Test a bigger change? Move on?

Knowledge base update

  • Added to Experiment Log
  • Learning shared in team Slack/standup
  • Updated playbook if this changes a best practice

6. Experimentation Program Maturity Scorecard

Assess where your program stands and what to improve.

Dimension Level 1 (Ad Hoc) Level 2 (Emerging) Level 3 (Systematic) Level 4 (Culture) Your Score
Frequency <1 test/month 1–4 tests/month 5–15 tests/month 15+ tests/month, continuous __
Rigor No hypothesis; check results when convenient Hypothesis written; basic significance check Formal brief with MDE, sample size calc, guardrails Automated statistical engine with sequential testing __
Scope Only landing pages or email subject lines Website + email + ads Full funnel: acquisition, activation, retention, pricing Product + marketing + sales + ops __
Ownership One person runs all tests Small team of testers Every growth team runs own tests Everyone in the company can and does experiment __
Learning Results stored in someone’s head Spreadsheet of results Searchable experiment log with learnings Institutional knowledge base that compounds __
Culture "We should test more" Tests happen but aren’t celebrated Failed experiments shared openly "We don’t launch without testing" is a company value __

Scoring: Rate yourself 1–4 on each dimension. Total score out of 24.

  • 6–12: Early stage — focus on building the habit and basic infrastructure
  • 13–18: Growing — invest in tooling, training, and expanding scope
  • 19–24: Advanced — focus on compounding learnings and organizational scale

7. Quarterly Experimentation Review Agenda

Run this meeting every quarter to assess program health and plan the next cycle.

Duration: 60 minutes

Attendees: Growth leads, product marketing, RevOps, engineering rep

Agenda:

  1. Scorecard review (10 min): Experiments run, win rate, cumulative impact on key metrics
  2. Top 3 wins (10 min): Share the biggest wins and why they worked
  3. Top 3 learnings from failures (10 min): What surprised us? What did we learn?
  4. Backlog review (15 min): Reprioritize the experiment pipeline using ICE scores
  5. Program health (10 min): Maturity scorecard update — are we improving?
  6. Next quarter goals (5 min): How many experiments? Which areas of the funnel?

Capstone Save Point

Save your completed experiment briefs, ICE-scored backlog, and maturity scorecard for Section 10 (Experimentation & Optimization Plan) of your Module 22 Capstone: Annual Growth Marketing Plan.