Experimentation Program Toolkit
Experimentation Program Toolkit
Templates for hypothesis documentation, ICE scoring, experiment tracking, and building a structured testing program.
Templates, frameworks, and scorecards for building and running a systematic experimentation program.
1. Experiment Brief Template
Use this for every experiment before launch. No brief = no test.
Experiment Name: ________ (short, descriptive)
Experiment ID: EXP-___
Owner: ________
Date Submitted: ________
Estimated Run Time: ___ weeks
Hypothesis
We believe that [specific change] will cause [primary metric] to [improve/decrease] by [estimated magnitude] because [reasoning based on data or insight].
Experiment Design
| Field | Details |
|---|---|
| What changes | Describe the exact variation being tested |
| Control | What the current experience looks like |
| Variant | What the new experience looks like |
| Primary metric | The one metric that determines win/loss |
| Secondary metrics | Supporting metrics to watch |
| Guardrail metrics | Metrics that must NOT degrade (e.g., page load, satisfaction) |
| Traffic split | e.g., 50/50, or 90/10 for riskier changes |
| Minimum sample size | Calculated based on MDE and baseline conversion |
| Run time | Minimum days to reach significance (include full week cycles) |
Success Criteria
- Win: Primary metric improves by ≥___% with ≥95% statistical significance AND guardrail metrics are flat or improved
- Inconclusive: No statistically significant change detected within run time
- Lose: Primary metric declines OR guardrail metrics degrade
Decision
| Outcome | Action |
|---|---|
| Win | Ship to 100%, document learning |
| Inconclusive | Extend test OR kill and move on (decide before launch) |
| Lose | Kill, document learning, consider inverse test |
2. ICE Prioritization Scorecard
Rank experiment ideas before committing resources.
| # | Experiment Idea | Impact (1–10) | Confidence (1–10) | Ease (1–10) | ICE Score | Priority |
|---|---|---|---|---|---|---|
| 1 | ___ | __ | __ | __ | __ | |
| 2 | ___ | __ | __ | __ | __ | |
| 3 | ___ | __ | __ | __ | __ | |
| 4 | ___ | __ | __ | __ | __ | |
| 5 | ___ | __ | __ | __ | __ |
Scoring guide:
- Impact: If this experiment wins, how much would it move the needle? (10 = transforms a key metric, 1 = marginal)
- Confidence: How sure are you it will work, based on data or prior tests? (10 = strong evidence, 1 = pure guess)
- Ease: How quickly and cheaply can you run this test? (10 = ship this week, 1 = months of dev work)
ICE Score = Impact + Confidence + Ease (or average them — pick one method and be consistent)
Rule of thumb: Top 3 ICE scores go into this sprint. The rest go to backlog.
3. Experiment Backlog & Pipeline Tracker
Maintain a living pipeline of experiment ideas, in-flight tests, and completed experiments.
Backlog (Ideas)
| Idea | Hypothesis (1-line) | ICE Score | Source | Status |
|---|---|---|---|---|
| ___ | ___ | __ | Customer feedback | 🟡 Queued |
| ___ | ___ | __ | Competitor analysis | 🟡 Queued |
| ___ | ___ | __ | Data anomaly | 🟡 Queued |
In-Flight
| Experiment | Start Date | Est. End | Primary Metric | Current Trend | Status |
|---|---|---|---|---|---|
| EXP-001 | __ | __ | ___ | ↑ / → / ↓ | 🟢 Running |
| EXP-002 | __ | __ | ___ | ↑ / → / ↓ | 🟢 Running |
Completed
| Experiment | Result | Primary Metric Change | Learning | Shipped? |
|---|---|---|---|---|
| EXP-001 | Win | +12% conversion | Users prefer shorter forms | ✅ Yes |
| EXP-002 | Lose | -3% CTR | Urgency messaging backfired | ❌ No |
| EXP-003 | Inconclusive | +1.2% (not sig) | Need larger sample or bigger change | ❌ No |
4. Sample Size Calculator Reference
Use this to estimate how long you need to run each test.
Inputs needed:
- Baseline conversion rate: Your current metric (e.g., 3.5% trial-to-paid conversion)
- Minimum Detectable Effect (MDE): The smallest improvement worth detecting (e.g., 10% relative lift = 3.5% → 3.85%)
- Statistical significance level: Typically 95% (p < 0.05)
- Statistical power: Typically 80%
- Daily traffic/volume: How many users or events hit the test per day
Quick reference table (two-sided test, 95% significance, 80% power):
| Baseline Rate | MDE (relative) | Sample Size per Variant | At 1,000/day = Days |
|---|---|---|---|
| 2% | 10% | ~78,000 | ~78 days |
| 2% | 20% | ~20,000 | ~20 days |
| 5% | 10% | ~30,000 | ~30 days |
| 5% | 20% | ~7,700 | ~8 days |
| 10% | 10% | ~14,500 | ~15 days |
| 10% | 20% | ~3,700 | ~4 days |
Key rule: Never stop a test early because it "looks like it’s winning." Peeking inflates false positive rates. Set your run time upfront and stick to it.
5. Experiment Retrospective Template
Complete this after every experiment to capture institutional knowledge.
Experiment: EXP-___ — ________
Result: Win / Lose / Inconclusive
Run dates: ___ to ___
What happened?
Describe the result in plain language. Include the metric change and confidence level.
Why did it happen?
Your best interpretation of why the variant won, lost, or was flat. Reference qualitative data if available (session recordings, surveys, support tickets).
What did we learn?
The transferable insight. Frame it as a principle, not just a finding.
Example: "Users abandon long forms not because of time, but because they don’t understand why each field is needed. Progressive disclosure with context copy outperforms simply reducing fields."
What’s next?
- If win: When does it ship? Any follow-up experiments?
- If lose: What’s the counter-hypothesis? Should we test the opposite?
- If inconclusive: Expand sample? Test a bigger change? Move on?
Knowledge base update
- Added to Experiment Log
- Learning shared in team Slack/standup
- Updated playbook if this changes a best practice
6. Experimentation Program Maturity Scorecard
Assess where your program stands and what to improve.
| Dimension | Level 1 (Ad Hoc) | Level 2 (Emerging) | Level 3 (Systematic) | Level 4 (Culture) | Your Score |
|---|---|---|---|---|---|
| Frequency | <1 test/month | 1–4 tests/month | 5–15 tests/month | 15+ tests/month, continuous | __ |
| Rigor | No hypothesis; check results when convenient | Hypothesis written; basic significance check | Formal brief with MDE, sample size calc, guardrails | Automated statistical engine with sequential testing | __ |
| Scope | Only landing pages or email subject lines | Website + email + ads | Full funnel: acquisition, activation, retention, pricing | Product + marketing + sales + ops | __ |
| Ownership | One person runs all tests | Small team of testers | Every growth team runs own tests | Everyone in the company can and does experiment | __ |
| Learning | Results stored in someone’s head | Spreadsheet of results | Searchable experiment log with learnings | Institutional knowledge base that compounds | __ |
| Culture | "We should test more" | Tests happen but aren’t celebrated | Failed experiments shared openly | "We don’t launch without testing" is a company value | __ |
Scoring: Rate yourself 1–4 on each dimension. Total score out of 24.
- 6–12: Early stage — focus on building the habit and basic infrastructure
- 13–18: Growing — invest in tooling, training, and expanding scope
- 19–24: Advanced — focus on compounding learnings and organizational scale
7. Quarterly Experimentation Review Agenda
Run this meeting every quarter to assess program health and plan the next cycle.
Duration: 60 minutes
Attendees: Growth leads, product marketing, RevOps, engineering rep
Agenda:
- Scorecard review (10 min): Experiments run, win rate, cumulative impact on key metrics
- Top 3 wins (10 min): Share the biggest wins and why they worked
- Top 3 learnings from failures (10 min): What surprised us? What did we learn?
- Backlog review (15 min): Reprioritize the experiment pipeline using ICE scores
- Program health (10 min): Maturity scorecard update — are we improving?
- Next quarter goals (5 min): How many experiments? Which areas of the funnel?
Capstone Save Point
Save your completed experiment briefs, ICE-scored backlog, and maturity scorecard for Section 10 (Experimentation & Optimization Plan) of your Module 22 Capstone: Annual Growth Marketing Plan.