Experimentation Program Toolkit · Template

Templates, frameworks, and scorecards for building and running a systematic experimentation program.

1. Experiment Brief Template

Use this for every experiment before launch. No brief = no test.

Experiment Name: ________ (short, descriptive)

Experiment ID: EXP-___

Owner: ________

Date Submitted: ________

Estimated Run Time: ___ weeks

Hypothesis

We believe that [specific change] will cause [primary metric] to [improve/decrease] by [estimated magnitude] because [reasoning based on data or insight].

Experiment Design

Field	Details
What changes	Describe the exact variation being tested
Control	What the current experience looks like
Variant	What the new experience looks like
Primary metric	The one metric that determines win/loss
Secondary metrics	Supporting metrics to watch
Guardrail metrics	Metrics that must NOT degrade (e.g., page load, satisfaction)
Traffic split	e.g., 50/50, or 90/10 for riskier changes
Minimum sample size	Calculated based on MDE and baseline conversion
Run time	Minimum days to reach significance (include full week cycles)

Success Criteria

Win: Primary metric improves by ≥___% with ≥95% statistical significance AND guardrail metrics are flat or improved
Inconclusive: No statistically significant change detected within run time
Lose: Primary metric declines OR guardrail metrics degrade

Decision

Outcome	Action
Win	Ship to 100%, document learning
Inconclusive	Extend test OR kill and move on (decide before launch)
Lose	Kill, document learning, consider inverse test

2. ICE Prioritization Scorecard

Rank experiment ideas before committing resources.

#	Experiment Idea	Impact (1–10)	Confidence (1–10)	Ease (1–10)	ICE Score
1	___	__	__	__	__
2	___	__	__	__	__
3	___	__	__	__	__
4	___	__	__	__	__
5	___	__	__	__	__

Scoring guide:

Impact: If this experiment wins, how much would it move the needle? (10 = transforms a key metric, 1 = marginal)
Confidence: How sure are you it will work, based on data or prior tests? (10 = strong evidence, 1 = pure guess)
Ease: How quickly and cheaply can you run this test? (10 = ship this week, 1 = months of dev work)

ICE Score = Impact + Confidence + Ease (or average them — pick one method and be consistent)

Rule of thumb: Top 3 ICE scores go into this sprint. The rest go to backlog.

3. Experiment Backlog & Pipeline Tracker

Maintain a living pipeline of experiment ideas, in-flight tests, and completed experiments.

Backlog (Ideas)

Idea	Hypothesis (1-line)	ICE Score	Source	Status
___	___	__	Customer feedback	🟡 Queued
___	___	__	Competitor analysis	🟡 Queued
___	___	__	Data anomaly	🟡 Queued

In-Flight

Experiment	Start Date	Est. End	Primary Metric	Current Trend	Status
EXP-001	__	__	___	↑ / → / ↓	🟢 Running
EXP-002	__	__	___	↑ / → / ↓	🟢 Running

Completed

Experiment	Result	Primary Metric Change	Learning	Shipped?
EXP-001	Win	+12% conversion	Users prefer shorter forms	✅ Yes
EXP-002	Lose	-3% CTR	Urgency messaging backfired	❌ No
EXP-003	Inconclusive	+1.2% (not sig)	Need larger sample or bigger change	❌ No

4. Sample Size Calculator Reference

Use this to estimate how long you need to run each test.

Inputs needed:

Baseline conversion rate: Your current metric (e.g., 3.5% trial-to-paid conversion)
Minimum Detectable Effect (MDE): The smallest improvement worth detecting (e.g., 10% relative lift = 3.5% → 3.85%)
Statistical significance level: Typically 95% (p < 0.05)
Statistical power: Typically 80%
Daily traffic/volume: How many users or events hit the test per day

Quick reference table (two-sided test, 95% significance, 80% power):

Baseline Rate	MDE (relative)	Sample Size per Variant	At 1,000/day = Days
2%	10%	~78,000	~78 days
2%	20%	~20,000	~20 days
5%	10%	~30,000	~30 days
5%	20%	~7,700	~8 days
10%	10%	~14,500	~15 days
10%	20%	~3,700	~4 days

Key rule: Never stop a test early because it "looks like it’s winning." Peeking inflates false positive rates. Set your run time upfront and stick to it.

5. Experiment Retrospective Template

Complete this after every experiment to capture institutional knowledge.

Experiment: EXP-___ — ________

Result: Win / Lose / Inconclusive

Run dates: ___ to ___

What happened?

Describe the result in plain language. Include the metric change and confidence level.

Why did it happen?

Your best interpretation of why the variant won, lost, or was flat. Reference qualitative data if available (session recordings, surveys, support tickets).

What did we learn?

The transferable insight. Frame it as a principle, not just a finding.

Example: "Users abandon long forms not because of time, but because they don’t understand why each field is needed. Progressive disclosure with context copy outperforms simply reducing fields."

What’s next?

If win: When does it ship? Any follow-up experiments?
If lose: What’s the counter-hypothesis? Should we test the opposite?
If inconclusive: Expand sample? Test a bigger change? Move on?

Knowledge base update

Added to Experiment Log
Learning shared in team Slack/standup
Updated playbook if this changes a best practice

6. Experimentation Program Maturity Scorecard

Assess where your program stands and what to improve.

Dimension	Level 1 (Ad Hoc)	Level 2 (Emerging)	Level 3 (Systematic)	Level 4 (Culture)	Your Score
Frequency	<1 test/month	1–4 tests/month	5–15 tests/month	15+ tests/month, continuous	__
Rigor	No hypothesis; check results when convenient	Hypothesis written; basic significance check	Formal brief with MDE, sample size calc, guardrails	Automated statistical engine with sequential testing	__
Scope	Only landing pages or email subject lines	Website + email + ads	Full funnel: acquisition, activation, retention, pricing	Product + marketing + sales + ops	__
Ownership	One person runs all tests	Small team of testers	Every growth team runs own tests	Everyone in the company can and does experiment	__
Learning	Results stored in someone’s head	Spreadsheet of results	Searchable experiment log with learnings	Institutional knowledge base that compounds	__
Culture	"We should test more"	Tests happen but aren’t celebrated	Failed experiments shared openly	"We don’t launch without testing" is a company value	__

Scoring: Rate yourself 1–4 on each dimension. Total score out of 24.

6–12: Early stage — focus on building the habit and basic infrastructure
13–18: Growing — invest in tooling, training, and expanding scope
19–24: Advanced — focus on compounding learnings and organizational scale

7. Quarterly Experimentation Review Agenda

Run this meeting every quarter to assess program health and plan the next cycle.

Duration: 60 minutes

Attendees: Growth leads, product marketing, RevOps, engineering rep

Agenda:

Scorecard review (10 min): Experiments run, win rate, cumulative impact on key metrics
Top 3 wins (10 min): Share the biggest wins and why they worked
Top 3 learnings from failures (10 min): What surprised us? What did we learn?
Backlog review (15 min): Reprioritize the experiment pipeline using ICE scores
Program health (10 min): Maturity scorecard update — are we improving?
Next quarter goals (5 min): How many experiments? Which areas of the funnel?

Capstone Save Point

Save your completed experiment briefs, ICE-scored backlog, and maturity scorecard for Section 10 (Experimentation & Optimization Plan) of your Module 22 Capstone: Annual Growth Marketing Plan.

#	Experiment Idea	Impact (1–10)	Confidence (1–10)	Ease (1–10)	ICE Score
1	___	__	__	__	__
2	___	__	__	__	__
3	___	__	__	__	__
4	___	__	__	__	__
5	___	__	__	__	__

#	Experiment Idea	Impact (1–10)	Confidence (1–10)	Ease (1–10)	ICE Score
1	___	__	__	__	__
2	___	__	__	__	__
3	___	__	__	__	__
4	___	__	__	__	__
5	___	__	__	__	__

#	Experiment Idea	Impact (1–10)	Confidence (1–10)	Ease (1–10)	ICE Score
1	___	__	__	__	__
2	___	__	__	__	__
3	___	__	__	__	__
4	___	__	__	__	__
5	___	__	__	__	__