Amsive
Insights / Data + Intelligence

PUBLISHED: May 27, 2026 12 min read

Designing Defensible Geo Holdout Tests for Incrementality Measurement

Tyler Kochanski

Tyler Kochanski

Senior Manager, Data Analysis & Insights

,
Two professionals smile while reviewing a tablet together against a pink and blue gradient background with geometric line accents, suggesting collaboration, digital strategy, and shared decision-making.

Most incrementality tests don’t survive scrutiny once they’re underway. There’s a wide gap between running a holdout and running a defensible holdout. Most teams discover which side of that gap they’re on only after results come in, which is usually too late to fix anything.

Our previous article on incrementality covered why measurement matters. This article will help teams bridge the gap between deciding to test and actually running a test that holds up. If you’re reading this, you’ve likely already chosen geo holdouts as the right approach for your situation. What you need now is a framework for designing one that produces a confident answer.

We’ll explore how you can build this framework in three steps:

1. Frame the decision and the hypothesis.

Know what you’re trying to learn and what you’ll do with each possible answer before you design anything else.

2. Run the power analysis.

Validate that the test you design can actually answer your specific question at the precision your decision requires.

3. Set decision thresholds and a readout cadence.

Lock in what counts as a result, and when you’ll allow yourself to call it, before any data starts flowing.

Each step has its own failure modes, and we’ll discuss which we see most often in our work.

Decision and question framing when designing incrementality tests

Our previous incrementality article advised starting with a clear business question. We need to go one layer deeper. The question alone isn’t enough. As test designers, we must know what decision the result will trigger. That decision determines the required precision, the minimum detectable effect (MDE), the resulting test size, duration, and cost.

Decisions that emerge from an incrementality test usually look like specific budget choices. Should we spend more in this channel, and if so, how much more? Did the channel actually drive incremental lift, and to what extent? In absolute terms, how profitable was that lift?

We pull those decisions to the front end of the project. If we aren’t designing the test against the decision, we end up with something that we can’t operationalize into a media flight. Even if the test was technically sound, we risk losing the actionable part.

There’s a difference between asking, ‘Is branded paid search incremental at all?’ And, ‘Should we cut branded paid search spend by 50%?’ The first is a simple yes-or-no question. The second is a specific budget decision that requires knowing exactly how much revenue you stand to lose. They require completely different test designs.

Work backwards from the decision. If your team would reallocate budget at a +10% lift but not at a +5% lift, the test needs to distinguish those two outcomes. If the test can only detect lift at or above +25%, it can’t answer either question.

This introduces the concept of decision thresholds. It anchors the power analysis and forms the conceptual bedrock of test design.

The mechanics of a well-designed geo test

Before looking at power analysis or market selection, we need to ground what a well-designed geo test actually consists of. Most failures in geo testing are structural rather than analytical. A test missing a clean pre-period or carrying fuzzy definitions of treatment can’t be saved by complex analysis later.

What is a geo holdout?

A geo holdout splits a brand footprint into two groups of markets. The treatment group is where the channel runs as usual. The control group is where the channel is turned off or held at a meaningfully different level. We compare performance between the two groups before and after the channel change. The difference is attributable to the channel itself.

That core idea is mechanically straightforward. The discipline lives in the parts that surround it.

The components of a geo test

Every geo test is built from the same set of components. Getting each one right at the start makes the analysis possible at the end:

Treatment markets

The cities or DMAs where the channel continues to run. These represent “business as usual.”

Control markets

The cities or DMAs where the channel is turned off (or held back). These provide the counterfactual.

Holdout share

The percentage of the customer base, or media weight, depending on the question, represented by the control group. Too small and the test is underpowered; too large and the opportunity cost is unacceptable to the business.

Pre-period

The window of time before the channel change, used to establish a baseline and to verify that test and control markets were tracking together. Typically equal to or longer than the test period itself.

Treatment period (or test period).

The window during which the channel change is in effect. Long enough for power to accumulate, short enough that conditions don’t drift materially.

Primary KPI

The single metric the test is designed to detect lift on. Usually a conversion measure tied to the budget decision (orders, sign-ups, donations, qualified leads). Secondary KPIs can support the read but shouldn’t drive it.

Decision thresholds

The lift level at which the test result triggers a specific business action, defined before the test runs, not after. This is the link back to Section 1.

Analysis method

Almost always Difference-in-Differences (DiD), which compares the change in test markets to the change in control markets across the pre and treatment periods.

Why each component matters

These components are the actual work. A test with a fuzzy primary KPI produces a fuzzy result. A test with no defined decision threshold produces results that everyone interprets differently. A test with a short pre-period produces a number that looks like a lift estimate but isn’t one.

How to design a well-powered geo test

The four levers of power

Power is the probability that a test will detect a real effect when one exists. It relies on four variables. Turn one, and another has to give.

Sample size

Markets multiplied by days. You need more markets, a longer test, or both to increase it.

Baseline variable

How noisy the metric is week to week. Set by the business and the metric, not by the test designer.

Effect size (MDE)

The smallest lift the test is designed to detect with high confidence. The smaller the MDE you want, the more sample you need.

Significance threshold

How willing you are to accept a false positive. Conventionally 90% or 95% confidence.

Power operates like the pixel resolution of a camera. To detect a small object, you need either higher resolution or a bigger object. You can’t beat the math, you only choose where to spend the cost.

Calculating MDE for a geo test

Here’s the formula at practitioner level:

MDE ≈ (zα/2 + zβ) × σ × √(1/ntest + 1/ncontrol)

Let’s take a look at what each variable means, in plain terms:

zα/2 — the confidence level, expressed as a z-score. Translation: how sure you want to be that you didn’t get a false positive.

Example value: 1.645 for 90% confidence, 1.96 for 95% confidence.

— the power level, expressed as a z-score. Translation: how often the test should detect a real effect when one exists.

Example value: 0.842 for 80% power, which is the industry standard.

σ (sigma) — the standard deviation of the metric in the pre-period. Translation: how noisy the metric is week to week.

Example value: calculated from your own pre-period data — this is why a clean pre-period matters.

ntest and ncontrol — the sample sizes of each group. Translation: number of market-weeks in each arm of the test.

Example value: 10 markets × 17 weeks = 170 market-weeks per group.

To put it simply: if your markets are noisy week to week, you’ll need more markets or more weeks. If you have only 20 markets and high variance, you might only be able to detect a 15%+ lift, which means a 5% real lift will look like noise and the test will read as a null result.

Common mistakes that teams make when calculating power

We consistently see the same handful of errors ruin otherwise well-planned tests. When teams calculate power, pre-period variance is oft ignored, which inflates their power estimates. Why does this matter? Well, it’s like trying to catch a single missing scoop of coffee on a scale that wobbles by a whole 32 oz bag’s worth every time you read it. The difference is real, but the instrument is too imprecise to prove what that difference is.

But the most frequent trap isn’t mathematical. Teams consistently design their tests around the massive lift they hope to see, rather than the minimum lift required to actually change a budget decision. Even if they get the numbers right, the physical setup often fails because designers ignore geographic spillover, allowing shared media delivery or brand search behavior to quietly contaminate their control markets.

Most paid media tests need four to eight weeks minimum and 20 or more markets per arm for reasonable power on national channels. They also need a pre-period of equal or greater length than the test period for clean DiD analysis. These are starting points.

Put it in practice: a YouTube geo holdout

A client came to us wanting to validate whether YouTube was driving incremental conversions in their main group of markets. The channel represented about 31% of their media mix at roughly $12.4K/month, and leadership wanted hard evidence of return. Last-click attribution was reporting the channel as unprofitable, and they suspected the attribution was wrong. 

The test design we built

Parameter Value Why 
Test type Sustained geo holdout YouTube is geo-targetable; clean off/on by market 
Control markets 10 cities Representative across regions, channel mix, and conversion volume 
% of customer base held out 7.6% Small enough to preserve campaign reach, large enough to power the test 
Duration 17 weeks Sized to hit target MDE 
Pre-period 9 weeks For DiD baseline and parallel trends check 
Primary KPI Incremental conversions (bottom of funnel) Tied directly to the budget decision 
Analysis method Difference-in-Differences Controls for seasonality and underlying trend 

Power analysis results 

Here’s how MDE moved with holdout size, holding duration constant at 17 weeks: 

Holdout Size Control Conversions (17 wk) MDE @ 90% Conf / 80% Power 
4 cities 170 21.5% 
6 cities 227 18.5% 
8 cities 264 17.0% 
10 cities (recommended) 361 15.0% 
12 cities 453 13.0% 

We decided on 10 cities at a 15% MDE. The implication of that MDE is straightforward, and we want the team to understand it before the test runs: 

If YouTube’s true incremental lift is 10%, this test will not detect it. 

How we set the decision threshold 

We aligned on the decision threshold up front. If the test detects ≥15% lift, the client renews YouTube budget at current levels or higher. If it returns a null result, that’s a verdict that we couldn’t prove the channel works at this spend level — different from a verdict that the channel doesn’t work. Saying that explicitly up front sets expectations for what the test can and can’t tell us, and keeps the team aligned on how to read the result. 

The big picture

Power analysis is a forcing function. It makes us align on what the test can and can’t tell us, and what we’ll do with each possible answer. The teams ‘ve seen get reliable answers treat this step as the most important conversation in the project, because it sets the expectations everything else gets measured against. 

Market selection and matching: where most tests quietly fail 

Power analysis tells us how many markets the test needs. The next step is choosing which ones. 

Random assignment alone isn’t enough

It can work in principle, but I don’t recommend it. It leaves more room for error than is necessary at the typical scale of a geo test. Random draws can produce imbalanced groups by chance, and the leg work to verify balance after the fact is more painful than matching upfront. 

Matched market selection 

Match on pre-period KPI levels, trends, seasonality patterns, demographics, and channel mix. The goal is to find markets that move together, not markets that look identical on a single dimension. In the YouTube case above, we included explicit market scoring on trend correlation and channel mix distance to select tight pairings. 

Test and control markets don’t need to have the same level of activity, but they need to move together in the pre-period. Levels are removed by the math; trends are not. This is where market-selection diagnostics earn their keep — anomalies in pre-period parallel movement are exactly what I want to catch and remove from the candidate pool before the test starts.

Diagnostic plot

Always plot pre-period treatment vs. control on the same axes. If they don’t track each other, the match is bad and the analysis will produce a number that looks like a result and isn’t one. See the example below. 

Spillover and contamination

Adjacent DMAs, national channels bleeding into “holdout” markets, brand search in geos where other channels are still on. These are the things that look like signal and aren’t. 

Pre-period parallel trends: what to look for, and what to throw out

In the left panel, test and control markets sit at very different levels but rise and fall together over the pre-period. That’s the ideal outcome — DiD will be able to isolate the effect of the channel change because everything else affecting the two groups was moving together. In the right panel, test markets are climbing while control markets are falling. If we run a test on this pairing, the analysis will report a “lift” that’s really just the gap that was already opening up before the campaign ever started. That’s a candidate market pairing that should be removed before the test goes live, not flagged after the fact.

Take the next step: building a disciplined test 

A well-designed geo test requires discipline. The teams that get confident, defensible answers frame the decision before designing the test. They run the power analysis honestly and pick markets that genuinely move together. None of that requires advanced statistical machinery. It requires the patience to do the upfront work most teams skip in their hurry to get something live.

Looking for ways to supercharge growth in 2026? Watch our recent webinar to learn how to orchestrate data and systems, or let’s talk about how Amsive can help you future-proof your marketing strategy.

Share: