Before you start — what you need
- A batch of 5–10 ad variants you're evaluating — each with a clear hypothesis label
- Your current champion creative to use as the control (the variant to beat)
- An attention scoring tool (GazeIQ, Attention Insight, or equivalent)
- A spreadsheet for ranking — overall score and 5 sub-scores per variant
- 2–3 humans who are NOT on the creative team for the 1.5-second sanity test
- A phone or timer to enforce the 1.5-second viewing window in filter 3
The 4 filters at a glance
This playbook runs every candidate variant through four sequential filters. Each filter catches a specific failure mode that the previous ones miss; skipping any one lets bad variants reach paid testing and wastes spend.
Attention score threshold
Overall ≥65 AND every sub-score ≥50. Drops variants with hidden failure modes before any manual work begins.
Psychology principle check
Dominance, contrast, legibility, Gestalt grouping, gaze-terminus CTA. Catches variants that score well but violate fundamentals.
1.5-second human sanity test
Show to 2–3 humans outside the team. If 2 of 3 can't say what the ad sells, the variant fails.
Predicted win probability
50% attention score + 30% principle check + 20% human comprehension, normalized. Ranks survivors objectively.
Order matters: filter 1 drops obviously weak variants in seconds, filter 2 catches scoring-model gameable weakness, filter 3 catches well-scored but incomprehensible variants, and filter 4 ranks survivors with a reproducible formula. Pair this with our pre-test Meta ads playbook to complete the pre-launch cycle.
The 7-step selection sequence
Assemble your candidate set with clear labels
Gather every variant you want to evaluate in one place, and include your current champion as the control. Label each variant with the single hypothesis it represents ('variant B: CTA moved to upper-center,' 'variant C: headline rewritten as specific offer'). Without labels, you lose attribution — when a winner emerges, you need to know what design choice produced it. Aim for 5–10 variants per batch; fewer than 5 doesn't justify the filter work, more than 10 hits diminishing returns.
- →Include the current champion as variant 0 — it's your control
- →Label each variant with the one thing that makes it different
- →5–10 variants is the sweet spot; more than 10 adds noise without adding learning
This step is done when: Every variant is labeled with its hypothesis and the control champion is in the set.
Run attention scoring on every candidate (including the control)
Upload all candidates to GazeIQ (or any attention-scoring tool) and capture each variant's overall 0–100 score plus the five sub-scores: CTA visibility, headline salience, visual hierarchy, edge avoidance, and clutter penalty. This is your quantitative baseline and the input to every downstream filter. Capture the scores in a spreadsheet with variants as rows and dimensions as columns — you'll reference this through all four filters.
- →Use the same tool for every variant — don't mix scoring systems
- →Capture all 5 sub-scores, not just the overall — filter 2 depends on the sub-scores
- →Score the control champion too, so you have a baseline to beat
This step is done when: Every variant has an overall score and 5 sub-scores in a single ranking sheet.
Filter 1 — Attention score: drop anyone below threshold
Apply the first filter: drop any variant with an overall score below 65 OR any single sub-score below 50. The threshold isn't arbitrary — in our audits, variants below those numbers almost never outperform a control in live testing. The 65 floor catches generally weak creative; the 50 sub-score floor catches variants with a hidden failure mode (for example, an overall score of 72 dragged down by an edge-avoidance sub-score of 48, meaning the CTA is in the safe-zone trap zone).
- →Double threshold: overall ≥65 AND every sub-score ≥50
- →The variant that passes this filter but has any sub-score below 60 goes on the watch list — not the shortlist
- →If your control champion fails this filter, you have a bigger problem than variant selection
This step is done when: Your candidate list is narrowed to variants meeting both the overall and sub-score thresholds.
Filter 2 — Psychology principle check on survivors
Walk through each surviving variant against five psychology principles: (1) one dominant focal point (figure-ground), (2) ≥4.5:1 headline contrast (preattentive processing), (3) mobile-first legibility at 375px viewport (80% of impressions), (4) Gestalt grouping (related elements close, unrelated elements separate), (5) CTA on a natural gaze terminus (after the product or headline, not before). A variant that scores well quantitatively but fails a principle often has hidden fragility — the saliency model flagged the hotspot, but the principle violation predicts poor downstream behavior (dwell, recall, conversion).
- →Apply all 5 principles to each variant; any 'fails' gets demoted or dropped
- →Especially watch for CTA-before-product ordering — a common scoring-model miss
- →Mobile legibility check: shrink the preview to 50% and verify the headline is still readable
This step is done when: Each surviving variant has been manually checked against all 5 principles with a pass/fail mark.
Filter 3 — The 1.5-second human sanity check
Show each remaining variant to 2–3 humans who are NOT on the creative team. Show it for 1.5 seconds (use a phone stopwatch), then ask one question: 'What is this ad selling?' If 2 of 3 can answer correctly, the variant passes. If they can't, the variant's attention architecture works but its message delivery doesn't — and it won't convert regardless of click. This is the cheapest, most-skipped filter in ad testing. It takes 5 minutes and catches the creatives that game scoring models without communicating the offer.
- →Use people outside the team — your designer and PM know what you're selling; your roommate doesn't
- →1.5 seconds is the real scroll window; 3 seconds gives you false positives
- →A 2-of-3 pass rate is the minimum; a 0-of-3 or 1-of-3 pass kills the variant
This step is done when: Every surviving variant has been 1.5-second-tested with 2–3 outside humans and has a pass/fail mark.
Filter 4 — Compute predicted win probability
For each variant that has passed filters 1–3, compute a predicted win probability: what's the chance this variant will beat the control in a live A/B test? A simple method: weight the overall attention score (50%), the psychology check score (30% — number of principles passed out of 5), and the human comprehension rate (20% — 2-of-3 = 67%). Normalize across all survivors so probabilities sum to 100%. The variant with the highest win probability is your top pick; the next-highest is your second pick. Document the math so anyone on the team can reproduce the ranking.
- →Weighting: 50% attention score, 30% principle check, 20% human comprehension
- →Normalize win probabilities across surviving variants so they sum to 100%
- →A variant with <20% predicted win probability rarely earns its test budget
This step is done when: Every surviving variant has a numeric predicted win probability, and the survivors sum to 100%.
Pick the top 1–2 and log the rejected variants
Promote the top 1–2 variants to paid-testing status. One primary (the highest predicted win probability), one challenger if your budget supports it. Archive the rest in a shared creative library with their scores, principle check results, and human test results attached. The log is what compounds — after 5–10 selection cycles, you'll see which sub-scores in your account best predict live CTR for your audience, and the win-probability calculation will sharpen. That compounding is what turns ad-creative testing from an expense into a capability.
- →Top 1 = primary paid test; top 2 = challenger at smaller budget share
- →Don't soft-launch 4 variants simultaneously — you dilute learning
- →Log every rejected variant with the filter that killed it — patterns emerge over cycles
This step is done when: The top 1–2 are queued for paid testing and every rejected variant is logged with its reason for rejection.
Common mistakes to avoid
Picking the highest-scoring variant without sub-score review
An overall score is a composite; it can hide a low sub-score that will surface in live delivery. A variant scoring 82 overall with a 48 edge-avoidance sub-score is worse than a variant scoring 76 overall with every sub-score above 60. Always read the sub-scores.
Skipping the human sanity check because 'the scores look good'
The 1.5-second human test is the cheapest filter you have, and the one that catches the 'scored well, no one understands it' trap. Skipping it costs you nothing on the day of selection but guarantees occasional expensive live-testing misses.
Running filters in the wrong order
Attention score first, principle check second, human test third, win probability fourth — in that order. Running the human test first wastes time on variants that quantitative filters would kill in 10 seconds. Running the principle check last means you test humans on variants with hidden contrast or legibility failures.
Launching all surviving variants instead of picking top 1–2
The point of filtering is to concentrate budget on the most-likely winners. If you launch 4 of 6 survivors, you're spreading signal thin and extending the time to significance. Pick top 1–2 and commit — the rest stay in the library for next cycle.
Frequently asked questions
Why not just pick the variant with the highest attention score?
Because attention scores can be manipulated by creative choices that don't translate to downstream conversion. A creative with a giant red arrow pointing at nothing will score well on hot-zone concentration but won't convert. The 4-filter framework combines quantitative scoring (filters 1 + 4) with qualitative checks (filters 2 + 3) specifically to catch variants that game the model. Every filter is cheap; skipping any of them is expensive in the long run.
How is 'predicted win probability' calculated?
A simple weighted composite: 50% from the attention score (normalized 0–1), 30% from the psychology principle check (count of principles passed out of 5, divided by 5), and 20% from human comprehension rate (fraction of human testers who correctly identified the offer). Sum and normalize across surviving variants so probabilities add to 100%. More sophisticated versions train a model on your account's historical pre-test vs live-CTR data, but the simple weighted composite works well from day one and improves as you accumulate cycles.
Can I apply this framework to video ads?
Yes. For video, treat the first frame, a mid-video frame, and the CTA frame as three separate scoring events, then take the minimum of the three (a weak first frame tanks the whole video regardless of later content). For the human sanity check, show the first 3 seconds of the video — that's the real scroll window on Reels and TikTok.
How often should I run a selection cycle?
Frequency depends on creative volume. If you're producing 10+ variants per week, run selection weekly. If you're producing 5–10 per month, run bi-weekly. The discipline matters more than the cadence — running selection on Monday every two weeks, even with fewer candidates, outperforms running sporadic big batches.
Related how-tos
Score 5–10 variants in under a minute
Upload your variants and GazeIQ returns the overall attention score plus the five sub-scores you need for filters 1 and 4. The quantitative half of this framework runs in seconds.
No credit card required · 3 free scans included