Listing Experiments¶

Capability Slug: ab_testing Plan Tier: Command Category: Optimization Lab

Directional · heuristic, not p-value

Listing Experiments is serial variant rotation with a heuristic signal-strength score — not a p-value driven A/B test. AWS Marketplace doesn't expose per-visitor traffic splitting, so we rotate variants in series and rank them by delta + sample size. Every UI surface labels metrics this way; this doc does the same.

What changed (was: "A/B Testing")¶

The feature was previously documented as "A/B Testing" with statistical-confidence framing (95% z-test, p-values, etc.). The renamed feature is honest about how it actually works: variants rotate in series, and the score is a heuristic. URLs still resolve to the legacy ab-testing path; UI, controllers, and copy now use Listing Experiments / Experiment.

How it works¶

Pick versions — choose 2-4 listing-content versions you want to compare. Variants can differ in title, tagline, short description, feature highlights, pricing presentation, or CTA copy.
Configure experiment — set the primary metric, the rotation cadence, and the auto-complete thresholds.
Launch — the system swaps your live listing through Variant A → B → C → A on the configured cadence. Engagement and revenue metrics are collected per active window.
Watch the signal — the index page shows a 5-bar signal-strength indicator per experiment with a heuristic micro-badge. Strength climbs as deltas widen and sample sizes grow.
Promote winner — when an experiment auto-completes (or you stop it), the show page surfaces a Winner Spotlight with the four headline metrics and a one-click Promote Winner button.

Signal strength buckets¶

The 5-bar indicator and tooltip describe each experiment with one of three buckets:

Bucket	Threshold	What it means
Strong	Δ ≥ 10% with ≥ 100 data points	Top variant is comfortably ahead of the others; safe to promote
Moderate	Δ ≥ 5% with ≥ 50 data points	Direction is consistent but the gap is narrow; consider extending
Gathering signal	Below both thresholds	Not enough data yet — keep rotating

The signal-strength threshold you pick at creation time (80% / 90% / 95%) governs auto-completion: the experiment stops automatically when one variant's heuristic strength meets that threshold and the maximum-days cap hasn't fired first.

Configuring an experiment¶

Step 1 — Pick versions¶

The version picker lists every listing version you've created (drafts, published, archived). Select 2-4. The Diff Preview card shows the differences between Variant A and Variant B side-by-side as you select.

Step 2 — Configure experiment¶

Primary metric — tile-grid picker:

Tile	What it tracks	Available
Click-Through Rate	Clicks ÷ Impressions × 100	✅
Conversion Rate	Subscriptions ÷ Page Views × 100	✅
Free Trials Started	Trial signups attributed to each variant	✅
Revenue	MRR generated during the variant window	✅
ACE Acceptance	ACE opportunities accepted	🔜 Soon
Search Rank	Marketplace search-result position	🔜 Soon

Where the numbers come from

AWS does not share listing-page traffic or visitor analytics with partners. Vellocity derives these metrics from the data you connect (AWS co-sell / Seller Central) plus any manual or CSV input — treat click-through and conversion figures as directional signals rather than AWS-reported page analytics.

AI hypothesis suggestion — Step 2 includes an AI-generated hypothesis card that auto-fills from the variant names if you don't have one in mind.

Rotation cadence — 5-button row with a Recommended star:

Daily (high-traffic listings)
2 days
Weekly ⭐ Recommended
2 weeks
Custom

Variants rotate in order (A → B → C → A…). Each variant gets equal active time for fair comparison; the currently active variant is always visible in the experiment dashboard.

Auto-complete (optional):

Stop at signal strength: 80% / 90% / 95% (recommended)
Hard cap: maximum days, regardless of signal strength

Step 3 — Launch¶

Confirm the configuration in the Diff Preview card and click Launch. The experiment goes live on the next cadence boundary.

Reading the show page¶

The experiment show page is built around three columns:

Header — Freshness pill row. Five staleness pills track where each data input came from and how fresh it is:

Source	What it means	Color
Co-sell / Seller Central	Latest pull from your connected AWS co-sell and Seller Central data	Green/amber/red by age
CSV	Last manual CSV upload	Green/amber/red
Manual	Manual data entry	Green/amber/red
Last switch	Most recent variant rotation	Green/amber/red

Top — Winner Spotlight. Brand-purple gradient card with the winning variant's name, a 4-metric grid (CTR, Conversion, Trials, Revenue), a quoted snippet of the winning listing copy, and a Promote Winner CTA.

Left column — Hypothesis card with the experiment's hypothesis text. If no signal has accumulated yet, this column shows an "Awaiting first signal" empty state instead of the spotlight.

Right column (sticky) — Signal panel + reasons. The Signal panel shows the heuristic strength per variant. Below it, a "Why we won't call it yet" card lists auto-computed reasons — for example: "Δ < 5% on primary metric", "fewer than 50 data points", "rotation cycles incomplete".

Data sources card at the bottom of the right sidebar lists what's powering the metrics — useful for debugging when a freshness pill goes red.

Best practices¶

Change one element at a time. Title vs. tagline vs. CTA — keep the variable axis small so the winning variant's lift is interpretable.
Run through a full week. Weekend AWS Marketplace traffic patterns differ materially from weekday. A 7-day minimum captures both.
Don't chase a 95% threshold on a low-traffic listing. Set a maximum-days cap. A "Moderate" call with a directional decision beats waiting forever for "Strong."
Document the hypothesis. The AI suggestion card gives you a starting point — refine it with what you actually expect to learn.
Promote and re-test. A winning variant becomes the new baseline for the next experiment. Iterative wins compound.

When the experiment is inconclusive¶

Some experiments end without a clear winner. The "Why we won't call it yet" card on the show page tells you which check failed:

Δ below threshold on primary metric → variants are too similar; consider more dramatically different content
Sample size below threshold → low-traffic listing; extend the cap or focus on higher-traffic listings first
Rotation cycles incomplete → wait at least one full cadence cycle per variant
Mixed signal across metrics → primary metric is flat but secondaries diverge; pick a different primary metric next time

API access¶

The capability is exposed through the agent runtime as ab_testing. Partner API keys scoped with manage_listings and the agent capability slug can create, read, and complete experiments programmatically. See Partner API Capabilities.

Last Updated: 2026-04-27 Renamed from "A/B Testing" — same URL, new framing, no statistical-confidence claims.