Skip to content

Listing Experiments

Capability Slug: ab_testing Plan Tier: Command Category: Optimization Lab

Directional · heuristic, not p-value

Listing Experiments is serial variant rotation with a heuristic signal-strength score — not a p-value driven A/B test. AWS Marketplace doesn't expose per-visitor traffic splitting, so we rotate variants in series and rank them by delta + sample size. Every UI surface labels metrics this way; this doc does the same.


What changed (was: "A/B Testing")

The feature was previously documented as "A/B Testing" with statistical-confidence framing (95% z-test, p-values, etc.). The renamed feature is honest about how it actually works: variants rotate in series, and the score is a heuristic. URLs still resolve to the legacy ab-testing path; UI, controllers, and copy now use Listing Experiments / Experiment.


How it works

  1. Pick versions — choose 2-4 listing-content versions you want to compare. Variants can differ in title, tagline, short description, feature highlights, pricing presentation, or CTA copy.
  2. Configure experiment — set the primary metric, the rotation cadence, and the auto-complete thresholds.
  3. Launch — the system swaps your live listing through Variant A → B → C → A on the configured cadence. Engagement and revenue metrics are collected per active window.
  4. Watch the signal — the index page shows a 5-bar signal-strength indicator per experiment with a heuristic micro-badge. Strength climbs as deltas widen and sample sizes grow.
  5. Promote winner — when an experiment auto-completes (or you stop it), the show page surfaces a Winner Spotlight with the four headline metrics and a one-click Promote Winner button.

Signal strength buckets

The 5-bar indicator and tooltip describe each experiment with one of three buckets:

Bucket Threshold What it means
Strong Δ ≥ 10% with ≥ 100 data points Top variant is comfortably ahead of the others; safe to promote
Moderate Δ ≥ 5% with ≥ 50 data points Direction is consistent but the gap is narrow; consider extending
Gathering signal Below both thresholds Not enough data yet — keep rotating

The signal-strength threshold you pick at creation time (80% / 90% / 95%) governs auto-completion: the experiment stops automatically when one variant's heuristic strength meets that threshold and the maximum-days cap hasn't fired first.


Configuring an experiment

Step 1 — Pick versions

The version picker lists every listing version you've created (drafts, published, archived). Select 2-4. The Diff Preview card shows the differences between Variant A and Variant B side-by-side as you select.

Step 2 — Configure experiment

Primary metric — tile-grid picker:

Tile What it tracks Available
Click-Through Rate Clicks ÷ Impressions × 100
Conversion Rate Subscriptions ÷ Page Views × 100
Free Trials Started Trial signups attributed to each variant
Revenue MRR generated during the variant window
ACE Acceptance ACE opportunities accepted 🔜 Soon
Search Rank Marketplace search-result position 🔜 Soon

AI hypothesis suggestion — Step 2 includes an AI-generated hypothesis card that auto-fills from the variant names if you don't have one in mind.

Rotation cadence — 5-button row with a Recommended star:

  • Daily (high-traffic listings)
  • 2 days
  • Weekly ⭐ Recommended
  • 2 weeks
  • Custom

Variants rotate in order (A → B → C → A…). Each variant gets equal active time for fair comparison; the currently active variant is always visible in the experiment dashboard.

Auto-complete (optional):

  • Stop at signal strength: 80% / 90% / 95% (recommended)
  • Hard cap: maximum days, regardless of signal strength

Step 3 — Launch

Confirm the configuration in the Diff Preview card and click Launch. The experiment goes live on the next cadence boundary.


Reading the show page

The experiment show page is built around three columns:

Header — Freshness pill row. Five staleness pills track where each data input came from and how fresh it is:

Source What it means Color
AWS API Latest pull from AWS Marketplace metrics API Green/amber/red by age
CSV Last manual CSV upload Green/amber/red
Manual Manual data entry Green/amber/red
Last switch Most recent variant rotation Green/amber/red

Top — Winner Spotlight. Brand-purple gradient card with the winning variant's name, a 4-metric grid (CTR, Conversion, Trials, Revenue), a quoted snippet of the winning listing copy, and a Promote Winner CTA.

Left column — Hypothesis card with the experiment's hypothesis text. If no signal has accumulated yet, this column shows an "Awaiting first signal" empty state instead of the spotlight.

Right column (sticky) — Signal panel + reasons. The Signal panel shows the heuristic strength per variant. Below it, a "Why we won't call it yet" card lists auto-computed reasons — for example: "Δ < 5% on primary metric", "fewer than 50 data points", "rotation cycles incomplete".

Data sources card at the bottom of the right sidebar lists what's powering the metrics — useful for debugging when a freshness pill goes red.


Best practices

  1. Change one element at a time. Title vs. tagline vs. CTA — keep the variable axis small so the winning variant's lift is interpretable.
  2. Run through a full week. Weekend AWS Marketplace traffic patterns differ materially from weekday. A 7-day minimum captures both.
  3. Don't chase a 95% threshold on a low-traffic listing. Set a maximum-days cap. A "Moderate" call with a directional decision beats waiting forever for "Strong."
  4. Document the hypothesis. The AI suggestion card gives you a starting point — refine it with what you actually expect to learn.
  5. Promote and re-test. A winning variant becomes the new baseline for the next experiment. Iterative wins compound.

When the experiment is inconclusive

Some experiments end without a clear winner. The "Why we won't call it yet" card on the show page tells you which check failed:

  • Δ below threshold on primary metric → variants are too similar; consider more dramatically different content
  • Sample size below threshold → low-traffic listing; extend the cap or focus on higher-traffic listings first
  • Rotation cycles incomplete → wait at least one full cadence cycle per variant
  • Mixed signal across metrics → primary metric is flat but secondaries diverge; pick a different primary metric next time

API access

The capability is exposed through the agent runtime as ab_testing. Partner API keys scoped with manage_listings and the agent capability slug can create, read, and complete experiments programmatically. See Partner API Capabilities.


Last Updated: 2026-04-27 Renamed from "A/B Testing" — same URL, new framing, no statistical-confidence claims.