Listing Experiments¶
Capability Slug:
ab_testingPlan Tier: Command Category: Optimization Lab
Directional · heuristic, not p-value
Listing Experiments is serial variant rotation with a heuristic signal-strength score — not a p-value driven A/B test. AWS Marketplace doesn't expose per-visitor traffic splitting, so we rotate variants in series and rank them by delta + sample size. Every UI surface labels metrics this way; this doc does the same.
What changed (was: "A/B Testing")¶
The feature was previously documented as "A/B Testing" with statistical-confidence framing (95% z-test, p-values, etc.). The renamed feature is honest about how it actually works: variants rotate in series, and the score is a heuristic. URLs still resolve to the legacy ab-testing path; UI, controllers, and copy now use Listing Experiments / Experiment.
How it works¶
- Pick versions — choose 2-4 listing-content versions you want to compare. Variants can differ in title, tagline, short description, feature highlights, pricing presentation, or CTA copy.
- Configure experiment — set the primary metric, the rotation cadence, and the auto-complete thresholds.
- Launch — the system swaps your live listing through Variant A → B → C → A on the configured cadence. Engagement and revenue metrics are collected per active window.
- Watch the signal — the index page shows a 5-bar signal-strength indicator per experiment with a
heuristicmicro-badge. Strength climbs as deltas widen and sample sizes grow. - Promote winner — when an experiment auto-completes (or you stop it), the show page surfaces a Winner Spotlight with the four headline metrics and a one-click Promote Winner button.
Signal strength buckets¶
The 5-bar indicator and tooltip describe each experiment with one of three buckets:
| Bucket | Threshold | What it means |
|---|---|---|
| Strong | Δ ≥ 10% with ≥ 100 data points | Top variant is comfortably ahead of the others; safe to promote |
| Moderate | Δ ≥ 5% with ≥ 50 data points | Direction is consistent but the gap is narrow; consider extending |
| Gathering signal | Below both thresholds | Not enough data yet — keep rotating |
The signal-strength threshold you pick at creation time (80% / 90% / 95%) governs auto-completion: the experiment stops automatically when one variant's heuristic strength meets that threshold and the maximum-days cap hasn't fired first.
Configuring an experiment¶
Step 1 — Pick versions¶
The version picker lists every listing version you've created (drafts, published, archived). Select 2-4. The Diff Preview card shows the differences between Variant A and Variant B side-by-side as you select.
Step 2 — Configure experiment¶
Primary metric — tile-grid picker:
| Tile | What it tracks | Available |
|---|---|---|
| Click-Through Rate | Clicks ÷ Impressions × 100 | ✅ |
| Conversion Rate | Subscriptions ÷ Page Views × 100 | ✅ |
| Free Trials Started | Trial signups attributed to each variant | ✅ |
| Revenue | MRR generated during the variant window | ✅ |
| ACE Acceptance | ACE opportunities accepted | 🔜 Soon |
| Search Rank | Marketplace search-result position | 🔜 Soon |
AI hypothesis suggestion — Step 2 includes an AI-generated hypothesis card that auto-fills from the variant names if you don't have one in mind.
Rotation cadence — 5-button row with a Recommended star:
- Daily (high-traffic listings)
- 2 days
- Weekly ⭐ Recommended
- 2 weeks
- Custom
Variants rotate in order (A → B → C → A…). Each variant gets equal active time for fair comparison; the currently active variant is always visible in the experiment dashboard.
Auto-complete (optional):
- Stop at signal strength: 80% / 90% / 95% (recommended)
- Hard cap: maximum days, regardless of signal strength
Step 3 — Launch¶
Confirm the configuration in the Diff Preview card and click Launch. The experiment goes live on the next cadence boundary.
Reading the show page¶
The experiment show page is built around three columns:
Header — Freshness pill row. Five staleness pills track where each data input came from and how fresh it is:
| Source | What it means | Color |
|---|---|---|
| AWS API | Latest pull from AWS Marketplace metrics API | Green/amber/red by age |
| CSV | Last manual CSV upload | Green/amber/red |
| Manual | Manual data entry | Green/amber/red |
| Last switch | Most recent variant rotation | Green/amber/red |
Top — Winner Spotlight. Brand-purple gradient card with the winning variant's name, a 4-metric grid (CTR, Conversion, Trials, Revenue), a quoted snippet of the winning listing copy, and a Promote Winner CTA.
Left column — Hypothesis card with the experiment's hypothesis text. If no signal has accumulated yet, this column shows an "Awaiting first signal" empty state instead of the spotlight.
Right column (sticky) — Signal panel + reasons. The Signal panel shows the heuristic strength per variant. Below it, a "Why we won't call it yet" card lists auto-computed reasons — for example: "Δ < 5% on primary metric", "fewer than 50 data points", "rotation cycles incomplete".
Data sources card at the bottom of the right sidebar lists what's powering the metrics — useful for debugging when a freshness pill goes red.
Best practices¶
- Change one element at a time. Title vs. tagline vs. CTA — keep the variable axis small so the winning variant's lift is interpretable.
- Run through a full week. Weekend AWS Marketplace traffic patterns differ materially from weekday. A 7-day minimum captures both.
- Don't chase a 95% threshold on a low-traffic listing. Set a maximum-days cap. A "Moderate" call with a directional decision beats waiting forever for "Strong."
- Document the hypothesis. The AI suggestion card gives you a starting point — refine it with what you actually expect to learn.
- Promote and re-test. A winning variant becomes the new baseline for the next experiment. Iterative wins compound.
When the experiment is inconclusive¶
Some experiments end without a clear winner. The "Why we won't call it yet" card on the show page tells you which check failed:
- Δ below threshold on primary metric → variants are too similar; consider more dramatically different content
- Sample size below threshold → low-traffic listing; extend the cap or focus on higher-traffic listings first
- Rotation cycles incomplete → wait at least one full cadence cycle per variant
- Mixed signal across metrics → primary metric is flat but secondaries diverge; pick a different primary metric next time
API access¶
The capability is exposed through the agent runtime as ab_testing. Partner API keys scoped with manage_listings and the agent capability slug can create, read, and complete experiments programmatically. See Partner API Capabilities.
Last Updated: 2026-04-27 Renamed from "A/B Testing" — same URL, new framing, no statistical-confidence claims.