Skip to content

AgentCore Gateway Audit: Build vs Buy Analysis

Date: 2026-02-19 Last Updated: 2026-02-21 (Funding intelligence integration — IP, data moat, and opportunity analysis for AI-enabled funding assistance) Scope: Vellocity's custom AgentCore orchestration vs Amazon Bedrock AgentCore managed services Goal: Identify weaknesses vs strengths, determine what to offload to Bedrock AgentCore and what to keep in-house


Table of Contents

  1. Executive Summary
  2. Current Architecture Scorecard
  3. Security Findings
  4. Scalability Findings
  5. Resilience Findings
  6. Observability Findings
  7. Failed Job Analysis
  8. Bedrock AgentCore Service Comparison
  9. Offload vs Keep Recommendations
  10. IP Impact Analysis
  11. Data Moat & Lock-In Assessment
  12. Opportunities
  13. Migration Roadmap
  14. Cost Estimation

Executive Summary

Vellocity's custom AgentCore is a capable workflow orchestration system with strong error classification and stuck-execution recovery. The initial audit identified critical security vulnerabilities (IDOR, prompt injection, unsafe handler loading), no circuit breaker for Bedrock API calls, no database transactions on step execution, and O(n³) dependency resolution that would break at scale.

Phase 0 Status: COMPLETE AND DEPLOYED. All 5 critical security vulnerabilities (SEC-001 through SEC-005), all 4 high security vulnerabilities (SEC-006 through SEC-009), and the critical scalability issue (SCALE-002) have been remediated in application code. Supporting CloudFormation templates have been hardened and deployed to prod on 2026-02-20: WAF rate limiting (prod-api-waf), Bedrock guardrail prompt attack detection (vellocity-bedrock-guardrails), AgentCore observability infrastructure (prod-observability), and a new AgentCore Gateway foundation (prod-agentcore-gateway) with DynamoDB tool registry, S3 schema storage, Lambda sync function, and EventBridge pipeline — preparing the infrastructure for Phase 2 Gateway migration.

Infrastructure Verification (2026-02-20): All 5 CloudFormation stacks healthy, all 22 CloudWatch alarms in OK state, WAF associated with API Gateway prod stage. 3 post-deployment issues found and resolved: (1) ~~SNS alert subscriptions not confirmed~~ — RESOLVED, both confirmed, (2) ~~Tool registry not seeded~~ — RESOLVED: 37 capabilities synced to DynamoDB (111 items incl. versioned snapshots) + 37 MCP schemas in S3 across 9 marketplace listings, (3) Sync pipeline untested in prod (LOW — deferred to Phase 2).

Amazon Bedrock AgentCore (GA Oct 2025) offers managed Runtime, Gateway, Memory, Identity, Observability, and Policy services that directly address remaining weaknesses — particularly around infrastructure resilience, tool authentication, memory management, and monitoring.

Funding Intelligence Update (2026-02-21): The AI-powered FundingApplicationWriterService (5 AWS Partner Funding programs, KB-grounded generation, dry-run reviewer evaluation, Partner Central Benefits API) is a significant IP asset operating outside the CapabilityRegistry. Integrating it as capability #38 unlocks agent-orchestrated funding workflows, Gateway discoverability, and outcome-driven approval rate intelligence — a new data moat with strong network effects.

Recommendation: Adopt a hybrid approach — offload infrastructure-layer concerns (runtime, memory, identity, observability) to Bedrock AgentCore while retaining ownership of business logic (workflow planning, brand voice, marketplace metering, GTM templates, funding intelligence).


Current Architecture Scorecard

Category Score (Initial → Current) Key Issues
Architecture 7/10 → 8/10 Clean capability-based design, good separation of concerns. Gateway foundation infrastructure added.
Security 4/10 → 8/10 ~~IDOR, prompt injection, missing auth checks, unsafe handler loading~~ — All 5 critical fixes applied (SEC-001–005). ~~Mass assignment, inconsistent auth, credential cache, unsafe JSON parsing~~ — All 4 high fixes applied (SEC-006–009).
Scalability 3/10 → 4/10 ~~O(n³) resolution~~ replaced with Kahn's algorithm. Remaining: No concurrency limits, memory leaks, no backpressure
Resilience 5/10 Good stuck detection, but no circuit breaker, no transactions, retry storms
Observability 6/10 → 7/10 ~~No ops alerting~~ — CloudWatch alarms, AgentCore log group, X-Ray sampling added. Remaining: No correlation IDs, no OTEL
Error Handling 7/10 Excellent error classification, but too many silent failures

Security Findings

CRITICAL

  • SEC-001: IDOR in getExecutionStatus / getExecutionResults — FIXED
  • File: AgentOrchestrator.php:377-438, AgentPolicy.php, AgentController.php
  • Issue: Public methods retrieve execution data by ID with no user ownership check. Any authenticated user can access another user's workflow results, task descriptions, and generated assets.
  • Fix applied: Defense-in-depth across 3 layers:

    1. Added execute(), viewExecution(), and cancel() methods to AgentPolicy
    2. Added $userId parameter + ownership guard to getExecutionStatus(), getExecutionResults(), and cancelExecution() in AgentOrchestrator
    3. Updated AgentController to use dedicated policy methods and pass Auth::id() to orchestrator
  • SEC-002: Unsafe Capability Handler Loading — FIXED

  • File: CapabilityRegistry.php:68-77
  • Issue: getHandler() instantiates handler classes without validating against an allow-list. If an attacker can create a capability record with a malicious handler_class, this is a path to remote code execution.
  • Fix applied: Added ALLOWED_HANDLER_NAMESPACES constant restricting to App\Extensions\ContentManager\System\Services\Capabilities\. Validates namespace prefix, class existence, and CapabilityInterface implementation before instantiation. Critical-level logging on blocked attempts.

  • SEC-003: Prompt Injection in WorkflowPlanner — FIXED

  • File: WorkflowPlanner.php:161-265
  • Issue: User-provided $taskDescription is interpolated directly into the planning prompt with no escaping or boundary markers. Crafted input could manipulate the planner.
  • Fix applied (multi-layer):

    1. Input sanitization: strip control characters, limit to 2,000 chars
    2. XML boundary markers: <system_instructions> wraps planner prompt, <user_task> wraps user input
    3. Output validation: max 50 steps, required fields per step (capability, parameters, depends_on)
    4. CFT defense-in-depth: Bedrock Guardrails PROMPT_ATTACK filter enabled (Prod: HIGH/NONE, Enterprise: HIGH/NONE — Bedrock requires OutputStrength=NONE for PROMPT_ATTACK)
  • SEC-004: Missing Sort Order Validation — FIXED

  • File: AgentController.php:287-296
  • Issue: $sortOrder parameter is passed directly to orderBy() without validation. Only $sortBy is whitelisted.
  • Fix applied: Added if (!in_array(strtolower($sortOrder), ['asc', 'desc'])) { $sortOrder = 'desc'; } — mirrors existing $sortBy whitelist pattern.

  • SEC-005: Potential SSRF in enrichBrandVoice — FIXED

  • File: AgentController.php:940-945
  • Issue: Company website URL used without validating scheme or blocking internal addresses.
  • Fix applied: Created reusable App\Services\UrlValidator utility. Enforces HTTPS-only, blocks RFC 1918 (10.x, 172.16-31.x, 192.168.x), link-local/AWS metadata (169.254.x), carrier-grade NAT (100.64.x), loopback (127.x), and IPv6 equivalents. DNS resolution check prevents DNS rebinding. Applied at enrichBrandVoice() endpoint.

HIGH

  • SEC-006: Mass Assignment Risk — FIXED
  • File: AgentController.php:735-753
  • Issue: $agent->update($validated) passes full validated array without explicit field whitelisting. The Agent model's $fillable includes sensitive fields (user_id, team_id, guardrail_id, bedrock_agent_id, ai_model, settings) that should not be updatable via this endpoint.
  • Fix applied: Replaced $agent->update($validated) with $agent->update($request->only(['name', 'description', 'is_active', 'capabilities'])) — explicit field whitelist prevents mass assignment of sensitive model attributes regardless of validation rules.

  • SEC-007: Inconsistent Authorization Patterns — FIXED

  • Files: AgentPolicy.php, AgentController.php
  • Issue: Mix of policy-based auth ($this->authorize()) and inline ownership checks ($execution->user_id !== Auth::id()). AgentPolicy missing create() method and execution-level policy methods for delete, rerun, restart, archive, restore, and notification operations.
  • Fix applied (two layers):

    1. Added 6 new policy methods to AgentPolicy: create(), deleteExecution(), rerunExecution(), restartExecution(), archiveExecution(), toggleNotification()
    2. Replaced all 9 inline $execution->user_id !== Auth::id() checks in AgentController with $this->authorize() calls using the appropriate policy method
    3. All 17 authorization points in AgentController now consistently use policy-based auth
  • SEC-008: Static Credential Cache Without TTL — FIXED

  • File: BedrockRuntimeService.php:24-80
  • Issue: Shared Bedrock client cached in static variable with no expiration. In queue workers (long-lived processes), stale or revoked credentials continue to be used indefinitely.
  • Fix applied:

    1. Added $sharedClientCreatedAt timestamp tracking to static cache
    2. Added CACHE_TTL_SECONDS = 900 (15-minute) TTL constant
    3. Cache validity now requires both credential hash match AND TTL not expired
    4. Added resetClientCache() static method for explicit invalidation when credentials are known to have changed
  • SEC-009: Unsafe JSON Plan Parsing — FIXED

  • File: WorkflowPlanner.php:330-455
  • Issue: Plan JSON decoded with minimal structural validation. No validation of step fields, types, or allowed keys. Deeply nested payloads could cause memory issues.
  • Fix applied (5 validations):
    1. JSON decode depth limit (10 levels) prevents stack exhaustion
    2. ALLOWED_STEP_KEYS whitelist strips arbitrary keys from step objects (capability, parameters, depends_on, name, description, step_id, condition, retry, timeout)
    3. Capability slug format validation via regex (/^[a-zA-Z][a-zA-Z0-9_]{0,99}$/)
    4. depends_on values validated as valid step indices (non-negative, within range, no self-reference)
    5. Parameters nesting depth capped at 5 levels via recursive arrayDepth() check

Scalability Findings

CRITICAL

  • SCALE-001: No Concurrent Execution Limits
  • File: AgentController.php:385-391
  • Issue: Every execution dispatches to queue immediately with zero rate limiting. 100 concurrent users = 100 simultaneous Bedrock API calls = AWS throttling → retry storm → cascading failure.
  • Fix: Add per-user concurrency cap (3-5), implement queue-level backpressure.

  • SCALE-002: O(n³) Dependency Resolution — FIXED

  • File: WorkflowPlanner.php:491-517
  • Issue: Execution order algorithm uses nested loops with maxIterations = count($steps)². A 50-step workflow = 312,500+ operations.
  • Fix applied: Replaced with Kahn's algorithm using SplQueue for BFS traversal and $inDegree[] array for tracking (O(V+E)). Uses associative array for O(1) lookup instead of in_array(). Preserved existing diagnostic error handling for unresolvable dependencies. Pre-validation via existing detectCircularDependencies() DFS retained.

  • SCALE-003: Unbounded Memory in Multi-Step Workflows

  • File: AgentOrchestrator.php:237-309
  • Issue: $stepOutputs and $results arrays accumulate all capability outputs in memory. A 20-step image workflow could exceed PHP's 128MB limit.
  • Fix: Store step outputs in DB/cache, load on-demand for downstream dependencies only.

HIGH

  • SCALE-004: Oversized Response Bodies
  • File: AgentOrchestrator.php:465-484
  • Issue: getExecutionResults() returns entire step_results and generated_assets with no pagination.
  • Fix: Add pagination, or return summary with on-demand detail endpoints.

  • SCALE-005: N+1 Query Patterns

  • File: AgentController.php:43-47, 313-315
  • Issue: Agent lists load without pagination (->get() instead of ->paginate()). Execution results load full JSON blobs without field selection.
  • Fix: Add pagination, use ->select() for list endpoints.

Resilience Findings

CRITICAL

  • RES-001: No Circuit Breaker for Bedrock
  • File: BedrockRuntimeService.php
  • Issue: When Bedrock is down or throttled, every request fails independently. No consecutive-failure tracking, no circuit opening, no adaptive backoff, no fallback to alternative models.
  • Impact: Complete workflow unavailability during any Bedrock outage.
  • Fix: Implement circuit breaker with states (closed → open → half-open), track failure rate, exponential backoff on 429/503.

  • RES-002: No Database Transactions on Step Execution

  • File: ProcessAgentWorkflowJob.php:67-128
  • Issue: Step results written to DB without transaction wrapping. Job crash leaves execution in ambiguous state with partial results.
  • Fix: Wrap step execution in DB::transaction(), implement idempotency keys.

  • RES-003: Retry Storm Risk

  • File: ProcessAgentWorkflowJob.php:27-32
  • Issue: 3 retries with no exponential backoff configured. All retries fire immediately, compounding AWS rate limiting.
  • Fix: Add backoff() method returning [30, 120, 300] (escalating delays).

HIGH

  • RES-004: Silent Service Degradation
  • Files: AgentOrchestrator.php:33-40, AgentMemoryService.php:135-143
  • Issue: Multiple services fail silently at DEBUG/WARNING log levels. Users have no visibility into whether memory, dry-run, or other optional services are functioning.
  • Fix: Add execution-level services_status field, surface degradation in UI.

STRENGTHS (Keep)

  • Stuck Execution RecoveryExecutionHealthMonitor with three-level escalation is excellent
  • Error ClassificationExecutionErrorAnalyzer categorizes errors with user-friendly guidance
  • Self-Healing Config — Configurable thresholds at global/team/user scope
  • Health Event Audit Trail — Full audit via ExecutionHealthEvent model

Observability Findings

MEDIUM

  • OBS-001: No Correlation IDs
  • Issue: Logs lack end-to-end request/execution correlation IDs. Tracing a failure across orchestrator → planner → Bedrock → capability requires manual log correlation.
  • Fix: Generate UUID at execution start, propagate through all service calls.

  • OBS-002: Incomplete Alerting — PARTIALLY ADDRESSED

  • Issue: ExecutionHealthMonitor sends in-app notifications but has no integration with CloudWatch, PagerDuty, Slack, or other ops monitoring.
  • Fix applied (infrastructure layer): Added CloudWatch metric filters for agent_execution_failed and Bedrock throttles. Added alarms: AgentFailureRateAlarm (>10 failures/5-min) and BedrockThrottleAlarm (>20 throttles/5-min) with SNS integration. Agent endpoint WAF rate limiting alarm added.
  • Remaining: Application-level CloudWatch metric emission, Slack/PagerDuty integration.

  • OBS-003: No Distributed Tracing — PARTIALLY ADDRESSED

  • Issue: No OpenTelemetry or X-Ray integration. Cannot trace latency across Bedrock API calls.
  • Fix applied (infrastructure layer): Added X-Ray sampling rule for AgentCore executions at 100% (Priority 50) for */agent* paths. Dedicated log group /vell/{env}/agentcore with 365-day retention. Lambda sync function has TracingConfig: Active.
  • Remaining: Application-level OTEL instrumentation in AgentOrchestrator and BedrockRuntimeService.

STRENGTHS (Keep)

  • Analytics ServiceAgentAnalyticsService with P50/P95/P99 percentiles and health scoring
  • Detailed Logging — 17+ structured log calls in AgentOrchestrator alone
  • Usage Tracking — Token/credit metering per step for billing

Failed Job Analysis

Current Infrastructure

Your application has failure tracking infrastructure in place:

Component Status Details
failed_jobs table Exists Standard Laravel DLQ — captures job payload + exception trace
agent_executions.status Exists Tracks: pending, planning, executing, completed, failed, cancelled, archived
agent_executions.error_message Exists Stores failure details
Self-healing fields Exists retry_count, max_retries, retry_history (JSON), is_stuck, health_status
execution_health_events Exists Full audit trail: stuck_detected, auto_restart, recovery, rate_limited, max_retries_reached
MonitorExecutionHealth command Runs every minute Detects stuck, auto-restarts, rate-limits
MonitorAgentWorker command Available Detects orphaned/stale jobs, can rescue

What's Missing

Gap Impact Priority
No admin dashboard for failed_jobs Failures only visible in DB/logs — you can't see them HIGH
No failure rate SLA tracking No alerting when failure rate exceeds threshold HIGH
No Slack/PagerDuty alerting Admin notifications are in-app only MEDIUM
No failed_jobs API endpoint Cannot query/retry failed jobs from UI MEDIUM
No poison pill detection Repeated failures from same input not detected LOW
No DLQ depth monitoring Queue backlog invisible until users complain MEDIUM

Job Retry Configuration

ProcessAgentWorkflowJob:
  tries: 3           (max attempts)
  timeout: 600       (10 minutes per attempt)
  backoff: none      (immediate retry — PROBLEM)

Self-Healing:
  stuck_warning: 5 min
  stuck_critical: 30 min
  auto_restart: 60 min
  max_auto_retries: 3
  exponential_backoff: true (for auto-restarts only)
  max_executions_per_hour: 100
  max_executions_per_day: 1000

Bedrock AgentCore Service Comparison

Service-by-Service Mapping

Bedrock AgentCore Service Your Current Implementation Gap Analysis
Runtime (serverless agent hosting, microVM isolation, 8hr execution windows) ProcessAgentWorkflowJob on Laravel queue + Redis Your queue has no session isolation, 10min timeout, no microVM sandboxing
Gateway (MCP tool registry, API→MCP transform, semantic tool selection, auth management) CapabilityRegistry + inline handler loading + DynamoDB tool registry + S3 MCP schemas (Phase 0C) ~~Unsafe handler loading~~ fixed (SEC-002). Gateway foundation deployed: DynamoDB registry, S3 MCP schemas, Lambda sync, EventBridge pipeline. Remaining: No semantic discovery, no auth management, no live Gateway registration (Phase 2). Gap: FundingApplicationWriterService operates as a standalone service outside CapabilityRegistry — not registered in DynamoDB tool registry, not discoverable via Gateway, not available to agent workflows.
Memory (short-term + long-term + episodic memory) AgentMemoryService wrapping Bedrock agent memory Already using Bedrock memory partially; local fallback is basic session tracking
Identity (agent identity, OAuth flows, credential management) Mix of Passport, Cognito, ~~inline API key checks~~ standardized policy (SEC-007) Fragmented; no unified agent identity; ~~static credential cache~~ fixed with TTL (SEC-008)
Observability (CloudWatch dashboards, OTEL, latency/error/token metrics) AgentAnalyticsService + ExecutionHealthMonitor + custom logging + CloudWatch alarms/metrics (Phase 0B) Good analytics. ~~No CloudWatch integration~~ — metric filters, alarms, X-Ray sampling added. Remaining: No OTEL instrumentation in application code, no distributed tracing end-to-end
Policy (Cedar-based tool-call interception, natural language policy rules) AgentPolicy + inline auth checks Minimal; no tool-call-level policy enforcement
Evaluations (13 built-in evaluators, custom scoring, continuous monitoring) None No quality evaluation system at all
Browser / Code Interpreter (secure browser runtime, code execution sandbox) None Not applicable to current use case

Bedrock AgentCore Pricing Reference

Service Metric Price
Runtime vCPU-hour $0.0895
Runtime GB-hour (memory) $0.00945
Gateway Per 1,000 tool invocations $0.005
Gateway Per 1,000 search queries $0.025
Gateway Per 100 tools indexed/month $0.02
Memory (short-term) Per 1,000 events $0.25
Memory (long-term) Per 1,000 memories stored $0.75
Memory (retrieval) Per 1,000 retrievals $0.50
Identity Per 1,000 requests $0.01 (free via Runtime/Gateway)
Policy Per 1,000 auth requests ~$0.025

Offload vs Keep Recommendations

OFFLOAD to Bedrock AgentCore

Component Why Offload Bedrock Service Resolves Issues
Agent Runtime / Execution Your queue has no isolation, 10min timeout, no backpressure. Bedrock provides microVM isolation, 8hr windows, auto-scaling, and consumption-based billing (only charged for active CPU). AgentCore Runtime RES-002, RES-003, SCALE-001
Tool Registry & Authentication ~~CapabilityRegistry has unsafe handler loading (SEC-002)~~ — fixed. ~~Static credential cache (SEC-008)~~ — fixed with TTL. ~~Inconsistent auth (SEC-007)~~ — fixed with standardized policy. DynamoDB tool registry + S3 MCP schemas deployed (Phase 0C). Gateway provides managed tool registry with semantic discovery, inbound/outbound auth, and 1-click connectors (Salesforce, Slack, Jira, etc). AgentCore Gateway ~~SEC-002~~, ~~SEC-008~~, ~~SEC-007~~
Memory Management AgentMemoryService already wraps Bedrock memory but with basic local fallback. Native AgentCore Memory adds episodic memory, managed short-term/long-term storage, and eliminates the need for your fallback code. AgentCore Memory RES-004 (memory degradation)
Observability & Monitoring No OTEL, no CloudWatch integration, no distributed tracing. AgentCore Observability provides turnkey dashboards, OTEL compatibility, and integration with Datadog/Dynatrace/LangSmith. AgentCore Observability OBS-001, OBS-002, OBS-003
Agent Identity & Auth Fragmented auth (Passport + Cognito + ~~inline checks~~ standardized policy). ~~Static credential cache (SEC-008)~~ — fixed with TTL. AgentCore Identity provides unified agent identity with OAuth flows, token management, and multi-tenant support. AgentCore Identity ~~SEC-007~~, ~~SEC-008~~
Tool-Call Policy Enforcement No tool-call-level authorization. AgentCore Policy intercepts every tool call in real-time with Cedar policies written in natural language. AgentCore Policy SEC-007

KEEP In-House (Competitive Advantages)

Component Why Keep Current Quality Notes
WorkflowPlanner (Claude-powered planning) Core IP — your GTM workflow templates, brand-aware planning prompts, and capability-aware step generation are unique. Bedrock AgentCore has no equivalent planning service. Good — ~~SEC-003, SCALE-002~~ fixed Prompt injection fixed (XML boundaries + input sanitization), O(n³) sort replaced with Kahn's algorithm. Planning logic retained.
BrandVoiceContextBuilder Product differentiator — enriches every workflow with company tone, audience, industry context. No managed equivalent exists. Excellent Keep and enhance
ExecutionErrorAnalyzer User-facing error classification with actionable recommendations. AgentCore Observability doesn't provide this UX layer. Excellent Keep as UI layer on top of AgentCore Observability
ExecutionHealthMonitor (self-healing) Sophisticated stuck detection with three-level escalation. While AgentCore Runtime handles infrastructure-level health, your business-logic-level health monitoring (stuck workflows, rate limiting) is valuable. Excellent Adapt to monitor AgentCore Runtime sessions instead of queue jobs
SelfHealingConfig (per-user thresholds) Unique multi-scope configuration (global/team/user). No managed equivalent. Good Keep for business-level config
AgentAnalyticsService Business-level metrics (success rate, P50/P95/P99, cost analysis, capability breakdown, health scoring). AgentCore Observability provides infrastructure metrics but not GTM-specific analytics. Good Keep, feed data from AgentCore Observability
FundingApplicationWriter (5 AWS programs) Core IP — program-specific AI generation with KB grounding from official AWS docs, dry-run evaluation with reviewer personas, Partner Central Benefits API integration. No managed equivalent. Only standalone AI funding assistant for AWS Partner Funding programs. Excellent Keep — register as AgentCore capability (currently standalone, not in CapabilityRegistry). Wire outcome tracking for approval rate intelligence.
GTM Workflow Templates (10 pre-built) Core product value — co-sell partner workflows, marketplace optimization, content generation pipelines. Good Keep and expand
Marketplace Metering AWS Marketplace billing integration (credit/token accounting per step). Specific to your ISV business model. Good Keep; wire into AgentCore Runtime session metrics
Content Tag System Content organization and tracking for GTM workflows. No managed equivalent. Good Keep

IP Impact Analysis

What You Own vs What You Rent

Offloading to Bedrock AgentCore shifts ownership of infrastructure but does NOT transfer your business logic IP. The key question is: which components contain defensible IP, and which are commodity infrastructure you're maintaining at cost?

IP Value Map

Component Lines of Custom Logic Replication Effort IP Value Migration Risk
Marketplace SEO Score (MSS) 1,070 6-8 weeks VERY HIGH NONE — stays in-house
Capability Registry (37 registered + 5 unregistered capabilities) 736 4-6 weeks VERY HIGH ~~MEDIUM~~ LOW — Handler namespace allowlist (SEC-002) + DynamoDB local registry (IP-002/003) ensure handler IP stays in-house. Gateway hosts only schemas/descriptions.
FundingApplicationWriter (5 AWS programs, KB-grounded generation, dry-run evaluation) ~800+ 4-6 weeks VERY HIGH NONE — stays in-house
Co-Sell Matching (ICP overlap, partner intelligence) 584 3-4 weeks HIGH NONE — stays in-house
WorkflowPlanner (dependency graph, retry heuristics) 572 3-4 weeks VERY HIGH NONE — stays in-house
Deal Influence Tracking (multi-touch attribution) 400+ 4-5 weeks VERY HIGH NONE — stays in-house
BrandVoiceContextBuilder 349 2-3 weeks HIGH NONE — stays in-house
AgentAnalyticsService (P50/P95/P99, health scoring) 770 3 weeks MEDIUM LOW — sits on top of AgentCore Observability
Co-Sell Analytics 453 2 weeks MEDIUM NONE — stays in-house
Marketplace Metering 229 2-3 days LOW NONE — commodity wrapper

Total custom business logic: ~5,963+ lines across 10 components Full system replication effort: 6-12 months (including integration testing, data migration, domain expertise)

What Makes Your IP Defensible

1. Marketplace SEO Score (MSS) Algorithm — Patent-worthy - 3-component scoring: 40% Listing Quality, 25% Backlink Authority, 35% AI Visibility - AI Visibility scoring (tracking LLM mentions of listings) is genuinely novel - Bedrock fallback proxy when DataForSEO API is unavailable — clever engineering - Category benchmark medians improve with every customer scored (network effect)

2. WorkflowPlanner Intelligence - Temperature reduction on retry (0.3 → 0.1) to produce more deterministic replans - Context truncation strategy (2KB limit) prevents prompt bloat while preserving semantics - Capability-aware hints: detects available capabilities and adjusts planning prompts dynamically - Learns from execution errors to improve subsequent planning

3. AWS Partner Network-Specific Capabilities - ACE Opportunity Sync (auto-generates pre-filled ACE briefs) - CPPO Proposal Generator (pricing proposals for AWS Marketplace) - AWS Clean Rooms integration (privacy-preserving partner overlap analysis) - Partner Intelligence scoring (relationship strength + warm intro paths) - These require AWS Partner Network access — competitors can't replicate without partnership agreements

4. AI-Powered Funding Application Intelligence - Program-specific prompt engineering for 5 AWS Partner Funding programs (Innovation Sandbox, POC, ISV WMP, MDF, MAP) - Knowledge Base grounding from official AWS documentation — responses cite real program requirements, not generic AI output - Dry-run evaluation with AI personas simulating AWS funding reviewers (funding_reviewer, funding_technical_reviewer) - Company profile enrichment pre-fills applications with existing brand/product data - Partner Central Benefits API integration creates a submission-to-outcome feedback loop (coming soon) - Approval rate data accumulates: more submissions → better program-specific guidance → higher success rates (network effect)

5. Deal Influence Attribution - 6 correlation input types: UTM, private offers, metering, CRM stages, email/calendar, content engagement - Multi-touch attribution modeling (first-touch vs last-touch) - Content-to-conversion lag time tracking — improves with more data

IP Risk from Migration

Migration Phase IP Risk Mitigation
Phase 1: Observability ZERO Only adds monitoring layer
Phase 2: Gateway ~~MEDIUM~~ LOW Your 37 capability handlers become Gateway tools. Handler logic stays yours. Tool metadata mitigated: DynamoDB local registry is source of truth (IP-002), MCP schemas versioned in S3 (IP-003), handler namespace allowlist prevents unauthorized loading (SEC-002).
Phase 3: Memory LOW Memory content moves to AWS-managed storage. Session metadata stays in your DB. You lose direct access to raw memory vectors.
Phase 4: Runtime LOW Your orchestrator code runs on AgentCore compute but remains your code. Similar to deploying on EC2 — AWS runs it, you own it.
Phase 5: Identity LOW Auth config moves to AgentCore Identity. Credential mapping is operational, not IP.

Key IP Protection Actions

  • IP-001: Document MSS algorithm separately — consider provisional patent filing
  • IP-002: Keep capability handler source code in your repo (Gateway only hosts tool schemas/descriptions) — ADDRESSED: Handler source stays in App\Extensions\ContentManager\System\Services\Capabilities\. DynamoDB registry + S3 schemas store only metadata/descriptions, never handler code. ALLOWED_HANDLER_NAMESPACES constant enforces this boundary.
  • IP-003: Maintain local copies of all tool metadata registered with Gateway — ADDRESSED: DynamoDB vell-{env}-gateway-tool-registry is the local source of truth. php artisan agentcore:sync-tools seeds from application DB. Versioned snapshots in DynamoDB + versioned S3 schemas provide full history. Lambda sync is event-driven, not Gateway-dependent.
  • IP-004: Export Bedrock Agent Memory sessions periodically to your own S3 bucket
  • IP-005: Ensure all AgentCore Policy rules are version-controlled in your repo, not only in AWS console

Data Moat & Lock-In Assessment

Data That Accumulates Value Over Time

Data Category Stickiness Portability Network Effect Flywheel
Marketplace SEO benchmark data VERY HIGH Low (AWS-specific) VERY STRONG More listings → better category medians → better recommendations
Deal influence correlation data VERY HIGH Medium STRONG More customers → better attribution models → more predictive
Execution retry/health patterns VERY HIGH High (your DB) MEDIUM More executions → better self-healing → fewer failures
Keyword gap intelligence VERY HIGH Low (AWS-specific) VERY STRONG More listings → better competitive landscape maps
Funding application outcomes VERY HIGH Medium (your DB) VERY STRONG More submissions → approval/rejection patterns → better program-specific guidance → higher success rates
Funding program intelligence HIGH Low (AWS-specific) STRONG KB-grounded program requirements + reviewer persona feedback accumulate institutional knowledge of what gets approved
Brand voice profiles HIGH High (JSON export) MEDIUM Every execution refines "what context works"
Knowledge base content HIGH Medium (S3 docs portable, embeddings not) HIGH Quality improves with document volume
Compliance history VERY HIGH Medium HIGH Full audit trail creates switching cost
Partner matching patterns HIGH Medium STRONG More partners matched → better success predictors
Agent memory (Bedrock) MEDIUM LOW (Bedrock-specific) HIGH Conversation history compounds
Team configuration MEDIUM High (portable) LOW Organizational inertia

What You LOSE Control Of With AgentCore Migration

Fully Lost (Bedrock-hosted, no direct access): 1. Active Bedrock Agent Memory content (session summaries, semantic facts) 2. Fine-tuned model weights (bedrock_model_arn is AWS-hosted) 3. Knowledge Base embeddings (Bedrock-specific vectors)

Stays in Your Database (Portable): 1. All execution history, retry patterns, health events 2. Brand voice profiles, GTM goals, personas 3. Compliance reports and rule history 4. Marketplace metrics, SEO scores, keyword gaps 5. Agent definitions, capability configurations 6. Memory session metadata (just not the memory content itself) 7. All integration credentials (encrypted) 8. Deal influence correlation data 9. Team/org configuration

Net Impact: You retain ~90% of your data moat. The 10% you lose (active memory, embeddings) is operational state, not strategic data.

Customer Switching Costs Created

Feature What Customer Loses by Leaving Lock-In Strength
Marketplace SEO Score trends Historical score trajectory, benchmark comparisons HIGH
Deal influence models Years of content-to-conversion correlation data VERY HIGH
Compliance audit trail Full validation history, rule evolution HIGH
Brand voice configuration Tuned personas, GTM positioning, competitive differentiators MEDIUM
Workflow execution history What worked, what failed, optimization patterns MEDIUM
Knowledge bases Curated document corpus with per-capability tuning MEDIUM
Partner matching history Relationship strength scores, ICP overlap data MEDIUM
Funding application history Approval/rejection patterns, program-specific guidance, reviewer feedback, reusable application templates HIGH

Opportunities

Opportunity 1: Marketplace Intelligence as a Standalone Product

Your MSS algorithm, keyword gap analysis, and competitive benchmarking could be offered as a standalone analytics product for AWS Marketplace ISVs — even those not using your full GTM platform.

  • Market size: 3,000+ ISV listings on AWS Marketplace
  • Moat: Benchmark data improves with every customer (network effect)
  • Revenue model: Tiered pricing by listing count
  • AgentCore relevance: Gateway enables this as a standalone MCP tool that other agents can invoke

Opportunity 2: AgentCore Gateway as Distribution Channel

By registering your 37 capabilities as Gateway tools with semantic discovery, your capabilities become discoverable by any agent connected to Gateway — not just your own UI.

  • Your co-sell matching, SEO scoring, content generation, and funding application generation become tools other frameworks (CrewAI, LangGraph, LlamaIndex) can invoke
  • Gateway's 1-click connectors (Salesforce, Slack, Jira) replace your custom integration code
  • This shifts your business model from "app you log into" to "capabilities any agent can call"
  • Funding-specific opportunity: A partner's agent discovers your funding_application_writer tool via Gateway, generates a joint POC funding application combining both partners' data, and submits via Partner Central Benefits API — fully automated, agent-to-agent co-funding
  • Phase 0C progress: DynamoDB tool registry seeded (37 capabilities, 9 listings), 37 MCP schemas in S3, EventBridge sync pipeline deployed, agentcore:sync-tools runs automatically on every deploy. Phase 2 will register tools with live Gateway. Action needed: Register FundingApplicationWriterService as capability #38 before Phase 2 Gateway registration.

Opportunity 3: A2A Protocol for Multi-Agent GTM

AgentCore Runtime supports the Agent-to-Agent (A2A) protocol. Your specialized agents (SEO analyzer, co-sell matcher, content generator) could communicate with:

  • Customer's own internal agents
  • Partner agents (ISV-to-ISV collaboration)
  • AWS first-party agents (Marketplace listing optimizer)

This enables agent-mediated co-sell — a partner's agent negotiates joint GTM campaigns with your agent automatically.

Opportunity 4: AgentCore Evaluations for Quality Differentiation

You currently have zero quality evaluation for agent outputs. AgentCore Evaluations provides 13 built-in evaluators. Adding quality scoring to every workflow creates:

  • Customer-visible quality grades (trust signal)
  • Continuous quality monitoring (catch regressions)
  • A/B testing of prompt strategies
  • Data for fine-tuning (reward signal)

High-value use case — Funding application quality scoring: Your FundingApplicationWriter already has dry-run evaluation with funding_reviewer and funding_technical_reviewer personas. AgentCore Evaluations could formalize this into continuous scoring: grade applications before submission, track score correlation with actual approval outcomes, and use approval data as a reward signal to improve generation quality over time. This is a natural fit — you already have the persona simulation infrastructure, Evaluations adds the scoring framework and regression detection.

Opportunity 5: Convert Self-Healing Into a Feature

Your ExecutionHealthMonitor + SelfHealingConfig + ExecutionHealthEvent stack is genuinely sophisticated. Most SaaS apps don't expose this.

  • Surface health dashboards to customers ("Your agents are 94% healthy")
  • Let customers tune their own self-healing thresholds
  • Create "reliability SLAs" as a premium tier feature
  • Market as "Enterprise-grade agent reliability" differentiator

Opportunity 7: AI-Powered Funding Intelligence Platform

Your FundingApplicationWriterService already generates applications for 5 AWS Partner Funding programs with KB grounding and dry-run evaluation. This is a significant capability that's not yet integrated into the AgentCore orchestration system — it exists as a standalone service outside the CapabilityRegistry.

Immediate integration gap: The funding writer is not registered as one of the 37 CapabilityRegistry capabilities, meaning it's not in the DynamoDB tool registry, not discoverable via Gateway, and not available for agent-orchestrated workflows. Registering it unlocks:

  • Agent-orchestrated funding workflows: WorkflowPlanner can chain funding applications with co-sell matching, deal intelligence, and marketplace optimization into multi-step GTM sequences (e.g., "identify best co-sell partner → generate joint POC funding application → create co-branded content")
  • Outcome-driven intelligence: Once Partner Central Benefits API submission goes live, track approval/rejection outcomes per program. This creates a feedback loop: more submissions → pattern recognition on what gets approved → better program-specific guidance → higher success rates. This data is a defensible moat.
  • Standalone product potential: Similar to Opportunity 1 (Marketplace Intelligence), funding application assistance could be offered standalone to the 3,000+ AWS Marketplace ISVs who need help navigating AWS Partner Funding programs but don't use your full platform
  • Revenue model: Tiered by program complexity — free Innovation Sandbox applications (lead gen), paid for POC/MAP/MDF applications with outcome tracking
  • AgentCore relevance: Register as Gateway tool so other agents (partner agents, AWS first-party agents) can request funding applications via A2A protocol — enables agent-mediated joint funding applications between ISV partners

Action items: - [ ] Register funding_application_writer as a CapabilityRegistry capability (adds to DynamoDB tool registry + S3 MCP schema) - [ ] Complete Partner Central Benefits API direct submission (currently "coming soon") - [ ] Add funding_applications table to track submissions, outcomes, and program-specific patterns - [ ] Build approval rate analytics that feed back into generation prompts (reward signal) - [ ] Add funding_reviewer dry-run personas to AgentCore Evaluations (Opportunity 4) when adopted

Opportunity 6: Patent Filing

Four components are novel enough for provisional patent applications:

  1. MSS Algorithm — 3-component marketplace SEO scoring with AI Visibility tracking and Bedrock fallback proxy
  2. Self-Healing Workflow Orchestration — Three-tier stuck detection with exponential backoff auto-restart and per-scope configuration
  3. Capability-Aware Workflow Planning — Dynamic hint injection based on available capability set with temperature-reducing retry strategy
  4. AI-Powered Funding Application Intelligence — Program-specific KB-grounded application generation with simulated reviewer evaluation and outcome-driven feedback loop

Migration Roadmap

Phase 0: Critical Fixes + Gateway Foundation (Week 1-3) — COMPLETE & DEPLOYED

Done regardless of migration decision. Prepares infrastructure for Phase 2 Gateway migration. Deployed: 2026-02-20 — All 4 stacks healthy in us-east-1 (account 253265132499).

Phase 0A: Application Security Fixes — COMPLETE

Critical (SEC-001–005): - [x] SEC-004: Sort order validation whitelist in AgentController - [x] SEC-001: IDOR ownership checks — defense-in-depth across AgentPolicy, AgentOrchestrator, AgentController - [x] SEC-002: Handler namespace allowlist + CapabilityInterface validation in CapabilityRegistry - [x] SEC-005: SSRF protection via App\Services\UrlValidator (RFC 1918, metadata endpoint, loopback blocking) - [x] SEC-003: Prompt injection boundary markers + input sanitization + output validation in WorkflowPlanner - [x] SCALE-002: Kahn's algorithm topological sort replacing O(n³) dependency resolution in WorkflowPlanner

High (SEC-006–009): - [x] SEC-006: Explicit field whitelist ($request->only()) replacing $validated mass assignment in AgentController::update() - [x] SEC-007: Standardized policy-based auth — 6 new AgentPolicy methods, 9 inline checks replaced with $this->authorize() in AgentController - [x] SEC-008: TTL-based credential cache (15-min expiry) + resetClientCache() method in BedrockRuntimeService - [x] SEC-009: JSON schema validation in WorkflowPlanner::extractJsonFromResponse() — decode depth limit, step key whitelist, capability slug regex, dependency index validation, parameter depth cap

Phase 0B: CloudFormation Hardening — COMPLETE & DEPLOYED

  • WAF: Agent endpoint rate limiting (100 req/5-min per IP) for /agent, /execute, /workflow paths — vell-api-waf.yaml
  • Stack: prod-api-waf — UPDATE_COMPLETE (2026-02-20)
  • Observability: AgentCore log group (/vell/{env}/agentcore), metric filters (execution failures, Bedrock throttles), alarms (>10 failures/5-min, >20 throttles/5-min), X-Ray 100% sampling for agent paths — vell-observability.yaml
  • Stack: prod-observability — CREATE_COMPLETE (2026-02-20)
  • Bedrock Guardrails: PROMPT_ATTACK filter enabled on Prod (HIGH/NONE) and Enterprise (HIGH/NONE) tiers — bedrock-guardrails.yml (Note: Bedrock requires OutputStrength=NONE for PROMPT_ATTACK filter type)
  • Stack: vellocity-bedrock-guardrails — UPDATE_COMPLETE (2026-02-20)
  • IAM Role: Added AgentCore Gateway (ListTools, GetTool, InvokeTool, SearchTools), Observability (GetAgentCoreMetrics, ListAgentCoreTraces), and Memory (GetMemory, PutMemory, DeleteMemory) permissions — vell-agentcore-bedrock-role.yaml
  • Customer-facing template — not deployed in Vell's account (customers deploy in their own accounts)

Phase 0C: AgentCore Gateway Foundation — COMPLETE & DEPLOYED

  • New CFT: cloudformation/application/vell-agentcore-gateway.yaml
  • Stack: prod-agentcore-gateway — CREATE_COMPLETE (2026-02-20)
  • DynamoDB table vell-{env}-gateway-tool-registry with category-index and marketplace-listing-index GSIs, PITR in prod
  • S3 bucket vell-{env}-gateway-tool-schemas with versioning, Glacier lifecycle, HTTPS-only policy
  • SQS sync queue + DLQ (3 retries, 14-day DLQ retention)
  • EventBridge rule for CapabilityRegistryChanged, ToolSchemaUpdated, MarketplaceListingChanged events
  • Lambda sync function (Python 3.11) with SQS event source mapping
  • IAM role with DynamoDB, S3, Bedrock Gateway, SQS, X-Ray, and CloudWatch permissions
  • CloudWatch dashboard (Lambda invocations/errors/duration, DynamoDB read/write, SQS queue depth, S3 operations)
  • 7 SSM parameters for cross-stack resource discovery
  • CloudWatch alarms for sync errors and DLQ depth
  • Artisan command: php artisan agentcore:sync-toolsSyncGatewayToolsCommand.php
  • Seeds 37 capabilities from CapabilityRegistry::bootstrapDefaults() into DynamoDB
  • Generates MCP-compatible tool schemas to S3 for each capability
  • Flags: --dry-run, --force, --capability={slug}, --listing={id}
  • Marketplace listing mapping for 8 future listings (bundle-first strategy)
  • Versioned snapshots in DynamoDB for tool metadata history

Phase 0 Files Changed/Created

Created: - app/Services/UrlValidator.php — Reusable SSRF protection utility - cloudformation/application/vell-agentcore-gateway.yaml — Gateway foundation CFT (DynamoDB, S3, SQS, EventBridge, Lambda, IAM, CloudWatch) - app/Extensions/ContentManager/System/Console/Commands/SyncGatewayToolsCommand.php — Artisan agentcore:sync-tools command

Modified: - app/Extensions/ContentManager/System/Policies/AgentPolicy.php — Added execute(), viewExecution(), cancel() methods (SEC-001); Added create(), deleteExecution(), rerunExecution(), restartExecution(), archiveExecution(), toggleNotification() methods (SEC-007) - app/Extensions/ContentManager/System/Services/AgentCore/AgentOrchestrator.php — Ownership guards on 3 methods - app/Extensions/ContentManager/System/Services/AgentCore/CapabilityRegistry.php — Handler namespace allowlist + interface validation - app/Extensions/ContentManager/System/Services/AgentCore/WorkflowPlanner.php — Prompt injection protection + Kahn's algorithm (SEC-003/SCALE-002); JSON schema validation with step key whitelist, capability slug regex, dependency index validation, parameter depth cap, decode depth limit (SEC-009) - app/Extensions/ContentManager/System/Http/Controllers/AgentController.php — Sort order validation, SSRF protection, policy method updates (SEC-001/004/005); Explicit field whitelist for mass assignment (SEC-006); 9 inline auth checks replaced with policy-based $this->authorize() (SEC-007) - app/Services/Bedrock/BedrockRuntimeService.php — TTL-based credential cache (15-min expiry), resetClientCache() static method (SEC-008) - cloudformation/application/vell-api-waf.yaml — Agent endpoint rate limiting rule + alarm - cloudformation/application/vell-observability.yaml — AgentCore log group, metric filters, alarms, X-Ray sampling - cloudformation/application/bedrock-guardrails.yml — PROMPT_ATTACK filters on Prod/Enterprise - app/CustomExtensions/CloudMarketplace/resources/cloudformation/vell-agentcore-bedrock-role.yaml — Gateway, Observability, Memory IAM permissions

Modified (Infrastructure Health Check + Tool Registry Seeding 2026-02-20): - app/Extensions/ContentManager/System/ContentManagerServiceProvider.php — Registered SyncGatewayToolsCommand in registerCommands() (was missing — root cause of "no commands defined in agentcore namespace" error) - cloudformation/stacks/prod/prod-security.yml — Added logs:TagResource to app-ec2-perms policy (AWS requirement for CreateLogGroup with tags). Added dynamodb:PutItem/GetItem/Query on vell-{env}-gateway-tool-registry and s3:PutObject/GetObject on vell-{env}-gateway-tool-schemas-* for agentcore:sync-tools - vell/codedeploy/after-install.sh — Added agentcore:sync-tools to CodeDeploy AfterInstall hook (runs as $WEB_USER, non-fatal on failure) - app/Extensions/ContentManager/System/Console/Commands/SyncGatewayToolsCommand.php — Replaced app(DynamoDbClient::class) / app(S3Client::class) with direct instantiation using region config (no service container binding existed for AWS SDK clients)

Phase 0 Deployment Log (2026-02-20)

Deployment fixes applied during launch (3 template issues discovered and resolved):

  1. bedrock-guardrails.yml — Enterprise PROMPT_ATTACK OutputStrength constraint
  2. Issue: Enterprise tier had OutputStrength: MEDIUM for PROMPT_ATTACK filter
  3. Error: Bedrock API rejected with "PROMPT ATTACK content filter strength for response must be NONE"
  4. Fix: Changed to OutputStrength: NONE, added clarifying comment
  5. Root cause: Bedrock enforces OutputStrength: NONE for PROMPT_ATTACK filter type (output-side prompt attack detection is not supported)

  6. vell-agentcore-gateway.yaml — Lambda ZipFile size limit

  7. Issue: Lambda inline code was 4,826 bytes, exceeding CloudFormation's 4,096-byte ZipFile limit
  8. Error: PropertyValidation hook rejected changeset
  9. Fix: Minified Lambda code from 4,826 to 2,843 bytes while preserving all functionality (handler(), sync_cap(), upd_schema(), upd_listing())

  10. vell-agentcore-gateway.yaml — S3 lifecycle property name

  11. Issue: Used NoncurrentVersionTransition (singular) with a list value
  12. Error: PropertyValidation hook rejected changeset (cfn-lint: E3012)
  13. Fix: Changed to NoncurrentVersionTransitions (plural) for list form

Final stack states:

Stack Operation Status Timestamp (UTC)
vellocity-bedrock-guardrails UPDATE UPDATE_COMPLETE 2026-02-20T01:23:24
prod-api-waf UPDATE UPDATE_COMPLETE 2026-02-20T01:32:44
prod-observability CREATE CREATE_COMPLETE 2026-02-20T01:37:17
prod-agentcore-gateway CREATE CREATE_COMPLETE 2026-02-20T01:41:01

Post-deployment action: Re-subscribed ops@vell.ai to SNS vell-prod-critical-alerts topic (subscription pending email confirmation).

Tool Registry Seeding (2026-02-20):

4 blockers discovered and resolved across 3 CodeDeploy deployments:

Deployment Commit Fix Result
d-0WIFWHIXH 96a9d2f53 (pre-fix baseline) AfterInstall succeeded but agentcore:sync-tools not in hook
d-7X2THSIXH f42f46f44 Service provider registration + IAM permissions + AfterInstall hook + SSM env vars (v49) Command found but app(DynamoDbClient::class)BindingResolutionException
d-UK7H68JXH ff5a7cc69 Direct AWS SDK client instantiation (new DynamoDbClient()/new S3Client()) SUCCESS — 37 capabilities synced

Final tool registry state: - DynamoDB: 111 items (37 latest + 74 versioned snapshots) - S3: 37 MCP tool schemas - 9 marketplace listings: brand-knowledge (5), competitive-intelligence (1), content-generation (5), cosell-partner-intelligence (6), deal-intelligence (3), gtm-planning (4), marketplace-intelligence (6), seo-intelligence (6), vell-platform (1)

Infrastructure Health Check (2026-02-19)

Verified all Phase 0 infrastructure from live AWS account 253265132499 in us-east-1.

Stack Status:

Stack Status Last Updated (UTC)
prod-agentcore-gateway CREATE_COMPLETE 2026-02-20T01:41:01
prod-observability CREATE_COMPLETE 2026-02-20T01:37:17
prod-api-waf UPDATE_COMPLETE 2026-02-20T01:32:44
vellocity-bedrock-guardrails UPDATE_COMPLETE 2026-02-20T01:23:24

Resource Verification:

Resource Status Details
DynamoDB vell-prod-gateway-tool-registry ACTIVE PITR enabled (35-day recovery). 2 GSIs active (category-index, marketplace-listing-index). 111 items — 37 capabilities × 3 (latest + 2 versioned snapshots from rolling deploy across 2 instances).
S3 vell-prod-gateway-tool-schemas-253265132499 ACTIVE Versioning enabled. 37 MCP tool schemas uploaded across 9 marketplace listings.
Lambda vell-prod-gateway-tool-sync Active Python 3.11, 256MB, 60s timeout, X-Ray Active. SQS event source mapping enabled. 0 invocations since deployment.
SQS vell-prod-gateway-sync ACTIVE 0 messages in-flight. Queue healthy.
SQS DLQ vell-prod-gateway-sync-dlq ACTIVE 0 messages. No failed sync events.
EventBridge vell-prod-gateway-tool-sync ENABLED Listening for CapabilityRegistryChanged, ToolSchemaUpdated, MarketplaceListingChanged from vell.agentcore.
WAF vell-prod-api-waf ACTIVE 7 rules (3 AWS managed + 3 rate limits + 1 count-mode burst detector). Associated with API Gateway stage prod (qkxjis5iel).
Bedrock Guardrails ACTIVE 3 tiers: Dev (uvishu7ijb29), Prod (5bf7khsguf6i), Enterprise (7s64nv00v8t5).
CloudWatch Alarms ALL OK 22 alarms verified — all in OK state including gateway-specific vell-prod-gateway-sync-errors and vell-prod-gateway-sync-dlq-depth.

Issues Found (ACTION REQUIRED):

Issue Severity Details Action
INFRA-001: SNS subscriptions not confirmed HIGH Both admin@vell.ai and ops@vell.ai were PendingConfirmation on vell-prod-critical-alerts. CloudWatch alarms fire but no one receives alerts. RESOLVED (2026-02-19). Both subscriptions confirmed: ops@vell.ai (ccc72725-6796-4e86-95c3-cb5a6e822543), admin@vell.ai (7ce4e5c2-d039-44bd-ae1b-9d40071e1585).
INFRA-002: Tool registry not seeded MEDIUM ~~DynamoDB table has 0 items.~~ RESOLVED (2026-02-20). 4 blockers found and fixed across 3 deployments: (1) SyncGatewayToolsCommand not registered in ContentManagerServiceProvider::registerCommands(), (2) EC2 instance role missing logs:TagResource + DynamoDB/S3 permissions — prod-security.yml updated (stack UPDATE_COMPLETE 2026-02-20T03:05:26 UTC), (3) AGENTCORE_GATEWAY_TABLE and AGENTCORE_GATEWAY_SCHEMA_BUCKET env vars missing from SSM /prod/app/.env — added (version 49), (4) AWS SDK clients resolved via app() with no container binding — replaced with direct new DynamoDbClient()/new S3Client() instantiation. Result: 37 capabilities synced, 111 DynamoDB items (37 latest + 74 versioned snapshots), 37 MCP schemas in S3 across 9 marketplace listings. agentcore:sync-tools now runs automatically on every CodeDeploy AfterInstall.
INFRA-003: Sync pipeline untested in prod LOW Lambda has 0 invocations. The EventBridge → SQS → Lambda pipeline has never been triggered. Will work on first real event, but no smoke test has validated the end-to-end flow. Publish a test CapabilityRegistryChanged event via EventBridge to validate the pipeline: aws events put-events --entries '[{"Source":"vell.agentcore","DetailType":"CapabilityRegistryChanged","Detail":"{\"capability_slug\":\"test\",\"action\":\"test\"}"}]'

Phase 1: Adopt AgentCore Observability (Week 4-5)

Lowest risk, highest immediate value. OBS-002/OBS-003 infrastructure layer done in Phase 0B; this phase adds application-level instrumentation.

  • Enable AgentCore Observability for existing Bedrock API calls
  • Add OTEL instrumentation to AgentOrchestrator and BedrockRuntimeService
  • Create CloudWatch dashboards for workflow health metrics
  • Set up CloudWatch Alarms → SNS → Slack/PagerDuty for failure rate thresholds
  • Add correlation IDs (execution UUID) to all service calls
  • Build admin dashboard view combining failed_jobs + agent_executions + AgentCore metrics

Phase 2: Migrate to AgentCore Gateway (Week 6-9)

Tool registry security fixed (SEC-002). DynamoDB registry + S3 MCP schemas deployed (Phase 0C). This phase registers tools with live Gateway.

  • Register capabilities with live AgentCore Gateway using MCP schemas from S3
  • Enable semantic tool selection for capability discovery
  • Migrate outbound auth (HubSpot, LinkedIn, Slack, etc.) to Gateway auth management
  • Implement Gateway interceptors for tool-call-level authorization
  • Transition CapabilityRegistry to read from Gateway (keep local DynamoDB as fallback)
  • Enable AgentCore Policy for tool-call enforcement
  • Configure marketplace listing unbundling (8 listings in DynamoDB, activate per listing)

Phase 3: Adopt AgentCore Memory (Week 9-10)

Replaces fragile fallback code

  • Migrate AgentMemoryService to native AgentCore Memory API
  • Enable episodic memory for GTM workflow learning
  • Remove local session fallback code
  • Add memory health status to execution UI

Phase 4: Migrate to AgentCore Runtime (Week 11-16)

Largest change — replaces queue-based execution

  • Package AgentOrchestrator as AgentCore Runtime-compatible agent
  • Implement session-based execution (replace ProcessAgentWorkflowJob)
  • Enable microVM isolation per execution
  • Extend timeout from 10min → up to 8hrs for complex workflows
  • Migrate ExecutionHealthMonitor to monitor Runtime sessions instead of queue jobs
  • Implement A2A protocol for multi-agent GTM workflows
  • Decommission ProcessAgentWorkflowJob and self-healing queue infrastructure

Phase 5: Adopt AgentCore Identity (Week 17-18)

Unifies fragmented auth

  • Configure AgentCore Identity with existing Cognito user pool
  • Migrate per-user Bedrock credentials to Identity-managed access
  • Enable custom claims for multi-tenant team isolation
  • Remove static credential cache from BedrockRuntimeService

Cost Estimation

Current Infrastructure Costs (Estimated)

Component Cost Driver Monthly Estimate
Queue workers (EC2/ECS) Always-on compute for job processing $200-500
Redis (ElastiCache) Queue + cache $100-300
Health monitoring (scheduler) EC2 compute for cron Included above
Developer time (maintenance) Bug fixes, monitoring, on-call 40-80 hrs/month

Projected AgentCore Costs (at 10,000 executions/month)

Service Usage Monthly Cost
Runtime 10K sessions × ~18s active CPU × 1 vCPU + 60s × 2GB memory ~$8
Gateway 10K sessions × 5 tool calls = 50K invocations ~$0.25
Memory 50K short-term events + 5K long-term + 10K retrievals ~$21
Identity Free via Runtime/Gateway $0
Observability CloudWatch log storage (~5GB) ~$3
Policy 50K tool calls × 1 policy check ~$1.25
Total AgentCore ~$33/month

Savings Analysis

Category Before After Savings
Infrastructure compute $300-800/mo ~$33/mo $267-767/mo
Developer maintenance 40-80 hrs/mo 10-20 hrs/mo 30-60 hrs/mo
Security risk exposure ~~5 critical~~ 0 critical + ~~4 high~~ 0 high vulns (Phase 0A) Addressed by managed services Reduced attack surface
Incident response Reactive (no alerting) Proactive (CloudWatch alarms) Faster MTTR

Note: Bedrock model inference costs (Claude tokens) remain the same regardless of migration. These estimates cover only the orchestration infrastructure layer.


Sources