AgentCore Gateway Audit: Build vs Buy Analysis¶
Date: 2026-02-19 Last Updated: 2026-02-21 (Funding intelligence integration — IP, data moat, and opportunity analysis for AI-enabled funding assistance) Scope: Vellocity's custom AgentCore orchestration vs Amazon Bedrock AgentCore managed services Goal: Identify weaknesses vs strengths, determine what to offload to Bedrock AgentCore and what to keep in-house
Table of Contents¶
- Executive Summary
- Current Architecture Scorecard
- Security Findings
- Scalability Findings
- Resilience Findings
- Observability Findings
- Failed Job Analysis
- Bedrock AgentCore Service Comparison
- Offload vs Keep Recommendations
- IP Impact Analysis
- Data Moat & Lock-In Assessment
- Opportunities
- Migration Roadmap
- Cost Estimation
Executive Summary¶
Vellocity's custom AgentCore is a capable workflow orchestration system with strong error classification and stuck-execution recovery. The initial audit identified critical security vulnerabilities (IDOR, prompt injection, unsafe handler loading), no circuit breaker for Bedrock API calls, no database transactions on step execution, and O(n³) dependency resolution that would break at scale.
Phase 0 Status: COMPLETE AND DEPLOYED. All 5 critical security vulnerabilities (SEC-001 through SEC-005), all 4 high security vulnerabilities (SEC-006 through SEC-009), and the critical scalability issue (SCALE-002) have been remediated in application code. Supporting CloudFormation templates have been hardened and deployed to prod on 2026-02-20: WAF rate limiting (prod-api-waf), Bedrock guardrail prompt attack detection (vellocity-bedrock-guardrails), AgentCore observability infrastructure (prod-observability), and a new AgentCore Gateway foundation (prod-agentcore-gateway) with DynamoDB tool registry, S3 schema storage, Lambda sync function, and EventBridge pipeline — preparing the infrastructure for Phase 2 Gateway migration.
Infrastructure Verification (2026-02-20): All 5 CloudFormation stacks healthy, all 22 CloudWatch alarms in OK state, WAF associated with API Gateway prod stage. 3 post-deployment issues found and resolved: (1) ~~SNS alert subscriptions not confirmed~~ — RESOLVED, both confirmed, (2) ~~Tool registry not seeded~~ — RESOLVED: 37 capabilities synced to DynamoDB (111 items incl. versioned snapshots) + 37 MCP schemas in S3 across 9 marketplace listings, (3) Sync pipeline untested in prod (LOW — deferred to Phase 2).
Amazon Bedrock AgentCore (GA Oct 2025) offers managed Runtime, Gateway, Memory, Identity, Observability, and Policy services that directly address remaining weaknesses — particularly around infrastructure resilience, tool authentication, memory management, and monitoring.
Funding Intelligence Update (2026-02-21): The AI-powered FundingApplicationWriterService (5 AWS Partner Funding programs, KB-grounded generation, dry-run reviewer evaluation, Partner Central Benefits API) is a significant IP asset operating outside the CapabilityRegistry. Integrating it as capability #38 unlocks agent-orchestrated funding workflows, Gateway discoverability, and outcome-driven approval rate intelligence — a new data moat with strong network effects.
Recommendation: Adopt a hybrid approach — offload infrastructure-layer concerns (runtime, memory, identity, observability) to Bedrock AgentCore while retaining ownership of business logic (workflow planning, brand voice, marketplace metering, GTM templates, funding intelligence).
Current Architecture Scorecard¶
| Category | Score (Initial → Current) | Key Issues |
|---|---|---|
| Architecture | 7/10 → 8/10 | Clean capability-based design, good separation of concerns. Gateway foundation infrastructure added. |
| Security | 4/10 → 8/10 | ~~IDOR, prompt injection, missing auth checks, unsafe handler loading~~ — All 5 critical fixes applied (SEC-001–005). ~~Mass assignment, inconsistent auth, credential cache, unsafe JSON parsing~~ — All 4 high fixes applied (SEC-006–009). |
| Scalability | 3/10 → 4/10 | ~~O(n³) resolution~~ replaced with Kahn's algorithm. Remaining: No concurrency limits, memory leaks, no backpressure |
| Resilience | 5/10 | Good stuck detection, but no circuit breaker, no transactions, retry storms |
| Observability | 6/10 → 7/10 | ~~No ops alerting~~ — CloudWatch alarms, AgentCore log group, X-Ray sampling added. Remaining: No correlation IDs, no OTEL |
| Error Handling | 7/10 | Excellent error classification, but too many silent failures |
Security Findings¶
CRITICAL¶
- SEC-001: IDOR in getExecutionStatus / getExecutionResults — FIXED
- File:
AgentOrchestrator.php:377-438,AgentPolicy.php,AgentController.php - Issue: Public methods retrieve execution data by ID with no user ownership check. Any authenticated user can access another user's workflow results, task descriptions, and generated assets.
-
Fix applied: Defense-in-depth across 3 layers:
- Added
execute(),viewExecution(), andcancel()methods toAgentPolicy - Added
$userIdparameter + ownership guard togetExecutionStatus(),getExecutionResults(), andcancelExecution()inAgentOrchestrator - Updated
AgentControllerto use dedicated policy methods and passAuth::id()to orchestrator
- Added
-
SEC-002: Unsafe Capability Handler Loading — FIXED
- File:
CapabilityRegistry.php:68-77 - Issue:
getHandler()instantiates handler classes without validating against an allow-list. If an attacker can create a capability record with a malicioushandler_class, this is a path to remote code execution. -
Fix applied: Added
ALLOWED_HANDLER_NAMESPACESconstant restricting toApp\Extensions\ContentManager\System\Services\Capabilities\. Validates namespace prefix, class existence, andCapabilityInterfaceimplementation before instantiation. Critical-level logging on blocked attempts. -
SEC-003: Prompt Injection in WorkflowPlanner — FIXED
- File:
WorkflowPlanner.php:161-265 - Issue: User-provided
$taskDescriptionis interpolated directly into the planning prompt with no escaping or boundary markers. Crafted input could manipulate the planner. -
Fix applied (multi-layer):
- Input sanitization: strip control characters, limit to 2,000 chars
- XML boundary markers:
<system_instructions>wraps planner prompt,<user_task>wraps user input - Output validation: max 50 steps, required fields per step (
capability,parameters,depends_on) - CFT defense-in-depth: Bedrock Guardrails PROMPT_ATTACK filter enabled (Prod: HIGH/NONE, Enterprise: HIGH/NONE — Bedrock requires OutputStrength=NONE for PROMPT_ATTACK)
-
SEC-004: Missing Sort Order Validation — FIXED
- File:
AgentController.php:287-296 - Issue:
$sortOrderparameter is passed directly toorderBy()without validation. Only$sortByis whitelisted. -
Fix applied: Added
if (!in_array(strtolower($sortOrder), ['asc', 'desc'])) { $sortOrder = 'desc'; }— mirrors existing$sortBywhitelist pattern. -
SEC-005: Potential SSRF in enrichBrandVoice — FIXED
- File:
AgentController.php:940-945 - Issue: Company website URL used without validating scheme or blocking internal addresses.
- Fix applied: Created reusable
App\Services\UrlValidatorutility. Enforces HTTPS-only, blocks RFC 1918 (10.x, 172.16-31.x, 192.168.x), link-local/AWS metadata (169.254.x), carrier-grade NAT (100.64.x), loopback (127.x), and IPv6 equivalents. DNS resolution check prevents DNS rebinding. Applied atenrichBrandVoice()endpoint.
HIGH¶
- SEC-006: Mass Assignment Risk — FIXED
- File:
AgentController.php:735-753 - Issue:
$agent->update($validated)passes full validated array without explicit field whitelisting. The Agent model's$fillableincludes sensitive fields (user_id,team_id,guardrail_id,bedrock_agent_id,ai_model,settings) that should not be updatable via this endpoint. -
Fix applied: Replaced
$agent->update($validated)with$agent->update($request->only(['name', 'description', 'is_active', 'capabilities']))— explicit field whitelist prevents mass assignment of sensitive model attributes regardless of validation rules. -
SEC-007: Inconsistent Authorization Patterns — FIXED
- Files:
AgentPolicy.php,AgentController.php - Issue: Mix of policy-based auth (
$this->authorize()) and inline ownership checks ($execution->user_id !== Auth::id()).AgentPolicymissingcreate()method and execution-level policy methods for delete, rerun, restart, archive, restore, and notification operations. -
Fix applied (two layers):
- Added 6 new policy methods to
AgentPolicy:create(),deleteExecution(),rerunExecution(),restartExecution(),archiveExecution(),toggleNotification() - Replaced all 9 inline
$execution->user_id !== Auth::id()checks inAgentControllerwith$this->authorize()calls using the appropriate policy method - All 17 authorization points in
AgentControllernow consistently use policy-based auth
- Added 6 new policy methods to
-
SEC-008: Static Credential Cache Without TTL — FIXED
- File:
BedrockRuntimeService.php:24-80 - Issue: Shared Bedrock client cached in static variable with no expiration. In queue workers (long-lived processes), stale or revoked credentials continue to be used indefinitely.
-
Fix applied:
- Added
$sharedClientCreatedAttimestamp tracking to static cache - Added
CACHE_TTL_SECONDS = 900(15-minute) TTL constant - Cache validity now requires both credential hash match AND TTL not expired
- Added
resetClientCache()static method for explicit invalidation when credentials are known to have changed
- Added
-
SEC-009: Unsafe JSON Plan Parsing — FIXED
- File:
WorkflowPlanner.php:330-455 - Issue: Plan JSON decoded with minimal structural validation. No validation of step fields, types, or allowed keys. Deeply nested payloads could cause memory issues.
- Fix applied (5 validations):
- JSON decode depth limit (10 levels) prevents stack exhaustion
ALLOWED_STEP_KEYSwhitelist strips arbitrary keys from step objects (capability,parameters,depends_on,name,description,step_id,condition,retry,timeout)- Capability slug format validation via regex (
/^[a-zA-Z][a-zA-Z0-9_]{0,99}$/) depends_onvalues validated as valid step indices (non-negative, within range, no self-reference)- Parameters nesting depth capped at 5 levels via recursive
arrayDepth()check
Scalability Findings¶
CRITICAL¶
- SCALE-001: No Concurrent Execution Limits
- File:
AgentController.php:385-391 - Issue: Every execution dispatches to queue immediately with zero rate limiting. 100 concurrent users = 100 simultaneous Bedrock API calls = AWS throttling → retry storm → cascading failure.
-
Fix: Add per-user concurrency cap (3-5), implement queue-level backpressure.
-
SCALE-002: O(n³) Dependency Resolution — FIXED
- File:
WorkflowPlanner.php:491-517 - Issue: Execution order algorithm uses nested loops with
maxIterations = count($steps)². A 50-step workflow = 312,500+ operations. -
Fix applied: Replaced with Kahn's algorithm using
SplQueuefor BFS traversal and$inDegree[]array for tracking (O(V+E)). Uses associative array for O(1) lookup instead ofin_array(). Preserved existing diagnostic error handling for unresolvable dependencies. Pre-validation via existingdetectCircularDependencies()DFS retained. -
SCALE-003: Unbounded Memory in Multi-Step Workflows
- File:
AgentOrchestrator.php:237-309 - Issue:
$stepOutputsand$resultsarrays accumulate all capability outputs in memory. A 20-step image workflow could exceed PHP's 128MB limit. - Fix: Store step outputs in DB/cache, load on-demand for downstream dependencies only.
HIGH¶
- SCALE-004: Oversized Response Bodies
- File:
AgentOrchestrator.php:465-484 - Issue:
getExecutionResults()returns entirestep_resultsandgenerated_assetswith no pagination. -
Fix: Add pagination, or return summary with on-demand detail endpoints.
-
SCALE-005: N+1 Query Patterns
- File:
AgentController.php:43-47, 313-315 - Issue: Agent lists load without pagination (
->get()instead of->paginate()). Execution results load full JSON blobs without field selection. - Fix: Add pagination, use
->select()for list endpoints.
Resilience Findings¶
CRITICAL¶
- RES-001: No Circuit Breaker for Bedrock
- File:
BedrockRuntimeService.php - Issue: When Bedrock is down or throttled, every request fails independently. No consecutive-failure tracking, no circuit opening, no adaptive backoff, no fallback to alternative models.
- Impact: Complete workflow unavailability during any Bedrock outage.
-
Fix: Implement circuit breaker with states (closed → open → half-open), track failure rate, exponential backoff on 429/503.
-
RES-002: No Database Transactions on Step Execution
- File:
ProcessAgentWorkflowJob.php:67-128 - Issue: Step results written to DB without transaction wrapping. Job crash leaves execution in ambiguous state with partial results.
-
Fix: Wrap step execution in
DB::transaction(), implement idempotency keys. -
RES-003: Retry Storm Risk
- File:
ProcessAgentWorkflowJob.php:27-32 - Issue: 3 retries with no exponential backoff configured. All retries fire immediately, compounding AWS rate limiting.
- Fix: Add
backoff()method returning[30, 120, 300](escalating delays).
HIGH¶
- RES-004: Silent Service Degradation
- Files:
AgentOrchestrator.php:33-40,AgentMemoryService.php:135-143 - Issue: Multiple services fail silently at DEBUG/WARNING log levels. Users have no visibility into whether memory, dry-run, or other optional services are functioning.
- Fix: Add execution-level
services_statusfield, surface degradation in UI.
STRENGTHS (Keep)¶
- Stuck Execution Recovery —
ExecutionHealthMonitorwith three-level escalation is excellent - Error Classification —
ExecutionErrorAnalyzercategorizes errors with user-friendly guidance - Self-Healing Config — Configurable thresholds at global/team/user scope
- Health Event Audit Trail — Full audit via
ExecutionHealthEventmodel
Observability Findings¶
MEDIUM¶
- OBS-001: No Correlation IDs
- Issue: Logs lack end-to-end request/execution correlation IDs. Tracing a failure across orchestrator → planner → Bedrock → capability requires manual log correlation.
-
Fix: Generate UUID at execution start, propagate through all service calls.
-
OBS-002: Incomplete Alerting — PARTIALLY ADDRESSED
- Issue:
ExecutionHealthMonitorsends in-app notifications but has no integration with CloudWatch, PagerDuty, Slack, or other ops monitoring. - Fix applied (infrastructure layer): Added CloudWatch metric filters for
agent_execution_failedand Bedrock throttles. Added alarms:AgentFailureRateAlarm(>10 failures/5-min) andBedrockThrottleAlarm(>20 throttles/5-min) with SNS integration. Agent endpoint WAF rate limiting alarm added. -
Remaining: Application-level CloudWatch metric emission, Slack/PagerDuty integration.
-
OBS-003: No Distributed Tracing — PARTIALLY ADDRESSED
- Issue: No OpenTelemetry or X-Ray integration. Cannot trace latency across Bedrock API calls.
- Fix applied (infrastructure layer): Added X-Ray sampling rule for AgentCore executions at 100% (Priority 50) for
*/agent*paths. Dedicated log group/vell/{env}/agentcorewith 365-day retention. Lambda sync function hasTracingConfig: Active. - Remaining: Application-level OTEL instrumentation in
AgentOrchestratorandBedrockRuntimeService.
STRENGTHS (Keep)¶
- Analytics Service —
AgentAnalyticsServicewith P50/P95/P99 percentiles and health scoring - Detailed Logging — 17+ structured log calls in AgentOrchestrator alone
- Usage Tracking — Token/credit metering per step for billing
Failed Job Analysis¶
Current Infrastructure¶
Your application has failure tracking infrastructure in place:
| Component | Status | Details |
|---|---|---|
failed_jobs table |
Exists | Standard Laravel DLQ — captures job payload + exception trace |
agent_executions.status |
Exists | Tracks: pending, planning, executing, completed, failed, cancelled, archived |
agent_executions.error_message |
Exists | Stores failure details |
| Self-healing fields | Exists | retry_count, max_retries, retry_history (JSON), is_stuck, health_status |
execution_health_events |
Exists | Full audit trail: stuck_detected, auto_restart, recovery, rate_limited, max_retries_reached |
MonitorExecutionHealth command |
Runs every minute | Detects stuck, auto-restarts, rate-limits |
MonitorAgentWorker command |
Available | Detects orphaned/stale jobs, can rescue |
What's Missing¶
| Gap | Impact | Priority |
|---|---|---|
| No admin dashboard for failed_jobs | Failures only visible in DB/logs — you can't see them | HIGH |
| No failure rate SLA tracking | No alerting when failure rate exceeds threshold | HIGH |
| No Slack/PagerDuty alerting | Admin notifications are in-app only | MEDIUM |
| No failed_jobs API endpoint | Cannot query/retry failed jobs from UI | MEDIUM |
| No poison pill detection | Repeated failures from same input not detected | LOW |
| No DLQ depth monitoring | Queue backlog invisible until users complain | MEDIUM |
Job Retry Configuration¶
ProcessAgentWorkflowJob:
tries: 3 (max attempts)
timeout: 600 (10 minutes per attempt)
backoff: none (immediate retry — PROBLEM)
Self-Healing:
stuck_warning: 5 min
stuck_critical: 30 min
auto_restart: 60 min
max_auto_retries: 3
exponential_backoff: true (for auto-restarts only)
max_executions_per_hour: 100
max_executions_per_day: 1000
Bedrock AgentCore Service Comparison¶
Service-by-Service Mapping¶
| Bedrock AgentCore Service | Your Current Implementation | Gap Analysis |
|---|---|---|
| Runtime (serverless agent hosting, microVM isolation, 8hr execution windows) | ProcessAgentWorkflowJob on Laravel queue + Redis |
Your queue has no session isolation, 10min timeout, no microVM sandboxing |
| Gateway (MCP tool registry, API→MCP transform, semantic tool selection, auth management) | CapabilityRegistry + inline handler loading + DynamoDB tool registry + S3 MCP schemas (Phase 0C) |
~~Unsafe handler loading~~ fixed (SEC-002). Gateway foundation deployed: DynamoDB registry, S3 MCP schemas, Lambda sync, EventBridge pipeline. Remaining: No semantic discovery, no auth management, no live Gateway registration (Phase 2). Gap: FundingApplicationWriterService operates as a standalone service outside CapabilityRegistry — not registered in DynamoDB tool registry, not discoverable via Gateway, not available to agent workflows. |
| Memory (short-term + long-term + episodic memory) | AgentMemoryService wrapping Bedrock agent memory |
Already using Bedrock memory partially; local fallback is basic session tracking |
| Identity (agent identity, OAuth flows, credential management) | Mix of Passport, Cognito, ~~inline API key checks~~ standardized policy (SEC-007) | Fragmented; no unified agent identity; ~~static credential cache~~ fixed with TTL (SEC-008) |
| Observability (CloudWatch dashboards, OTEL, latency/error/token metrics) | AgentAnalyticsService + ExecutionHealthMonitor + custom logging + CloudWatch alarms/metrics (Phase 0B) |
Good analytics. ~~No CloudWatch integration~~ — metric filters, alarms, X-Ray sampling added. Remaining: No OTEL instrumentation in application code, no distributed tracing end-to-end |
| Policy (Cedar-based tool-call interception, natural language policy rules) | AgentPolicy + inline auth checks |
Minimal; no tool-call-level policy enforcement |
| Evaluations (13 built-in evaluators, custom scoring, continuous monitoring) | None | No quality evaluation system at all |
| Browser / Code Interpreter (secure browser runtime, code execution sandbox) | None | Not applicable to current use case |
Bedrock AgentCore Pricing Reference¶
| Service | Metric | Price |
|---|---|---|
| Runtime | vCPU-hour | $0.0895 |
| Runtime | GB-hour (memory) | $0.00945 |
| Gateway | Per 1,000 tool invocations | $0.005 |
| Gateway | Per 1,000 search queries | $0.025 |
| Gateway | Per 100 tools indexed/month | $0.02 |
| Memory (short-term) | Per 1,000 events | $0.25 |
| Memory (long-term) | Per 1,000 memories stored | $0.75 |
| Memory (retrieval) | Per 1,000 retrievals | $0.50 |
| Identity | Per 1,000 requests | $0.01 (free via Runtime/Gateway) |
| Policy | Per 1,000 auth requests | ~$0.025 |
Offload vs Keep Recommendations¶
OFFLOAD to Bedrock AgentCore¶
| Component | Why Offload | Bedrock Service | Resolves Issues |
|---|---|---|---|
| Agent Runtime / Execution | Your queue has no isolation, 10min timeout, no backpressure. Bedrock provides microVM isolation, 8hr windows, auto-scaling, and consumption-based billing (only charged for active CPU). | AgentCore Runtime | RES-002, RES-003, SCALE-001 |
| Tool Registry & Authentication | ~~CapabilityRegistry has unsafe handler loading (SEC-002)~~ — fixed. ~~Static credential cache (SEC-008)~~ — fixed with TTL. ~~Inconsistent auth (SEC-007)~~ — fixed with standardized policy. DynamoDB tool registry + S3 MCP schemas deployed (Phase 0C). Gateway provides managed tool registry with semantic discovery, inbound/outbound auth, and 1-click connectors (Salesforce, Slack, Jira, etc). |
AgentCore Gateway | ~~SEC-002~~, ~~SEC-008~~, ~~SEC-007~~ |
| Memory Management | AgentMemoryService already wraps Bedrock memory but with basic local fallback. Native AgentCore Memory adds episodic memory, managed short-term/long-term storage, and eliminates the need for your fallback code. |
AgentCore Memory | RES-004 (memory degradation) |
| Observability & Monitoring | No OTEL, no CloudWatch integration, no distributed tracing. AgentCore Observability provides turnkey dashboards, OTEL compatibility, and integration with Datadog/Dynatrace/LangSmith. | AgentCore Observability | OBS-001, OBS-002, OBS-003 |
| Agent Identity & Auth | Fragmented auth (Passport + Cognito + ~~inline checks~~ standardized policy). ~~Static credential cache (SEC-008)~~ — fixed with TTL. AgentCore Identity provides unified agent identity with OAuth flows, token management, and multi-tenant support. | AgentCore Identity | ~~SEC-007~~, ~~SEC-008~~ |
| Tool-Call Policy Enforcement | No tool-call-level authorization. AgentCore Policy intercepts every tool call in real-time with Cedar policies written in natural language. | AgentCore Policy | SEC-007 |
KEEP In-House (Competitive Advantages)¶
| Component | Why Keep | Current Quality | Notes |
|---|---|---|---|
| WorkflowPlanner (Claude-powered planning) | Core IP — your GTM workflow templates, brand-aware planning prompts, and capability-aware step generation are unique. Bedrock AgentCore has no equivalent planning service. | Good — ~~SEC-003, SCALE-002~~ fixed | Prompt injection fixed (XML boundaries + input sanitization), O(n³) sort replaced with Kahn's algorithm. Planning logic retained. |
| BrandVoiceContextBuilder | Product differentiator — enriches every workflow with company tone, audience, industry context. No managed equivalent exists. | Excellent | Keep and enhance |
| ExecutionErrorAnalyzer | User-facing error classification with actionable recommendations. AgentCore Observability doesn't provide this UX layer. | Excellent | Keep as UI layer on top of AgentCore Observability |
| ExecutionHealthMonitor (self-healing) | Sophisticated stuck detection with three-level escalation. While AgentCore Runtime handles infrastructure-level health, your business-logic-level health monitoring (stuck workflows, rate limiting) is valuable. | Excellent | Adapt to monitor AgentCore Runtime sessions instead of queue jobs |
| SelfHealingConfig (per-user thresholds) | Unique multi-scope configuration (global/team/user). No managed equivalent. | Good | Keep for business-level config |
| AgentAnalyticsService | Business-level metrics (success rate, P50/P95/P99, cost analysis, capability breakdown, health scoring). AgentCore Observability provides infrastructure metrics but not GTM-specific analytics. | Good | Keep, feed data from AgentCore Observability |
| FundingApplicationWriter (5 AWS programs) | Core IP — program-specific AI generation with KB grounding from official AWS docs, dry-run evaluation with reviewer personas, Partner Central Benefits API integration. No managed equivalent. Only standalone AI funding assistant for AWS Partner Funding programs. | Excellent | Keep — register as AgentCore capability (currently standalone, not in CapabilityRegistry). Wire outcome tracking for approval rate intelligence. |
| GTM Workflow Templates (10 pre-built) | Core product value — co-sell partner workflows, marketplace optimization, content generation pipelines. | Good | Keep and expand |
| Marketplace Metering | AWS Marketplace billing integration (credit/token accounting per step). Specific to your ISV business model. | Good | Keep; wire into AgentCore Runtime session metrics |
| Content Tag System | Content organization and tracking for GTM workflows. No managed equivalent. | Good | Keep |
IP Impact Analysis¶
What You Own vs What You Rent¶
Offloading to Bedrock AgentCore shifts ownership of infrastructure but does NOT transfer your business logic IP. The key question is: which components contain defensible IP, and which are commodity infrastructure you're maintaining at cost?
IP Value Map¶
| Component | Lines of Custom Logic | Replication Effort | IP Value | Migration Risk |
|---|---|---|---|---|
| Marketplace SEO Score (MSS) | 1,070 | 6-8 weeks | VERY HIGH | NONE — stays in-house |
| Capability Registry (37 registered + 5 unregistered capabilities) | 736 | 4-6 weeks | VERY HIGH | ~~MEDIUM~~ LOW — Handler namespace allowlist (SEC-002) + DynamoDB local registry (IP-002/003) ensure handler IP stays in-house. Gateway hosts only schemas/descriptions. |
| FundingApplicationWriter (5 AWS programs, KB-grounded generation, dry-run evaluation) | ~800+ | 4-6 weeks | VERY HIGH | NONE — stays in-house |
| Co-Sell Matching (ICP overlap, partner intelligence) | 584 | 3-4 weeks | HIGH | NONE — stays in-house |
| WorkflowPlanner (dependency graph, retry heuristics) | 572 | 3-4 weeks | VERY HIGH | NONE — stays in-house |
| Deal Influence Tracking (multi-touch attribution) | 400+ | 4-5 weeks | VERY HIGH | NONE — stays in-house |
| BrandVoiceContextBuilder | 349 | 2-3 weeks | HIGH | NONE — stays in-house |
| AgentAnalyticsService (P50/P95/P99, health scoring) | 770 | 3 weeks | MEDIUM | LOW — sits on top of AgentCore Observability |
| Co-Sell Analytics | 453 | 2 weeks | MEDIUM | NONE — stays in-house |
| Marketplace Metering | 229 | 2-3 days | LOW | NONE — commodity wrapper |
Total custom business logic: ~5,963+ lines across 10 components Full system replication effort: 6-12 months (including integration testing, data migration, domain expertise)
What Makes Your IP Defensible¶
1. Marketplace SEO Score (MSS) Algorithm — Patent-worthy - 3-component scoring: 40% Listing Quality, 25% Backlink Authority, 35% AI Visibility - AI Visibility scoring (tracking LLM mentions of listings) is genuinely novel - Bedrock fallback proxy when DataForSEO API is unavailable — clever engineering - Category benchmark medians improve with every customer scored (network effect)
2. WorkflowPlanner Intelligence - Temperature reduction on retry (0.3 → 0.1) to produce more deterministic replans - Context truncation strategy (2KB limit) prevents prompt bloat while preserving semantics - Capability-aware hints: detects available capabilities and adjusts planning prompts dynamically - Learns from execution errors to improve subsequent planning
3. AWS Partner Network-Specific Capabilities - ACE Opportunity Sync (auto-generates pre-filled ACE briefs) - CPPO Proposal Generator (pricing proposals for AWS Marketplace) - AWS Clean Rooms integration (privacy-preserving partner overlap analysis) - Partner Intelligence scoring (relationship strength + warm intro paths) - These require AWS Partner Network access — competitors can't replicate without partnership agreements
4. AI-Powered Funding Application Intelligence
- Program-specific prompt engineering for 5 AWS Partner Funding programs (Innovation Sandbox, POC, ISV WMP, MDF, MAP)
- Knowledge Base grounding from official AWS documentation — responses cite real program requirements, not generic AI output
- Dry-run evaluation with AI personas simulating AWS funding reviewers (funding_reviewer, funding_technical_reviewer)
- Company profile enrichment pre-fills applications with existing brand/product data
- Partner Central Benefits API integration creates a submission-to-outcome feedback loop (coming soon)
- Approval rate data accumulates: more submissions → better program-specific guidance → higher success rates (network effect)
5. Deal Influence Attribution - 6 correlation input types: UTM, private offers, metering, CRM stages, email/calendar, content engagement - Multi-touch attribution modeling (first-touch vs last-touch) - Content-to-conversion lag time tracking — improves with more data
IP Risk from Migration¶
| Migration Phase | IP Risk | Mitigation |
|---|---|---|
| Phase 1: Observability | ZERO | Only adds monitoring layer |
| Phase 2: Gateway | ~~MEDIUM~~ LOW | Your 37 capability handlers become Gateway tools. Handler logic stays yours. Tool metadata mitigated: DynamoDB local registry is source of truth (IP-002), MCP schemas versioned in S3 (IP-003), handler namespace allowlist prevents unauthorized loading (SEC-002). |
| Phase 3: Memory | LOW | Memory content moves to AWS-managed storage. Session metadata stays in your DB. You lose direct access to raw memory vectors. |
| Phase 4: Runtime | LOW | Your orchestrator code runs on AgentCore compute but remains your code. Similar to deploying on EC2 — AWS runs it, you own it. |
| Phase 5: Identity | LOW | Auth config moves to AgentCore Identity. Credential mapping is operational, not IP. |
Key IP Protection Actions¶
- IP-001: Document MSS algorithm separately — consider provisional patent filing
- IP-002: Keep capability handler source code in your repo (Gateway only hosts tool schemas/descriptions) — ADDRESSED: Handler source stays in
App\Extensions\ContentManager\System\Services\Capabilities\. DynamoDB registry + S3 schemas store only metadata/descriptions, never handler code.ALLOWED_HANDLER_NAMESPACESconstant enforces this boundary. - IP-003: Maintain local copies of all tool metadata registered with Gateway — ADDRESSED: DynamoDB
vell-{env}-gateway-tool-registryis the local source of truth.php artisan agentcore:sync-toolsseeds from application DB. Versioned snapshots in DynamoDB + versioned S3 schemas provide full history. Lambda sync is event-driven, not Gateway-dependent. - IP-004: Export Bedrock Agent Memory sessions periodically to your own S3 bucket
- IP-005: Ensure all AgentCore Policy rules are version-controlled in your repo, not only in AWS console
Data Moat & Lock-In Assessment¶
Data That Accumulates Value Over Time¶
| Data Category | Stickiness | Portability | Network Effect | Flywheel |
|---|---|---|---|---|
| Marketplace SEO benchmark data | VERY HIGH | Low (AWS-specific) | VERY STRONG | More listings → better category medians → better recommendations |
| Deal influence correlation data | VERY HIGH | Medium | STRONG | More customers → better attribution models → more predictive |
| Execution retry/health patterns | VERY HIGH | High (your DB) | MEDIUM | More executions → better self-healing → fewer failures |
| Keyword gap intelligence | VERY HIGH | Low (AWS-specific) | VERY STRONG | More listings → better competitive landscape maps |
| Funding application outcomes | VERY HIGH | Medium (your DB) | VERY STRONG | More submissions → approval/rejection patterns → better program-specific guidance → higher success rates |
| Funding program intelligence | HIGH | Low (AWS-specific) | STRONG | KB-grounded program requirements + reviewer persona feedback accumulate institutional knowledge of what gets approved |
| Brand voice profiles | HIGH | High (JSON export) | MEDIUM | Every execution refines "what context works" |
| Knowledge base content | HIGH | Medium (S3 docs portable, embeddings not) | HIGH | Quality improves with document volume |
| Compliance history | VERY HIGH | Medium | HIGH | Full audit trail creates switching cost |
| Partner matching patterns | HIGH | Medium | STRONG | More partners matched → better success predictors |
| Agent memory (Bedrock) | MEDIUM | LOW (Bedrock-specific) | HIGH | Conversation history compounds |
| Team configuration | MEDIUM | High (portable) | LOW | Organizational inertia |
What You LOSE Control Of With AgentCore Migration¶
Fully Lost (Bedrock-hosted, no direct access): 1. Active Bedrock Agent Memory content (session summaries, semantic facts) 2. Fine-tuned model weights (bedrock_model_arn is AWS-hosted) 3. Knowledge Base embeddings (Bedrock-specific vectors)
Stays in Your Database (Portable): 1. All execution history, retry patterns, health events 2. Brand voice profiles, GTM goals, personas 3. Compliance reports and rule history 4. Marketplace metrics, SEO scores, keyword gaps 5. Agent definitions, capability configurations 6. Memory session metadata (just not the memory content itself) 7. All integration credentials (encrypted) 8. Deal influence correlation data 9. Team/org configuration
Net Impact: You retain ~90% of your data moat. The 10% you lose (active memory, embeddings) is operational state, not strategic data.
Customer Switching Costs Created¶
| Feature | What Customer Loses by Leaving | Lock-In Strength |
|---|---|---|
| Marketplace SEO Score trends | Historical score trajectory, benchmark comparisons | HIGH |
| Deal influence models | Years of content-to-conversion correlation data | VERY HIGH |
| Compliance audit trail | Full validation history, rule evolution | HIGH |
| Brand voice configuration | Tuned personas, GTM positioning, competitive differentiators | MEDIUM |
| Workflow execution history | What worked, what failed, optimization patterns | MEDIUM |
| Knowledge bases | Curated document corpus with per-capability tuning | MEDIUM |
| Partner matching history | Relationship strength scores, ICP overlap data | MEDIUM |
| Funding application history | Approval/rejection patterns, program-specific guidance, reviewer feedback, reusable application templates | HIGH |
Opportunities¶
Opportunity 1: Marketplace Intelligence as a Standalone Product¶
Your MSS algorithm, keyword gap analysis, and competitive benchmarking could be offered as a standalone analytics product for AWS Marketplace ISVs — even those not using your full GTM platform.
- Market size: 3,000+ ISV listings on AWS Marketplace
- Moat: Benchmark data improves with every customer (network effect)
- Revenue model: Tiered pricing by listing count
- AgentCore relevance: Gateway enables this as a standalone MCP tool that other agents can invoke
Opportunity 2: AgentCore Gateway as Distribution Channel¶
By registering your 37 capabilities as Gateway tools with semantic discovery, your capabilities become discoverable by any agent connected to Gateway — not just your own UI.
- Your co-sell matching, SEO scoring, content generation, and funding application generation become tools other frameworks (CrewAI, LangGraph, LlamaIndex) can invoke
- Gateway's 1-click connectors (Salesforce, Slack, Jira) replace your custom integration code
- This shifts your business model from "app you log into" to "capabilities any agent can call"
- Funding-specific opportunity: A partner's agent discovers your
funding_application_writertool via Gateway, generates a joint POC funding application combining both partners' data, and submits via Partner Central Benefits API — fully automated, agent-to-agent co-funding - Phase 0C progress: DynamoDB tool registry seeded (37 capabilities, 9 listings), 37 MCP schemas in S3, EventBridge sync pipeline deployed,
agentcore:sync-toolsruns automatically on every deploy. Phase 2 will register tools with live Gateway. Action needed: RegisterFundingApplicationWriterServiceas capability #38 before Phase 2 Gateway registration.
Opportunity 3: A2A Protocol for Multi-Agent GTM¶
AgentCore Runtime supports the Agent-to-Agent (A2A) protocol. Your specialized agents (SEO analyzer, co-sell matcher, content generator) could communicate with:
- Customer's own internal agents
- Partner agents (ISV-to-ISV collaboration)
- AWS first-party agents (Marketplace listing optimizer)
This enables agent-mediated co-sell — a partner's agent negotiates joint GTM campaigns with your agent automatically.
Opportunity 4: AgentCore Evaluations for Quality Differentiation¶
You currently have zero quality evaluation for agent outputs. AgentCore Evaluations provides 13 built-in evaluators. Adding quality scoring to every workflow creates:
- Customer-visible quality grades (trust signal)
- Continuous quality monitoring (catch regressions)
- A/B testing of prompt strategies
- Data for fine-tuning (reward signal)
High-value use case — Funding application quality scoring: Your FundingApplicationWriter already has dry-run evaluation with funding_reviewer and funding_technical_reviewer personas. AgentCore Evaluations could formalize this into continuous scoring: grade applications before submission, track score correlation with actual approval outcomes, and use approval data as a reward signal to improve generation quality over time. This is a natural fit — you already have the persona simulation infrastructure, Evaluations adds the scoring framework and regression detection.
Opportunity 5: Convert Self-Healing Into a Feature¶
Your ExecutionHealthMonitor + SelfHealingConfig + ExecutionHealthEvent stack is genuinely sophisticated. Most SaaS apps don't expose this.
- Surface health dashboards to customers ("Your agents are 94% healthy")
- Let customers tune their own self-healing thresholds
- Create "reliability SLAs" as a premium tier feature
- Market as "Enterprise-grade agent reliability" differentiator
Opportunity 7: AI-Powered Funding Intelligence Platform¶
Your FundingApplicationWriterService already generates applications for 5 AWS Partner Funding programs with KB grounding and dry-run evaluation. This is a significant capability that's not yet integrated into the AgentCore orchestration system — it exists as a standalone service outside the CapabilityRegistry.
Immediate integration gap: The funding writer is not registered as one of the 37 CapabilityRegistry capabilities, meaning it's not in the DynamoDB tool registry, not discoverable via Gateway, and not available for agent-orchestrated workflows. Registering it unlocks:
- Agent-orchestrated funding workflows: WorkflowPlanner can chain funding applications with co-sell matching, deal intelligence, and marketplace optimization into multi-step GTM sequences (e.g., "identify best co-sell partner → generate joint POC funding application → create co-branded content")
- Outcome-driven intelligence: Once Partner Central Benefits API submission goes live, track approval/rejection outcomes per program. This creates a feedback loop: more submissions → pattern recognition on what gets approved → better program-specific guidance → higher success rates. This data is a defensible moat.
- Standalone product potential: Similar to Opportunity 1 (Marketplace Intelligence), funding application assistance could be offered standalone to the 3,000+ AWS Marketplace ISVs who need help navigating AWS Partner Funding programs but don't use your full platform
- Revenue model: Tiered by program complexity — free Innovation Sandbox applications (lead gen), paid for POC/MAP/MDF applications with outcome tracking
- AgentCore relevance: Register as Gateway tool so other agents (partner agents, AWS first-party agents) can request funding applications via A2A protocol — enables agent-mediated joint funding applications between ISV partners
Action items:
- [ ] Register funding_application_writer as a CapabilityRegistry capability (adds to DynamoDB tool registry + S3 MCP schema)
- [ ] Complete Partner Central Benefits API direct submission (currently "coming soon")
- [ ] Add funding_applications table to track submissions, outcomes, and program-specific patterns
- [ ] Build approval rate analytics that feed back into generation prompts (reward signal)
- [ ] Add funding_reviewer dry-run personas to AgentCore Evaluations (Opportunity 4) when adopted
Opportunity 6: Patent Filing¶
Four components are novel enough for provisional patent applications:
- MSS Algorithm — 3-component marketplace SEO scoring with AI Visibility tracking and Bedrock fallback proxy
- Self-Healing Workflow Orchestration — Three-tier stuck detection with exponential backoff auto-restart and per-scope configuration
- Capability-Aware Workflow Planning — Dynamic hint injection based on available capability set with temperature-reducing retry strategy
- AI-Powered Funding Application Intelligence — Program-specific KB-grounded application generation with simulated reviewer evaluation and outcome-driven feedback loop
Migration Roadmap¶
Phase 0: Critical Fixes + Gateway Foundation (Week 1-3) — COMPLETE & DEPLOYED¶
Done regardless of migration decision. Prepares infrastructure for Phase 2 Gateway migration. Deployed: 2026-02-20 — All 4 stacks healthy in
us-east-1(account253265132499).
Phase 0A: Application Security Fixes — COMPLETE¶
Critical (SEC-001–005):
- [x] SEC-004: Sort order validation whitelist in AgentController
- [x] SEC-001: IDOR ownership checks — defense-in-depth across AgentPolicy, AgentOrchestrator, AgentController
- [x] SEC-002: Handler namespace allowlist + CapabilityInterface validation in CapabilityRegistry
- [x] SEC-005: SSRF protection via App\Services\UrlValidator (RFC 1918, metadata endpoint, loopback blocking)
- [x] SEC-003: Prompt injection boundary markers + input sanitization + output validation in WorkflowPlanner
- [x] SCALE-002: Kahn's algorithm topological sort replacing O(n³) dependency resolution in WorkflowPlanner
High (SEC-006–009):
- [x] SEC-006: Explicit field whitelist ($request->only()) replacing $validated mass assignment in AgentController::update()
- [x] SEC-007: Standardized policy-based auth — 6 new AgentPolicy methods, 9 inline checks replaced with $this->authorize() in AgentController
- [x] SEC-008: TTL-based credential cache (15-min expiry) + resetClientCache() method in BedrockRuntimeService
- [x] SEC-009: JSON schema validation in WorkflowPlanner::extractJsonFromResponse() — decode depth limit, step key whitelist, capability slug regex, dependency index validation, parameter depth cap
Phase 0B: CloudFormation Hardening — COMPLETE & DEPLOYED¶
- WAF: Agent endpoint rate limiting (100 req/5-min per IP) for
/agent,/execute,/workflowpaths —vell-api-waf.yaml - Stack:
prod-api-waf— UPDATE_COMPLETE (2026-02-20) - Observability: AgentCore log group (
/vell/{env}/agentcore), metric filters (execution failures, Bedrock throttles), alarms (>10 failures/5-min, >20 throttles/5-min), X-Ray 100% sampling for agent paths —vell-observability.yaml - Stack:
prod-observability— CREATE_COMPLETE (2026-02-20) - Bedrock Guardrails: PROMPT_ATTACK filter enabled on Prod (HIGH/NONE) and Enterprise (HIGH/NONE) tiers —
bedrock-guardrails.yml(Note: Bedrock requires OutputStrength=NONE for PROMPT_ATTACK filter type) - Stack:
vellocity-bedrock-guardrails— UPDATE_COMPLETE (2026-02-20) - IAM Role: Added AgentCore Gateway (ListTools, GetTool, InvokeTool, SearchTools), Observability (GetAgentCoreMetrics, ListAgentCoreTraces), and Memory (GetMemory, PutMemory, DeleteMemory) permissions —
vell-agentcore-bedrock-role.yaml - Customer-facing template — not deployed in Vell's account (customers deploy in their own accounts)
Phase 0C: AgentCore Gateway Foundation — COMPLETE & DEPLOYED¶
- New CFT:
cloudformation/application/vell-agentcore-gateway.yaml - Stack:
prod-agentcore-gateway— CREATE_COMPLETE (2026-02-20) - DynamoDB table
vell-{env}-gateway-tool-registrywithcategory-indexandmarketplace-listing-indexGSIs, PITR in prod - S3 bucket
vell-{env}-gateway-tool-schemaswith versioning, Glacier lifecycle, HTTPS-only policy - SQS sync queue + DLQ (3 retries, 14-day DLQ retention)
- EventBridge rule for
CapabilityRegistryChanged,ToolSchemaUpdated,MarketplaceListingChangedevents - Lambda sync function (Python 3.11) with SQS event source mapping
- IAM role with DynamoDB, S3, Bedrock Gateway, SQS, X-Ray, and CloudWatch permissions
- CloudWatch dashboard (Lambda invocations/errors/duration, DynamoDB read/write, SQS queue depth, S3 operations)
- 7 SSM parameters for cross-stack resource discovery
- CloudWatch alarms for sync errors and DLQ depth
- Artisan command:
php artisan agentcore:sync-tools—SyncGatewayToolsCommand.php - Seeds 37 capabilities from
CapabilityRegistry::bootstrapDefaults()into DynamoDB - Generates MCP-compatible tool schemas to S3 for each capability
- Flags:
--dry-run,--force,--capability={slug},--listing={id} - Marketplace listing mapping for 8 future listings (bundle-first strategy)
- Versioned snapshots in DynamoDB for tool metadata history
Phase 0 Files Changed/Created¶
Created:
- app/Services/UrlValidator.php — Reusable SSRF protection utility
- cloudformation/application/vell-agentcore-gateway.yaml — Gateway foundation CFT (DynamoDB, S3, SQS, EventBridge, Lambda, IAM, CloudWatch)
- app/Extensions/ContentManager/System/Console/Commands/SyncGatewayToolsCommand.php — Artisan agentcore:sync-tools command
Modified:
- app/Extensions/ContentManager/System/Policies/AgentPolicy.php — Added execute(), viewExecution(), cancel() methods (SEC-001); Added create(), deleteExecution(), rerunExecution(), restartExecution(), archiveExecution(), toggleNotification() methods (SEC-007)
- app/Extensions/ContentManager/System/Services/AgentCore/AgentOrchestrator.php — Ownership guards on 3 methods
- app/Extensions/ContentManager/System/Services/AgentCore/CapabilityRegistry.php — Handler namespace allowlist + interface validation
- app/Extensions/ContentManager/System/Services/AgentCore/WorkflowPlanner.php — Prompt injection protection + Kahn's algorithm (SEC-003/SCALE-002); JSON schema validation with step key whitelist, capability slug regex, dependency index validation, parameter depth cap, decode depth limit (SEC-009)
- app/Extensions/ContentManager/System/Http/Controllers/AgentController.php — Sort order validation, SSRF protection, policy method updates (SEC-001/004/005); Explicit field whitelist for mass assignment (SEC-006); 9 inline auth checks replaced with policy-based $this->authorize() (SEC-007)
- app/Services/Bedrock/BedrockRuntimeService.php — TTL-based credential cache (15-min expiry), resetClientCache() static method (SEC-008)
- cloudformation/application/vell-api-waf.yaml — Agent endpoint rate limiting rule + alarm
- cloudformation/application/vell-observability.yaml — AgentCore log group, metric filters, alarms, X-Ray sampling
- cloudformation/application/bedrock-guardrails.yml — PROMPT_ATTACK filters on Prod/Enterprise
- app/CustomExtensions/CloudMarketplace/resources/cloudformation/vell-agentcore-bedrock-role.yaml — Gateway, Observability, Memory IAM permissions
Modified (Infrastructure Health Check + Tool Registry Seeding 2026-02-20):
- app/Extensions/ContentManager/System/ContentManagerServiceProvider.php — Registered SyncGatewayToolsCommand in registerCommands() (was missing — root cause of "no commands defined in agentcore namespace" error)
- cloudformation/stacks/prod/prod-security.yml — Added logs:TagResource to app-ec2-perms policy (AWS requirement for CreateLogGroup with tags). Added dynamodb:PutItem/GetItem/Query on vell-{env}-gateway-tool-registry and s3:PutObject/GetObject on vell-{env}-gateway-tool-schemas-* for agentcore:sync-tools
- vell/codedeploy/after-install.sh — Added agentcore:sync-tools to CodeDeploy AfterInstall hook (runs as $WEB_USER, non-fatal on failure)
- app/Extensions/ContentManager/System/Console/Commands/SyncGatewayToolsCommand.php — Replaced app(DynamoDbClient::class) / app(S3Client::class) with direct instantiation using region config (no service container binding existed for AWS SDK clients)
Phase 0 Deployment Log (2026-02-20)¶
Deployment fixes applied during launch (3 template issues discovered and resolved):
bedrock-guardrails.yml— Enterprise PROMPT_ATTACK OutputStrength constraint- Issue: Enterprise tier had
OutputStrength: MEDIUMfor PROMPT_ATTACK filter - Error: Bedrock API rejected with "PROMPT ATTACK content filter strength for response must be NONE"
- Fix: Changed to
OutputStrength: NONE, added clarifying comment -
Root cause: Bedrock enforces
OutputStrength: NONEforPROMPT_ATTACKfilter type (output-side prompt attack detection is not supported) -
vell-agentcore-gateway.yaml— Lambda ZipFile size limit - Issue: Lambda inline code was 4,826 bytes, exceeding CloudFormation's 4,096-byte ZipFile limit
- Error:
PropertyValidationhook rejected changeset -
Fix: Minified Lambda code from 4,826 to 2,843 bytes while preserving all functionality (
handler(),sync_cap(),upd_schema(),upd_listing()) -
vell-agentcore-gateway.yaml— S3 lifecycle property name - Issue: Used
NoncurrentVersionTransition(singular) with a list value - Error:
PropertyValidationhook rejected changeset (cfn-lint: E3012) - Fix: Changed to
NoncurrentVersionTransitions(plural) for list form
Final stack states:
| Stack | Operation | Status | Timestamp (UTC) |
|---|---|---|---|
vellocity-bedrock-guardrails |
UPDATE | UPDATE_COMPLETE | 2026-02-20T01:23:24 |
prod-api-waf |
UPDATE | UPDATE_COMPLETE | 2026-02-20T01:32:44 |
prod-observability |
CREATE | CREATE_COMPLETE | 2026-02-20T01:37:17 |
prod-agentcore-gateway |
CREATE | CREATE_COMPLETE | 2026-02-20T01:41:01 |
Post-deployment action: Re-subscribed ops@vell.ai to SNS vell-prod-critical-alerts topic (subscription pending email confirmation).
Tool Registry Seeding (2026-02-20):
4 blockers discovered and resolved across 3 CodeDeploy deployments:
| Deployment | Commit | Fix | Result |
|---|---|---|---|
d-0WIFWHIXH |
96a9d2f53 |
(pre-fix baseline) | AfterInstall succeeded but agentcore:sync-tools not in hook |
d-7X2THSIXH |
f42f46f44 |
Service provider registration + IAM permissions + AfterInstall hook + SSM env vars (v49) | Command found but app(DynamoDbClient::class) → BindingResolutionException |
d-UK7H68JXH |
ff5a7cc69 |
Direct AWS SDK client instantiation (new DynamoDbClient()/new S3Client()) |
SUCCESS — 37 capabilities synced |
Final tool registry state:
- DynamoDB: 111 items (37 latest + 74 versioned snapshots)
- S3: 37 MCP tool schemas
- 9 marketplace listings: brand-knowledge (5), competitive-intelligence (1), content-generation (5), cosell-partner-intelligence (6), deal-intelligence (3), gtm-planning (4), marketplace-intelligence (6), seo-intelligence (6), vell-platform (1)
Infrastructure Health Check (2026-02-19)¶
Verified all Phase 0 infrastructure from live AWS account
253265132499inus-east-1.
Stack Status:
| Stack | Status | Last Updated (UTC) |
|---|---|---|
prod-agentcore-gateway |
CREATE_COMPLETE | 2026-02-20T01:41:01 |
prod-observability |
CREATE_COMPLETE | 2026-02-20T01:37:17 |
prod-api-waf |
UPDATE_COMPLETE | 2026-02-20T01:32:44 |
vellocity-bedrock-guardrails |
UPDATE_COMPLETE | 2026-02-20T01:23:24 |
Resource Verification:
| Resource | Status | Details |
|---|---|---|
DynamoDB vell-prod-gateway-tool-registry |
ACTIVE | PITR enabled (35-day recovery). 2 GSIs active (category-index, marketplace-listing-index). 111 items — 37 capabilities × 3 (latest + 2 versioned snapshots from rolling deploy across 2 instances). |
S3 vell-prod-gateway-tool-schemas-253265132499 |
ACTIVE | Versioning enabled. 37 MCP tool schemas uploaded across 9 marketplace listings. |
Lambda vell-prod-gateway-tool-sync |
Active | Python 3.11, 256MB, 60s timeout, X-Ray Active. SQS event source mapping enabled. 0 invocations since deployment. |
SQS vell-prod-gateway-sync |
ACTIVE | 0 messages in-flight. Queue healthy. |
SQS DLQ vell-prod-gateway-sync-dlq |
ACTIVE | 0 messages. No failed sync events. |
EventBridge vell-prod-gateway-tool-sync |
ENABLED | Listening for CapabilityRegistryChanged, ToolSchemaUpdated, MarketplaceListingChanged from vell.agentcore. |
WAF vell-prod-api-waf |
ACTIVE | 7 rules (3 AWS managed + 3 rate limits + 1 count-mode burst detector). Associated with API Gateway stage prod (qkxjis5iel). |
| Bedrock Guardrails | ACTIVE | 3 tiers: Dev (uvishu7ijb29), Prod (5bf7khsguf6i), Enterprise (7s64nv00v8t5). |
| CloudWatch Alarms | ALL OK | 22 alarms verified — all in OK state including gateway-specific vell-prod-gateway-sync-errors and vell-prod-gateway-sync-dlq-depth. |
Issues Found (ACTION REQUIRED):
| Issue | Severity | Details | Action |
|---|---|---|---|
| INFRA-001: SNS subscriptions not confirmed | HIGH | Both admin@vell.ai and ops@vell.ai were PendingConfirmation on vell-prod-critical-alerts. CloudWatch alarms fire but no one receives alerts. |
RESOLVED (2026-02-19). Both subscriptions confirmed: ops@vell.ai (ccc72725-6796-4e86-95c3-cb5a6e822543), admin@vell.ai (7ce4e5c2-d039-44bd-ae1b-9d40071e1585). |
| INFRA-002: Tool registry not seeded | MEDIUM | ~~DynamoDB table has 0 items.~~ | RESOLVED (2026-02-20). 4 blockers found and fixed across 3 deployments: (1) SyncGatewayToolsCommand not registered in ContentManagerServiceProvider::registerCommands(), (2) EC2 instance role missing logs:TagResource + DynamoDB/S3 permissions — prod-security.yml updated (stack UPDATE_COMPLETE 2026-02-20T03:05:26 UTC), (3) AGENTCORE_GATEWAY_TABLE and AGENTCORE_GATEWAY_SCHEMA_BUCKET env vars missing from SSM /prod/app/.env — added (version 49), (4) AWS SDK clients resolved via app() with no container binding — replaced with direct new DynamoDbClient()/new S3Client() instantiation. Result: 37 capabilities synced, 111 DynamoDB items (37 latest + 74 versioned snapshots), 37 MCP schemas in S3 across 9 marketplace listings. agentcore:sync-tools now runs automatically on every CodeDeploy AfterInstall. |
| INFRA-003: Sync pipeline untested in prod | LOW | Lambda has 0 invocations. The EventBridge → SQS → Lambda pipeline has never been triggered. Will work on first real event, but no smoke test has validated the end-to-end flow. | Publish a test CapabilityRegistryChanged event via EventBridge to validate the pipeline: aws events put-events --entries '[{"Source":"vell.agentcore","DetailType":"CapabilityRegistryChanged","Detail":"{\"capability_slug\":\"test\",\"action\":\"test\"}"}]' |
Phase 1: Adopt AgentCore Observability (Week 4-5)¶
Lowest risk, highest immediate value. OBS-002/OBS-003 infrastructure layer done in Phase 0B; this phase adds application-level instrumentation.
- Enable AgentCore Observability for existing Bedrock API calls
- Add OTEL instrumentation to
AgentOrchestratorandBedrockRuntimeService - Create CloudWatch dashboards for workflow health metrics
- Set up CloudWatch Alarms → SNS → Slack/PagerDuty for failure rate thresholds
- Add correlation IDs (execution UUID) to all service calls
- Build admin dashboard view combining
failed_jobs+agent_executions+ AgentCore metrics
Phase 2: Migrate to AgentCore Gateway (Week 6-9)¶
Tool registry security fixed (SEC-002). DynamoDB registry + S3 MCP schemas deployed (Phase 0C). This phase registers tools with live Gateway.
- Register capabilities with live AgentCore Gateway using MCP schemas from S3
- Enable semantic tool selection for capability discovery
- Migrate outbound auth (HubSpot, LinkedIn, Slack, etc.) to Gateway auth management
- Implement Gateway interceptors for tool-call-level authorization
- Transition
CapabilityRegistryto read from Gateway (keep local DynamoDB as fallback) - Enable AgentCore Policy for tool-call enforcement
- Configure marketplace listing unbundling (8 listings in DynamoDB, activate per listing)
Phase 3: Adopt AgentCore Memory (Week 9-10)¶
Replaces fragile fallback code
- Migrate
AgentMemoryServiceto native AgentCore Memory API - Enable episodic memory for GTM workflow learning
- Remove local session fallback code
- Add memory health status to execution UI
Phase 4: Migrate to AgentCore Runtime (Week 11-16)¶
Largest change — replaces queue-based execution
- Package
AgentOrchestratoras AgentCore Runtime-compatible agent - Implement session-based execution (replace
ProcessAgentWorkflowJob) - Enable microVM isolation per execution
- Extend timeout from 10min → up to 8hrs for complex workflows
- Migrate
ExecutionHealthMonitorto monitor Runtime sessions instead of queue jobs - Implement A2A protocol for multi-agent GTM workflows
- Decommission
ProcessAgentWorkflowJoband self-healing queue infrastructure
Phase 5: Adopt AgentCore Identity (Week 17-18)¶
Unifies fragmented auth
- Configure AgentCore Identity with existing Cognito user pool
- Migrate per-user Bedrock credentials to Identity-managed access
- Enable custom claims for multi-tenant team isolation
- Remove static credential cache from
BedrockRuntimeService
Cost Estimation¶
Current Infrastructure Costs (Estimated)¶
| Component | Cost Driver | Monthly Estimate |
|---|---|---|
| Queue workers (EC2/ECS) | Always-on compute for job processing | $200-500 |
| Redis (ElastiCache) | Queue + cache | $100-300 |
| Health monitoring (scheduler) | EC2 compute for cron | Included above |
| Developer time (maintenance) | Bug fixes, monitoring, on-call | 40-80 hrs/month |
Projected AgentCore Costs (at 10,000 executions/month)¶
| Service | Usage | Monthly Cost |
|---|---|---|
| Runtime | 10K sessions × ~18s active CPU × 1 vCPU + 60s × 2GB memory | ~$8 |
| Gateway | 10K sessions × 5 tool calls = 50K invocations | ~$0.25 |
| Memory | 50K short-term events + 5K long-term + 10K retrievals | ~$21 |
| Identity | Free via Runtime/Gateway | $0 |
| Observability | CloudWatch log storage (~5GB) | ~$3 |
| Policy | 50K tool calls × 1 policy check | ~$1.25 |
| Total AgentCore | ~$33/month |
Savings Analysis¶
| Category | Before | After | Savings |
|---|---|---|---|
| Infrastructure compute | $300-800/mo | ~$33/mo | $267-767/mo |
| Developer maintenance | 40-80 hrs/mo | 10-20 hrs/mo | 30-60 hrs/mo |
| Security risk exposure | ~~5 critical~~ 0 critical + ~~4 high~~ 0 high vulns (Phase 0A) | Addressed by managed services | Reduced attack surface |
| Incident response | Reactive (no alerting) | Proactive (CloudWatch alarms) | Faster MTTR |
Note: Bedrock model inference costs (Claude tokens) remain the same regardless of migration. These estimates cover only the orchestration infrastructure layer.