Skip to content

VPC Endpoint Cleanup — May 2026

Why

April 2026 invoice audit found VpcEndpoint-Hours running at $496.80/mo — 49,680 endpoint-hours from 23 Interface Endpoints across prod-vpc, dev-vpc, demo-vpc, and network-hardening. Combined with 5 NAT Gateways at $162/mo, the VPC was paying for two parallel egress paths to AWS APIs.

This cleanup removes only endpoints with verified zero traffic in the preceding 30 days. The original audit recommended a wider 9-endpoint cut on the assumption that Secrets Manager / KMS / CloudWatch Logs payloads were KB-scale. CloudWatch AWS/PrivateLinkEndpoints metrics over 2026-04-04 → 2026-05-04 disproved that assumption — see "What we kept and why" below.

Scope

Stack Removed Kept Approx save
network-hardening EpSsmContacts, EpIncidentManager EpEcrApi, EpEcrDkr, EpEc2Api ~$44/mo
dev-vpc (via env-vpc.yml) KmsEndpoint SSM trio, SecretsEndpoint, LogsEndpoint, Gateway endpoints ~$22/mo
demo-vpc (via env-vpc.yml) KmsEndpoint same as dev ~$22/mo
prod-vpc (unchanged) all 6 $0
Total 4 endpoints removed across 3 resource blocks ~$88/mo

Why these specific endpoints (verified zero usage)

30-day CloudWatch metrics (BytesProcessed + NewConnections, summed across all subnet ENIs per endpoint, window 2026-04-04 → 2026-05-04):

Endpoint Bytes Connections Action
prod-ssm-contacts (vpce-0bb935982984ce5ee) 0 0 Remove
prod-ssm-incidents (vpce-025391667e496086a) 0 0 Remove
dev-kms (vpce-0604244fbd2c77bd9) 0 0 Remove
demo-kms (vpce-02d28a469f6867a8e) 0 0 Remove

Incident Manager features (contacts, incidents) were never configured. KMS endpoint receives no direct SDK traffic — Aurora/S3/Secrets Manager server-side encryption stays AWS-internal and does not transit the endpoint.

What we kept and why

Endpoint Bytes (30d) Connections (30d) Reason kept
prod-ec2 (vpce-0292c6c6ceb26fdb3) 3.8 MB 300 Light but non-zero. Likely SSM agent metadata, CodeDeploy describe-instances hooks, or Image Builder pipeline. Trace caller via CloudTrail before considering removal.
dev-secretsmanager 512 MB 52K Heavy use — refresh-secrets.sh cron at */5 × 3 secrets. NAT path works functionally but adds latency on hot path.
demo-secretsmanager 495 MB 49K Same pattern as dev.
dev-logs 7.0 GB 481K CloudWatch Logs agent + Laravel LOG_CHANNEL=cloudwatch ship through this.
demo-logs 3.6 GB 231K Same pattern as dev.

Future consideration: removing the 5 "kept" endpoints

The dollar math still favors removal — at $0.045/GB NAT processing, the ~12 GB/mo of dev+demo log+secrets traffic costs ~$0.55/mo via NAT, vs ~$108/mo for the 5 endpoints. But three operational concerns warrant deferring:

  1. Latency — Interface endpoints stay on AWS backbone (~1-2ms). NAT routes via Elastic IP through public internet (~5-15ms added). Not user-visible per request, but accumulates across hundreds of thousands of cron + log-shipping calls.
  2. Posture — The hardening stack is named that for a reason. Removing endpoints means Laravel log shipping + Secrets Manager calls flow over the public internet (TLS still, but public). Reverses the original reason these were added.
  3. NAT SPOF interaction — Workstream #7 in the cost audit wants OnePerAZ → SingleNat for $65/mo. Combined with this, all dev/demo egress would route through one NAT in one AZ. If pursuing #7, keep these endpoints to retain a private path for AWS API calls.

Decision: revisit if/when workstream #7 is resolved.

Deploy sequence

Step 0 — pre-flight verification

Pre-drift baseline is clean as of 2026-05-04. Commit 47a047644 (refresh-secrets cron + chmod fix) is in main and was deployed via prod-pipeline (parent b00aca35d6). The functional fix ships through CodeDeploy after-install.sh, so currently-running instances have the fix without an instance refresh.

Step 1 — preview network-hardening with a change-set

CS_NAME="endpoint-cleanup-$(date +%s)"

aws cloudformation create-change-set \
  --profile vell-prod-admin --region us-east-1 \
  --stack-name network-hardening \
  --change-set-name "$CS_NAME" \
  --change-set-type UPDATE \
  --template-body file://cloudformation/hardening/network-hardening.yml \
  --parameters ParameterKey=EnvironmentName,UsePreviousValue=true \
               ParameterKey=VpcId,UsePreviousValue=true \
               ParameterKey=EndpointSecurityGroupId,UsePreviousValue=true \
               ParameterKey=FlowLogRetentionDays,UsePreviousValue=true \
               ParameterKey=FlowLogsRoleName,UsePreviousValue=true \
               ParameterKey=FlowLogsLogGroupName,UsePreviousValue=true \
  --capabilities CAPABILITY_NAMED_IAM

aws cloudformation describe-change-set \
  --profile vell-prod-admin --region us-east-1 \
  --stack-name network-hardening --change-set-name "$CS_NAME" \
  --query 'Changes[].ResourceChange.[Action,LogicalResourceId,ResourceType]' \
  --output table

Expected output: exactly 2 Remove rows (EpSsmContacts, EpIncidentManager). Anything else → stop, investigate.

Step 2 — execute network-hardening

aws cloudformation execute-change-set \
  --profile vell-prod-admin --region us-east-1 \
  --stack-name network-hardening --change-set-name "$CS_NAME"

aws cloudformation wait stack-update-complete \
  --profile vell-prod-admin --region us-east-1 \
  --stack-name network-hardening

Step 3 — dev-vpc

CS_NAME="endpoint-cleanup-$(date +%s)"

aws cloudformation create-change-set \
  --profile vell-prod-admin --region us-east-1 \
  --stack-name dev-vpc \
  --change-set-name "$CS_NAME" \
  --change-set-type UPDATE \
  --template-body file://cloudformation/stacks/env/env-vpc.yml \
  --parameters ParameterKey=EnvironmentName,UsePreviousValue=true \
               ParameterKey=VpcCidr,UsePreviousValue=true \
               ParameterKey=NatStrategy,UsePreviousValue=true \
  --capabilities CAPABILITY_IAM

aws cloudformation describe-change-set \
  --profile vell-prod-admin --region us-east-1 \
  --stack-name dev-vpc --change-set-name "$CS_NAME" \
  --query 'Changes[].ResourceChange.[Action,LogicalResourceId,ResourceType]' \
  --output table

Expected output: exactly 1 Remove row (KmsEndpoint).

aws cloudformation execute-change-set \
  --profile vell-prod-admin --region us-east-1 \
  --stack-name dev-vpc --change-set-name "$CS_NAME"

Step 4 — soak verification on dev (24-48h)

  • /var/log/vell-boot.log shows secrets refresh continuing (we kept SecretsEndpoint)
  • CloudWatch log groups (/dev/laravel, etc.) continue receiving events (we kept LogsEndpoint)
  • Session Manager still works on dev-web instances (we kept SSM trio)
  • App health checks pass; ALB target group reports healthy

Step 5 — demo-vpc

Repeat Step 3 with --stack-name demo-vpc.

Rollback

If verification fails on dev, roll back the template change and re-deploy:

git revert <commit-sha>
aws cloudformation deploy --profile vell-prod-admin --region us-east-1 \
  --stack-name dev-vpc \
  --template-file cloudformation/stacks/env/env-vpc.yml \
  --capabilities CAPABILITY_IAM --no-fail-on-empty-changeset

CloudFormation will recreate the Interface Endpoints (new IDs, but same DNS names because PrivateDnsEnabled: true registers AWS service DNS).

Same procedure for network-hardening if needed.

Out of scope

  • prod-vpc interface endpoints — kept as-is for operational hygiene
  • prod-ec2 endpoint (light usage) — trace caller before considering removal
  • dev/demo Secrets Manager + CloudWatch Logs endpoints — actively used; defer pending NAT consolidation decision (workstream #7)
  • NAT Gateway consolidation (5 → 3) — separate workstream
  • Bedrock Knowledge Base migration off OpenSearch Serverless — separate
  • Aurora Serverless v2 ACU tuning — separate
  • AWS Support tier downgrade (console-only, no CFN)

Cross-reference

Audit findings: AWS April 2026 invoice review (cost-reduction-audit, 2026-05-01). Verification methodology: AWS/PrivateLinkEndpoints BytesProcessed + NewConnections summed across subnet ENIs over a 30-day window.