VPC Endpoint Cleanup — May 2026¶
Why¶
April 2026 invoice audit found VpcEndpoint-Hours running at $496.80/mo —
49,680 endpoint-hours from 23 Interface Endpoints across prod-vpc, dev-vpc,
demo-vpc, and network-hardening. Combined with 5 NAT Gateways at $162/mo,
the VPC was paying for two parallel egress paths to AWS APIs.
This cleanup removes only endpoints with verified zero traffic in the
preceding 30 days. The original audit recommended a wider 9-endpoint cut on
the assumption that Secrets Manager / KMS / CloudWatch Logs payloads were
KB-scale. CloudWatch AWS/PrivateLinkEndpoints metrics over 2026-04-04 →
2026-05-04 disproved that assumption — see "What we kept and why" below.
Scope¶
| Stack | Removed | Kept | Approx save |
|---|---|---|---|
network-hardening |
EpSsmContacts, EpIncidentManager |
EpEcrApi, EpEcrDkr, EpEc2Api |
~$44/mo |
dev-vpc (via env-vpc.yml) |
KmsEndpoint |
SSM trio, SecretsEndpoint, LogsEndpoint, Gateway endpoints | ~$22/mo |
demo-vpc (via env-vpc.yml) |
KmsEndpoint |
same as dev | ~$22/mo |
prod-vpc |
(unchanged) | all 6 | $0 |
| Total | 4 endpoints removed across 3 resource blocks | ~$88/mo |
Why these specific endpoints (verified zero usage)¶
30-day CloudWatch metrics (BytesProcessed + NewConnections, summed across
all subnet ENIs per endpoint, window 2026-04-04 → 2026-05-04):
| Endpoint | Bytes | Connections | Action |
|---|---|---|---|
prod-ssm-contacts (vpce-0bb935982984ce5ee) |
0 | 0 | Remove |
prod-ssm-incidents (vpce-025391667e496086a) |
0 | 0 | Remove |
dev-kms (vpce-0604244fbd2c77bd9) |
0 | 0 | Remove |
demo-kms (vpce-02d28a469f6867a8e) |
0 | 0 | Remove |
Incident Manager features (contacts, incidents) were never configured. KMS endpoint receives no direct SDK traffic — Aurora/S3/Secrets Manager server-side encryption stays AWS-internal and does not transit the endpoint.
What we kept and why¶
| Endpoint | Bytes (30d) | Connections (30d) | Reason kept |
|---|---|---|---|
prod-ec2 (vpce-0292c6c6ceb26fdb3) |
3.8 MB | 300 | Light but non-zero. Likely SSM agent metadata, CodeDeploy describe-instances hooks, or Image Builder pipeline. Trace caller via CloudTrail before considering removal. |
dev-secretsmanager |
512 MB | 52K | Heavy use — refresh-secrets.sh cron at */5 × 3 secrets. NAT path works functionally but adds latency on hot path. |
demo-secretsmanager |
495 MB | 49K | Same pattern as dev. |
dev-logs |
7.0 GB | 481K | CloudWatch Logs agent + Laravel LOG_CHANNEL=cloudwatch ship through this. |
demo-logs |
3.6 GB | 231K | Same pattern as dev. |
Future consideration: removing the 5 "kept" endpoints¶
The dollar math still favors removal — at $0.045/GB NAT processing, the ~12 GB/mo of dev+demo log+secrets traffic costs ~$0.55/mo via NAT, vs ~$108/mo for the 5 endpoints. But three operational concerns warrant deferring:
- Latency — Interface endpoints stay on AWS backbone (~1-2ms). NAT routes via Elastic IP through public internet (~5-15ms added). Not user-visible per request, but accumulates across hundreds of thousands of cron + log-shipping calls.
- Posture — The hardening stack is named that for a reason. Removing endpoints means Laravel log shipping + Secrets Manager calls flow over the public internet (TLS still, but public). Reverses the original reason these were added.
- NAT SPOF interaction — Workstream #7 in the cost audit wants
OnePerAZ → SingleNatfor $65/mo. Combined with this, all dev/demo egress would route through one NAT in one AZ. If pursuing #7, keep these endpoints to retain a private path for AWS API calls.
Decision: revisit if/when workstream #7 is resolved.
Deploy sequence¶
Step 0 — pre-flight verification¶
Pre-drift baseline is clean as of 2026-05-04. Commit 47a047644
(refresh-secrets cron + chmod fix) is in main and was deployed via
prod-pipeline (parent b00aca35d6). The functional fix ships through
CodeDeploy after-install.sh, so currently-running instances have the
fix without an instance refresh.
Step 1 — preview network-hardening with a change-set¶
CS_NAME="endpoint-cleanup-$(date +%s)"
aws cloudformation create-change-set \
--profile vell-prod-admin --region us-east-1 \
--stack-name network-hardening \
--change-set-name "$CS_NAME" \
--change-set-type UPDATE \
--template-body file://cloudformation/hardening/network-hardening.yml \
--parameters ParameterKey=EnvironmentName,UsePreviousValue=true \
ParameterKey=VpcId,UsePreviousValue=true \
ParameterKey=EndpointSecurityGroupId,UsePreviousValue=true \
ParameterKey=FlowLogRetentionDays,UsePreviousValue=true \
ParameterKey=FlowLogsRoleName,UsePreviousValue=true \
ParameterKey=FlowLogsLogGroupName,UsePreviousValue=true \
--capabilities CAPABILITY_NAMED_IAM
aws cloudformation describe-change-set \
--profile vell-prod-admin --region us-east-1 \
--stack-name network-hardening --change-set-name "$CS_NAME" \
--query 'Changes[].ResourceChange.[Action,LogicalResourceId,ResourceType]' \
--output table
Expected output: exactly 2 Remove rows (EpSsmContacts, EpIncidentManager).
Anything else → stop, investigate.
Step 2 — execute network-hardening¶
aws cloudformation execute-change-set \
--profile vell-prod-admin --region us-east-1 \
--stack-name network-hardening --change-set-name "$CS_NAME"
aws cloudformation wait stack-update-complete \
--profile vell-prod-admin --region us-east-1 \
--stack-name network-hardening
Step 3 — dev-vpc¶
CS_NAME="endpoint-cleanup-$(date +%s)"
aws cloudformation create-change-set \
--profile vell-prod-admin --region us-east-1 \
--stack-name dev-vpc \
--change-set-name "$CS_NAME" \
--change-set-type UPDATE \
--template-body file://cloudformation/stacks/env/env-vpc.yml \
--parameters ParameterKey=EnvironmentName,UsePreviousValue=true \
ParameterKey=VpcCidr,UsePreviousValue=true \
ParameterKey=NatStrategy,UsePreviousValue=true \
--capabilities CAPABILITY_IAM
aws cloudformation describe-change-set \
--profile vell-prod-admin --region us-east-1 \
--stack-name dev-vpc --change-set-name "$CS_NAME" \
--query 'Changes[].ResourceChange.[Action,LogicalResourceId,ResourceType]' \
--output table
Expected output: exactly 1 Remove row (KmsEndpoint).
aws cloudformation execute-change-set \
--profile vell-prod-admin --region us-east-1 \
--stack-name dev-vpc --change-set-name "$CS_NAME"
Step 4 — soak verification on dev (24-48h)¶
-
/var/log/vell-boot.logshows secrets refresh continuing (we kept SecretsEndpoint) - CloudWatch log groups (
/dev/laravel, etc.) continue receiving events (we kept LogsEndpoint) - Session Manager still works on dev-web instances (we kept SSM trio)
- App health checks pass; ALB target group reports healthy
Step 5 — demo-vpc¶
Repeat Step 3 with --stack-name demo-vpc.
Rollback¶
If verification fails on dev, roll back the template change and re-deploy:
git revert <commit-sha>
aws cloudformation deploy --profile vell-prod-admin --region us-east-1 \
--stack-name dev-vpc \
--template-file cloudformation/stacks/env/env-vpc.yml \
--capabilities CAPABILITY_IAM --no-fail-on-empty-changeset
CloudFormation will recreate the Interface Endpoints (new IDs, but same DNS
names because PrivateDnsEnabled: true registers AWS service DNS).
Same procedure for network-hardening if needed.
Out of scope¶
prod-vpcinterface endpoints — kept as-is for operational hygieneprod-ec2endpoint (light usage) — trace caller before considering removaldev/demoSecrets Manager + CloudWatch Logs endpoints — actively used; defer pending NAT consolidation decision (workstream #7)- NAT Gateway consolidation (5 → 3) — separate workstream
- Bedrock Knowledge Base migration off OpenSearch Serverless — separate
- Aurora Serverless v2 ACU tuning — separate
- AWS Support tier downgrade (console-only, no CFN)
Cross-reference¶
Audit findings: AWS April 2026 invoice review (cost-reduction-audit, 2026-05-01).
Verification methodology: AWS/PrivateLinkEndpoints BytesProcessed +
NewConnections summed across subnet ENIs over a 30-day window.