CloudFormation Deploy Discipline¶
A reference for keeping git, CloudFormation, and running infrastructure in agreement. Drift is cheap to prevent and expensive to debug — this doc captures the muscle memory.
Why this exists¶
Three independent things must stay aligned for "the code is the infrastructure" to be true:
- Local git template (what you edit in
code/social/cloudformation/) - Deployed CFN template (what AWS has on file for the stack)
- Actual running resource state (what's really configured in AWS)
When any two diverge, future changes get harder. Bugs become "is the template wrong, did the deploy fail, or did someone change it via the console?" — and you spend an hour answering that before fixing the bug.
The three flavors of drift¶
1. Template drift (git vs deployed CFN)¶
Local edits committed but never deployed. Or the reverse: someone pulled the deployed template into git from another branch.
How to detect:
aws cloudformation get-template --profile vell-prod-admin --region us-east-1 \
--stack-name <stack> --template-stage Original --query 'TemplateBody' \
--output text > /tmp/deployed.yml
diff -u cloudformation/path/to/template.yml /tmp/deployed.yml
Common false positives: AWS mangles non-ASCII characters (em-dash —
becomes ? in the round-tripped template) and may add a trailing
newline. Real drift looks like added/removed resources or changed values.
2. Resource drift (deployed CFN vs actual resources)¶
Someone changed a resource directly via console / CLI / SDK without going through CFN. The deployed template says X, the resource is Y.
How to detect:
# Trigger detection (async, takes 1-5 min)
aws cloudformation detect-stack-drift --profile vell-prod-admin --region us-east-1 \
--stack-name <stack>
# Read results
aws cloudformation describe-stacks --profile vell-prod-admin --region us-east-1 \
--stack-name <stack> --query 'Stacks[0].DriftInformation'
# Detail per resource
aws cloudformation describe-stack-resource-drifts --profile vell-prod-admin \
--region us-east-1 --stack-name <stack> \
--query 'StackResourceDrifts[?StackResourceDriftStatus!=`IN_SYNC`]'
UNKNOWN ≠ drifted. It just means CFN can't introspect that resource
type. Check the per-resource detail to be sure.
3. Instance / runtime drift (deployed config vs what's running)¶
This is the dangerous one because CFN does not detect it. CFN checks that the Launch Template configuration matches its deployed template — but it does NOT check whether existing instances are running with that template version, nor does it look inside the instance.
Examples CFN won't catch:
- ASG instances were launched with Launch Template v3, but CFN deployed
v5 a month ago and never refreshed the ASG
- Someone SSH'd into an EC2 instance and edited /etc/cron.d/... or
/etc/nginx/... directly
- A package was manually yum install'd on an instance
- chmod / chown changes made by hand
CFN will say IN_SYNC for the Launch Template resource, blissfully
unaware the instance is running stale config. This is the hardest
drift category to detect and the most common in practice.
CFN's blind spots — know them by name¶
| Blind spot | What it means | How to mitigate |
|---|---|---|
| Stale Launch Template usage | ASG keeps running instances launched from older LT versions | Always pair LT changes with start-instance-refresh |
| Inside-the-instance changes | Files, packages, services modified on the box | Treat instances as cattle; replace, don't edit |
| AMI updates | New AMI in template doesn't replace existing instances | Same — start-instance-refresh |
| Parameter Store / Secrets values | CFN can manage the secret but not the value behind the reference | Audit values separately; rotate via service-specific tools |
| Resources outside CFN | Manually-created VPC endpoints, security groups, IAM roles | Adopt them into CFN or document as "managed elsewhere" |
| RDS parameter group changes | Modifying parameters can require manual reboot | Always check pending-maintenance-action after deploys |
The deploy ritual by change type¶
A. Pure CFN resources (atomic — just deploy)¶
Examples: VPC endpoints, IAM roles/policies, S3 buckets, Lambda, Route 53 records, CloudFront distributions, KMS keys, SQS queues, SNS topics, EventBridge rules.
Just deploy. No instance refresh, no reboot, no manual steps.
aws cloudformation deploy --profile vell-prod-admin --region us-east-1 \
--stack-name <stack> \
--template-file <path> \
--capabilities CAPABILITY_IAM --no-fail-on-empty-changeset
B. Launch Template / UserData changes (TWO STEPS)¶
Examples: AMI updates, instance type changes, any UserData script edit (cron files, systemd units, package installs, environment variables).
Step 1 — deploy CFN (creates new Launch Template version):
aws cloudformation deploy --profile vell-prod-admin --region us-east-1 \
--stack-name dev-compute \
--template-file cloudformation/stacks/env/env-compute.yml \
--capabilities CAPABILITY_IAM --no-fail-on-empty-changeset
Step 2 — replace existing instances (forces them to use the new LT):
# Find the ASG
ASG=$(aws cloudformation describe-stack-resources --profile vell-prod-admin \
--region us-east-1 --stack-name dev-compute \
--query 'StackResources[?ResourceType==`AWS::AutoScaling::AutoScalingGroup`].PhysicalResourceId' \
--output text)
# Refresh
aws autoscaling start-instance-refresh --profile vell-prod-admin --region us-east-1 \
--auto-scaling-group-name "$ASG" \
--preferences '{"MinHealthyPercentage": 50, "InstanceWarmup": 180}'
# Watch progress (don't sleep blindly — poll)
aws autoscaling describe-instance-refreshes --profile vell-prod-admin --region us-east-1 \
--auto-scaling-group-name "$ASG" --max-records 1
Schedule Step 2 during a low-traffic window. Instances roll one batch
at a time. MinHealthyPercentage: 50 keeps half the fleet up
throughout. InstanceWarmup: 180 gives new instances 3 minutes to
register healthy with the ALB before counting them.
Skipping Step 2 is the most common drift sin in this codebase. The deploy will succeed, CFN will say IN_SYNC, but instances stay on old config indefinitely.
C. ASG parameter-only changes (deploy, no refresh)¶
Examples: MinSize, MaxSize, DesiredCapacity, scaling policies, target group attachments. These take effect on the existing fleet without replacing instances.
Just deploy. Step 1 only.
D. Aurora Serverless v2 / RDS config (deploy, sometimes pending)¶
Examples: MinCapacity, MaxCapacity, parameter group changes,
backup retention.
Deploy normally. Most changes apply on next maintenance window or immediately depending on the property. Check after:
aws rds describe-db-clusters --profile vell-prod-admin --region us-east-1 \
--db-cluster-identifier <cluster-id> \
--query 'DBClusters[0].[ServerlessV2ScalingConfiguration,PendingModifiedValues]'
If PendingModifiedValues is non-empty, the change is queued for the
maintenance window. Force apply with --apply-immediately on a
direct modify-db-cluster call if needed (be aware of brief downtime).
E. NAT Gateway / route table changes (deploy, traffic reroutes)¶
Examples: switching NatStrategy: OnePerAZ ↔ SingleNat, adding
gateway endpoints.
Deploy normally. Routes update atomically — in-flight TCP connections may briefly stall, then reconnect. Schedule during low-traffic if prod.
F. Parameters & Secrets (out of band from CFN)¶
CFN can manage the resource (AWS::SecretsManager::Secret,
AWS::SSM::Parameter) but the value typically rotates via the
service, not via CFN. Don't try to deploy secret values from CFN —
you'll leak them into CloudFormation events.
Manage values via:
- Secrets Manager rotation (preferred for DB creds)
- Direct aws secretsmanager put-secret-value for manual updates
- Parameter Store via aws ssm put-parameter
If a stack reference like '{{resolve:secretsmanager:...}}' doesn't
update, the issue is almost always that the consumer (e.g., the EC2
instance or Lambda) hasn't picked up the new value — refresh it.
Pre-flight checklist (before any CFN deploy)¶
- Working tree is clean (
git status -sempty for tracked files) - On the right branch (
devfor non-prod, follow promotion process for prod) - Pulled latest (
git pull) - Template diff against deployed is exactly what you expect:
- Created a change-set first; reviewed every Action and ResourceType:
aws cloudformation create-change-set --profile vell-prod-admin --region us-east-1 \ --stack-name <stack> --change-set-name "preview-$(date +%s)" \ --change-set-type UPDATE --template-body file://<path> \ --parameters ParameterKey=<each>,UsePreviousValue=true ... \ --capabilities CAPABILITY_IAM # Then describe and inspect - Identified whether this is type A, B, C, D, E, or F above
- If type B (Launch Template / UserData): scheduled an instance refresh window
- If touching prod: confirmed with stakeholder; deploy in dev/demo first
Post-flight checklist (after any CFN deploy)¶
- Stack status is
CREATE_COMPLETEorUPDATE_COMPLETE - Template drift is zero (diff against
get-templateagain) - If type B: instance refresh kicked off and succeeded
- Application health check (ALB target group, app metric, smoke test)
- CloudWatch logs show app starting cleanly on new instances
- If touching cost-relevant resources: note the change in
docs/infrastructure/COST_*.mdor relevant runbook
Quarterly drift sweep¶
Run this on the first of every quarter. Catches manual changes you didn't notice, deploys you forgot, and stale Launch Template usage.
# 1. Trigger drift detection on every stack with non-trivial resources
PROFILE=vell-prod-admin
REGION=us-east-1
for s in $(aws cloudformation list-stacks --profile $PROFILE --region $REGION \
--stack-status-filter CREATE_COMPLETE UPDATE_COMPLETE \
--query 'StackSummaries[?!starts_with(StackName, `StackSet-`) && !starts_with(StackName, `SC-RA-`)].StackName' \
--output text); do
aws cloudformation detect-stack-drift --profile $PROFILE --region $REGION \
--stack-name "$s" >/dev/null 2>&1
echo "queued: $s"
done
# 2. Wait 5 min for detection to complete
# 3. Pull results
aws cloudformation list-stacks --profile $PROFILE --region $REGION \
--stack-status-filter CREATE_COMPLETE UPDATE_COMPLETE \
--query 'StackSummaries[?DriftInformation.StackDriftStatus==`DRIFTED`].[StackName,DriftInformation.StackDriftStatus,DriftInformation.LastCheckTimestamp]' \
--output table
# 4. Detail any DRIFTED stacks
for s in $(...drifted-stack-names...); do
echo "=== $s ==="
aws cloudformation describe-stack-resource-drifts --profile $PROFILE --region $REGION \
--stack-name "$s" \
--query 'StackResourceDrifts[?StackResourceDriftStatus!=`IN_SYNC`].[LogicalResourceId,ResourceType,StackResourceDriftStatus]' \
--output table
done
# 5. For ASG stacks: check stale Launch Template usage
for asg in $(aws autoscaling describe-auto-scaling-groups --profile $PROFILE --region $REGION \
--query 'AutoScalingGroups[].AutoScalingGroupName' --output text); do
aws autoscaling describe-auto-scaling-groups --profile $PROFILE --region $REGION \
--auto-scaling-group-names "$asg" \
--query 'AutoScalingGroups[0].[AutoScalingGroupName,LaunchTemplate.Version,Instances[0].LaunchTemplate.Version]' \
--output text
# If column 2 != column 3, instances are running stale LT
done
Anti-patterns to avoid¶
| Don't | Because |
|---|---|
| Edit a stack via the AWS console "just this once" | The next CFN update will revert your change AND surprise you |
aws cloudformation deploy and walk away |
Watch the events; rollback failures are quieter than they should be |
| Skip change-set preview on prod | The preview is the difference between "3 Remove rows" and "47 unexpected changes" |
| Deploy a UserData change without instance refresh | CFN says IN_SYNC, instances run last quarter's config |
| Commit local-only template edits and never deploy | Pre-drift accumulates; every future deploy carries surprise changes |
| Mix unrelated changes in one PR | Hard to roll back the bad part without losing the good part |
| Force-deploy through a failing change-set | The change-set is failing for a reason; debug first |
Use --no-execute-changeset and forget |
Orphan change-sets pile up; they expire but waste mental space |
Cross-references¶
- COST_CLEANUP_VPC_ENDPOINTS.md — applied example of types A (endpoints) sequencing
- AWS docs: Detecting unmanaged config changes
- AWS docs: Instance refresh