Skip to content

CloudFormation Deploy Discipline

A reference for keeping git, CloudFormation, and running infrastructure in agreement. Drift is cheap to prevent and expensive to debug — this doc captures the muscle memory.

Why this exists

Three independent things must stay aligned for "the code is the infrastructure" to be true:

  1. Local git template (what you edit in code/social/cloudformation/)
  2. Deployed CFN template (what AWS has on file for the stack)
  3. Actual running resource state (what's really configured in AWS)

When any two diverge, future changes get harder. Bugs become "is the template wrong, did the deploy fail, or did someone change it via the console?" — and you spend an hour answering that before fixing the bug.

The three flavors of drift

1. Template drift (git vs deployed CFN)

Local edits committed but never deployed. Or the reverse: someone pulled the deployed template into git from another branch.

How to detect:

aws cloudformation get-template --profile vell-prod-admin --region us-east-1 \
  --stack-name <stack> --template-stage Original --query 'TemplateBody' \
  --output text > /tmp/deployed.yml
diff -u cloudformation/path/to/template.yml /tmp/deployed.yml

Common false positives: AWS mangles non-ASCII characters (em-dash becomes ? in the round-tripped template) and may add a trailing newline. Real drift looks like added/removed resources or changed values.

2. Resource drift (deployed CFN vs actual resources)

Someone changed a resource directly via console / CLI / SDK without going through CFN. The deployed template says X, the resource is Y.

How to detect:

# Trigger detection (async, takes 1-5 min)
aws cloudformation detect-stack-drift --profile vell-prod-admin --region us-east-1 \
  --stack-name <stack>

# Read results
aws cloudformation describe-stacks --profile vell-prod-admin --region us-east-1 \
  --stack-name <stack> --query 'Stacks[0].DriftInformation'

# Detail per resource
aws cloudformation describe-stack-resource-drifts --profile vell-prod-admin \
  --region us-east-1 --stack-name <stack> \
  --query 'StackResourceDrifts[?StackResourceDriftStatus!=`IN_SYNC`]'

UNKNOWN ≠ drifted. It just means CFN can't introspect that resource type. Check the per-resource detail to be sure.

3. Instance / runtime drift (deployed config vs what's running)

This is the dangerous one because CFN does not detect it. CFN checks that the Launch Template configuration matches its deployed template — but it does NOT check whether existing instances are running with that template version, nor does it look inside the instance.

Examples CFN won't catch: - ASG instances were launched with Launch Template v3, but CFN deployed v5 a month ago and never refreshed the ASG - Someone SSH'd into an EC2 instance and edited /etc/cron.d/... or /etc/nginx/... directly - A package was manually yum install'd on an instance - chmod / chown changes made by hand

CFN will say IN_SYNC for the Launch Template resource, blissfully unaware the instance is running stale config. This is the hardest drift category to detect and the most common in practice.

CFN's blind spots — know them by name

Blind spot What it means How to mitigate
Stale Launch Template usage ASG keeps running instances launched from older LT versions Always pair LT changes with start-instance-refresh
Inside-the-instance changes Files, packages, services modified on the box Treat instances as cattle; replace, don't edit
AMI updates New AMI in template doesn't replace existing instances Same — start-instance-refresh
Parameter Store / Secrets values CFN can manage the secret but not the value behind the reference Audit values separately; rotate via service-specific tools
Resources outside CFN Manually-created VPC endpoints, security groups, IAM roles Adopt them into CFN or document as "managed elsewhere"
RDS parameter group changes Modifying parameters can require manual reboot Always check pending-maintenance-action after deploys

The deploy ritual by change type

A. Pure CFN resources (atomic — just deploy)

Examples: VPC endpoints, IAM roles/policies, S3 buckets, Lambda, Route 53 records, CloudFront distributions, KMS keys, SQS queues, SNS topics, EventBridge rules.

Just deploy. No instance refresh, no reboot, no manual steps.

aws cloudformation deploy --profile vell-prod-admin --region us-east-1 \
  --stack-name <stack> \
  --template-file <path> \
  --capabilities CAPABILITY_IAM --no-fail-on-empty-changeset

B. Launch Template / UserData changes (TWO STEPS)

Examples: AMI updates, instance type changes, any UserData script edit (cron files, systemd units, package installs, environment variables).

Step 1 — deploy CFN (creates new Launch Template version):

aws cloudformation deploy --profile vell-prod-admin --region us-east-1 \
  --stack-name dev-compute \
  --template-file cloudformation/stacks/env/env-compute.yml \
  --capabilities CAPABILITY_IAM --no-fail-on-empty-changeset

Step 2 — replace existing instances (forces them to use the new LT):

# Find the ASG
ASG=$(aws cloudformation describe-stack-resources --profile vell-prod-admin \
  --region us-east-1 --stack-name dev-compute \
  --query 'StackResources[?ResourceType==`AWS::AutoScaling::AutoScalingGroup`].PhysicalResourceId' \
  --output text)

# Refresh
aws autoscaling start-instance-refresh --profile vell-prod-admin --region us-east-1 \
  --auto-scaling-group-name "$ASG" \
  --preferences '{"MinHealthyPercentage": 50, "InstanceWarmup": 180}'

# Watch progress (don't sleep blindly — poll)
aws autoscaling describe-instance-refreshes --profile vell-prod-admin --region us-east-1 \
  --auto-scaling-group-name "$ASG" --max-records 1

Schedule Step 2 during a low-traffic window. Instances roll one batch at a time. MinHealthyPercentage: 50 keeps half the fleet up throughout. InstanceWarmup: 180 gives new instances 3 minutes to register healthy with the ALB before counting them.

Skipping Step 2 is the most common drift sin in this codebase. The deploy will succeed, CFN will say IN_SYNC, but instances stay on old config indefinitely.

C. ASG parameter-only changes (deploy, no refresh)

Examples: MinSize, MaxSize, DesiredCapacity, scaling policies, target group attachments. These take effect on the existing fleet without replacing instances.

Just deploy. Step 1 only.

D. Aurora Serverless v2 / RDS config (deploy, sometimes pending)

Examples: MinCapacity, MaxCapacity, parameter group changes, backup retention.

Deploy normally. Most changes apply on next maintenance window or immediately depending on the property. Check after:

aws rds describe-db-clusters --profile vell-prod-admin --region us-east-1 \
  --db-cluster-identifier <cluster-id> \
  --query 'DBClusters[0].[ServerlessV2ScalingConfiguration,PendingModifiedValues]'

If PendingModifiedValues is non-empty, the change is queued for the maintenance window. Force apply with --apply-immediately on a direct modify-db-cluster call if needed (be aware of brief downtime).

E. NAT Gateway / route table changes (deploy, traffic reroutes)

Examples: switching NatStrategy: OnePerAZSingleNat, adding gateway endpoints.

Deploy normally. Routes update atomically — in-flight TCP connections may briefly stall, then reconnect. Schedule during low-traffic if prod.

F. Parameters & Secrets (out of band from CFN)

CFN can manage the resource (AWS::SecretsManager::Secret, AWS::SSM::Parameter) but the value typically rotates via the service, not via CFN. Don't try to deploy secret values from CFN — you'll leak them into CloudFormation events.

Manage values via: - Secrets Manager rotation (preferred for DB creds) - Direct aws secretsmanager put-secret-value for manual updates - Parameter Store via aws ssm put-parameter

If a stack reference like '{{resolve:secretsmanager:...}}' doesn't update, the issue is almost always that the consumer (e.g., the EC2 instance or Lambda) hasn't picked up the new value — refresh it.

Pre-flight checklist (before any CFN deploy)

  • Working tree is clean (git status -s empty for tracked files)
  • On the right branch (dev for non-prod, follow promotion process for prod)
  • Pulled latest (git pull)
  • Template diff against deployed is exactly what you expect:
    aws cloudformation get-template --profile vell-prod-admin --region us-east-1 \
      --stack-name <stack> --template-stage Original --query 'TemplateBody' \
      --output text > /tmp/deployed.yml
    diff -u cloudformation/path/to/template.yml /tmp/deployed.yml
    
  • Created a change-set first; reviewed every Action and ResourceType:
    aws cloudformation create-change-set --profile vell-prod-admin --region us-east-1 \
      --stack-name <stack> --change-set-name "preview-$(date +%s)" \
      --change-set-type UPDATE --template-body file://<path> \
      --parameters ParameterKey=<each>,UsePreviousValue=true ... \
      --capabilities CAPABILITY_IAM
    # Then describe and inspect
    
  • Identified whether this is type A, B, C, D, E, or F above
  • If type B (Launch Template / UserData): scheduled an instance refresh window
  • If touching prod: confirmed with stakeholder; deploy in dev/demo first

Post-flight checklist (after any CFN deploy)

  • Stack status is CREATE_COMPLETE or UPDATE_COMPLETE
  • Template drift is zero (diff against get-template again)
  • If type B: instance refresh kicked off and succeeded
  • Application health check (ALB target group, app metric, smoke test)
  • CloudWatch logs show app starting cleanly on new instances
  • If touching cost-relevant resources: note the change in docs/infrastructure/COST_*.md or relevant runbook

Quarterly drift sweep

Run this on the first of every quarter. Catches manual changes you didn't notice, deploys you forgot, and stale Launch Template usage.

# 1. Trigger drift detection on every stack with non-trivial resources
PROFILE=vell-prod-admin
REGION=us-east-1

for s in $(aws cloudformation list-stacks --profile $PROFILE --region $REGION \
  --stack-status-filter CREATE_COMPLETE UPDATE_COMPLETE \
  --query 'StackSummaries[?!starts_with(StackName, `StackSet-`) && !starts_with(StackName, `SC-RA-`)].StackName' \
  --output text); do
  aws cloudformation detect-stack-drift --profile $PROFILE --region $REGION \
    --stack-name "$s" >/dev/null 2>&1
  echo "queued: $s"
done

# 2. Wait 5 min for detection to complete

# 3. Pull results
aws cloudformation list-stacks --profile $PROFILE --region $REGION \
  --stack-status-filter CREATE_COMPLETE UPDATE_COMPLETE \
  --query 'StackSummaries[?DriftInformation.StackDriftStatus==`DRIFTED`].[StackName,DriftInformation.StackDriftStatus,DriftInformation.LastCheckTimestamp]' \
  --output table

# 4. Detail any DRIFTED stacks
for s in $(...drifted-stack-names...); do
  echo "=== $s ==="
  aws cloudformation describe-stack-resource-drifts --profile $PROFILE --region $REGION \
    --stack-name "$s" \
    --query 'StackResourceDrifts[?StackResourceDriftStatus!=`IN_SYNC`].[LogicalResourceId,ResourceType,StackResourceDriftStatus]' \
    --output table
done

# 5. For ASG stacks: check stale Launch Template usage
for asg in $(aws autoscaling describe-auto-scaling-groups --profile $PROFILE --region $REGION \
  --query 'AutoScalingGroups[].AutoScalingGroupName' --output text); do
  aws autoscaling describe-auto-scaling-groups --profile $PROFILE --region $REGION \
    --auto-scaling-group-names "$asg" \
    --query 'AutoScalingGroups[0].[AutoScalingGroupName,LaunchTemplate.Version,Instances[0].LaunchTemplate.Version]' \
    --output text
  # If column 2 != column 3, instances are running stale LT
done

Anti-patterns to avoid

Don't Because
Edit a stack via the AWS console "just this once" The next CFN update will revert your change AND surprise you
aws cloudformation deploy and walk away Watch the events; rollback failures are quieter than they should be
Skip change-set preview on prod The preview is the difference between "3 Remove rows" and "47 unexpected changes"
Deploy a UserData change without instance refresh CFN says IN_SYNC, instances run last quarter's config
Commit local-only template edits and never deploy Pre-drift accumulates; every future deploy carries surprise changes
Mix unrelated changes in one PR Hard to roll back the bad part without losing the good part
Force-deploy through a failing change-set The change-set is failing for a reason; debug first
Use --no-execute-changeset and forget Orphan change-sets pile up; they expire but waste mental space

Cross-references