The Importance of a Good AWS Auto Scaling Policy

4 min readJul 11, 2021

A brief experience about why being thorough with your Auto-Scaling strategies can come a long way.

The article assumes that you may have some elementary knowledge on AWS, Auto-Scaling, EC2 compute and SQS.

If not here are some links to get you up to speed:

The System

As a Software Engineer there are times when one comes across a certain bug/ flaw that on the first look is easy to identify that “how did such an obvious defect get into production?”.

Nonetheless, what’s done is done and now you need to fix it.

It is quite common within AWS to have multiple services interact with each other and complement them such as:

S3 → SQS → CloudWatch Alarm → Auto Scaling → EC2

To break this down simply,

When an object is uploaded to S3.
An SQS message is created for that object.
Which acts as a trigger to either raise an Alarm or not.
Based on which Auto Scaling polices would act to add or remove EC2 instances.
Finally, the created EC2 would process the message, and get terminated once the queue is empty.

Sounds simple and it is actually. But like anything in Tech, subtle over-looking may lead to failures sooner or later.

The Problem

Now that we have established how the chain of command works, let’s look at a simple implementation and why sometimes being too simple can be detrimental.

The simple Auto Scaling policy:

"Properties": {
        "AdjustmentType": "ExactCapacity",
        "AutoScalingGroupName": {
          "Ref": "BatchProcessorGroup"
        },
        "PolicyType": "StepScaling",
        "MetricAggregationType": "Average",
        "EstimatedInstanceWarmup": "600",
        "StepAdjustments": [
          {
            "MetricIntervalLowerBound": "0",
            "MetricIntervalUpperBound": "1",
            "ScalingAdjustment": "0"
          },
          {
            "MetricIntervalLowerBound": "1",
            "MetricIntervalUpperBound": "3",
            "ScalingAdjustment": "1"
          },
          {
            "MetricIntervalLowerBound": "3",
            "ScalingAdjustment": "3"
          }
        ]
      }

Simplified:

When the assigned metric (SQS message count) reaches a certain threshold boundary set the number of instances according to the ScalingAdjustment based on the number of messages in the SQS.

Alarm Trigger:

"Properties": {
        "EvaluationPeriods": "1",
        "Statistic": "Average",
        "Threshold": "0",
        "AlarmDescription": "Alarm if SQS ApproximateNumberOfMessagesVisible > than Threshold",
        "Period": "60",
        "AlarmActions": [
          {
            "Ref": "ScalingPolicy"
          }
        ],
        "Namespace": "AWS/SQS",
        "Dimensions": [
          {
            "Name": "QueueName",
            "Value": {
              "Ref": "InputQueueName"
            }
          }
        ],
        "ComparisonOperator": "GreaterThanOrEqualToThreshold",
        "MetricName": "ApproximateNumberOfMessagesVisible"
      }

Based on the above alarm the scaling policy is triggered when the messages in the SQS become higher than the threshold.

Here, I would like to mention that SQS defines it’s messages in two ways:

Available (ApproximateNumberOfMessagesVisible)
InFlight (ApproximateNumberOfMessagesNotVisible)

In the simple policy the glaring issue is that once the available messages reach zero the auto scaling policy should set the instances to zero, sounds sane right? Wrong.

This implementation does not work if let’s say an instance is still processing a message and the auto-scaling signals it’s termination. The process would be incomplete and you wouldn’t even come to know why. No logs would indicate what happened since the instance terminated before it could push any.

Precisely, this issue came up, a message request with a large data size object was under processing and took longer to complete before messages in the queue reach zero. The more you think about the issue seems to become more obvious.

The Solution

A robust policy that can address the removal of instances as well as keep track of not only available message but in-flight ones as well.

Addition Policy:

Changed: AdjustmentType to ChangeInCapacity from ExactCapacity, now addition policy will only adjust according to the value and not set it exactly as mentioned in ScalingAdjustment.

"Properties": {
        "AdjustmentType": "ChangeInCapacity",
        "AutoScalingGroupName": {
            "Ref": "BatchProcessorGroup"
        },
        "PolicyType": "StepScaling",
        "MetricAggregationType": "Average",
        "EstimatedInstanceWarmup": "600",
        "StepAdjustments": [
            {
                "MetricIntervalLowerBound": "0",
                "MetricIntervalUpperBound": "1",
                "ScalingAdjustment": "0"
            },
            {
                "MetricIntervalLowerBound": "1",
                "MetricIntervalUpperBound": "3",
                "ScalingAdjustment": "1"
            },
            {
                "MetricIntervalLowerBound": "3",
                "ScalingAdjustment": "3"
            }
        ]
    }

Removal Policy:

This would take care of setting the capacity to zero and remove all instances. Best part is now we choose when this happens rather than previously where it was baked into one.

"Properties": {
        "AdjustmentType": "ExactCapacity",
        "AutoScalingGroupName": {
            "Ref": "BatchProcessorGroup"
        },
        "PolicyType": "StepScaling",
        "MetricAggregationType": "Average",
        "EstimatedInstanceWarmup": "600",
        "StepAdjustments": [
            {
                "MetricIntervalUpperBound": "0",
                "ScalingAdjustment": "1"
            }
        ]
    }

Available messages Alarm:

Only triggers when available messages in SQS increase and calls the addition policy.

"Properties": {
        "EvaluationPeriods": "1",
        "Statistic": "Average",
        "Threshold": "0",
        "AlarmDescription": "Alarm if SQS ApproximateNumberOfMessagesVisible >= than Threshold",
        "Period": "60",
        "AlarmActions": [
            {
                "Ref": "AddInstancePolicy"
            }
        ],
        "Namespace": "AWS/SQS",
        "Dimensions": [
            {
                "Name": "QueueName",
                "Value": {
                    "Ref": "InputQueueName"
                }
            }
        ],
        "ComparisonOperator": "GreaterThanOrEqualToThreshold",
        "MetricName": "ApproximateNumberOfMessagesVisible"
    }

In-Flight Messages Alarm:

Only triggers when in-flight messages reach zero and the removal policy is triggered.

"Properties": {
        "EvaluationPeriods": "1",
        "Statistic": "Average",
        "Threshold": "0",
        "AlarmDescription": "Alarm if SQS ApproximateNumberOfMessagesNotVisible <= than Threshold",
        "Period": "60",
        "AlarmActions": [
            {
                "Ref": "RemoveInstancePolicy"
            }
        ],
        "Namespace": "AWS/SQS",
        "Dimensions": [
            {
                "Name": "QueueName",
                "Value": {
                    "Ref": "InputQueueName"
                }
            }
        ],
        "ComparisonOperator": "LessThanOrEqualToThreshold",
        "MetricName": "ApproximateNumberOfMessagesNotVisible"
    }

Conclusion

This has been my attempt to share an experience about seemingly simple problems, but how a lack of attention to detail can lead to bugs that are ambiguous and difficult to track down as well.

I hope this might have helped you or please share it someone who might benefit from it. Keep watching this space as I continue to share my daily observations during my work. Thanks for sticking to the end.

The Importance of a Good AWS Auto Scaling Policy

The System

The Problem

The Solution

Conclusion

Written by Anosh Billimoria