Most cloud infrastructure environments let you auto-scale based on a variety of different metrics. If available, auto-scaling by queue length is most often the right way to go in a full-fledged, event-driven architecture – but this can be a little counter-intuitive. If you’re paying for resources like CPU hours, for instance, why would you want to start scaling out before you have made full use of the resources you are already paying for, right? Not quite… Here’s why.
Once upon a time in the data center, monitoring and scaling servers was a manual process. An administrator might monitor resource usage for their “mission-critical” servers, a.k.a. pets, to ensure that they weren’t getting overloaded. When the total CPU reached a certain threshold for a set period of time, it might trigger an alert. If the total CPU utilization for the company web portal started to max out, another server could be built and added to the cluster.
Typically, this process was manual and so slow that the whole cluster was usually just over-sized to avoid this situation altogether. Things became more tolerable with machine virtualization, in part, because growth and traffic trends for many apps were still reasonably predictable and image-based deployment was automated.
Even then, scaling based on resource consumption was grossly misguided because users often experience slow response times well before server resources are exhausted. Once a server reaches its maximum throughput, any additional load causes throughput to decline and user response latency to skyrocket. And in this regard, things are no different in the cloud.
Just as in the physical world, when new requests are arriving faster than they can be processed, a queue is going to form somewhere and start to back up. Users at the back of the line will experience much longer wait times. This is the iron law of queuing theory.
Unfortunately, you won’t find a “throughput” metric for auto-scaling because the point of maximum throughput is different for every application. In the absence of this metric, however, the queue length is the best indicator that it’s time to scale up or down.
If you imagine that you are a shift manager at a supermarket during peak shopping hours, this idea becomes fairly obvious. Knowing the average checkout time (throughput) is fairly constant, you will want to open a new register as soon as you see a certain number of shoppers in line. This tells you that shoppers are arriving faster than they can be checked out by the existing number of cashiers. If you don’t open a new lane (auto-scale) fast, shoppers may decide to abandon their carts and leave because they are doing the same calculation themselves. Like the supermarket, if you don’t add capacity to your application before long queues form, waiting users may simply decide to leave in frustration. If queues grow too long, request timeouts will be coming soon.
With most cloud-native apps, especially the B2C variety, traffic peaks and valleys may vary greatly, change quickly, and be difficult to predict. Reacting quickly to this kind of fluctuation is critical for remaining responsive to users when traffic spikes, and then reducing resource costs when it subsides.
In a modern event-driven architecture, a single customer request may involve many different messages, services, and queues. A single order may require a dozen price lookups and inventory checks, several rewards lookups, one or two tax calculations, a credit card validation and charge, and a single receipt. The number of messages that end up on any given queue within such sagas, and the average processing time it takes to process them, may differ a lot.
One advantage of relying on queue lengths for auto-scaling is that services can scale independently, helping to minimize the “noisy neighbor” issue. You have an opportunity to “tune” the app by adjusting the queue length thresholds. But choosing the right threshold for each queue isn’t as simple as it sounds. The length of the queue is determined by the arrival rate of new messages and the average processing time for each message, both of which are highly dynamic based on runtime conditions. How can you decide what is the best value for each queue length?
There is a fairly rigorous way to answer this question called Transaction Cost Analysis (TCA). TCA can be too costly and time-consuming for the relative value of the answers it can provide in this context. A rough and ready approach may be just as good, given that traffic patterns are constantly evolving anyway. You can do this by testing your services under sufficient load to cause the queue to grow, and then measuring both the (input) queue length and the average processing latency for the consuming service. By gradually increasing the arrival rate, you can see what the queue length is when wait times become unacceptable.
If you are auto-scaling your services with Azure Monitor, both Azure Storage and Azure Service Bus expose a queue length metric that it can use for this auto-scaling strategy.
If your services will be hosted in Kubernetes, you will need the custom metrics Stackdriver adapter and the HorizontalPodAutoscaler (HPA).
Even if you are writing front end services for Azure API Management instances, you can still take partial advantage of this strategy. Azure Monitor will auto-scale your APIM instances based on a composite metric called capacity. This composite includes consideration of the incoming request queue length for each API. If in doubt, favor the lower end of the scale because these instances take some time to spin up.
Finally, for those of you who can do the math, the ideal auto-scaling metric is actually the rate of acceleration of queue length increase or decrease over a period of time i.e. the second derivative. This figure could be used to auto-scale predicatively, but so far no orchestrators are that smart. Maybe you can drop a note in the suggestion box for your favorite autoscaler!