Cloud Control for Developers (part two)

Limits, Alarms & Budgets

In the previous post we have talked about how the cloud allows us to generate velocity we have not seen before. We also touched on the modern challenges we are forced to confront because of this. To recap, as developers we can quickly create whole proverbial cities of application infrastructure in the cloud in a matter of minutes. These cities can go out of control in a number of ways. First, we might lose track of one due to the fact that we’re building many of them in different locations. Second, the connections between the cities might break because changes have been made. And last, our cities might unexpectedly grow beyond our budget.

For the first two of these problems we have seen approaches we can take. By using Infrastructure-as-Code we can keep track of where cities exist and more importantly, how they were configured. The risk of breaking integrations between them can then be mitigated by creating integration tests powered by the very same IaC templates we write. Only the last challenge remains: how do we prevent our cities from scaling out of control?

The cloud was designed to allow for incredible scalability, often seemingly infinite. And while this illusion of infinity is only an illusion, the truth remains that the cloud will out-scale our budget much earlier than it will encounter any physical server limit. Therefore we must do something to avoid the already infamous bill-shock: receiving an unexpected large bill from your cloud provider at the end of the month.

We will solve this issue in two ways. First we will define limits on our services wherever possible. This is a robust yet rather absolute way of limiting cloud spend. In situations where we cannot impose a limit or where we would rather not, we will setup alarms. That way we will be aware of any potential issues with scalability we might face.

As in the previous post we will be applying these principles on AWS and the examples will be given in Java using the Cloud Development Kit.

Limiting scalability

There are different kinds of limits we can talk about. We can limit usage of a particular service in the cloud, such a limit is often relative to time (e.g. throttling). We can also limit the number of resources we can create, such a limit would then be for example the absolute number of virtual machines we are allowed to have. But in this context both limits come down to the same result: limit spending. Now you might be falsely inclined to assume that we only need to impose a budget of sorts, limit our cloud spend by disabling our account if we cross a boundary. And while some cloud providers actually allow you do this, see for example Google Cloud, it is important to understand that each of these limits has a very different use case and that they are really not the same.

If your cloud providers allows it, using an absolute spending limit can be a great measure for development and experimental accounts. Being able to limit the maximum spend you can incur for each account makes it easy to give out accounts to different people and therefore isolate them into their own little islands. It is not a tool however meant to be used on production systems where uptime is more important than incurring an overhead in cost. In such cases you want to limit the use of your services so that users cannot either abuse your system by making too many calls or overload it during peak moments in traffic. Limiting the number of resources you can create is the last limit we mentioned, and it can be useful in situations where you cannot impose a spending limit but still want to limit cloud spend for development.

Limits on AWS

Let’s immediately discuss the elephant in the AWS room: AWS does not make it easy to limit our cloud spend. Limiting with a budget is straight up not possible. The AWS Budget service does not function as a limit, but more as a specialized alarm. Setting it up is highly recommended though, so we’ll go ahead and do exactly that later. Furthermore, it is service-dependant whether we can limit the total number of active instances. With EC2 for example, we cannot. This leaves us with mostly service limits that we can apply, but they will bring us a long way.

On AWS many of the services can be configured to obey certain scalability limits. For example configuring throttling on API Gateway or setting an upper limit on an auto-scaling group. Whenever a service allows for this we can set the limit through Infrastructure-as-Code. In certain cases however this might not be straightforward or even possible. An example of this is limiting how often a serverless function can be invoked in a certain time period. In such cases, or alternatively whenever it does not make sense to limit scalability, we will implement alarms.

Limits using CDK

We will apply a throttling limit to our API Gateway and configure an email notification as an alarm for when our lambda function is invoked more often than we expected.

The throttling limit itself can be configured using the deployOptions of our resource. It consists or both a burst as well as a rate limit. More information on how these two interact can be found here. We can then throttle our API to ten requests per second as such:

public class CodeStack extends Stack {
    public CodeStack(final Construct scope, final String id, final StackProps props) {
        super(scope, id, props);

        // Other resources removed for readability

        LambdaRestApi lambdaRestApi = LambdaRestApi.Builder.create(this, "CloudControlApi")
                .handler(apiFunction)
                .deployOptions(StageOptions.builder()
                        .throttlingBurstLimit(10.0)
                        .throttlingRateLimit(10)
                        .build()
                .build();
  }
}

Alarms for monitoring scalability

Alarms are an important tool when trying to control risk. Alarms are often used to monitor machine metrics so that failures are detected in a timely manner and upgrades to for example disk space are carried out before running out. This type of alarms is used in operations and refers to machine metrics for availability or health checks. In the modern cloud we are growing less concerned with machine metrics due to the advent of serverless, but the way serverless services scale only makes monitoring their use and scalability more important, not less.

Alarms using CloudWatch on AWS

Alarms on AWS are configured using CloudWatch. CloudWatch is the service that collects all the logs and metrics on an AWS account. The metrics can be used to trigger an action, which can be anything from a notification by email or Slack to automatic resolution of a security risk (more on this in a previous blog here). The exact components of CloudWatch and how it functions is not important to us right now, as alarms can be configured in CDK without much knowledge of it.

CloudWatch and CDK

Defining an alarm is a bit more involving than a limit. We set up an SNS topic and register our email to it for notifications, then configure an alarm on a metric and connect it to our SNS topic. This can look as follows:

public class CodeStack extends Stack {
    public CodeStack(final Construct scope, final String id, final StackProps props) {
        super(scope, id, props);

        // Other resources removed for readability

        Function apiFunction = Function.Builder.create(this, "CloudControlApiFunction")
                .runtime(Runtime.JAVA_11)
                .timeout(Duration.seconds(15))
                .memorySize(512)
                .code(Code.fromAsset("../api/target/api-0.1.jar"))
                .handler("nl.p4c.code.api.AwsHandler")
                .build();
        
        // Create an SNS topic
        Topic alarmTopic = Topic.Builder.create(this, "EmailTopic")
                .topicName("ApiAlarmTopic")
                .displayName("API Alarm")
                .build();
        
        // Register our email
        alarmTopic.addSubscription(EmailSubscription.Builder
                .create("my-email@profit4cloud.nl")
                .build());

        // Choose the metric to monitor
        Metric metric = apiFunction.metric("Invocations");

        // Configure the alarm to trigger after 100 invocations in 5 minutes
        Alarm alarm = Alarm.Builder.create(this, "InvocationAlarm")
                .metric(metric)
                .threshold(100)
                .evaluationPeriods(1) // This corresponds to 5 minutes for this metric
                .treatMissingData(TreatMissingData.IGNORE)
                .actionsEnabled(true)
                .build();

        // Register our SNS topic to our alarm
        SnsAction snsAction = new SnsAction(alarmTopic);
        alarm.addAlarmAction(snsAction);
  }
}

CDK deep-dive: budgets using CDK

One of the most useful things that we always advice to do is to configure budgets on each AWS account. An AWS budget allows you to set up alarms that are triggered when for example the estimated monthly spending of an account exceeds your expectations. Although configuring one in CDK is more difficult than what we have seen so far and requires us to dive into the lower-level CloudFormation API that CDK exposes.

Not all services are abstracted nicely. Whenever a useful class does not exist in CDK you can always refer back to so-called CloudFormation classes, prefixed with Cfn. These classes are more verbose, require more of an understanding of CloudFormation and often take Strings as inputs instead of descriptive and type-safe Enums. Luckily this all sounds more difficult than it is in practice.

To setup a budget we must define a budget, a subscriber and the trigger when to notify the subscriber. The last component is required because you can notify users when a percentage of the budget is forecaster, a useful feature that we will be implementing now. A budget for 200 USD per month with a notification at 80% forecasted can look like the following snippet:

public class InfraStack extends Stack {
    public InfraStack(final Construct scope, final String id) {
        this(scope, id, null);
    }

    public InfraStack(final Construct scope, final String id, final StackProps props) {
        super(scope, id, props);

        // Rest of the code removed for clarity
        
        constructForecastedCostBudget();
    }

    private void constructForecastedCostBudget() {
        CfnBudget cfnBudget = CfnBudget.Builder.create(this, "MyBudget")
                .budget(CfnBudget.BudgetDataProperty.builder()
                        .budgetName("MyBudget")
                        .budgetType("COST")
                        .timeUnit("MONTHLY")
                        .costTypes(CfnBudget.CostTypesProperty.builder()
                                .includeCredit(false)
                                .build())
                        .budgetLimit(CfnBudget.SpendProperty.builder()
                                .amount(200)
                                .unit("USD")
                                .build())
                        .build())
                .build();

        CfnBudget.SubscriberProperty subscriberProperty = CfnBudget.SubscriberProperty
                .builder()
                .subscriptionType("EMAIL")
                .address("example@profit4cloud.nl")
                .build();

        CfnBudget.NotificationWithSubscribersProperty notificationWithSubscribersProperty = 
                CfnBudget.NotificationWithSubscribersProperty.builder()
                        .notification(CfnBudget.NotificationProperty.builder()
                                .thresholdType("PERCENTAGE")
                                .threshold(80)
                                .notificationType("FORECASTED")
                                .comparisonOperator("GREATER_THAN")
                                .build())
                        .subscribers(Collections.singletonList(subscriberProperty))
                        .build();

        cfnBudget.setNotificationsWithSubscribers(
                Collections.singletonList(notificationWithSubscribersProperty));
    }
}

Pro-tip, if you ever find yourself struggling with CloudFormation classes or the CloudFormation definitions in general: have this page open during development. It is the CloudFormation resource and property reference. Any question starting about resource definitions and acceptable values can be answered with that page.

Summary

To take control over the scalability of our system we have talked about limits and alarms. We have discussed the different types of limits in the cloud and when to use them. For AWS we have implemented a service limit in the form of API Gateway throttling. Some services cannot be limited, as is the case with how often a lambda function is invoked. To this end we have discussed alarms and implemented one in AWS. Last we have taken a look at how to define an AWS Budget in CDK using the lower level API that is available to us. With this we have mitigated the risk that comes with the scalability that we get in the cloud.

And thus we have finished the journey we set out in part one, but the path to cloud control is longer. A holistic approach to cloud control will involve much more on many different layers, from service to account to organization configuration and security. It will involve everything from RBAC with IAM to automatic scans of our infrastructure. But the good news is that the tools and skills required for this are the same as what we have seen just now. Infrastructure-as-code empowers developers to safeguard the principles we have discussed, so that the metropolis that we build does not collapse under its own weight.

Next time

During this two part series we have heavily leaned on the Cloud Developer Kit to make working with IaC easier. Because we can use existing programming languages to define our infrastructure we can leverage all of the flexibility, existing libraries and features that these languages bring. One of the things that comes immediately to mind here is whether we can unit test our Infrastructure-as-Code. Now wouldn’t that be amazing? Tune in next time for more on this topic.

Ilia Awakimjan

Ilia Awakimjan is na het behalen van zijn Master-titel sinds 2017 Software Engineer met specialisatie AWS in dienst bij Profit4Cloud. Ilia is AWS Certified DevOps Professional, AWS Certified Security Specialty en AWS Certified Networking Specialty gecertificeerd.