Leveraging HashiCorp Terraform to Make AWS Spot Instances More Effective
EC2 Spot Instances give organizations the opportunity to leverage EC2 compute at a much lower cost, and scale with greater granularity. Prior to Spot Instances, organizations only had the option of spinning up infrastructure based on some estimation of future capacity, which left them paying for unused resources. In addition to removing unutilized compute resources, Spot Instances save money directly because they cost less than a typical EC2 instance.
However, getting the most out of your Spot Instances, and actually deriving value from them, is not as simple as merely starting a Spot Instance. You need to manage the instance effectively, which is difficult, because configuring Spot Instances can sometimes feel like you’re roaming the frontier of the Wild West, for a few reasons:
- Spot Instances only run as long as the price you bid exceeds the current Spot price, which makes provisioning with certainty tricky
- The workflow involved in provisioning them is cluttered and messy
- People don’t understand how to utilize Spot Instances effectively.
Fortunately, there is a solution: HashiCorp Terraform. You need to automate spot instance provisioning, otherwise, your service might be down when they get terminated by AWS. You can't control when the instance will be terminated, but you can keep them 'always up' with Terraform. The tool allows organizations to create predictable Spot Instances, and thus more effectively leverage their resources for their application. This article explains why and how to use Terraform to help manage Spot Instances.
Why Terraform and Spot Instances?
If you are already leveraging Terraform for EC2 infrastructure today, and considering Spot Instances, it would be a mistake not to roll them together. Because in addition to standardizing your orchestration environment with Terraform, you benefit from consistency/repeatability, visibility, and flexibility.
Consistency/repeatability: Because you are scripting your bid requests, you can build a real strategy for how you will procure Spot Instances, and implement it in a repeatable, versionable way.
Visibility: If you can read the Terraform configuration files, you know what is going on. There is a clear, documented implementation of all infrastructure and Spot Instances for everyone.
Flexibility: By being able to fully automate your requests and the creation of Spot Instances, you have a way to integrate your infrastructure into your application and let the application start to dictate the infrastructure it runs on.
You should know there is a bit of an art to effectiveness. With Spot Instances, you have to know how to bid in a way that lets you save the most money, but still get the resources you need. Terraform can’t help you there, but it can help you make modifications to your bidding strategy programmatically.
Creating Responsive EC2 Spot Instances with Terraform
Arguably, one of the biggest benefits of Spot Instances comes when they are treated as temporary infrastructure. When their existence spans hours or days, not weeks and months. This is a little contrary to most infrastructure provisioned and managed by Terraform, which is typically designed to be longer-lived (although infrastructure should never be designed to be long-lived, per se, it should have the ability to come alive, serve its purpose and disappear when it isn't required to save money). The Spot Instances workloads are production, but their lives could be as long as needed to satisfy volume or deployment needs.
One use case, for example, would be data-processing fleets that spin up when the spot price is right, mine the data furiously, then spin down again when the price is unfavourable. This could be done with no manual intervention whatsoever, using infrastructure-as-code and configuration management.
The reason short-lived instances are so beneficial is that Spot Instances can be used to respond to fluctuating demand, variable regions, and unique architectures, such as microservices. In order to make this possible in Terraform, there are a few key elements that need to be implemented.
Leverage fleets: Fleets allow you to provision a group of Spot Instances. This is useful when you are leveraging them to support regions, for example, so that future manipulation can be done on a fleet level and not at the level of individual instances. When you are creating requests for two or more instances that have a similar use case (such as more resources for one region), it’s suggested to use fleets instead of individual requests.
Set wait_for_fulfillment to true: Wait_for_fulfillment is set to false by default. However, if the Spot Instance is not provided, you do not know for certain if the resource is available. So it is advised that in most scenarios, wait_for_fulfillment should be set to true. You will have to effectively handle exceptions of timeout exceeded, but it’s better to know whether the resource is there.
User block_duration_minutes: This parameter sets the lifespan of the Spot Instance in minutes. This is extremely important because without it you can easily end up with tons of rogue instances that are running unused, which completely diminishes their value.
Even though Terraform is set up to manage Spot Instance requests, the time during the request is the time to specify its lifecycle. With Terraform set up in this way, you can leverage Spot Instances to better support microservices, regional scaling, and scaling based on demand.
Microservices: The number of variables associated with microservice-based applications is already tremendously high. That is why trying to predict infrastructure requirements is difficult and costly. If organizations change their strategies to make infrastructure responsive to the application itself, they have more flexibility, can do less planning, and can prevent costs from going out of control due to rogue infrastructure, too many resources, and too many variables.
Make regions work for you: For applications deployed across the world, and thus using multiple AWS regions, organizations usually have to cookie-cut each region, making them all equal, even though each region has a unique user base and application usage. This challenge is amplified for microservices-based applications where each region has a deployment of all the services. By using Spot Instances, infrastructure can be created that is relative to the region. For example, Instances can be timed based on peak usage in that region.
How to Manage Spot Instances with Terraform
All the standard Terraform functionality for managing infrastructure also applies to the Spot Instances. Terraform’s strength is provisioning infrastructure, and implementing plans to maintain a state. Because it is not as well suited for environments that are changing more rapidly, management of Spot Instances takes some more consideration.
First, plans are going to be very important to ensure stats of all infrastructure managed by Terraform. Second, you may end up using and destroying more than you expected. Finally, logging is absolutely critical. The Hashi team and open source community know this. That is why they built in aws_Spot_datafeed_subscription so that you can store all data associated with requests, usage, and pricing. You can use this to create more automation, or just to audit the automation you have already created.
Here are some helpful links for creating the following Terraform resources:
Spot the Difference
Although Spot Instances have been around since late 2009, they arguably have not been used to their full extent. Many organizations that can benefit from the cost savings have been missing out on opportunities and regionally-diverse applications with complex architecture such as microservices are becoming a greater compute burden than they need to be. Terraform is already a great tool for managing the creation of infrastructure, and there are plenty of reasons to leverage it to harness the full power of Spot Instances.