Experimenting in development, not Production.
Moving your infrastructure to one of the mainstream cloud providers such as Amazon Web Services or Google Cloud is in most cases a massive cost saving for organizations. Not only by just cutting costs of your hardware, but more importantly cutting costs of operational resources. Ease of scaling compute, storage and other infrastructure resources by just executing an API call is a massive cost saver, for both SMEs and Enterprises. However, nowadays the goal is no longer to just to get to the cloud, but to make the best use of it by utilizing services at your disposal.
The Cloud enables organizations to perform experimentation and optimization that was often difficult on physical hardware, but it is not always desired, especially when it comes to profitability and for 24/7 production systems. Fair enough. The fraction of possible cost savings is not easily traded for potential profit loss if the end-user system is unstable or not reachable.
But when you have a delivery pipeline that pushes to Development and Test environments first, experimentation is easily achievable. For a SAAS company, downtime does not mean only loss of revenue, but also increased customer complaints, which can increase operational costs and reduce your billable capacity. However, to get the product to a production environment through a delivery pipeline, application artifacts may be present on testing / staging and development environments that are not as sensitive to risk of downtime and underperformance as production environment is.
Following good DevOps and development practices can be expensive
A rapid delivery cycle often utilizes a micro services architecture. With this architecture, resources are organized in small autonomous development teams, with each team owning one or more product features. These teams have development environments built per team - implying total costs of development environment would amount to number of development teams multiplied with costs of a development environment. The convenience of having isolated development environments per team comes at a higher price - higher costs for your network, compute and storage infrastructure.
Here’s where you can get creative with the latest capabilities of cloud providers and utilize their APIs in order to cut on costs. Assuming that developers in each team are located in the same geographical region and working 9-5, 5 days a week, it is easy to see that the development environments are not used for more than 30% of the time each week. This unused time is obviously a cost saving opportunity.
How we do it
At Base2Services, as part of our DevOps practice, we use Amazon CloudFormation service to manage lifecycle of all infrastructure elements, allowing us to easily duplicate application infrastructure across different environments (development, UAT, production). With this approach, the problem of saving money when environments are not needed can be solved by either destroying and recreating the whole stack, or alternatively traversing the stack and stopping the resources that can be stopped.
Amazon is making it easier to stop resources rather than just killing entire stacks by exposing functionality such as stopping instances or setting Auto Scaling Group sizing via their web services. With the recently introduced support for starting and stopping RDS instances, most of the stacks for traditional n-tier architecture can be stopped by stopping compute (EC2) and storage (RDS) instances. Additionally, any CloudWatch alarms present in the stack should (and can) be silenced, as it doesn’t make much sense to alarm on infrastructure elements that are intentionally stopped.
As we tend to automate most of our daily operations, and with the problem of starting and stopping stacks, we have developed a ruby gem called cfn_manage to help out. This gem is designed as a command line tool for managing CloudFormation stacks and their resources. cfn_manage supports both starting and stopping stack operations, with support for following resource types: 1. EC2 Instances 2. Auto Scaling Groups 3. RDS Instances (see notes below) 4. CloudWatch Alarms
For RDS Instances, the instance needs to be placed in a single AZ to call the stop instance api. We worked around this limitation by converting MultiAZ instances to single AZ before stopping them, and converting them back to MultiAZ instance after starting them.
For ASGs and RDS instances, the current configuration of the resources is stored in an s3 bucket when the resources are stopped. This is to handle the ASG sizing and RDS MultiAZ configuration when resources are started again. No configuration is being stored for EC2 instances or CloudWatch alarms. Both operations are implemented in an idempotent way, so starting an already started stack, or stopping a stopped stack will not imply any changes to the AWS resources. You can find more details on the tool itself on GitHub repository.
Summary
In summary, playing with production environments for infrastructure cost saving has risks to any business, but development environment are safe to be experimented with. With the latest perks of cloud offerings, it is easy to save money by just switching environments off when they are not utilized. In context of Amazon Web Services, we have developed and opened sourced tooling to aid in this task and it can be found by following the links in this document.
Available resources
- The code is available in our GitHub repository
- The ruby gem is publicly available. cfn_manage
- Amazon announcement on starting and stopping RDS instances
- More information on Maximising Cloud Efficieny and Strategic Cost Control
- DevOps as a Service