The ZeoTech Series: Cloud Miser


 

 

 

 

The ZeoTech Series: Cloud Miser

 

 

At zeotap, we are using AWS for all our cloud processing needs. By early 2018, we found out that our AWS expenditure had been increasing at an alarming rate with a CAGR of 8.05%.

This was worrisome. For almost every startup, the cost is one of the top priorities in infra setup (this is where cloud platforms are saviors for them) and saving big expenses also contributes to revenue for a startup. It is a big problem if we are not able to attribute the spends and analyze their worthiness, and that’s when we decided the problem had to be tackled. After closely analyzing it and having implemented a solution, we were able to reduce our expenditure by 50%. We came out with three strategies, and the savings with each can be found in the chart below.

Now, isn’t this intriguing! Let us tell you the whole story of how we were able to cut down our spending in such a drastic way.

Prologue

It all started over a beer ?while having long conversations about AWS costs. We knew something had to be done and fast. Finally, the time came around when we heard zeotap was announcing its first hackathon [1], and that’s when we knew that was just the right project for us.

Problem Statement

Though we were spending hundreds of thousands of dollars, there were some basic questions that we could not answer:

  1. Why is our cost increasing so much?
  2. Who is actually spending so much?
  3. Are we following industry practices on having correct visibility on expenditure?

After the questions were clear, we knew the scope of this project should be to provide the following features:

    • Transparency over services: When you have multiple modules and products running across teams, it’s difficult to find out the break up of cost per module/product/team or owner. If we know who is spending how much, we can ask the right questions and make better management calls like ROI calls over projects.
    • Monitoring resource utilization across services: We need to monitor our resources properly to know whether we are over-provisioning.
    • Providing better alternatives with respect to cost: Choose from multiple billing models like RI and Storage/Compute instances from AWS.

 

Data Analysis and Problem Scoping

We started out by analyzing our AWS billing data. In order for us to limit the scope of our problem, we first identified the cost distribution per AWS service per region. You can find the analysis below:

This graph shows the cost distribution per service and region. Just by looking at the graph, we understood right away, what to focus on. For our hackathon, we limited the cost optimization scope to EC2 [2] and RDS [3] alone which accounted for 80% of the cost. In this article, we will only talk about EC2 since RDS has a very similar story.

EC2 provides two types of instances: On-Demand [4] and Spot Instances [5]. Out of the total cost of EC2, more than 70% was attributed to On-Demand EC2 Nodes. Hence, we chose to look more closely at on-demand EC2/RDS spend to recommend more optimal instance types in order to reduce our expenditure.

Literature Survey

After reading through people’s experience of how they actually save costs over a cloud infra and what best practices are recommended, we came out with the following points:

  • Right-Sizing: Allocating just the best instance for the task. An instance is best for the task when it is optimally used in terms of memory and CPU (i.e an instance should be neither too large to be underutilized nor too small to miss the SLA of the running job).
  • Monitoring: For right resizing, we need to continuously monitor our instances (their usage) on the regular interval and calculate the optimal required capacity of the instances.
  • Reserved Instance: This is another technique where AWS provides significant discounts (up to 70%) compared to the on-demand instance. AWS provides different purchasing options for Reserved Instances based on customers’ comfort. For more details on RI, see the reference [6].
  • Dayparting: This is another great way to reduce the spends. Dayparting means to scale up and down as per requirement. i.e scale down/stop/terminate the machines when not in use. Alternatively, for scheduled jobs/services we can launch the cluster when required and terminate once the job is done.

 

We picked the top 3 strategies in our project. In order to implement them, we needed a better understanding of the AWS pricing model [7] and the AWS reserved Instances [6]. Not only this, but we also needed to find ways to collect the resource utilization data. We found two viable sources to do so, Cloudwatch [8] and Prometheus [9]. Cloudwatch, though natively provided by AWS, is a paid solution while Prometheus is an open source solution. We, therefore, decided to go ahead with Prometheus.

Data Collection and Analysis Architecture

The diagram above shows the complete stack and architecture we used to come up with our solution. We used the following tools:

  1. Java for data collection and munging the data from AWS and Prometheus
  2. Athena [10] for computations
  3. DOMO [11] for visualization

 

The following datasets were collected in order to arrive at the optimized recommendations.

EC2 Instance Data

Tagging was a prerequisite for this data. As per our process, all EC2 instances should be tagged with Team-Name, Owner, and Project. We collected all the relevant data from the instances using AWS SDK, which comprised of Instance ID, Instance Type, Launch Date Time, Availability Zone, Tenancy, Private IP Address, Status, Region and tags data.

AWS master pricing data

AWS provides an API to fetch the current on-demand pricing. We extracted this data to check for instances that have a similar configuration but are cheaper. We then checked all the instance types that were provided by AWS, and what their On Demand and RI (Reserved Instances) price were under various models.

Utilization Data from Prometheus

We found out that IO (Network and Disk) was seldom a problem for us but the CPU and Memory were. Hence, we extracted out the utilization data for CPU and Memory using Prometheus. Prometheus collects metrics every 5 seconds. To aggregate these metrics, we experimented with max, min, average and percentile calculations.

Percentile [12] came out as the best option and gave better recommendations. Percentile intuition is explained in the algorithm section below. We took the 80th percentile of resource utilization data over a 15 days window, and we pulled host address, timestamp, aggregate utilization value per instance. This was possible since Prometheus has inbuilt support for percentile aggregation.

EC2 Recommendation Algorithm

For providing size recommendations for any given instance, we required the following data:

  1. AWS master pricing data
  2. Prometheus data (Usage)
  3. EC2 Instance data

From the AWS master pricing data, we needed to extract the following details for all the available instances on AWS:

  • InstanceType (e.g. m4.2xlarge)
  • Available CPU and Memory (# of cores and memory)
  • Region (as we have different prices in different regions)
  • On-demand price per unit
  • RI price per unit for no upfront

 

Using AWS APIs, we can extract all the instance details running in our AWS account.

Algorithm Example

Consider we have one EC2 instance running in our AWS account with the following configuration:

Instance type: c3.4xlarge

Available CPU: 16 core

Available Memory: 30GB

Price (per hour): $0.840

 

From the Prometheus data for this instance, we observe the following metrics:

CPU usage: 25% (80 percentile metrics used here)

Memory usage: 30% (80 percentile metrics used here)

* 80 percentile or above means that 80% of the time the CPU usage is less than 25%.

Using this instance usage information, we calculated, the required CPU and Memory for the same instance based on its usage would be:

Required CPU: 16 * 25% = 4 core

Required Mem: 30 * 30% = 9 GB

 

As we had taken the 80th percentile, we added an additional buffer on the CPU and Memory of the recommended instance as the below formula shows:

Actual Memory Usage = Instance Memory * Usage

Buffer Memory = Actual Memory Usage

Recommended Memory = Actual Memory Usage + Buffer Memory

The same calculation can be done for CPU cores as well.

 

Why is the buffer memory calculated using the Square root function?

Consider the following two examples.

Case 1. The actual memory usage is 4GB. According to the formula, the buffer memory will come out to be 2GB which is more than sufficient in this case.

Case 2. Let’s say the actual memory usage is 128GB. Again as per the formula, the buffer memory will come out to be approximately 128= 11GB.

 

In both the cases, we see it is a reasonable buffer to add.

Square root curve accentuates lower data points and penalizes high data points giving us an acceptable value of buffer size.

The same applies to the CPU usage as well, and now we have the following recommended instance properties as per the above formula:

Recommended CPU = 4 +sqrt(4) = 6 core

Recommended Memory = 9 +sqrt(9) = 12 GB

If we have to visualize this, see the bubble chart of AWS instances based on its CPU (x-axis), Memory (y-axis), and Price (size of bubble). The goal is to find the top N closest and cheapest instance available with the above configurations. The black bubble in the below diagram shows the required Instance with the configurations we calculated above. We find the cheapest top 3 recommendations which satisfy the required configurations and sort by the price of the instance.

As a result, our recommendations would be following:

  • t3.2xlarge (CPU – 8 core, Mem – 32 GB and Price : $0.332)
  • c5.2xlarge (CPU – 8 core, Mem – 16GB and Price : $0.340)
  • t2.2xlarge (CPU – 8 core, Mem – 16GB and Price : $0.371)

 

If we resize our current instance based on the above recommendations, we would end up saving around:

Savings per hour = $0.840 – $0.332 = $0.51 per hour

Savings (in %age) = $0.51 / $0.84 * 100 = 60% approx

And this saving is for just one instance. We can apply the same algorithm on all the on-demand instances (EC2 or RDS) based on its usage and save the company a whole lot of money.

Conclusion

AWS provides a variety of ways to reduce the infrastructure cost. We have to choose the right model and the right size for each of the AWS services. For example, we should consider using Spot Instances for EMR related jobs and Reserve Instances (RIs) for long running instances with known utilization. Further, understanding and analyzing the billing data over a period of time helps identify the optimal mix of various pricing models to address your infrastructure requirements.

Transparency in infra is another very important aspect to achieve infra cost optimizations and assist in ROI calls for the services deployed. Tagging the instance is one of the mandatory practices for transparency. Regular monitoring of the infra, removing or resizing underutilized resources, goes a long way in reducing spends. We are running this recommendation algorithm once a month and looking at more automated ways of applying the optimization recommendations to our cloud infrastructure.

References

[1] Zeotap Hackathon – https://www.zeotap.com/hackathon

[2] EC2 – https://aws.amazon.com/ec2/details/

[3] RDS – https://aws.amazon.com/rds/

[4] On Demand – https://aws.amazon.com/ec2/pricing/on-demand/

[5] Spot Instances – https://aws.amazon.com/ec2/spot/

[6] AWS Reserved Instances – https://aws.amazon.com/ec2/pricing/reserved-instances/

[7] AWS pricing model – https://aws.amazon.com/pricing/

[8] Cloud Watch – https://aws.amazon.com/cloudwatch/

[9] Prometheus – https://github.com/prometheus/prometheus

[10] Athena – https://aws.amazon.com/athena/

[11] Domo – https://www.domo.com/solution/data-visualization

[12] Percentile – https://en.wikipedia.org/wiki/Percentile

 

About The Authors

Rakesh Sharma and Aman Verma are both Senior Engineers that work in the Data Engineering team which handles the massive scale of data at zeotap. They work on the core data platform zeoCore, responsible for ingesting, refining and distributing data from our multiple data partners. Rakesh studied computer science and engineering and had a stint with Minjar Cloud Solutions (Acquired by Nutanix) before joining Zeotap. His areas of expertise include data engineering, distributed systems, and AWS Services amongst others. Aman studied electronics and communications engineering and worked at one of the largest e-commerce startups in India Snapdeal before moving to Zeotap. His areas of expertise include data engineering, distributed systems, and data platforms.

Stay Connected ×

Get the newstap