At zeotap, we are using AWS for all our cloud processing needs. By early 2018, we found out that our AWS expenditure had been increasing at an alarming rate with a CAGR of 8.05%.
This was worrisome. For almost every startup, the cost is one of the top priorities in infra setup (this is where cloud platforms are saviors for them) and saving big expenses also contributes to revenue for a startup. It is a big problem if we are not able to attribute the spends and analyze their worthiness, and that’s when we decided the problem had to be tackled. After closely analyzing it and having implemented a solution, we were able to reduce our expenditure by 50%. We came out with three strategies, and the savings with each can be found in the chart below.
Now, isn’t this intriguing! Let us tell you the whole story of how we were able to cut down our spending in such a drastic way.
It all started over a beer ?while having long conversations about AWS costs. We knew something had to be done and fast. Finally, the time came around when we heard zeotap was announcing its first hackathon , and that’s when we knew that was just the right project for us.
Though we were spending hundreds of thousands of dollars, there were some basic questions that we could not answer:
After the questions were clear, we knew the scope of this project should be to provide the following features:
We started out by analyzing our AWS billing data. In order for us to limit the scope of our problem, we first identified the cost distribution per AWS service per region. You can find the analysis below:
This graph shows the cost distribution per service and region. Just by looking at the graph, we understood right away, what to focus on. For our hackathon, we limited the cost optimization scope to EC2  and RDS  alone which accounted for 80% of the cost. In this article, we will only talk about EC2 since RDS has a very similar story.
EC2 provides two types of instances: On-Demand  and Spot Instances . Out of the total cost of EC2, more than 70% was attributed to On-Demand EC2 Nodes. Hence, we chose to look more closely at on-demand EC2/RDS spend to recommend more optimal instance types in order to reduce our expenditure.
After reading through people’s experience of how they actually save costs over a cloud infra and what best practices are recommended, we came out with the following points:
We picked the top 3 strategies in our project. In order to implement them, we needed a better understanding of the AWS pricing model  and the AWS reserved Instances . Not only this, but we also needed to find ways to collect the resource utilization data. We found two viable sources to do so, Cloudwatch  and Prometheus . Cloudwatch, though natively provided by AWS, is a paid solution while Prometheus is an open source solution. We, therefore, decided to go ahead with Prometheus.
The diagram above shows the complete stack and architecture we used to come up with our solution. We used the following tools:
The following datasets were collected in order to arrive at the optimized recommendations.
EC2 Instance Data
Tagging was a prerequisite for this data. As per our process, all EC2 instances should be tagged with Team-Name, Owner, and Project. We collected all the relevant data from the instances using AWS SDK, which comprised of Instance ID, Instance Type, Launch Date Time, Availability Zone, Tenancy, Private IP Address, Status, Region and tags data.
AWS master pricing data
AWS provides an API to fetch the current on-demand pricing. We extracted this data to check for instances that have a similar configuration but are cheaper. We then checked all the instance types that were provided by AWS, and what their On Demand and RI (Reserved Instances) price were under various models.
Utilization Data from Prometheus
We found out that IO (Network and Disk) was seldom a problem for us but the CPU and Memory were. Hence, we extracted out the utilization data for CPU and Memory using Prometheus. Prometheus collects metrics every 5 seconds. To aggregate these metrics, we experimented with max, min, average and percentile calculations.
Percentile  came out as the best option and gave better recommendations. Percentile intuition is explained in the algorithm section below. We took the 80th percentile of resource utilization data over a 15 days window, and we pulled host address, timestamp, aggregate utilization value per instance. This was possible since Prometheus has inbuilt support for percentile aggregation.
For providing size recommendations for any given instance, we required the following data:
From the AWS master pricing data, we needed to extract the following details for all the available instances on AWS:
Using AWS APIs, we can extract all the instance details running in our AWS account.
Consider we have one EC2 instance running in our AWS account with the following configuration:
Instance type: c3.4xlarge
Available CPU: 16 core
Available Memory: 30GB
Price (per hour): $0.840
From the Prometheus data for this instance, we observe the following metrics:
CPU usage: 25% (80 percentile metrics used here)
Memory usage: 30% (80 percentile metrics used here)
* 80 percentile or above means that 80% of the time the CPU usage is less than 25%.
Using this instance usage information, we calculated, the required CPU and Memory for the same instance based on its usage would be:
Required CPU: 16 * 25% = 4 core
Required Mem: 30 * 30% = 9 GB
As we had taken the 80th percentile, we added an additional buffer on the CPU and Memory of the recommended instance as the below formula shows:
Actual Memory Usage = Instance Memory * Usage
Buffer Memory = Actual Memory Usage
Recommended Memory = Actual Memory Usage + Buffer Memory
The same calculation can be done for CPU cores as well.
Why is the buffer memory calculated using the Square root function?
Consider the following two examples.
Case 1. The actual memory usage is 4GB. According to the formula, the buffer memory will come out to be 2GB which is more than sufficient in this case.
Case 2. Let’s say the actual memory usage is 128GB. Again as per the formula, the buffer memory will come out to be approximately 128= 11GB.
In both the cases, we see it is a reasonable buffer to add.
Square root curve accentuates lower data points and penalizes high data points giving us an acceptable value of buffer size.
The same applies to the CPU usage as well, and now we have the following recommended instance properties as per the above formula:
Recommended CPU = 4 +sqrt(4) = 6 core
Recommended Memory = 9 +sqrt(9) = 12 GB
If we have to visualize this, see the bubble chart of AWS instances based on its CPU (x-axis), Memory (y-axis), and Price (size of bubble). The goal is to find the top N closest and cheapest instance available with the above configurations. The black bubble in the below diagram shows the required Instance with the configurations we calculated above. We find the cheapest top 3 recommendations which satisfy the required configurations and sort by the price of the instance.
As a result, our recommendations would be following:
If we resize our current instance based on the above recommendations, we would end up saving around:
Savings per hour = $0.840 – $0.332 = $0.51 per hour
Savings (in %age) = $0.51 / $0.84 * 100 = 60% approx
And this saving is for just one instance. We can apply the same algorithm on all the on-demand instances (EC2 or RDS) based on its usage and save the company a whole lot of money.
AWS provides a variety of ways to reduce the infrastructure cost. We have to choose the right model and the right size for each of the AWS services. For example, we should consider using Spot Instances for EMR related jobs and Reserve Instances (RIs) for long running instances with known utilization. Further, understanding and analyzing the billing data over a period of time helps identify the optimal mix of various pricing models to address your infrastructure requirements.
Transparency in infra is another very important aspect to achieve infra cost optimizations and assist in ROI calls for the services deployed. Tagging the instance is one of the mandatory practices for transparency. Regular monitoring of the infra, removing or resizing underutilized resources, goes a long way in reducing spends. We are running this recommendation algorithm once a month and looking at more automated ways of applying the optimization recommendations to our cloud infrastructure.
 Zeotap Hackathon – https://www.zeotap.com/hackathon
 EC2 – https://aws.amazon.com/ec2/details/
 RDS – https://aws.amazon.com/rds/
 On Demand – https://aws.amazon.com/ec2/pricing/on-demand/
 Spot Instances – https://aws.amazon.com/ec2/spot/
 AWS Reserved Instances – https://aws.amazon.com/ec2/pricing/reserved-instances/
 AWS pricing model – https://aws.amazon.com/pricing/
 Cloud Watch – https://aws.amazon.com/cloudwatch/
 Prometheus – https://github.com/prometheus/prometheus
 Athena – https://aws.amazon.com/athena/
 Domo – https://www.domo.com/solution/data-visualization
 Percentile – https://en.wikipedia.org/wiki/Percentile
Rakesh Sharma and Aman Verma are both Senior Engineers that work in the Data Engineering team which handles the massive scale of data at zeotap. They work on the core data platform zeoCore, responsible for ingesting, refining and distributing data from our multiple data partners. Rakesh studied computer science and engineering and had a stint with Minjar Cloud Solutions (Acquired by Nutanix) before joining Zeotap. His areas of expertise include data engineering, distributed systems, and AWS Services amongst others. Aman studied electronics and communications engineering and worked at one of the largest e-commerce startups in India Snapdeal before moving to Zeotap. His areas of expertise include data engineering, distributed systems, and data platforms.