CFD19 - Get Dollars Back in Your Pocket With EKS Optimisation

Most kubernetes clusters are running at 30% or less utilisation, which means there’s a lot of resource wastage out there in the wild, which translates to dollars being spent for no value. Platform9 want to help you save some money by tweaking the way you run your EKS clusters, and this same pattern could make its way to other managed kubernetes solutions in the future.

Platform9 were founded in 2013 with a relatively clear mission: Democratize cloud computing. To use a quote they shared with us during their Cloud Field Day 19 presentation:

We help you scale cloud infrastructure without operational burden or cost concerns"

“We empower DevOps and Platform Engineering teams, to deploy, manage, and run cloud infrastructure anywhere, eliminating operational hurdles and cost concerns, and freeing developers and data scientists.”

Platform9 briefly took us through the backbone of their company, which is based around managing the infrastructure across customers' distributed infrastructure. But that’s not really why we’re here today at CFD19. Today we’re here to hear about issues with kubernetes wastage, specifically with EKS, and how Platform9 can help with their new offering, Elastic Machine Pool (EMP).

What is the issue?

Platform9 informed us that k8s is the most inefficient platform, and resources are being wasted with most EKS users rarely achieve 30% utilization. There are several causes of ineffeciency, but an example is the bin-packing problem, where pod sizing within EKS is misaligned with available EC2 sizes, leading to oversubscription.

Additionally, developers typically resist changes to resource configuration, as they are ultimately responsbile for the performance and availability of their applications. Enabling engineers to take action is a top challenge highlighted by Platform9 in their presentation, also noting that developers will often set high resource “limits”, as to not interfere with application performance and availability.

Elastic Machine Pool (EMP)

Enter Elastic Machine Pool (EMP). What is EMP? Platform9 reveals it as a new engine that helps optimize compute utilization, when running kubernetes in the public cloud, and can save up to 50% of costs.

If you deployed EKS today from AWS, you’d have two choices for the compute engine:

  1. EC2 instances
  2. Fargate

The way Platform9 implement EMP is instead of using EC2 VM worker nodes, Platform9 will instead leverage AWS bare metal instances, and then provision “elastic VMs” (EVM) on top of that bare metal, which get joined to the EKS cluster as worker nodes, as shown in the diagram below. Notice in particular the utilized capacity versus unused capacity on the nodes, and on the bare metal instance.

EMP allows you to maximise your EKS cluster utilization, while not impacting the application SLA. Platform9 tells us this works seamlessly with existing EKS environments and applications, and no configuration changes are required.

EMP isn’t in the business of giving you the best cost visibility. EMP is great at:

  • Optimizing usage via improved utilization
  • Zero pod disruption
  • Eliminating engineering and ops back and forth
  • Optimizing usage via resolving bin-packing

What Will EMP Cost Me?

While we don’t typically get too deep in the commercials during CFD discussions, as those commercials can also vary from deal to deal, the way Platform9 are approaching is this they will take 20-25% of the money being saved by EMP. Platform9 want you to be successful with your savings and ideally save up to or more than 50% of your current EKS cost, so it’s very much in that shared risk, shared reward model.

As expected, step #1 is to do an analysis on the current workloads to identify potentail cost savings, which leads in to the hands on action …

Can We See it In Action?

Absolutely! The Platform9 team, specifically Peter, took us through a demonstration of the product, where we are staring with an AWS EKS cluster, with a number of EC2 instances running as worker nodes.

He deployed the platform9 cost analyser, which is deployed via Helm charts, which identified we could save up to 66.25% by using EMP.

The idea is to take a number of EC2 nodes running as EKS worker nodes, and replace them on a single physical AWS compute node, running elastic VMs for the worker nodes. It’s worth noting that a single physical compute node running all of your worker node VMs is not great from an availability perspective, so this setup was for the puposes of demonstation.

Peter also showed us another example of generating load against the worker nodes, now running as elastic VMs across 2 bare metal nodes, and a component within EMP will monitor the load, add a new bare metal instance, and rebalance / add worker EVMs to meet the performance requirements, and balance load evenly across the availabe physical instances. The demo gremlins came out of the closet here, and the addition of a 3rd bare metal node was timing out with an AWS issue, but literally 30 seconds after we stopped the live stream, the issue resolved itself and a 3rd bare metal node was added. Believe me!

Summary

I understand the problem space, and saving anyone money is a good thing. All discussions with my customers today are centered around cost optimisation in public cloud environments. They’ve all experiences cost blowout and unreliable forecasting, leading to irregular and unpredictable cloud spend.

While EMP is bound to save you money, and I like the technology supporting the initiative, I think the general concensus among delegates is that it’s not going to help solve the underlying problem of correctly sizing your workloads, and implemeting resource controls in kubernetes. That’s not necessarily an easy thing to do - if it were everyone would have done it and Platform9 wouldn’t have come up with this solution.


See also

comments powered by Disqus