How We Saved ~$100,000 Annually by Running Our Kubernetes Clusters Exclusively on Spot Instances (while achieving “zero-downtime” production!)

How We Saved ~$100,000 Annually by Running Our Kubernetes Clusters Exclusively on Spot Instances (while achieving “zero-downtime” production!)

March 21, 2024
Get tips and best practices from Develeap’s experts in your inbox

Intro

Soliduslabs is the leader in crypto-native trade surveillance and risk monitoring.  The company plays a pivotal role in safeguarding millions of retail and institutional entities globally. Soliduslabs operates on a robust tech stack that powers its intricate data pipelines, incorporating real-time and batch processes. Apache Spark, Kafka, Airflow, and other cutting-edge data processing tools form the backbone of the organization’s day-to-day operations. Further enhancing its capabilities, Solidus Labs utilizes Amazon Elastic Kubernetes Service (EKS) to manage Kubernetes clusters across multiple regions and environments.

This article delves into Solidus Labs’ transformative journey of optimizing its cloud infrastructure, addressing the dual challenges of soaring AWS costs and the pressing need for performance and reliability.

The problem
Faced with a monthly EC2 bill of tens of thousands of dollars, Solidus embarked on a dynamic evolution, strategically integrating both PerfectScale and cast.ai

As a crucial first step, Solidus prioritized ensuring that all CPU and memory requirements were configured accurately according to the actual needs of their workloads. To achieve this, we implemented the PerfectScale solution. A case study link shared below (in the conclusion section of this article) provides further insights into Solidus transformative journey with PerfectScale. Feel free to check it out for a deeper understanding of their solution.

Additionally, We migrated all of our instances (EC2, RDS, OpenSearch, ElastiCache (Redis)) to be based on AWS Graviton (ARM-Processor).

The challenge
We had to run all (or nearly all) workloads on SPOT instances to achieve significant cost savings for our EC2 expenditures without compromising resilience and uptime. The task obliged us to handle a fully operational production environment on SPOT instances in ZERO DOWNTIME.

The Solution included

  • Cast.ai autoscaler
  • Spot Instances
  • Topology Spread Constraint
  • Pod Distribution Budget (PDB)

About Cast.ai
Solidus strategically addressed soaring AWS costs by leveraging cast.ai’s cluster autoscaler platform. Adopting Spot Instances, facilitated by cast.ai’s significantly reduced monthly expenses.

The platform’s advanced predictive analytics allowed for forecasting potential Spot Instance terminations and proactive planning for workload adjustments.

Furthermore, cast.ai efficiently identified the most “stable spots” at the most affordable price.

Cast.ai demonstrated swift scaling processes with outstanding automated cluster autoscaling capabilities, the platform ensured Soliduslabs’ infrastructure dynamically adjusted to changing demands.

This maintained optimal performance while adapting to cost considerations, with cast.ai working very fast on the scaling process (up or down), even outpacing Karpenter.

The platform’s ability to provide comprehensive cost visibility empowered Solidus.

This enabled them to make informed decisions and navigate market challenges with a clear understanding of AWS EC2 spending.

Now, let’s talk about how we really did that!

Workload analysis
In conducting a detailed workload analysis, Solidus identified specific tasks suitable for Spot Instances, considering each workload’s tolerance for interruptions. Notably, applications categorized as “stateless”, or those capable of concluding their requests within 30 seconds since SIGTERM is accepted, were recognized as fitting candidates for running on Spot Instances without requiring modifications. This strategic analysis aligned the nature of these workloads with the inherent characteristics of Spot Instances, optimizing their utilization without needing application adjustments.

Spot Instances Configuration
To configure spot instances correctly, we used an add-on of cast.ai called “castai-pod-node-lifecycle”.
Some of our clusters required custom configurations, as some of our deployments had to run only on on-demand instances. With this add-on, we could customize everything we need related to running the workloads on SPOT instances only, on-demand instances, or using both of them!

Pod Distribution Budget (PDB)

Solidus obligates supporting a high availability level of its services (SLA of 99.99% uptime).

SPOT instances are cost-effective but come with the risk of being terminated by the cloud provider when resource demand increases. To mitigate the impact of these interruptions on your workloads, you can use PDBs to define the minimum number of pods that must be available for each service, even during disruptions. In the context of running workloads on SPOT instances, which are typically less reliable and can be preempted with little notice, PDBs become particularly important.

Pod Disruption Budget (PDB) is a Kubernetes resource that allows you to control the disruption caused by voluntary disruptions, such as those that occur during node draining for maintenance or scaling down the number of replicas in a deployment. Setting a minimum number of pods available during disruptions ensures that your workloads remain available. The PDB allows prioritization of critical services, ensuring higher resilience for essential components.

We implemented PDB for our services to achieve the required HA level, setting a minimum available based on the service’s importance.This minimized the impact of potential interruptions caused by SPOT instance terminations.

We enforced this PDB template for our workloads:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: {{ .Chart.Name }}
spec:
  minAvailable: {{ .Values.minAvailable }}
  selector:
    matchLabels:
      app: {{ .Chart.Name }}

High availability setup (with Topology Spread Constraint)

A minimum number of replicas for each workload in different Availability Zones (AZ) was deployed, ensuring redundancy and high availability.

This way, we ensure that all replicas of each of our workloads are spread equally across the availability zones (and, as a result of this distribution, obviously, the replicas will be scheduled on different nodes), further enhancing our infrastructure’s resilience and fault tolerance. This strategic deployment across distinct Availability Zones and spot instance nodes contributes to a robust and reliable operational setup.

Here is the YAML configuration that defines a topology spread constraint for Kubernetes Pods with the specified label selector, ensuring that the pods are spread across different zones in the cluster. If the constraints cannot be satisfied, the pods should not be scheduled.

deployment.yaml:

spec:
  template:
    spec:
      topologySpreadConstraints:
        - maxSkew: 1
          topologyKey: topology.kubernetes.io/zone
          whenUnsatisfiable: DoNotSchedule
          labelSelector:
            matchExpressions:
              - key: app
                operator: In
                values:
                  - {{ template "service.name" . }}

Performance testing

To ensure that our solution is reliable and can run in the face of real-world demands (our production environments), we applied the following steps:

Performance environment setup:

We established a dedicated performance environment (Kubernetes cluster mirroring our production cluster) to implement and test all of the above configs.

Synthetic testing:
Golang script, custom-written by the team, was utilized to simulate workload failure and evaluate system performance under various conditions. For this synthetic testing, we implemented the alerting mechanism that notified the team via a Slack channel in case of any disruptions during the testing phase.

Picture Title: System performance during Synthetic Testing before implementing Pod Distribution Budget (PDB). (A lot of interruptions while spots are up and down)

After implementing the Pod Distribution Budget (PDB), we didn’t face ANY interruption while spots were up and down.

One-week testing period:
We took a rigorous one-week testing period in the performance environment, continuously monitoring and analyzing system behavior, Spot Instance terminations, and cast.ai’s automated scaling responses.

Moving forward to production:

After we ensured that all our careful checks passed successfully, we decided to apply this solution across all of our Kubernetes clusters. The result was monthly savings of $8,000, cutting a significant part of our EC2 spending each month!

Side Effects

Here are some side effects we observed after applying this solution widely across all of our Kubernetes clusters:

Increased Cross-region ECR data transfer costs

We operate multiple clusters across various regions. Before implementing this solution, we stored all of our Docker images in a centralized region, which was the only region for pulling images across all other regions. As a result of exclusively using spot instances over all of our environments, we encountered a high turnover of nodes. Consequently, we observed a significant increase in cross-region communication for pulling images, leading to higher costs. 

SOLUTION: We began storing images in each region separately to solve this issue and avoid unnecessary expenses.

Increased AWS Config costs

The increase in AWS Config billing was directly linked to the substantial turnover of EC2 instances, primarily caused by our use of SPOT instances. The frequent termination and replacement of SPOT instances led to a higher volume of configuration checks triggered by the AWS Security Hub, consequently contributing to the escalated AWS Config billing.

SOLUTION: We successfully tackled this issue by strategically reducing the scan frequency. This approach mitigated the number of checks performed, aligning with the dynamic nature of SPOT instances and effectively resolving the associated cost escalation.

Conclusion

The success of our cost-saving initiative relied not only on the strategic adoption of PerfectScale, Spot Instances, and cast.ai but also on a thorough testing regimen, including implementing Pod Distribution Budget and high availability setups. This meticulous approach guarantees that our infrastructure is both cost-efficient and resilient in the face of real-world demands and potential disruptions, ensuring uninterrupted service and business continuity.

You’re welcome to read about our journey with PerfectScale (and how they are helping us be much more efficient in utilizing CPU and memory for our workloads) here.

In addition to these measures, we maximized the efficiency of our clusters by leveraging the outstanding “scheduled rebalancing” feature of cast.ai. This innovative capability allowed us to dynamically resize our clusters during idle times, ensuring that resources were allocated optimally. By downsizing clusters during periods of reduced demand, we further optimized our costs without compromising performance.

Further beyond

These days, the Solidus DevOps team is collaborating with R&D teams to identify additional opportunities for cost optimization. We anticipate achieving an additional ~$75,000 in EC2 spending cuts by transitioning specific backend services to a batch processing model using AWS Batch (Serverless). This approach replaces the previous practice of running these workloads as Kubernetes deployments, which resulted in substantial idle time usage on compute resources.

We’re Hiring!
Develeap is looking for talented DevOps engineers who want to make a difference in the world.