Transforming Kubernetes Management: Simplify Operations, Optimize Costs, and Scale Effectively

December 23, 2024

404 views

#aws #cast.ai #eks #kubernetes #terraform

Managing Kubernetes clusters at scale on AWS presents a significant operational challenge, often requiring substantial effort to balance performance, scalability, and cost efficiency. The complexities of provisioning resources, scaling workloads, and maintaining availability can quickly become overwhelming. CAST AI is designed to simplify and optimize cloud-native application management. It provides real-time workload analysis to adjust resources dynamically, enables cost-effective spot instance usage, and monitors costs and usage in real-time. CAST AI provides a transformative solution, delivering advanced automation for resource scaling, cost optimization, and high availability. By integrating CAST AI with Amazon Elastic Kubernetes Service (EKS) using Terraform, organizations can streamline operations, reduce cloud costs, and maintain control over their infrastructure. This guide provides a comprehensive walkthrough for leveraging CAST AI and Terraform to enhance Kubernetes management, offering a scalable, efficient, and cost-effective approach to cloud-native operations.

Key Features of CAST AI

Automated Scaling: CAST AI dynamically scales resources based on workload demand.
Spot Instance Optimization: CAST AI uses spot instances when available to reduce costs and seamlessly handles instance interruptions.
Real-Time Monitoring and Cost Insights: Users can optimize resource usage and control expenses based on CAST AI’s detailed insights.

Tutorial Overview

The step-by-step tutorial walks you through configuring CAST AI with EKS using Terraform, covering IAM roles, node template setups, autoscaling, and rebalancing schedules. With this setup, CAST AI manages cluster resources autonomously, balancing performance and cost. This integration allows for highly resilient clusters with minimal manual intervention, optimizing cloud infrastructure and enhancing operational efficiency for Kubernetes workloads.

Setting Up CAST AI with EKS Using Terraform

Before diving into the configuration, ensure that you have:

Terraform is installed and configured for your environment.
AWS CLI with permissions for EKS management.
An active CAST AI account with API access.

Step 1: Creating the CAST AI User ARN for EKS Cluster Access

The castai_eks_user_arn resource in Terraform creates a unique identifier that grants CAST AI permission to interact with the EKS cluster.


hcl
resource "castai_eks_user_arn" "castai_user_arn" {
  cluster_id = castai_eks_clusterid.cluster_id.id
}

This ARN (Amazon Resource Name) is a secure identity, allowing CAST AI to monitor and manage the EKS cluster. This initial step is essential for enabling CAST AI’s access to the resources it will manage.

Step 2: Setting Up AWS IAM Policies and Roles

CAST AI needs specific AWS Identity and Access Management (IAM) roles and policies to operate within your EKS environment. The following Terraform module automates IAM role and policy setup:



hcl
module "castai-eks-role-iam" {
  source = "git::https://github.com/castai/terraform-castai-eks-role-iam.git?ref=v0.3.1"

  aws_account_id                   = data.aws_caller_identity.current.account_id
  aws_cluster_region               = var.cluster_region
  aws_cluster_name                 = var.cluster_name
  aws_cluster_vpc_id               = var.vpc_id
  castai_user_arn                  = castai_eks_user_arn.castai_user_arn.arn
  create_iam_resources_per_cluster = true
}

This module handles:

Role Creation: Defines the IAM roles needed by CAST AI.
Policy Assignment: Grants CAST AI permissions to manage EKS resources, such as scaling and monitoring.
Security Best Practices: Ensures that only necessary permissions are granted to CAST AI for optimal security.

Using this module saves time and helps prevent potential security misconfigurations by following CAST AI’s recommended IAM setup.

Step 3: Configuring CAST AI’s Connection to the EKS Cluster

With IAM permissions established, we define the cluster configuration for CAST AI.


hcl
resource "castai_eks_clusterid" "cluster_id" {
  account_id   = data.aws_caller_identity.current.account_id
  region       = var.cluster_region
  cluster_name = var.cluster_name
}

resource "castai_eks_cluster" "cluster" {
  account_id                 = data.aws_caller_identity.current.account_id
  region                     = var.cluster_region
  name                       = data.aws_eks_cluster.eks.id
  delete_nodes_on_disconnect = var.delete_nodes_on_disconnect
  assume_role_arn            = module.castai-eks-role-iam.role_arn
}

Cluster ID and Connection: These resources establish CAST AI’s connection to the EKS cluster.
Delete Nodes on Disconnect: Automatically removes nodes if CAST AI disconnects, preventing unused nodes from incurring costs.
Assume Role ARN: Provides the necessary permissions for CAST AI to operate within the EKS cluster.

Step 4: Setting Node Configuration for CAST AI

Node configuration defines how CAST AI should manage nodes, specifying instance profiles, subnets, and security groups.


hcl
resource "castai_node_configuration" "node_configuration" {
  name       = "default"
  cluster_id = castai_eks_cluster.cluster.id
  subnets    = var.subnets
  tags       = var.tags
  eks {
    instance_profile_arn = module.castai-eks-role-iam.instance_profile_arn
    security_groups      = var.security_group_ids
  }
}

configuration allows for:

Network Connectivity: Subnets and security groups ensure nodes have proper network access.
Instance Profile: Specifies permissions for nodes to interact with other AWS resources.
Tags: Facilitates resource organization and management across AWS.

Step 5: Applying Node Configuration as the Default

To ensure consistent node management, apply this configuration as the default node template.


hcl
resource "castai_node_configuration_default" "node_configuration_default" {
  cluster_id       = castai_eks_cluster.cluster.id
  configuration_id = castai_node_configuration.node_configuration.id
}

This step automates the application of node settings, allowing CAST AI to manage new and existing nodes uniformly.

Step 6: Defining Node Templates for Autoscaling

Node templates specify autoscaling criteria, instance types, and spot instance usage preferences, allowing CAST AI to manage resources according to workload demands.


hcl
resource "castai_node_template" "dynamic_node_template" {
  for_each = local.node_templates_mapping

  depends_on       = [castai_autoscaler.castai_autoscaler_policies]
  cluster_id       = castai_eks_cluster.cluster.id
  configuration_id = castai_node_configuration.node_configuration.id
  name             = each.key
  should_taint     = lookup(each.value, "should_taint", false)

  custom_labels = each.value.custom_labels

  dynamic "custom_taints" {
    for_each = lookup(each.value, "custom_taints", [])

    content {
      key    = custom_taints.value.key
      value  = custom_taints.value.value
      effect = custom_taints.value.effect
    }
  }

  constraints {
    on_demand                                   = lookup(each.value.constraints, "on_demand", false)
    spot                                        = lookup(each.value.constraints, "spot", false)
    enable_spot_diversity                       = lookup(each.value.constraints, "enable_spot_diversity", false)
    min_cpu                                     = lookup(each.value.constraints, "min_cpu", 0)
    min_memory                                  = lookup(each.value.constraints, "min_memory", 0)
  }
}

Key configuration elements:

Instance Preferences: Controls on-demand and spot instance usage.
Custom Labels and Taints: Assign workloads to specific nodes.
Scaling Constraints: Enables CAST AI to manage resource allocation based on real-time demands.

Step 7: Configuring Autoscaler Policies

The castai_autoscaler resource enables CAST AI’s autoscaler to adjust cluster size based on workload demands.

hcl
resource "castai_autoscaler" "castai_autoscaler_policies" {
  cluster_id               = castai_eks_cluster.cluster.id
  autoscaler_policies_json = <<-EOT
      {
        "enabled" : true,
        "isScopedMode" : false,
        "unschedulablePods" : {
          "enabled" : true
        },
        "nodeDownscaler" : {
          "enabled" : true,
          "evictor" : {
            "enabled" : true,
            "nodeGracePeriodMinutes" : 5
          }
        }
      }
  EOT
}

This policy allows CAST AI to monitor unschedulable pods and adjust resources as needed, maintaining an optimal balance between performance and cost.

Step 8: Implementing Rebalancing Schedules for Cost Optimization

In any cloud environment, unused or underutilized resources can quickly add to significant unnecessary costs. CAST AI’s Rebalancing Schedules allow you to create policies that optimize resource use at scheduled times, often during off-peak hours when workload demands are low. By rebalancing nodes, CAST AI can remove excess capacity or replace high-cost instances with more economical options. The following resources establish nightly and hourly rebalancing schedules with conditions that trigger when savings potential reaches a certain threshold.

Code for Nightly Rebalancing Schedule and Job


hcl
resource "castai_rebalancing_schedule" "nightly" {
  name = format("NIGHTLY-%s", var.account_name)
  schedule {
    cron = var.rebalancing_schedule_cron_nightly
  }
  trigger_conditions {
    savings_percentage = var.savings_percentage_nightly
  }
  launch_configuration {
    execution_conditions {
      enabled                     = true
      achieved_savings_percentage = var.savings_percentage_nightly
    }
  }
}

resource "castai_rebalancing_job" "nightly" {
  for_each                = toset(var.clusters_ids_cast)
  cluster_id              = each.value
  rebalancing_schedule_id = castai_rebalancing_schedule.nightly.id
  enabled                 = true
}

Explanation

Nightly Schedule: This rebalancing schedule triggers according to a custom CRON schedule specified in var.rebalancing_schedule_cron_nightly, typically set for low-traffic periods (e.g., late at night) to minimize the impact on active applications.
Trigger Conditions: The savings_percentage parameter ensures that rebalancing occurs only if a predefined savings threshold (set in var.savings_percentage_nightly) is achievable. This prevents unnecessary rebalancing operations if the potential savings are minimal.
Launch Configuration: Within execution_conditions, achieved_savings_percentage is checked to ensure that any initiated rebalancing provides meaningful cost savings.

The castai_rebalancing_job resource ties this nightly rebalancing configuration to specific clusters by iterating over var.clusters_ids_cast, which contains the IDs of clusters to manage. Setting enabled = true ensures the job is active.

Code for Hourly Rebalancing Schedule and Job


hcl
resource "castai_rebalancing_schedule" "hourly" {
  name = format("HOURLY-%s", var.account_name)
  schedule {
    cron = var.rebalancing_schedule_cron_hourly
  }
  trigger_conditions {
    savings_percentage = var.savings_percentage_hourly
  }
  launch_configuration {
    execution_conditions {
      enabled                     = true
      achieved_savings_percentage = var.savings_percentage_hourly
    }
  }
}

resource "castai_rebalancing_job" "hourly" {
  for_each                = toset(var.clusters_ids_cast)
  cluster_id              = each.value
  rebalancing_schedule_id = castai_rebalancing_schedule.hourly.id
  enabled                 = true
}

Explanation

Hourly Schedule: This resource functions similarly to the nightly schedule but operates at a higher frequency, as dictated by var.rebalancing_schedule_cron_hourly. Hourly rebalancing is ideal for workloads that fluctuate throughout the day, providing more frequent adjustments to adapt to workload changes.
Trigger Conditions: The threshold (var.savings_percentage_hourly) ensures that only when hourly rebalancing can deliver meaningful savings, the job is triggered.
Launch Configuration: The execution_conditions in the hourly schedule similarly evaluate if the achieved savings are sufficient to initiate rebalancing.

This hourly rebalancing job also iterates over var.clusters_ids_cast, targeting each cluster for frequent rebalancing, if cost-effective.

Why Rebalancing Matters

Rebalancing on a regular schedule helps your cloud environment maintain efficiency by:

Reducing Costs: Moving from on-demand to spot instances or scaling down when workloads are low.
Avoiding Resource Waste: Ensures you’re only using the resources you need, preventing “ghost” infrastructure from idling.
Enhanced Flexibility: Automated rebalancing keeps clusters adaptable to real-time demands without human intervention.

By applying both nightly and hourly schedules, CAST AI provides robust automation to consistently keep your Kubernetes clusters optimized for performance and cost efficiency. This configuration enables more frequent monitoring and adjustment of cluster resources while minimizing management overhead.

Conclusion

By integrating CAST AI with Amazon EKS through Terraform, you’ve established an environment where cluster resources dynamically scale, spot instances are utilized for cost savings and node configurations are automatically rebalanced. This setup offers powerful benefits for Kubernetes management on AWS, automating the intricate balance between cost efficiency and performance optimization. With CAST AI’s capabilities in place, you can focus on deploying and developing applications rather than manually managing infrastructure, knowing that the underlying environment remains cost-effective and performant. This guide showcases how CAST AI and Terraform simplify Kubernetes operations and help organizations maximize their cloud investment, empowering DevOps teams to scale confidently in the cloud.

Transforming Kubernetes Management: Simplify Operations, Optimize Costs, and Scale Effectively

Key Features of CAST AI

Tutorial Overview

Setting Up CAST AI with EKS Using Terraform

Step 1: Creating the CAST AI User ARN for EKS Cluster Access

Step 2: Setting Up AWS IAM Policies and Roles

Step 3: Configuring CAST AI’s Connection to the EKS Cluster

Step 4: Setting Node Configuration for CAST AI

Step 5: Applying Node Configuration as the Default

Step 6: Defining Node Templates for Autoscaling

Step 7: Configuring Autoscaler Policies

Step 8: Implementing Rebalancing Schedules for Cost Optimization

Code for Nightly Rebalancing Schedule and Job

Explanation

Code for Hourly Rebalancing Schedule and Job

Explanation

Why Rebalancing Matters

Conclusion

Follow us