Build a Resilient Containerized Jenkins CloudWatch Disk Monitor with Terraform

Build a Resilient Containerized Jenkins CloudWatch Disk Monitor with Terraform

December 02, 2024
178 views
Get tips and best practices from Develeap’s experts in your inbox

Monitoring disk usage is crucial when running Jenkins on an EC2 instance. Disk exhaustion can disrupt build processes and affect the performance of the system. AWS CloudWatch provides robust monitoring, and in this guide, we’ll demonstrate how to set up a CloudWatch alarm to monitor disk usage using Terraform. The alarm can also be extended to monitor memory and CPU usage, ensuring your Jenkins server remains stable.

Terraform Setup for Jenkins Server

We’ll start by defining the infrastructure in Terraform for our containerized Jenkins server. The EC2 instance will be created using the terraform-aws-ec2-instance module. Below is the main.tf that provisions the Jenkins server.



# main.tf
module "ec2_jenkins" {
  source  = "terraform-aws-modules/ec2-instance/aws"
  version = "~> 5.7"
  create  = var.ec2["jenkins"]["create"]

  name                   = var.ec2["jenkins"]["name"]
  ami                    = data.aws_ami.ubuntu.id
  instance_type          = var.ec2["jenkins"]["instance_type"]
  key_name               = var.ec2["jenkins"]["key_name"]
  monitoring             = var.ec2["jenkins"]["monitoring"]
  vpc_security_group_ids = var.ec2["jenkins"]["security_groups_list"]
  subnet_id              = var.ec2["jenkins"]["subnet_id"]
  associate_public_ip_address = var.ec2["jenkins"]["associate_public_ip_address"]

  ebs_optimized      = true
  root_block_device  = var.ec2["jenkins"]["volumes"]["root_volume_create"] ? [var.ec2["jenkins"]["volumes"]["root_volume"]] : []
  ebs_block_device   = length(var.external_ebs_volumes) > 0 ? var.external_ebs_volumes : []

  user_data            = data.cloudinit_config.user_data.rendered
  user_data_replace_on_change = true
  create_iam_instance_profile = true
  iam_role_policies = {
    CloudWatchAgentServerPolicy = "arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy",
    AmazonSSMManagedInstanceCore = "arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore"
  }
}
variables.tf
# variables.tf
variable "ec2" {
  default = {
    jenkins = {
      create = true
      name   = "Jenkins-Ubuntu24"
      instance_type = "your_instance_type"
      key_name      = "your_key"

      monitoring                  = true
      security_groups_list        = ["sg-***"] #can use data. aswell
      subnet_id                   = "subnet-*" #can use data. aswell
      associate_public_ip_address = false

      # Root volume configuration
      volumes = {
        root_volume_create = true
        root_volume = {
          device_name           = "/dev/sda1"
          delete_on_termination = false  # Option to change later
          encrypted             = true
          volume_size           = 100
          volume_type           = "gp3"
          iops                  = 3000
          throughput            = 125
        }
      }
    }
  }
}

# External EBS volumes configuration if needed, for my task i needed it to be external.
variable "external_ebs_volumes" {
  default = [
    {
      device_name           = "/dev/sdf"
      volume_size           = 3000
      snapshot_id           = "snap-XXX"
      volume_type           = "gp3"
      delete_on_termination = false  # Option to change later
      iops                  = 2000
      throughput            = 150
      tags = {
        MountPoint = "/mnt/ext"
      }
    }
  ]
}

variable "cloudwatch" {
  default = {
    cloudwatch = {
      create = true
		  alarm_name                = "disk_used_percent_Jenkins"
		  comparison_operator       = "GreaterThanThreshold"
		  evaluation_periods        = "1"
		  metric_name               = "disk_used_percent"
		  namespace                 = "CWAgent"
		  period                    = "60"
		  actions_enabled           = true
		  unit                      = "Percent"
		  statistic                 = "Average"
		  threshold                 = "80"
		  dimensions_device = "<your device>"
		  dimensions_fstype = "<your fstype>"

      sns = {
        create = true
        aws_sns_topic_subscription           =  ["rotem.kalman@develeap.com"]
        protocol = "email"
        
      }
    }
  }
}



# Provider.tf

terraform {
  required_version = ">= 1.0"

  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = ">= 5.20"
    }
  }
  # this s3 bucket not included in this source, need to create it before.
  backend "s3" {
    bucket = "terraform-state"
    region = "<your_region>"
    key    = "<your_key>"
  }
}

provider "aws" {
  region                   = "region"

  default_tags {
    tags = { #Can create local.tags and use it.
      Name      = "Jenkins_name"
      Owner     = "rotem.kalman"
      Objective = "Testing"
      Made_by   = "terraform"
    }
  }
  ignore_tags {
    keys = ["if needed"]
  }
}

Setting Up the CloudWatch Alarm Once Jenkins is set up, we can configure CloudWatch alarms to monitor critical system metrics such as disk usage. The following Terraform code defines an alarm that triggers when disk usage exceeds 80%. This can be extended to monitor other metrics such as memory and CPU.



# cloudwatch.tf
resource "aws_sns_topic" "topic" {
  create  = var.cloudwatch["sns"]["create"]
  name = "${var.ec2["jenkins"]["name"]}-Topic-${module.ec2_jenkins.id}"
  depends_on = [ module.ec2_jenkins ]
}

resource "aws_sns_topic_subscription" "topic_email_subscription" {
  create  = var.cloudwatch["sns"]["create"]

  count     = length(var.cloudwatch["sns"]["aws_sns_topic_subscription"])
  topic_arn = aws_sns_topic.topic.arn
  protocol  = var.cloudwatch["sns"]["protocol"]
  endpoint  = var.cloudwatch["sns"]["aws_sns_topic_subscription"][count.index]
}

resource "aws_cloudwatch_metric_alarm" "ec2_disk_used" {
  create  = var.cloudwatch["cloudwatch"]["create"]
  alarm_name                = "${var.cloudwatch["cloudwatch"]["alarm_name"]}-${module.ec2_jenkins.private_ip}"
  comparison_operator       = ${var.cloudwatch["cloudwatch"]["comparison_operator"]}"
  evaluation_periods        = ${var.cloudwatch["cloudwatch"]["evaluation_periods"]}"
  metric_name               = ${var.cloudwatch["cloudwatch"]["metric_name"]}"
  namespace                 = ${var.cloudwatch["cloudwatch"]["namespace"]}"
  period                    = ${var.cloudwatch["cloudwatch"]["period"]}"
  actions_enabled           = ${var.cloudwatch["cloudwatch"]["actions_enabled"]}"
  unit                      = ${var.cloudwatch["cloudwatch"]["unit"]}"
  statistic                 = ${var.cloudwatch["cloudwatch"]["statistic"]}"
  threshold                 = ${var.cloudwatch["cloudwatch"]["threshold"]}"
  dimensions = {
    path = "${var.external_ebs_volumes[0].tags["MountPoint"]}"
    host = "ip-${replace(module.ec2_jenkins.private_ip, ".", "-")}"
    device = ${var.cloudwatch["cloudwatch"]["dimensions_device"]}"
    fstype = ${var.cloudwatch["cloudwatch"]["dimensions_fstype"]}"
  }
  alarm_description = <<-EOF
      This alarm monitors the disk usage on the Jenkins server with Instance ID: ${module.ec2_jenkins.id}.
      It will trigger when disk usage exceeds 80%, which could lead to performance degradation.
  EOF
  alarm_actions             = [aws_sns_topic.topic.arn]
}
CloudWatch Agent Configuration

For the Jenkins server to send its disk usage and other metrics to CloudWatch, we need to configure the CloudWatch Agent. Below is a sample cw_agent_config.json file used for configuring the CloudWatch Agent to track disk usage, memory, and more.



>{
  "agent": {
    "metrics_collection_interval": 10
  },
  "metrics": {
    "metrics_collected": {
      "disk": {
        "resources": ["/", "${jenkins_path}"],
        "measurement": ["disk_used_percent"],
        "ignore_file_system_types": ["sysfs", "devtmpfs"]
      },
      "mem": {
        "measurement": ["mem_available_percent"]
      }
    },
    "aggregation_dimensions": [["InstanceId", "InstanceType"], ["InstanceId"]]
  }
}

data.tf

This file sets up the necessary data sources for the AWS environment, such as the AWS caller identity, region, and Ubuntu AMI, which is used to launch EC2 instances with Jenkins and the CloudWatch agent.


# data.tf
data "aws_caller_identity" "current" {}

data "aws_region" "current" {}

# ami for ubuntu 24
data "aws_ami" "ubuntu" {
  most_recent = true
  filter {
    name   = "name"
    values = ["ubuntu/images/hvm-ssd-gp3/ubuntu-noble-24.04-amd64-server-*"]
    # For Ubuntu 20.04
    # values = ["ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-*"]
    # For Ubuntu 22.04
    #values = ["ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-amd64-*"]
  }

  filter {
    name   = "virtualization-type"
    values = ["hvm"]
  }

  owners = ["099720109477"] # Canonical's Ubuntu AMI owner ID
}

cloudinit.tf

This file uses the cloudinit_config resource to generate user data scripts, which will be passed to the EC2 instance on launch. These scripts handle the installation of Docker and the CloudWatch agent, as well as provisioning the necessary configuration files.


data "cloudinit_config" "user_data" {
  gzip          = false
  base64_encode = false

  part {
    content_type = "text/cloud-config"
    content = yamlencode({
      write_files = [
        {
          content = templatefile("./${path.module}/resources/Dockerfile", {
            jenkins_path = "<your jenkins home path>"
          })
          path        = "/Dockerfile"
          permissions = "0666"
        },
        {
          content = templatefile("./${path.module}/resources/cw_agent_config.json", {
            jenkins_path = "<your jenkins home path>"
          })
          path        = "/cw_agent_config.json"
          permissions = "0666"
        },
        {
          content = templatefile("./${path.module}/resources/docker-compose.yaml", {
            jenkins_path = "<your jenkins home path>"
          })
          path        = "/docker-compose.yaml"
          permissions = "0777"
        },
        {
          content     = file("${path.module}/resources/install-docker.sh")
          path        = "/install-docker.sh"
          permissions = "0777"
        },
        {
          content     = file("${path.module}/resources/install_cloudwatch_agent.sh")
          path        = "/install_cloudwatch_agent.sh"
          permissions = "0777"
        },
      ]
      runcmd = [
        "/install-docker.sh",
        "/install_cloudwatch_agent.sh",
        "cd / && docker compose up --build -d"
      ]
    })
  }
}
install-docker.sh

This script ensures that Docker is installed on the instance. It first checks if Docker is already installed; if not, it installs Docker, sets the necessary permissions, and adds the ubuntu user to the docker group.


#!/bin/bash
# install-docker.sh
echo "Installing Docker"

command_exists() {
    command -v "$@" > /dev/null 2>&1
}

if command_exists "docker"; then
    echo "Docker Exist"
else
    curl -fsSL <https://get.docker.com> -o get-docker.sh
    chmod +x get-docker.sh
    echo "Starting ./get-docker.sh"
    ./get-docker.sh
fi

if ! getent group docker; then
    echo "Command: groupadd docker"
    groupadd docker
    echo "usermod ubuntu"
    usermod -aG docker ubuntu
    echo "newgrp docker"
    newgrp docker
fi

echo "chmod /var/run/docker.sock"
chmod 666 /var/run/docker.sock

install_cloudwatch_agent.sh

This script installs and configures the CloudWatch agent on the EC2 instance, ensuring that it fetches the configuration from the given cw_agent_config.json file.


#!/bin/bash
# install_cloudwatch_agent.sh
# Update system packages
apt update -y
apt upgrade -y

# Install the CloudWatch agent
wget <https://amazoncloudwatch-agent.s3.amazonaws.com/ubuntu/amd64/latest/amazon-cloudwatch-agent.deb>
dpkg -i -E ./amazon-cloudwatch-agent.deb

# Fetch CloudWatch agent configuration from local file
/opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a fetch-config -m ec2 -c file:/cw_agent_config.json -s

# Enable CloudWatch agent service
systemctl restart amazon-cloudwatch-agent.service

status=$(systemctl is-enabled amazon-cloudwatch-agent.service 2>/dev/null)

if [[ "$status" != "enabled" ]]; then
    echo "CloudWatch Agent is not enabled. Attempting to start it..."
    /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a fetch-config -m ec2 -c file:/cw_agent_config.json -s
    systemctl enable amazon-cloudwatch-agent.service
else
    echo "CloudWatch Agent is enabled and running."
fi

Dockerfile

This Dockerfile builds a Jenkins image and installs the necessary tools such as Docker, GitHub CLI, jq, and yq for automation.


FROM jenkins/jenkins:2.462.2-lts

ARG user=jenkins
ARG group=jenkins
ARG uid=1000
ARG gid=1000

USER root

RUN apt-get update && \\
    apt-get -y install apt-transport-https ca-certificates curl software-properties-common vim iputils-ping unzip wget gnupg zip jq yq

RUN curl -fsSL <https://get.docker.com> -o get-docker.sh && \\
    chmod +x get-docker.sh && ./get-docker.sh

# Install GitHub CLI
RUN wget -qO- <https://cli.github.com/packages/githubcli-archive-keyring.gpg> | tee /etc/apt/keyrings/githubcli-archive-keyring.gpg > /dev/null && \\
    chmod go+r /etc/apt/keyrings/githubcli-archive-keyring.gpg && \\
    echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/githubcli-archive-keyring.gpg] <https://cli.github.com/packages> stable main" | tee /etc/apt/sources.list.d/github-cli.list > /dev/null && \\
    apt update && apt install gh -y

RUN usermod -aG docker jenkins
USER jenkins

docker-compose.yaml

This docker-compose.yaml file sets up Jenkins, binds necessary ports, and mounts the Jenkins home directory and Docker socket.


name: jenkins
services:
  jenkins:
    build:
      context: /
      dockerfile: Dockerfile
    restart: always
    privileged: true
    user: root
    ports:
      - 8080:8080
      - 50000:50000
    container_name: jenkins
    environment:
      - "JAVA_OPTS=-Djenkins.install.runSetupWizard=false"
    volumes:
      - ${jenkins_path}:/var/jenkins_home
      - /var/run/docker.sock:/var/run/docker.sock

Testing:

To test your CloudWatch alarm for the Jenkins server (or any EC2 instance), you can simulate conditions that trigger the alarm or manually set up a test environment to monitor certain metrics. Here are several methods to test your CloudWatch alarm:

1. Simulate Disk Usage Increase

Since the alarm is set for disk usage (disk_used_percent), one way to test it is to artificially increase disk usage and observe if the alarm gets triggered.

Steps:
    • Connect to your EC2 instance (Jenkins server):
      
      
      ssh -i /path/to/your/key.pem ubuntu@<EC2-Instance-IP>
      ## Can connect with SSM or EC2 Connect aswell
      
    • Fill up the disk: You can use the dd stress or fallocate to consume disk space.
      
      sudo dd if=/dev/zero of=/tmp/testfile.img bs=1M count=5000
      stress --hdd 1 --timeout 60s
      fallocate -l 10G /path/to/testfile
      
  • This command will fill the disk, potentially pushing disk usage over the threshold (80%). These tests help ensure that the alarm will work under real-world conditions where disk usage might spike unexpectedly.
  • Monitor Disk Usage: Use df -h to check disk usage.
  • Once the disk usage exceeds the threshold set in the CloudWatch alarm, you should receive a notification through your configured SNS topic (e.g., an email alert).
  • Clean Up: After testing, remove the file to free up disk space.
2. Lower the Alarm Threshold Temporarily

A quick way to test the alarm without changing disk usage is to lower the threshold temporarily.

  • Modify your aws_cloudwatch_metric_alarm resource in Terraform:
    
    
    
    resource "aws_cloudwatch_metric_alarm" "ec2_disk_used" {
      threshold = "10"  # Lower threshold for testing
      # Other alarm settings
    }
    
     
  • Apply the changes: terraform apply

  • This will trigger the alarm almost immediately since your current disk usage is likely already above 10%. Once you’ve confirmed that the alarm works, reset the threshold to its original value (e.g., 80%) and apply the change again.

4. Test with the treat_missing_data Feature

If you’ve configured your alarm with the treat_missing_data parameter (for example, to treat missing data as “Breaching”), you can stop the flow of metrics to CloudWatch, causing the alarm to trigger based on missing data.

Steps:
  • Temporarily stop the CloudWatch agent on the Jenkins EC2 instance to stop sending metrics:
    
    sudo systemctl stop amazon-cloudwatch-agent
    
  • If the treat_missing_data is set to "breaching", the alarm should trigger after some time, as it won’t receive metrics data.
  • After testing, start the CloudWatch agent again:
    
    sudo systemctl start amazon-cloudwatch-agent
    
5. Check CloudWatch Logs and Metrics
  • Monitor the disk usage metric in the CloudWatch console:
    • Go to CloudWatch Console > Metrics > CWAgent > Per-Instance Metrics and check the disk_used_percent metric to ensure it’s reporting correctly.
  • You can also view the alarm history under CloudWatch Console > Alarms to see if the alarm was triggered and what actions were taken.
6. Verify SNS and Email Notifications

Make sure the SNS topic is correctly configured and that the email notifications arrive as expected when the alarm is triggered.

Conclusion

By using Terraform and AWS CloudWatch, you can effectively monitor disk usage and other critical metrics on your Jenkins server. This setup will ensure that you’re alerted when disk space runs low, preventing disruptions to your CI/CD pipeline. By adjusting the CloudWatch Agent configuration, this approach can also be extended to monitor CPU and memory usage.

We’re Hiring!
Develeap is looking for talented DevOps engineers who want to make a difference in the world.
Skip to content