
Build a Resilient Containerized Jenkins CloudWatch Disk Monitor with Terraform
Monitoring disk usage is crucial when running Jenkins on an EC2 instance. Disk exhaustion can disrupt build processes and affect the performance of the system. AWS CloudWatch provides robust monitoring, and in this guide, we’ll demonstrate how to set up a CloudWatch alarm to monitor disk usage using Terraform. The alarm can also be extended to monitor memory and CPU usage, ensuring your Jenkins server remains stable.
Terraform Setup for Jenkins Server
We’ll start by defining the infrastructure in Terraform for our containerized Jenkins server. The EC2 instance will be created using the terraform-aws-ec2-instance module. Below is the main.tf
that provisions the Jenkins server.
# main.tf
module "ec2_jenkins" {
source = "terraform-aws-modules/ec2-instance/aws"
version = "~> 5.7"
create = var.ec2["jenkins"]["create"]
name = var.ec2["jenkins"]["name"]
ami = data.aws_ami.ubuntu.id
instance_type = var.ec2["jenkins"]["instance_type"]
key_name = var.ec2["jenkins"]["key_name"]
monitoring = var.ec2["jenkins"]["monitoring"]
vpc_security_group_ids = var.ec2["jenkins"]["security_groups_list"]
subnet_id = var.ec2["jenkins"]["subnet_id"]
associate_public_ip_address = var.ec2["jenkins"]["associate_public_ip_address"]
ebs_optimized = true
root_block_device = var.ec2["jenkins"]["volumes"]["root_volume_create"] ? [var.ec2["jenkins"]["volumes"]["root_volume"]] : []
ebs_block_device = length(var.external_ebs_volumes) > 0 ? var.external_ebs_volumes : []
user_data = data.cloudinit_config.user_data.rendered
user_data_replace_on_change = true
create_iam_instance_profile = true
iam_role_policies = {
CloudWatchAgentServerPolicy = "arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy",
AmazonSSMManagedInstanceCore = "arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore"
}
}
variables.tf
# variables.tf
variable "ec2" {
default = {
jenkins = {
create = true
name = "Jenkins-Ubuntu24"
instance_type = "your_instance_type"
key_name = "your_key"
monitoring = true
security_groups_list = ["sg-***"] #can use data. aswell
subnet_id = "subnet-*" #can use data. aswell
associate_public_ip_address = false
# Root volume configuration
volumes = {
root_volume_create = true
root_volume = {
device_name = "/dev/sda1"
delete_on_termination = false # Option to change later
encrypted = true
volume_size = 100
volume_type = "gp3"
iops = 3000
throughput = 125
}
}
}
}
}
# External EBS volumes configuration if needed, for my task i needed it to be external.
variable "external_ebs_volumes" {
default = [
{
device_name = "/dev/sdf"
volume_size = 3000
snapshot_id = "snap-XXX"
volume_type = "gp3"
delete_on_termination = false # Option to change later
iops = 2000
throughput = 150
tags = {
MountPoint = "/mnt/ext"
}
}
]
}
variable "cloudwatch" {
default = {
cloudwatch = {
create = true
alarm_name = "disk_used_percent_Jenkins"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = "1"
metric_name = "disk_used_percent"
namespace = "CWAgent"
period = "60"
actions_enabled = true
unit = "Percent"
statistic = "Average"
threshold = "80"
dimensions_device = "<your device>"
dimensions_fstype = "<your fstype>"
sns = {
create = true
aws_sns_topic_subscription = ["rotem.kalman@develeap.com"]
protocol = "email"
}
}
}
}
# Provider.tf
terraform {
required_version = ">= 1.0"
required_providers {
aws = {
source = "hashicorp/aws"
version = ">= 5.20"
}
}
# this s3 bucket not included in this source, need to create it before.
backend "s3" {
bucket = "terraform-state"
region = "<your_region>"
key = "<your_key>"
}
}
provider "aws" {
region = "region"
default_tags {
tags = { #Can create local.tags and use it.
Name = "Jenkins_name"
Owner = "rotem.kalman"
Objective = "Testing"
Made_by = "terraform"
}
}
ignore_tags {
keys = ["if needed"]
}
}
Setting Up the CloudWatch Alarm Once Jenkins is set up, we can configure CloudWatch alarms to monitor critical system metrics such as disk usage. The following Terraform code defines an alarm that triggers when disk usage exceeds 80%. This can be extended to monitor other metrics such as memory and CPU.
# cloudwatch.tf
resource "aws_sns_topic" "topic" {
create = var.cloudwatch["sns"]["create"]
name = "${var.ec2["jenkins"]["name"]}-Topic-${module.ec2_jenkins.id}"
depends_on = [ module.ec2_jenkins ]
}
resource "aws_sns_topic_subscription" "topic_email_subscription" {
create = var.cloudwatch["sns"]["create"]
count = length(var.cloudwatch["sns"]["aws_sns_topic_subscription"])
topic_arn = aws_sns_topic.topic.arn
protocol = var.cloudwatch["sns"]["protocol"]
endpoint = var.cloudwatch["sns"]["aws_sns_topic_subscription"][count.index]
}
resource "aws_cloudwatch_metric_alarm" "ec2_disk_used" {
create = var.cloudwatch["cloudwatch"]["create"]
alarm_name = "${var.cloudwatch["cloudwatch"]["alarm_name"]}-${module.ec2_jenkins.private_ip}"
comparison_operator = ${var.cloudwatch["cloudwatch"]["comparison_operator"]}"
evaluation_periods = ${var.cloudwatch["cloudwatch"]["evaluation_periods"]}"
metric_name = ${var.cloudwatch["cloudwatch"]["metric_name"]}"
namespace = ${var.cloudwatch["cloudwatch"]["namespace"]}"
period = ${var.cloudwatch["cloudwatch"]["period"]}"
actions_enabled = ${var.cloudwatch["cloudwatch"]["actions_enabled"]}"
unit = ${var.cloudwatch["cloudwatch"]["unit"]}"
statistic = ${var.cloudwatch["cloudwatch"]["statistic"]}"
threshold = ${var.cloudwatch["cloudwatch"]["threshold"]}"
dimensions = {
path = "${var.external_ebs_volumes[0].tags["MountPoint"]}"
host = "ip-${replace(module.ec2_jenkins.private_ip, ".", "-")}"
device = ${var.cloudwatch["cloudwatch"]["dimensions_device"]}"
fstype = ${var.cloudwatch["cloudwatch"]["dimensions_fstype"]}"
}
alarm_description = <<-EOF
This alarm monitors the disk usage on the Jenkins server with Instance ID: ${module.ec2_jenkins.id}.
It will trigger when disk usage exceeds 80%, which could lead to performance degradation.
EOF
alarm_actions = [aws_sns_topic.topic.arn]
}
CloudWatch Agent Configuration
For the Jenkins server to send its disk usage and other metrics to CloudWatch, we need to configure the CloudWatch Agent. Below is a sample cw_agent_config.json
file used for configuring the CloudWatch Agent to track disk usage, memory, and more.
>{
"agent": {
"metrics_collection_interval": 10
},
"metrics": {
"metrics_collected": {
"disk": {
"resources": ["/", "${jenkins_path}"],
"measurement": ["disk_used_percent"],
"ignore_file_system_types": ["sysfs", "devtmpfs"]
},
"mem": {
"measurement": ["mem_available_percent"]
}
},
"aggregation_dimensions": [["InstanceId", "InstanceType"], ["InstanceId"]]
}
}
data.tf
This file sets up the necessary data sources for the AWS environment, such as the AWS caller identity, region, and Ubuntu AMI, which is used to launch EC2 instances with Jenkins and the CloudWatch agent.
# data.tf
data "aws_caller_identity" "current" {}
data "aws_region" "current" {}
# ami for ubuntu 24
data "aws_ami" "ubuntu" {
most_recent = true
filter {
name = "name"
values = ["ubuntu/images/hvm-ssd-gp3/ubuntu-noble-24.04-amd64-server-*"]
# For Ubuntu 20.04
# values = ["ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-*"]
# For Ubuntu 22.04
#values = ["ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-amd64-*"]
}
filter {
name = "virtualization-type"
values = ["hvm"]
}
owners = ["099720109477"] # Canonical's Ubuntu AMI owner ID
}
cloudinit.tf
This file uses the cloudinit_config
resource to generate user data scripts, which will be passed to the EC2 instance on launch. These scripts handle the installation of Docker and the CloudWatch agent, as well as provisioning the necessary configuration files.
data "cloudinit_config" "user_data" {
gzip = false
base64_encode = false
part {
content_type = "text/cloud-config"
content = yamlencode({
write_files = [
{
content = templatefile("./${path.module}/resources/Dockerfile", {
jenkins_path = "<your jenkins home path>"
})
path = "/Dockerfile"
permissions = "0666"
},
{
content = templatefile("./${path.module}/resources/cw_agent_config.json", {
jenkins_path = "<your jenkins home path>"
})
path = "/cw_agent_config.json"
permissions = "0666"
},
{
content = templatefile("./${path.module}/resources/docker-compose.yaml", {
jenkins_path = "<your jenkins home path>"
})
path = "/docker-compose.yaml"
permissions = "0777"
},
{
content = file("${path.module}/resources/install-docker.sh")
path = "/install-docker.sh"
permissions = "0777"
},
{
content = file("${path.module}/resources/install_cloudwatch_agent.sh")
path = "/install_cloudwatch_agent.sh"
permissions = "0777"
},
]
runcmd = [
"/install-docker.sh",
"/install_cloudwatch_agent.sh",
"cd / && docker compose up --build -d"
]
})
}
}
install-docker.sh
This script ensures that Docker is installed on the instance. It first checks if Docker is already installed; if not, it installs Docker, sets the necessary permissions, and adds the ubuntu
user to the docker
group.
#!/bin/bash
# install-docker.sh
echo "Installing Docker"
command_exists() {
command -v "$@" > /dev/null 2>&1
}
if command_exists "docker"; then
echo "Docker Exist"
else
curl -fsSL <https://get.docker.com> -o get-docker.sh
chmod +x get-docker.sh
echo "Starting ./get-docker.sh"
./get-docker.sh
fi
if ! getent group docker; then
echo "Command: groupadd docker"
groupadd docker
echo "usermod ubuntu"
usermod -aG docker ubuntu
echo "newgrp docker"
newgrp docker
fi
echo "chmod /var/run/docker.sock"
chmod 666 /var/run/docker.sock
install_cloudwatch_agent.sh
This script installs and configures the CloudWatch agent on the EC2 instance, ensuring that it fetches the configuration from the given cw_agent_config.json
file.
#!/bin/bash
# install_cloudwatch_agent.sh
# Update system packages
apt update -y
apt upgrade -y
# Install the CloudWatch agent
wget <https://amazoncloudwatch-agent.s3.amazonaws.com/ubuntu/amd64/latest/amazon-cloudwatch-agent.deb>
dpkg -i -E ./amazon-cloudwatch-agent.deb
# Fetch CloudWatch agent configuration from local file
/opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a fetch-config -m ec2 -c file:/cw_agent_config.json -s
# Enable CloudWatch agent service
systemctl restart amazon-cloudwatch-agent.service
status=$(systemctl is-enabled amazon-cloudwatch-agent.service 2>/dev/null)
if [[ "$status" != "enabled" ]]; then
echo "CloudWatch Agent is not enabled. Attempting to start it..."
/opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a fetch-config -m ec2 -c file:/cw_agent_config.json -s
systemctl enable amazon-cloudwatch-agent.service
else
echo "CloudWatch Agent is enabled and running."
fi
Dockerfile
This Dockerfile
builds a Jenkins image and installs the necessary tools such as Docker, GitHub CLI, jq
, and yq
for automation.
FROM jenkins/jenkins:2.462.2-lts
ARG user=jenkins
ARG group=jenkins
ARG uid=1000
ARG gid=1000
USER root
RUN apt-get update && \\
apt-get -y install apt-transport-https ca-certificates curl software-properties-common vim iputils-ping unzip wget gnupg zip jq yq
RUN curl -fsSL <https://get.docker.com> -o get-docker.sh && \\
chmod +x get-docker.sh && ./get-docker.sh
# Install GitHub CLI
RUN wget -qO- <https://cli.github.com/packages/githubcli-archive-keyring.gpg> | tee /etc/apt/keyrings/githubcli-archive-keyring.gpg > /dev/null && \\
chmod go+r /etc/apt/keyrings/githubcli-archive-keyring.gpg && \\
echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/githubcli-archive-keyring.gpg] <https://cli.github.com/packages> stable main" | tee /etc/apt/sources.list.d/github-cli.list > /dev/null && \\
apt update && apt install gh -y
RUN usermod -aG docker jenkins
USER jenkins
docker-compose.yaml
This docker-compose.yaml
file sets up Jenkins, binds necessary ports, and mounts the Jenkins home directory and Docker socket.
name: jenkins
services:
jenkins:
build:
context: /
dockerfile: Dockerfile
restart: always
privileged: true
user: root
ports:
- 8080:8080
- 50000:50000
container_name: jenkins
environment:
- "JAVA_OPTS=-Djenkins.install.runSetupWizard=false"
volumes:
- ${jenkins_path}:/var/jenkins_home
- /var/run/docker.sock:/var/run/docker.sock
Testing:
To test your CloudWatch alarm for the Jenkins server (or any EC2 instance), you can simulate conditions that trigger the alarm or manually set up a test environment to monitor certain metrics. Here are several methods to test your CloudWatch alarm:
1. Simulate Disk Usage Increase
Since the alarm is set for disk usage (disk_used_percent
), one way to test it is to artificially increase disk usage and observe if the alarm gets triggered.
Steps:
-
- Connect to your EC2 instance (Jenkins server):
ssh -i /path/to/your/key.pem ubuntu@<EC2-Instance-IP> ## Can connect with SSM or EC2 Connect aswell
- Fill up the disk: You can use the
dd
stress
orfallocate
to consume disk space.sudo dd if=/dev/zero of=/tmp/testfile.img bs=1M count=5000 stress --hdd 1 --timeout 60s fallocate -l 10G /path/to/testfile
- Connect to your EC2 instance (Jenkins server):
- This command will fill the disk, potentially pushing disk usage over the threshold (80%). These tests help ensure that the alarm will work under real-world conditions where disk usage might spike unexpectedly.
- Monitor Disk Usage: Use
df -h
to check disk usage. - Once the disk usage exceeds the threshold set in the CloudWatch alarm, you should receive a notification through your configured SNS topic (e.g., an email alert).
- Clean Up: After testing, remove the file to free up disk space.
2. Lower the Alarm Threshold Temporarily
A quick way to test the alarm without changing disk usage is to lower the threshold temporarily.
- Modify your
aws_cloudwatch_metric_alarm
resource in Terraform:resource "aws_cloudwatch_metric_alarm" "ec2_disk_used" { threshold = "10" # Lower threshold for testing # Other alarm settings }
-
Apply the changes:
terraform apply
-
This will trigger the alarm almost immediately since your current disk usage is likely already above 10%. Once you’ve confirmed that the alarm works, reset the threshold to its original value (e.g., 80%) and apply the change again.
4. Test with the treat_missing_data
Feature
If you’ve configured your alarm with the treat_missing_data
parameter (for example, to treat missing data as “Breaching”), you can stop the flow of metrics to CloudWatch, causing the alarm to trigger based on missing data.
Steps:
- Temporarily stop the CloudWatch agent on the Jenkins EC2 instance to stop sending metrics:
sudo systemctl stop amazon-cloudwatch-agent
- If the
treat_missing_data
is set to"breaching"
, the alarm should trigger after some time, as it won’t receive metrics data. - After testing, start the CloudWatch agent again:
sudo systemctl start amazon-cloudwatch-agent
5. Check CloudWatch Logs and Metrics
- Monitor the disk usage metric in the CloudWatch console:
- Go to CloudWatch Console > Metrics > CWAgent > Per-Instance Metrics and check the
disk_used_percent
metric to ensure it’s reporting correctly.
- Go to CloudWatch Console > Metrics > CWAgent > Per-Instance Metrics and check the
- You can also view the alarm history under CloudWatch Console > Alarms to see if the alarm was triggered and what actions were taken.
6. Verify SNS and Email Notifications
Make sure the SNS topic is correctly configured and that the email notifications arrive as expected when the alarm is triggered.
Conclusion
By using Terraform and AWS CloudWatch, you can effectively monitor disk usage and other critical metrics on your Jenkins server. This setup will ensure that you’re alerted when disk space runs low, preventing disruptions to your CI/CD pipeline. By adjusting the CloudWatch Agent configuration, this approach can also be extended to monitor CPU and memory usage.