Case Study

Zero downtime, max efficiency

December 05, 2023

82 views

#architecture #aws #docker swarm #kubernetes #terraform

Background

Minute Media, a leading media and technology company, not only owns massive content websites (90min.com, 12up, The Big Lead, and FanSided, to name a few) but also offers a unique, flagship content creation platform (Voltax) that allows third-party users to create and publish content to hundreds of millions of unique users around the world each month.

From video creation and monetization solutions across different websites and platforms to podcasts and any other virtual content, users can consume, create, share and profit from their creations.
Minute Media uploads around 10k-20k videos and 600 articles per week — and with this much content traffic and bandwidth used, they need to use reliable and updated technologies to store their content and products and make them readily available for all users.

The need: Migration from Docker swarm to Kubernetes

Like many other enterprise organizations, Minute Media was interested in upgrading to have a faster deployment in an advanced and fully supported technological environment. Thus, the main goal was very straightforward: move all the workload of Minute Media’s services from Docker swarm to Kubernetes in order to improve its scalability, availability, and deployment times.

The challenges: Smooth migration, avoiding downtime, and meeting the budget

To put things in perspective, it would take several maintenance sessions to complete a project of this scale. However, for Minute Media and its clients, constant access and availability are crucial.

The estimated damages for a single one-hour downtime maintenance session could easily accumulate to around hundreds of thousands of dollars loss, not to mention a few of those.
Besides that, the image and reputation consequences to the websites and platforms could be severe.

The platform may lose regular media consumers, advertising, and of course, third-party content creators who will, in turn, suffer for their content being unavailable to consumers.
The damage, in this case, is collateral: the creators now have an unstable, negative user experience; some users may abandon the service while new creators will be deterred.
And so, as a robust media company — downtime must be avoided at all costs.

Still, the migration had to be don’t to ensure future support and stability for the different platforms.

How, then, can the company avoid the millions of dollars in financial damages resulting from downtime maintenance?

As a professional DevOps service provider specializing in cloud services, our team at Develeap took the challenge.

Our solution

Solution strategy

While planning the migration strategy for Minute Media, we made sure to keep in mind the following key goals and concerns :

Make the transition from Docker swarm to Kubernetes as smooth as possible.
Make the new system redundant and highly available.
Keep track of every change in the infrastructure.
Make it easy to manage and provision cloud resources.
Provide the ability to know the system’s state at each point, both the infrastructure status and the application’s logs and metrics.
Make the infrastructure as secure as possible.
Complete the project within the budget.

Planning and execution

From planning to execution, our solution consisted of five parts — planning and creating the infrastructure architecture, managing the infrastructure as code, creating a CD process, designing the Kubernetes clusters, and lastly — performing the migration.

1. Planning and creating infrastructure architecture

While architecting a cloud-based infrastructure, we needed to keep in mind a few principles along the way:

Scalability and availability — The first one is keeping the system as scalable and available as possible.
Doing so will ensure that our service will stay available all the time for our customers. It is a fundamental principle in general, particularly in our system that needs to handle hundreds of millions of users each month.
Security — The second principle is security; In each step taken towards our goal, we must keep the security aspect in mind, make the system as secure as possible, and perform some security hardening processes to ensure that our infrastructure is secure.
Budget — The last principle is the budget; We need to make sure that we are using each service/resource we provision and prevent from provision non-usable resources, hence, controlling our budget and not spending more than what we need in order to keep our service up and running.

At the starting point, Minute Media’s system included two main environments, production, and QA:

Production — A Docker swarm cluster with a fixed size of dozens of instances as the worker nodes.
QA —
a. A Kubernetes cluster that was built and managed with KOPS (an automated provisioning system to spin up Kubernetes clusters) with a fixed size of 4 instances.
b. A Swarm cluster running on top of 4 additional instances.

In addition to the complexity introduced by the non-homogeneous infrastructure, we had challenges with managing the related networking and load balancing components, which included:

Manually pre-configured classic load balancers.
Manually pre-configured target groups for each of the microservices.
Manually created DNS records for each one of the microservices.
Both production and QA environments infrastructures were running on the same network (VPC).
Several of dozens of subnets (both private and public) are separated by logical functionality.

In our new infrastructure architecture we decided to include three main environments:

Production environment, which contains all the production infrastructure resources.
QA environment for all our QA resources.
Administrator environment, which includes all our administration resources and peripherals, such as bastion hosts, VPN, etc.

In order to run such a scaled environment, we have chosen to use AWS managed services whenever possible — the Kubernetes cluster was set on EKS, RDS is being used as our main database managed service, Lambda, S3, and AWS’s strict security services are being implemented.

2. Infrastructure as a code — Terraform

Our first decision was that all infrastructure would be managed as a code. For this, we selected Terraform, which is considered to be a great tool for that purpose as known.
Our Terraform structure is made of modules and environments. Using Terraform’s excellent support for generalization, all the environments can be provisioned using the exact same code while each has its own settings in the tfvars files.

Each environment uses different modules (Kubernetes cluster, computing layer, networking, etc.) with overridden values, for example, the QA environment implements a Kubernetes cluster with its own values.

3. Create a CD process

Continuous deployment is a must for modern web services and microservices architecture (not only) that want to stay agile and relevant, so the next step was to build a fully automated CD process for our new Kubernetes environments.

Our CI pipeline includes the following phases: checkout, build, test, and finally publish, which is responsible to push our newly created Docker images (of our microservices) into our private Docker image registry.

Our new CD process implements the ‘Deploy’ phase of the CI/CD pipeline.
In order to achieve that, we’ve created and developed an internal deployment system (ADS) based on Helm, a Kubernetes package manager that helped us to package, publish and deploy our microservices into our Kubernetes clusters.
ADS is responsible for deploying services into our different Kubernetes environments, validating them, and exposing the services via different approaches, such as a Kubernetes Ingress object.

4. Design the Kubernetes clusters

As for designing our Kubernetes clusters, a dedicated namespace has been created to host all the monitoring and logging resources.
Our monitoring and logging stack includes Grafana and Prometheus for infrastructure and application metrics monitoring and EFK (Elasicsearch, Fluentd, and Kibana) as our logs collector and visualization tool.

The different monitoring tools were helping us to observe and get a quick status of our system, consequentially helping us on diagnostic issues and debug problems easily.

We have implemented both HPA and cluster auto scaler, which benefited us in three different vectors: the first two were scalability and availability, configuring cluster auto scaler, and HPA ensured that our services are available at all times. The third vector is budget. Configuring our worker nodes as part of an auto-scaling group rather than provisioning a fixed size of nodes, enabled us to pay less and get more in demand if needed.

5. Migrate the workload to the new Kubernetes clusters

After we have finished planning, architecting, and provisioning our new infrastructure, we moved forward to the migration process of the QA and production microservices and workload from Docker swarm to our newly created Kubernetes clusters.

We had to ensure that this transition is made seamlessly since the process is affecting hundreds of millions of users globally.

In order to ensure that our new infrastructure works properly and handles the high traffic that we are expecting it to receive, we’ve created a “shadow” environment in the new Kubernetes cluster and duplicated the incoming production traffic so that it is delivered to both the old Swarm cluster and to the new Kubernetes cluster. At this stage, responses were returned only from the Docker swarm cluster.

After validating that the new system is handling the high traffic properly, our strategy was to perform hot-swapping between each service, one by one, from the Docker swarm cluster to the Kubernetes production cluster. This meant deploying a service to both clusters, Swarm and Kubernetes, after testing and ensuring that the service in the Kubernetes cluster is valid, the traffic is routed to the service that runs on the Kubernetes cluster.

Results & Conclusions

“Thanks to Develeaps’s carefully designed architecture and accurate execution, the transition from Docker Swarm environment to Kubernetes ecosystem was successful”

“It has increased our scalability and availability, and reduced our overall cost at the same time. In addition, with the transition, we increased our deployment rate to around 200 deployments a day” (Eddy Kiselman, Minute Media’s VP R&D and Platform).

In conclusion, even though the transition process is still underway, it already paid off in several ways. For example, we saw an improvement in our performance, more intuitive deployment process, the overall cost of our cloud resources has been decreased, and as for the security aspect, the system has passed some security hardening processes to ensure that the system is secured as possible and stands with the industry top standards (and above that).

The use of open-source projects, such as Terraform, Kubernetes, Helm, etc. has made the migration process simpler and successful.

Zero downtime, max efficiency

Background

Our solution

Results & Conclusions

Explore more