Using Kubernetes Probes Correctly
Kubernetes probes are at the heart of application behavior during system lifetime.
In this article, I want to address the meaning and correct use of the three Kubernetes Probes:
Implementing them correctly is usually pretty simple, but developer ignorance often leads to badly implemented probes that adversely affect system stability and performance.
– Always implement both liveness and readiness. More often than not, they should have different logic.
* Liveness should be lightweight and minimal. Do not check other processes. Do use the initialDelaySeconds field.
* Readiness should be as minimal as possible BUT must include dependencies. Make sure you set an adequate periodSeconds.
– Implement startup if you have complex startup logic that you do not want to include in your Readiness
Old School Process Orchestration
Well before the days of Kubernetes, multi-process service architectures were known and used. At that time, it was very common for developers to implement a “master service,” often called “Process Manager,” which was tasked to handle correct system startup and (to some extent) ongoing stability.
Typically these services would have a definition of process dependency – some a-cyclic graph of process names and startup commands – and would focus on starting processes in the right order based on an “isReady” check.
Once a process was ready, they would roll to the next process until the system was up.
As soon as a process was up, OTHER mechanisms would regularly poll what was referred to as the “health check” probe. And when a process failed health check for several consecutive calls – all hell would break loose. How can you restart a process correctly? What do you do with its dependents? Should the whole process be re-initiated?
Kubernetes Process Orchestration
Kubernetes took a different stance. Instead of having a pre-defined, centrally managed dependency graph, it puts the responsibility for dependency management on each service. Instead of focusing on “system startup,” it expects every service to assume that the system is always on and consider how it should function.
Here is how it works:
(a) When Kubernetes starts a service, it gives it time to initialize. If needed, it can probe the process until it confirms it has indeed initialized. This is the “
(b) Once initialized, Kubernetes will regularly check if the process is healthy. If it is not – it will restart it. This is the “
(c) A healthy service is periodically checked to see if it is ready to receive traffic. If it is not, then Kubernetes will pass traffic to others. This is the “
Readiness and Liveness only start after success is returned from the
startupProbe, thus they can assume that initialization was successful.
Let’s look at a simple example: a business logic service that depends on a database and a remote foobar service.
During its startup phase, the process sometimes requires database migration. This is a slow process and is unique to the startup phase. Thus, the process implements a “
startupProbe” which returns success only after database migration is finished.
This may involve non-trivial logic and queries to various tables.
readinessProbe does not care about the state of migration – it assumes the database is OK as long as it is up. Something like issuing a “SELECT 1;” would do nicely.
And if both the database and the foobar service are up, it will indicate readiness.
Liveness is extremely simple in this example – no checks are needed. If I can answer, reasons the service, then I am surely alive.
Note that given these probes, Kubernetes can start the process and direct traffic only when ready. Should the process experience temporary database or foobar issues – Kubernetes will take it out of the service pool based on the
readinessProbe and return it once things are OK.
What if things break further? What if the process WANTS to be re-started?
Well, aside from calling system.exit, a process can always return failure from
livenessProbe, and leave it to Kubernetes to restart it after several failures. This is a good practice, as Kubernetes will handle the whole pod re-start.
Let’s take a look at our example from above:
What if the application has a bug which every so often leads to a db deadlock? Or what if it has a leak in the DB connection pool?
From an applicative perspective it looks as if the DB is not working. The
readinessProbe will return failure, and the application will be moved out of the service pool – but this will not resolve anything. The correct action is actually to restart the application, but in order for that to happen we need to fail the
One way to do that is to check the DB a part of the livenessProbe – but that means we are making it slower and less resource efficient. A better alternative is to allow
readinessProbe to raise an internal flag (e.g. using a Sempahore) when it feels that it is time to restart, and test for that flag in your
Regardless of the technique, the important point we can learn from this is the functional view of the
livenessProbe: anything that a restart can help to fix – should be monitored by liveness.
Do’s and Don’ts
livenessProbe should be lightweight, allowing Kubernetes to poll it often.
Never test anything external for liveness – it is just a test to see that your process is up.
If you have situations where you rely on liveness to ask for a re-start, prefer a flag/semaphore that allows for a low-cost check.
If restart can help some failure case, make sure liveness is geared towards testing for that failure.
People often implement readiness and liveness as the same method. This is almost always wrong, as they have completely different semantics: readiness should indicate if the process can handle traffic, but its failure is NOT an indication that the process is unhealthy. Just that it needs “a breather” to ready itself.
In your readiness, do check everything that is needed in order for process logic to function correctly, including 3rd party dependencies. On the other hand, do not perform complex tests that are relevant only during the startup phase – there is a startup probe for that.
Like the liveness probe, the readiness probe is called continuously, so it should be lightweight, allowing Kubernetes to poll it often.
Many processes do not need a
startupProbe, and if you do not need it – better not implement it. When should you use it? When you have complex startup logic, tests, and calculations that are not relevant once the process finishes initialization.
Unlike the other probes, this probe is not called once it has succeeded and thus can be arbitrarily complex. In practice, we use it as a way to make sure that
readinessProbe is kept lightweight – free of tests that are not needed continuously.
And since no article is complete without some snippets of code, let us configure the service we presented earlier:
- "check-database-init-done" # Custom command to check latest database migration job finished
periodSeconds: 10 # Ten seconds are usually enough for my process, so I put it as the period
failureThreshold: 60 # Allow for a longer initialization time
initialDelaySeconds: 10 # Give the microservice time to initialize. Notice it is the same 10 from above.
initialDelaySeconds: 5 # Give the microservice time to start
What if your system includes services that do not implement the probes?
Maybe they are just behind schedule and will implement them in several sprints. Maybe they are 3rd parties which you do not control. But what can you do?
Use what you have in hand – anything can be used as the probe as long as it aligns with the semantics:
- liveness – repetitive failure indicates I should re-start
- readiness – failure indicates I should not receive traffic
Always implement both liveness and readiness, making sure you follow their differing semantics.
Using the same minimal test for both, or worse – not implementing them at all, will lead to a system that accepts traffic it cannot handle and performs badly.
Liveness should be lightweight and minimal. Do not check other processes. Do use the initialDelaySeconds field.
Readiness should be as minimal as possible BUT must include dependencies. Make sure you set an adequate periodSeconds
Implement startup only when you have complex startup logic that you do not want to include in your