Configuring kubriX for High Availability
This document explains how to configure kubriX for high availability (HA).
It outlines how to enable HA and identifies components that are not designed for high availability.
High Availability vs. Restartability
High Availability (HA) ensures that your service continues to operate during common failure scenarios such as:
- Node drains or crashes
- Rolling updates
- Availability Zone (AZ) outages
With only a single replica, there will always be downtime when that pod is unavailable - for example, during rescheduling, image pulling, initialization, or when the underlying node fails.
However, depending on your service level agreements (SLAs), a single replica might still be sufficient for some components, especially if:
- The service is not required to be continuously available, but only when users actively access it.
- The component performs background or asynchronous processing, where temporary downtime does not impact the user experience.
Taking these considerations into account, the following sections describe the HA configuration for a the different kubriX capabilities.
TopologySpreadContraints and PodDisruptionBudgets
If possible, in HA setup we add also TopologySpreadConstraints with topologyKey: kubernetes.io/hostname, so that pods are spread across nodes to prevent outages on node failures. Also, we add PodDisruptionBudgets so that Pods get started on other nodes before they get removed from a node.
If you need zone aware topologySpreadContraints please contact kubriX support.
Three or two replicas
There is an excellent blog article explaining why three replicas are better than two replicas: https://sookocheff.com/post/kubernetes/why-three-replicas-are-better-than-two/
Still, out-of-the-box we provide values with 2 replicas most of the time to keep the footprint small. See the referenced values files for details.
HA configuration settings
The following table shows you which components can be implemented high available.
Either you extend the default.valueFiles list with values-ha-enabled-prime.yaml in the target-chart so every component gets installed in HA-mode or you extend the valueFiles list in each application in the applications list. ( applications.[].valueFiles)
Extend kubriX HA settings
If you want to extend or change the kubriX HA settings, don't change the provided values-ha-enabled-prime.yaml, but configure them in your overlayed customer specifiy values files (e.g. values-customer.yaml).
Observe
| Application | Component | HA type | Comments | Docs Link |
|---|---|---|---|---|
| Grafana | Grafana | Active-Active | switch to external DB needed (done implicitly via ha-enabled.yaml) | HA alerting github docs HA alerting grafana docs Grafana HA docs |
| k8s-monitoring | alloy-operator | Active-Passive (Leader-Election) | - | Helm Docs |
| k8s-monitoring | alloy-metrics | Active-Active | - | Grafana Docs |
| k8s-monitoring | kube-state-metrics | Active-Active | as long as you set discoveryType: service (implicitly set via ha-enabled) | Helm Docs |
| k8s-monitoring | alloy-singleton | not supported | - | because otherwise ClusterEvents would get retrieved multiple times. |
| Loki | every component | Active-Active | loki.eplication_factor needs to be the same number as write replicas | Grafana Docs |
| Mimir | every component with exceptions (see comments) | Active-Active | override-exporter, rollout-operator and alertmanager don't support scale out; compactor supports, but not scaled per default | |
| Tempo | not scaled yet | - | - | - |
Deliver
| Application | Component | HA type | Comments | Docs Link |
|---|---|---|---|---|
| ArgoCD | redis | Active-Active | switched to redis-ha | |
| ArgoCD | server | Active-Active | ||
| ArgoCD | repo-server | Active-Active | ||
| ArgoCD | dex-server | not supported | not needed when keycloak OIDC is used | |
| ArgoCD | application-controller | not supported | no HA possible, but sharding per cluster for scalability reasons can be implemented if required | |
| ArgoCD | notifications-controller | not supported | ||
| ArgoCD | applicationset-controller | not needed | no real benefit, because applicationset creation is asynchronous anyways | |
| Argo-Rollouts | every component | Active-Active | already upstream default | |
| Kargo | every component with exceptions (see comments) | Active-Active | controller and manager-controller are singletons. Switiching to Distributed configurations to achieve overall scalability | |
| Crossplane | every component (incl. providers) | Active-Passive (Leader-Election) | currently no PodDisruptionBudget implemented.Warning for other crossplane providers: If leader election is not implemented in the provider or not enabled, then der pod will consume 100% CPU! | Crossplane docs |
| KubeVirt | virt-operator | Active-Passive | ||
| KubeVirt | KubeVirt | Active-Active | ||
| KubeVirt | KubeVirt-Manager | not supported | original KubeVirt Deployment spec | |
| KubeVirt | CDI | not needed | Technically it is possible to scale the CDI resources via the CDI CustomResourse properties uploadProxyReplicas, apiServerReplicas and deploymentReplicas . However, currently we do not see the benefit to have multiple replicas.The cdi-operator itself get shipped with one replica out-of-the-box from the original project. |
Secure
| Application | Component | HA type | Comments | Docs Link |
|---|---|---|---|---|
| Keycloak | keycloak-operator | not supported | hard coded one replica in provides kubernetes.yaml | |
| Keycloak | kubrix-keycloak | Active-Active | ||
| Vault | every component | Active-Active | ||
| External-Secrets | every component | Active-Passive(Leader-Election) | if you really need to have active-active setup - it is still possible, but things are going to get complicated really fast | |
| Kyverno | every component | Active-Active, some Active-Passive | All components of Kyverno can be scaled out. However, there are some like reports-controller and background-controller which are stateful and implement a leader-election (as in external-secrets or crossplane), and sometimes just some functionality in admission-controller or cleanup-controller like certificate and webhook management uses a leader-election and some not. | Kyverno Docs |
| Velero | not supported | |||
| cert-manager | controller and cainjector | Active-Passive (Leader-Election) | Certmanager HA Docs | |
| cert-manager | webhook | Active-Active | Certmanager HA Docs |
Enable
| Application | Component | HA type | Comments | Docs Link |
|---|---|---|---|---|
| Backstage | every component | Active-Active | switch to external DB needed (done implicitly via ha-enabled.yaml) |
General
| Application | Component | HA type | Comments | Docs Link |
|---|---|---|---|---|
| Ingress-Nginx | every component | Active-Active | ||
| External-DNS | - | not supported | currently some concerns against and warnings running multiple replicas of external-dns and so it is hard coded to 1 replica | |
| CNPG | every component | Active-Passive (Leader-Election) |